1

Federated Learning over Wireless Networks: Convergence Analysis and Resource Allocation Canh T. Dinh, Nguyen H. Tran, Minh N. H. Nguyen, Choong Seon Hong, Wei Bao, Albert Y. Zomaya, Vincent Gramoli

Abstract—There is an increasing interest in a fast-growing communication. For example, location-based services such as technique called Federated Learning (FL), the app Waze [3], can help users avoid heavy-traffic roads and in which the model training is distributed over mobile user thus reduce congestion. However, to use this application, users equipment (UEs), exploiting UEs’ local computation and training data. Despite its advantages such as preserving data privacy, have to share their own locations with the server and it cannot FL still has challenges of heterogeneity across UEs’ data and guarantee that the location of drivers is kept safely. Besides, physical resources. To address these challenges, we first propose in order to suggest the optimal route for drivers, Waze collects FEDL, a FL algorithm which can handle heterogeneous UE data a large number of data including every road driven to transfer without further assumptions except strongly convex and smooth to the data center. Transferring this amount of data requires loss functions. We provide a convergence rate characterizing the trade-off between local computation rounds of each UE to update a high expense in communication and drivers’ devices to be its local model and global communication rounds to update the connected to the Internet continuously. FL global model. We then employ FEDL in wireless networks In order to maintain the privacy of consumer data and as a resource allocation optimization problem that captures the reduce the communication cost, it is necessary to have an trade-off between FEDL convergence wall clock time and energy emergence of a new class of machine learning techniques consumption of UEs with heterogeneous computing and power resources. Even though the wireless resource allocation problem that shifts computation to the edge network where the privacy of FEDL is non-convex, we exploit this problem’s structure to of data is maintained. One such popular technique is called decompose it into three sub-problems and analyze their closed- Federated Learning (FL) [4]. This technology allows users to form solutions as well as insights into problem design. Finally, collaboratively build a shared learning model while preserving we empirically evaluate the convergence of FEDL with PyTorch all training data on their user equipment (UE). In particular, experiments, and provide extensive numerical results for the wireless resource allocation sub-problems. Experimental results a UE computes the updates to the current global model on show that FEDL outperforms the vanilla FedAvg algorithm in its local training data, which is then aggregated and fed-back terms of convergence rate and test accuracy in various settings. by a central server, so that all UEs have access to the same Index Terms—Distributed Machine Learning, Federated global model to compute their new updates. This process Learning, Optimization Decomposition. is repeated until an accuracy level of the learning model is reached. In this way, the user data privacy is well protected because local training data are not shared, which thus differs I.INTRODUCTION FL from conventional approaches in data acquisition, storage, The significant increase in the number of cutting-edge and training. mobile and (IoT) devices results in the There are several reasons why FL is attracting plenty of phenomenal growth of the data volume generated at the edge interests. Firstly, modern smart UEs are now able to handle network. It has been predicted that in 2025 there will be heavy computing tasks of intelligent applications as they are 80 billion devices connected to the Internet and the global armed with high-performance central processing units (CPUs), data will achieve 180 trillion gigabytes [2]. However, most graphics processing units (GPUs) and integrated AI chips arXiv:1910.13067v4 [cs.LG] 29 Oct 2020 of this data is privacy-sensitive in nature. It is not only risky called neural processing units (e.g., Snapdragon 845, Kirin to store this data in data centers but also costly in terms of 980 CPU and Apple A12 Bionic CPU [5]). Being equipped with the latest computing resources at the edge, the model C. T. Dinh, N. H. Tran, W. Bao, and A.Y. Zomaya are with the training can be updated locally leading to the reduction in School of Computer Science, The University of Sydney, Sydney, NSW the time to upload raw data to the data center. Secondly, the 2006, Australia (email: [email protected],{nguyen.tran, wei.bao, albert.zomaya}@sydney.edu.au). V. Gramoli is with the School of Computer increase in storage capacity, as well as the plethora of sensors Science, The University of Sydney, Sydney, NSW 2006, Australia and the Dis- (e.g., cameras, microphones, GPS) in UEs enables them to tributed Computing Lab at EPFL, Station 14 CH-1015 Lausanne, Switzerland collect a wealth amount of data and store it locally. This (email: vincent.gramoli@epfl.ch). M. N. H. Nguyen is with the Department of Computer Science and Engineering, Kyung Hee University, South Korea facilitates unprecedented large-scale flexible data collection and also Vietnam - Korea University of Information and Communication and model training. With recent advances in edge computing, Technology, The University of Danang, Vietnam (email: [email protected]). FL can be more easily implemented in reality. For example, a C. S. Hong is with the Department of Computer Science and Engineering, Kyung Hee University, South Korea (email: [email protected]). This research crowd of smart devices can proactively sense and collect data is funded by Vietnam National Foundation for Science and Technology during the day hours, then jointly feedback and update the Development (NAFOSTED) under grant number 102.02-2019.321. Part of this global model during the night hours, to improve the efficiency work was presented at IEEE INFOCOM 2019 [1]. (Corresponding authors: Nguyen H. Tran and Choong Seon Hong.) and accuracy for next-day usage. We envision that such this approach will boost a new generation of smart services, such 2 as smart transportation, smart shopping, and smart hospital. of training loss, convergence rate and test accuracy. Despite its promising benefits, FL comes with new chal- • We propose a resource allocation problem for FEDL lenges to tackle. On one hand, the number of UEs in FL over wireless networks to capture the trade-off between can be large and the data generated by UEs have diverse the wall clock training time of FEDL and UE energy distributions [4]. Designing efficient algorithms to handle consumption by using the Pareto efficiency model. To statistical heterogeneity with convergence guarantee is thus handle the non-convexity of this problem, we exploit its a priority question. Recently, several studies [4], [6], [7] special structure to decompose it into three sub-problems. have used de facto optimization algorithms such as Gradient The first two sub-problems relate to UE resource al- Descent (GD), Stochastic (SGD) to enable location over wireless networks, which are transformed devices’ local updates in FL. One of the most well-known to be convex and solved separately; then their solutions methods named FedAvg [4] which uses average SGD updates are used to obtain the solution to the third sub-problem, was experimentally shown to perform well in heterogeneous which gives the optimal η and θ of FEDL. We derive UE data settings. However, this work lacks theoretical con- their closed-form solutions, and characterize the impact vergence analysis. By leveraging edge computing to enable of the Pareto-efficient controlling knob to the optimal: (i) FL, [7] proposed algorithms for heterogeneous FL networks computation and communication training time, (ii) UE by using GD with bounded gradient divergence assumption resource allocation, and (iii) hyper-learning rate and local to facilitate the convergence analysis. In another direction, accuracy. We also provide extensive numerical results to the idea of allowing UEs to solve local problems in FL with examine the impact of UE heterogeneity and Pareto curve arbitrary optimization algorithm to obtain a local accuracy (or of UE energy cost and wall clock training time. inexactness level) has attracted a number of researchers [8], The rest of this paper is organized as follows. Section II dis- [9]. While [8] uses primal-dual analysis to prove the algorithm cusses related works. Section III contains system model. Sec- convergence under any distribution of data, the authors of tions IV and V provide the proposed FL algorithm’s analysis [9] propose adding proximal terms to local functions and use and resource allocation over wireless networks, respectively. primal analysis for convergence proof with a local dissimilarity Experimental performance of FEDL and numerical results of assumption, a similar idea of bounding the gradient divergence the resource allocation problem are provided in Section VI and between local and global loss functions. Section VII, respectively. Section VIII concludes our work. While all of the above FL algorithms’ complexities are measured in terms of the number of local and global update rounds (or iterations), the wall clock time of FL when deployed II.RELATED WORKS in a wireless environment mainly depends on the number of Due to Big Data applications and complex models such as UEs and their diverse characteristics, since UEs may have , training machine learning models needs to be different hardware, energy budget, and wireless connection distributed over multiple machines, giving rise to researches status. Specifically, the total wall-clock training time of FL on decentralized machine learning [10]–[14]. However, most includes not only the UE computation time (which depend on of the algorithms in these works are designed for machines UEs’ CPU types and local data sizes) but also the commu- having balanced and/or independent and identically distributed nication time of all UEs (which depends on UEs’ channel (i.i.d.) data. Realizing the lack of studies in dealing with gains, transmission power, and local data sizes). Thus, to unbalanced and heterogeneous data distribution, an increasing minimize the wall-clock training time of FL, a careful resource number of researchers place interest in studying FL, a state- allocation problem for FL over wireless networks needs to of-the-art distributed machine learning technique [4], [7], [9], consider not only the FL parameters such as accuracy level [15]–[17]. This technique takes advantage of the involvement for computation-communication trade-off, but also allocating of a large number of devices where data are generated locally, the UEs’ resources such as power and CPU cycles with which makes them statistically heterogeneous in nature. As respect to wireless conditions. From the motivations above, a result, designing algorithms with global model’s conver- our contributions are summarized as follows: gence guarantee becomes challenging. There are two main • We propose a new FL algorithm with only assumption of approaches to overcome this problem. strongly convex and smooth loss functions, named FEDL. The first approach is based on de facto algorithm SGD The crux of FEDL is a new local surrogate function, with a fixed number of local iterations on each device [4]. which is designed for each UE to solve its local problem Despite its feasibility, these studies still have limitations as approximately up to a local accuracy level θ, and is lacking the convergence analysis. The work in [7], on the characterized by a hyper-learning rate η. Using primal other hand, used GD and additional assumptions on Lipschitz convergence analysis, we show the linear convergence local functions and bounded gradient divergence to prove the rate of FEDL by controlling η and θ, which also provides algorithm convergence. the trade-off between the number of local computa- Another useful approach to tackling the heterogeneity chal- tion and global communication rounds. We then employ lenge is to allow UEs to solve their primal problems ap- FEDL, using both strongly convex and non-convex loss proximately up to a local accuracy threshold [9], [16], [17]. functions, on PyTorch to verify its performance with Their works show that the main benefit of this approximation several federated datasets. The experimental results show approach is that it allows flexibility in the compromise between that FEDL outperforms the vanilla FedAvg [4] in terms the number of rounds run on the local model update and the 3 communication to the server for the global model update. Then, the learning model is the minimizer of the following [17] exploits proximal stochastic variance reduced gradient global loss function minimization problem methods for both convex and non-convex FL. While the XN authors of [9] use primal convergence analysis with bounded min F (w) := pnFn(w), (1) w∈ d n=1 gradient divergence assumption and show that their algorithm R Dn can apply to non-convex FL setting, [16] uses primal-dual where pn := D , ∀n. convergence analysis, which is only applicable to FL with Assumption 1. Fn(·) is L-smooth and β-strongly convex, ∀n, convex problems. 0 d respectively, as follows, ∀w, w ∈ R : From a different perspective, many researchers have recently 0 0 0 L 0 2 focused on the efficient communications between UEs and Fn(w) ≤ Fn(w )+ ∇Fn(w ), w − w + kw − w k edge servers in FL-supported networks [1], [7], [18]–[23]. 2 0 0 0 β 0 2 The work [7] proposes algorithms for FL in the context Fn(w) ≥ Fn(w )+ ∇Fn(w ), w − w + kw − w k . of edge networks with resource constraints. While there are 2 0 several works [24], [25] that study minimizing communicated Throughout this paper, hw, w i denotes the inner product of 0 messages for each global iteration update by applying spar- vectors w and w and k·k is Euclidean norm. We note that sification and quantization, it is still a challenge to utilize strong convexity and smoothness in Assumption 1, also used them in FL networks. For example, [18] uses the gradient in [7], can be found in a wide range of applications such as l2- 1 quantization, gradient sparsification, and error accumulation to regularized model with fi(w) = 2 (hxi, wi − 2 β 2 compress gradient message under the wireless multiple-access yi) + 2 kwk , yi ∈ R, and l2-regularized logistic regression  β 2 channel with the assumption of noiseless communication. with fi(w) = log 1 + exp(−yihxi, wi) + 2 kwk , yi ∈ L The work [19] studies a similar quantization technique to {−1, 1}. We also denote ρ := β the condition number of explore convergence guarantee with low-precision training. Fn(·)’s Hessian matrix. [23] considers joint learning with a subset of users and wireless factors such as packet errors and the availability IV. FEDERATED LEARNING ALGORITHM DESIGN of wireless resources while [20] focuses on using cell-free In this section, we propose a FL algorithm, named FEDL, as massive MIMO to support FL. [21], [22] apply a Stackelberg presented in Algorithm 1. To solve problem (1), FEDL uses an game to motivate the participation of the clients during the iterative approach that requires Kg global rounds for global aggregation. Contrary to most of these works which make use model updates. In each global round, there are interactions of existing, standard FL algorithms, our work proposes a new between the UEs and edge server as follows. one. Nevertheless, these works lack studies on unbalanced and UEs update local models: In order to obtain the local heterogeneous data among UEs. We study how the computa- t model wn at a global round t, each UE n first receives the tion and communication characteristics of UEs can affect their feedback information wt−1 and ∇F¯t−1 (which will be defined energy consumption, training time, convergence and accuracy later in (4) and (5), respectively) from the server, and then level of FL, considering heterogeneous UEs in terms of data minimize its following surrogate function (line 3) size, channel gain and computational and transmission power t ¯t−1 t−1 capabilities. min J (w) := Fn(w) + η∇F − ∇Fn(w ), w . (2) d n w∈R One of the key ideas of FEDL is UEs can solve (2) approxi- III.SYSTEM MODEL t mately to obtain an approximation solution wn satisfying t t t t−1 We consider a wireless multi-user system which consists of ∇Jn(wn) ≤ θ ∇Jn(w ) , ∀n, (3) one edge server and a set N of N UEs. Each participating which is parametrized by a local accuracy θ ∈ (0, 1) that is UE n stores a local dataset Dn, with its size denoted by Dn. PN common to all UEs. This local accuracy concept resembles Then, we can define the total data size by D = Dn. In n=1 the approximate factors in [8], [26]. Here θ = 0 means an example of the supervised learning setting, at UE n, Dn defines the collection of data samples given as a set of input- the local problem (2) is required to be solved optimally, Dn d and θ = 1 means no progress for local problem, e.g., by output pairs {xi, yi}i=1, where xi ∈ R is an input sample setting wt = wt−1. The surrogate function J t (.) (2) is vector with d features, and yi ∈ R is the labeled output value n n motivated from the scheme Distributed Approximate NEwton for the sample xi. The data can be generated through the usage of UE, for example, via interactions with mobile apps. (DANE) proposed in [11]. However, DANE requires (i) the global gradient ∇F (wt−1) (which is not available at UEs In a typical learning problem, for a sample data {x , y } i i or server in FL context), (ii) additional proximal terms (i.e., with input xi (e.g., the response time of various apps inside µ t−1 2 d w − w ), and (iii) solving local problem (2) exactly the UE), the task is to find the model parameter w ∈ R that 2 (i.e., θ = 0). On the other hand, FEDL uses (i) the global characterizes the output yi (e.g., label of edge server load, such gradient estimate ∇F¯t−1, which can be measured by the as high or low, in next hours) with the loss function fi(w). The loss function on the data set of UE n is defined as server from UE’s information, instead of exact but unrealistic ∇F (wt−1), (ii) avoids using proximal terms to limit additional 1 X controlling parameter (i.e., µ), and (iii) flexibly solves local Fn(w) := fi(w). Dn i∈Dn problem approximately by controlling θ. Furthermore, we 4

Algorithm 1 FEDL Lemma 1. With Assumption 1 and the assumed linear con- t−1 1: Input: w0, θ ∈ [0, 1], η > 0. vergence rate (8) with z0 = w , the number of local rounds K for solving to achieve a θ-approximation condition 2: for t = 1 to Kg do l (2) (3) 3: Computation: Each UE n receives wt−1 and ∇F¯t−1 is from the server, and solves (2) in Kl rounds to achieve 2 C t Kl = log , (9) θ-approximation solution wn satisfying (3). γ θ t t 4: Communication: UE n transmit wn and ∇Fn(wn), where C := cρ. ∀n, to the edge server. Theorem 1. With Assumption 1, the convergence of FEDL is 5: Aggregation and Feedbacks: The edge server updates achieved with linear rate the global model wt and ∇F¯t as in (4) and (5), t ∗ t (0) ∗ respectively, and then fed-backs them to all UEs. F (w ) − F (w ) ≤ (1 − Θ) (F (w ) − F (w )), (10) when Θ ∈ (0, 1), and 2 2 2 t t−1 t−1 η(2(θ − 1) − (θ + 1)θ(3η + 2)ρ − (θ + 1)ηρ ) have ∇J (w) = ∇Fn(w) + η∇F¯ − ∇Fn(w ), which Θ := . n 2 2 2  includes both local and global gradient estimate weighted by 2ρ (1 + θ) η ρ + 1 a controllable parameter η. We will see later how η affects to (11) the convergence of FEDL. Compared to the vanilla FedAvg, Corollary 1. The number of global rounds for FEDL to t FEDL requires more information (UEs sending not only wn achieve the convergence satisfying (6) is but also ∇F (wt )) to obtain the benefits of a) theoretical n n 1 F (w0) − F (w∗) linear convergence and b) experimental faster convergence, Kg = log . (12) which will be shown in later sections. And we will also show Θ  that with Assumption 1, the theoretical analysis of FEDL does The proof of this corollary can be shown similarly to that not require the gradient divergence bound assumption, which of Lemma 1. We have some following remarks: is typically required in non-strongly convex cases as in [7, 1) The convergence of FEDL can always be obtained by set- Definition 1], [9, Assumption 1]. ting sufficiently small values of both η and θ ∈ (0, 1) such Edge server updates global model: that Θ ∈ (0, 1). While the denominator of (11) is greater t t After receiving the local model wn and gradient ∇Fn(wn), than 2, its numerator can be rewritten as 2η(A−B), where ∀n, the edge server aggregates them as follows A = 2(θ −1)2 −(θ +1)θ(3η +2)ρ2 and B = (θ +1)ηρ2. N Since limθ→0 A = 2 and limθ,η→0 B = 0, there exists t X t w := pnwn, (4) small values of θ and η such that A − B > 0, thus n=1 N Θ > 0. On the other hand, we have limη→0 Θ = 0; thus, ¯t X t ∇F := pn∇Fn(wn) (5) n=1 there exists a small value of η such that Θ < 1. We note that Θ ∈ (0, 1) is only the sufficient condition, but not and then broadcast wt and ∇F¯t to all UEs (line 5), which the necessary condition, for the convergence of FEDL. are required for participating UEs to minimize their surrogate Thus, there exist possible hyper-parameter settings such J t+1 in the next global round t+1. We see that the edge server n that FEDL converges but Θ ∈/ (0, 1). does not access the local data D , ∀n, thus preserving data n 2) There is a convergence trade-off between the number privacy. For an arbitrary small constant  > 0, the problem (1) of local and global rounds characterized by θ: small θ achieves a global model convergence wt when its satisfies makes large Kl, yet small Kg, according to (9) and t ∗ F (w ) − F (w ) ≤ , ∀t ≥ Kg, (6) (12), respectively. This trade-off was also observed by authors of [8], though their technique (i.e., primal-dual ∗ where w is the optimal solution to (1). optimization) is different from ours. Next, we will provide the convergence analysis for FEDL. 3) While θ affects to both local and global convergence, η t We see that Jn(w) is also β-strongly convex and L-smooth as only affects to the global convergence rate of FEDL. If Fn(·) because they have the same Hessian matrix. With these η is small, then Θ is also small, thus inducing large Kg. t properties of Jn(w), we can use GD to solve (2) as follows However, if η is large enough, Θ may not be in (0, 1), which leads to the divergence of FEDL. We call η the zk+1 = zk − h ∇J t (zk), (7) k n hyper-learning rate for the global problem (1). where zk is the local model update and hk is a predefined 4) The condition number ρ also affects to the FEDL con- learning rate at iteration k, which has been shown to generate vergence: if ρ is large (i.e., poorly conditioned problem a convergent sequence (zk)k≥0 satisfying a linear convergence (2)), both η and θ should be sufficiently small in order for rate [27] as follows Θ ∈ (0, 1) (i.e., slow convergence rate.) This observation t t ∗ k t t ∗  is well-aligned to traditional optimization convergence Jn(zk) − Jn(z ) ≤ c(1 − γ) Jn(z0) − Jn(z ) , (8) analysis [28, Chapter 9]. where z∗ is the optimal solution to the local problem (2), and 5) In this work, the theory of FEDL is applicable to (i) c and γ ∈ (0, 1) are constants depending on ρ. full data passing using GD, (ii) the participation of all UEs, and (iii) strongly convex and smooth loss functions. 5

However, using mini-batch is a common practice in by fn. Then the CPU energy consumption of UE n for one machine learning to reduce the computation load at the local round of computation can be expressed as follows [30] UEs. On the other hand, choosing a subset of participating c D X n n αn 2 αn 2 UEs in each global iteration is a practical approach to En,cp = fn = cnDnfn, (13) i=1 2 2 reduce the straggler effect, in which the run-time of each iteration is limited by the “slowest” UEs (the straggler) where αn/2 is the effective capacitance coefficient of UE because heterogeneous UEs compute and communicate at n’s computing chipset. Furthermore, the computation time per local round of the UE n is cnDn , ∀n. We denote the vector different speeds. Finally, non-convex loss functions cap- fn n ture several essential machine learning tasks using neural of fn by f ∈ R . networks. In Section VII, we will experimentally show 2) Communication Model: In FEDL, regarding to the com- that FEDL works well with (i) mini-batch, (ii) subset of munication phase of UEs, we consider a time-sharing multi- UEs samplings, and (iii) non-convex loss functions. access protocol (similar to TDMA) for UEs. We note that this time-sharing model is not restrictive because other schemes, The time complexity of FEDL is represented by K g such as OFDMA, can also be applied to FEDL. The achievable communication rounds and computation complexity is K K g l transmission rate (nats/s) of UE n is defined as follows: computation rounds. When implementing FEDL over wireless ¯ networks, the wall clock time of each communication round hnpn  rn = B ln 1 + , (14) can be significantly larger than that of computation if the N0 number of UEs increases, due to multi-user contention for where B is the bandwidth, N0 is the background noise, pn is wireless medium. In the next section, we will study the UE ¯ the transmission power, and hn is the average channel gain resource allocation to enable FEDL over wireless networks. of the UE n during the training time of FEDL. Denote the fraction of communication time allocated to UE n by τn, and V. FEDL OVER WIRELESS NETWORKS the data size (in nats) of wn and ∇Fn(wn) by sn. Because the dimension of vectors wn and ∇Fn(wn) is fixed, we assume In this section, we first present the system model and that their sizes are constant throughout the FEDL learning. problem formulation of FEDL over a time-sharing wireless Then the transmission rate of each UE n is environment. We then decompose this problem into three sub-problems, derive their closed-form solutions, reveal the rn = sn/τn, (15) hindsights, and provide numerical support. which is shown to be the most energy-efficient transmission policy [31]. Thus, to transmit sn within a time duration τn, A. System Model the UE n’s energy consumption is At first, we consider synchronous communication which En,co = τn pn(sn/τn), (16) requires all UEs to finish solving their local problems before entering the communication phase. During the communication where the power function is phase, the model’s updates are transferred to the edge server N0  sn/τn  p (s /τ ) := e B − 1 (17) by using a wireless medium sharing scheme. In the com- n n n h¯ munication phase, each global round consists of computation n and communication time which includes uplink and downlink according to (14) and (15). We denote the vector of τn by n ones. In this work, however, we do not consider the downlink τ ∈ R . communication time as it is negligible compared to the uplink Define the total energy consumption of all UEs for each one. The reason is that the downlink has larger bandwidth global round by Eg, which is expressed as follows: than the uplink and the edge server power is much higher XN Eg := En,co + Kl En,cp. than UE’s transmission power. Besides, the computation time n=1 only depends on the number of local rounds, and thus θ, according to (9). Denoting the time of one local round by B. Problem formulation T , i.e., the time to computing one local round (8), then the cp We consider an optimization problem, abusing the same computation time in one global round is K T . Denoting the l cp name FEDL, as follows communication time in one global round by Tco, the wall clock  time of one global round of FEDL is defined as minimize Kg Eg + κ Tg f,τ,θ,η,Tco,Tcp N Tg := Tco + Kl Tcp. X subject to τn ≤ Tco, (18) n=1 1) Computation Model: We denote the number of CPU cnDn max = Tcp, (19) cycles for UE n to execute one sample of data by cn, which n fn can be measured offline [29] and is known a priori. Since all min max fn ≤ fn ≤ fn , ∀n ∈ N , (20) samples {x , y } have the same size (i.e., number of bits), i i i∈Dn pmin ≤ p (s /τ ) ≤ pmax, ∀n ∈ N , (21) the number of CPU cycles required for UE n to run one local n n n n n 0 ≤ θ ≤ 1. (22) round is cnDn. Denote the CPU-cycle frequency of the UE n 6

Minimize both UEs’ energy consumption and the FL time Algorithm 2 Finding N1, N2, N3 in Lemma 2 are conflicting. For example, the UEs can save the energy by c1D1 c2D2 cN DN 1: Sort UEs such that f min ≤ f min ... ≤ f min setting the lowest frequency level all the time, but this will 1 2 N 2: Input: N = ∅, N = ∅, N = N , T in (28) certainly increase the training time. Therefore, to strike the 1 2 3 N3 3: for i = 1 to N do balance between energy cost and training time, the weight cnDn 4: if maxn∈N f max ≥ TN3 > 0 and N1 == ∅ then κ (Joules/second), used in the objective as an amount of n  cmDm cnDn 5: N1 = N1 ∪ m : max = maxn∈N max additional energy cost that FEDL is willing to bear for one fm fn 6: N = N \N and update T in (28) unit of training time to be reduced, captures the Pareto-optimal 3 3 1 N3 ciDi tradeoff between the UEs’ energy cost and the FL time. For 7: if min ≤ TN3 then fi example, when most of the UEs are plugged in, then UE 8: N2 = N2 ∪ {i} energy is not the main concern, thus κ can be large. According 9: N3 = N3 \{i} and update TN3 in (28) to optimization theory, 1/κ also plays the role of a Lagrange multiplier for a “hard constraint” on UE energy [28]. While constraint (18) captures the time-sharing uplink trans- Lemma 2. The optimal solution to SUB1 is as follows mission of UEs, constraint (19) defines that the computing time in one local round is determined by the “bottleneck”  max UE (e.g., with large data size and low CPU frequency). The  fn , ∀n ∈ N1, ∗  min feasible regions of CPU-frequency and transmit power of UEs fn = fn , ∀n ∈ N2, (26) are imposed by constraints (20) and (21), respectively. We  cnDn  T ∗ , ∀n ∈ N3, note that (20) and (21) also capture the heterogeneity of UEs cp T ∗ = maxT ,T ,T , (27) with different types of CPU and transmit chipsets. The last cp N1 N2 N3 constraint restricts the feasible range of the local accuracy. where N , N , N ⊆ N are three subsets of UEs produced by C. Solutions to FEDL 1 2 3 Algorithm 2 and We see that FEDL is non-convex due to the constraint (19) and several products of two functions in the objective function. cnDn TN = max , However, in this section we will characterize FEDL’s solution 1 n∈N f max by decomposing it into multiple simpler sub-problems. n cnDn We consider the first case when θ and η are fixed, then TN2 = max min , n∈N2 f FEDL can be decomposed into two sub-problems as follows: n P 3 1/3  αn(cnDn)  N n∈N3 X TN3 = . (28) SUB1: minimize En,cp + κTcp κ f,Tcp n=1 cnDn subject to ≤ Tcp, ∀n ∈ N , (23) From Lemma 2, first, we see that the optimal solution fn min max depends not only on the existence of these subsets, but also fn ≤ fn ≤ fn , ∀n ∈ N . on their virtual deadlines TN1 , TN2 , and TN3 , in which the XN longest of them will determine the optimal virtual deadline SUB2: min. En,co + κTco τ,T n=1 ∗ co Tcp. Second, from (26), the optimal frequency of each UE XN will depend on both T ∗ and the subset it belongs to. We s.t. τn ≤ Tco, (24) cp n=1 note that depending on κ, some of the three sets (not all) are min max p ≤ pn(sn/τn) ≤ p , ∀n. (25) n n possibly empty sets, and by default TNi = 0 if Ni is an empty set, i = 1, 2, 3. Next, by varying κ, we observe the following While SUB1 is a CPU-cycle control problem for the special cases. computation time and energy minimization, SUB2 can be considered as an uplink power control to determine the UEs’ Corollary 2. The optimal solution to SUB1 can be divided fraction of time sharing to minimize the UEs energy and into four regions as follows. communication time. We note that the constraint (19) of FEDL min 3 is replaced by an equivalent one (23) in SUB1. We can a) κ ≤ minn∈N αn(fn ) : ∗ consider Tcp and Tco as virtual deadlines for UEs to perform N1 and N3 are empty sets. Thus, N2 = N , Tco = TN2 = cnDn ∗ min maxn∈N min , and fn = fn , ∀n ∈ N . their computation and communication updates, respectively. fn 3 3 It can be observed that both SUB1 and SUB2 are convex min cnDn  b) minn∈N αn(fn ) < κ ≤ maxn∈N2 f min : problems. n N2 and N3 are non-empty sets, whereas N1 is 1) SUB1 Solution: We first propose Algorithm 2 in order ∗  ∗ empty. Thus, Tcp = max TN2 ,TN3 , and fn = to categorize UEs into one of three groups: N1 is a group of  cnDn min max T ∗ , fn , ∀n ∈ N . “bottleneck” UEs that always run its maximum frequency; N2 cp P 3 3 αn cnDn is the group of “strong” UEs which can finish their tasks before cnDn  n∈N3 c) maxn∈N2 min < κ ≤ 3 : fn cnDn  the computational virtual deadline even with the minimum maxn∈N fmax n ∗ frequency; and N3 is the group of UEs having the optimal N1 and N2 are empty sets. Thus N3 = N , Tcp = TN3 , ∗ cnDn frequency inside the interior of their feasible sets. and fn = , ∀n ∈ N . TN3 7

1.6 UE 1 5 5 d 1 a * UE 2 b Tcmp 2 1.4 UE 3 c 4 T 1 UE 4 3 4 1.2 UE 5 T 2 T

3 ) 3

1.0 c 3

a b c d e s ( 0.8

f (GHz) 2 p c m

b c 2 0.6 T a 1 0.4 Number of elements d 1 0 0.2 10 3 10 2 10 1 100 101 10 3 10 2 10 1 100 101 0 10 3 10 2 10 1 100 101 (a) Optimal CPU frequency of each (b) Three subsets outputted by UE. Alg. 2. (c) Optimal computation time. Fig. 1: Solution to SUB1 with five UEs. For wireless communication model, the UE channel gains follow the exponential 4 distribution with the mean g0(d0/d) where g0 = −40 dB and the reference distance d0 = 1 m. The distance between these devices and the wireless access point is uniformly distributed between 2 and 50 m. In addition, B = 1 MHz, σ = 10−10 W, the transmission power of devices are limited from 0.2 to 1 W. For UE computation model, we set the training size Dn of max each UE as uniform distribution in 5 − 10 MB, cn is uniformly distributed in 10 − 30 cycles/bit, fn is uniformly distributed min −28 in 1.0 − 2.0 GHz, fn = 0.3 GHz. Furthermore, α = 2 × 10 and the UE update size sn = 25, 000 nats (≈4.5 KB).

P 3 αn cnDn n∈N3 2) SUB2 Solution: Before characterizing the solution to d) κ > 3 : cnDn  maxn∈N fmax SUB2, from (17) and (25), we first define two bounded values n ∗ N1 is non-empty. Thus Tcp = TN1 , and for τn as follows s τ max = n , ( max n ¯ −1 min f , ∀n ∈ N1, B ln(hnN p + 1) f ∗ = n (29) 0 n n  cnDn min min sn max T , fn , ∀n ∈ N \ N1. τ = , N1 n ¯ −1 max B ln(hnN0 pn + 1) We illustrate Corollary 2 in Fig. 1 with four regions1 as which are the maximum and minimum possible fractions of follows. Tco that UE n can achieve by transmitting with its minimum and maximum power, respectively. We also define a new κ κ ≤ 0.004 a) Very low (i.e., ): Designed for solely energy function gn : R → R as minimization. In this region, all UE runs their CPU at the s /B lowest cycle frequency f min, thus T ∗ is determined by the n n cp gn(κ) = −1¯ , κN0 hn−1  last UEs that finish their computation with their minimum 1 + W e frequency. where W (·) is the Lambert W -function. We can consider g (·) b) Low κ (i.e., 0.004 ≤ κ ≤ 0.1): Designed for prioritized n as an indirect “power control” function that helps UE n control energy minimization. This region contains UEs of both N 2 the amount of time it should transmit an amount of data s and N . T ∗ is governed by which subset has higher virtual n 3 cp by adjusting the power based on the weight κ. This function computation deadline, which also determines the optimal is strictly decreasing (thus its inverse function g−1(·) exists) CPU-cycle frequency of N . Other UEs with light-loaded n 3 reflecting that when we put more priotity on minimizing the data, if exist, can run at the most energy-saving mode f min n communication time (i.e., high κ), UE n should raise the yet still finish their task before T ∗ (i.e., N ). cp 2 power to finish its transmission with less time (i.e., low τ ). c) Medium κ (i.e., 0.1 ≤ κ ≤ 1): Designed for balancing n computation time and energy minimization. All UEs belong Lemma 3. The solution to SUB2 is as follows to N with their optimal CPU-cycle frequency strictly −1 max 3 a) If κ ≤ gn (τn ), then inside the feasible set. ∗ max d) High κ (i.e., κ ≥ 1): Designed for prioritized computation τn = τn time minimization. High value κ can ensure the existence −1 max −1 min b) If gn (τn ) < κ < gn (τn ), then of N1, consisting the most “bottleneck” UEs (i.e., heavy- max min ∗ max loaded data and/or low fn ) that runs their maximum τn < τn = gn(κ) < τn CPU-cycle in (29) (top) and thus determines the optimal −1 min ∗ c) If κ ≥ gn (τn ), then computation time Tcp. The other “non-bottleneck” UEs ∗ min either (i) adjust a “right” CPU-cycle to save the energy τn = τn , ∗ yet still maintain their computing time the same as Tcp and T ∗ = PN τ ∗. (i.e., N3), or (ii) can finish the computation with minimum co n=1 n frequency before the “bottleneck” UEs (i.e., N2) as in (29) This lemma can be explained through the lens of network (bottom). economics. If we interpret the FEDL system as the buyer and UEs as sellers with the UE powers as commodities, then the 1 −1 All closed-form solutions are also verified by the solver IPOPT [32]. inverse function gn (·) is interpreted as the price of energy 8

0.8 UE 1 UE 1 3) SUB3 Solution: We observe that the solutions to SUB1 1.0 UE 2 UE 2 UE 3 UE 3 ∗ UE 4 0.6 UE 4 and SUB2 have no dependence on θ so that the optimal Tco, 0.8 UE 5 UE 5 ) ∗ ∗ ∗ c

e T , f , τ , and thus the corresponding optimal energy values,

s cp

0.6 ( 0.4

n ∗ ∗ p (Watt) denoted by En,cp and En,cp, can be determined based on κ 0.4 0.2 according to Lemmas 2 and 3. However, these solutions will 0.2 0.0 affect to the third sub-problem of FEDL, as will be shown in 10 3 10 2 10 1 100 101 10 3 10 2 10 1 100 101 what follows. (a) UEs’ optimal transmission (b) UEs’ optimal transmission power. time . SUB3: N 1 X ∗ ∗ ∗ ∗  Fig. 2: The solution to SUB2 with five UEs. The numerical minimize En,co + Kl En,cp + κ Tco + Kl Tcp θ,η>0 Θ n=1 setting is the same as that of Fig. 1. subject to 0 < θ < 1, 0 < Θ < 1. ρ κ θ∗ Θ η∗ 0.1 .033/.015/.002 .094/.042/.003 .253/.177/.036 SUB3 is unfortunately non-convex. However, since there 1.4/2/5 1 .035/.016/.002 .092/.041/.003 .253/.177/.036 are only two variables to optimize, we can employ numerical 10 .035/.016/.002 .092/.041/.003 .253/.177/.036 methods to find the optimal solution. The numerical results in ∗ ∗ TABLE I: The solution to SUB3 with five UEs. The numerical Table I show that the solution θ and η to SUB3 decreases setting is the same as that of Fig. 1. when ρ increases, which makes Θ decreases, as explained by the results of Theorem 1. Also we observe that κ as more effect to the solution to SUB3 when ρ is small. that UE n is willing to accept to provide power service for 4) FEDL Solution: Since we can obtain the stationary FEDL to reduce the training time. There are two properties of points of SUB3 using Successive Convex Approximation this function: (i) the price increases with respect to UE power, techniques such as NOVA [33], then we have: and (ii) the price sensitivity depends on UEs characteristics, Theorem 2. The combined solutions to three sub-problems e.g., UEs with better channel quality can have lower price, SUB1, SUB2, and SUB3 are stationary points of FEDL. whereas UEs with larger data size sn will have higher price. −1 Thus, each UE n will compare its energy price gn (·) with The proof of this theorem is straightforward. The idea is the “offer” price κ by the system to decide how much power it to use the KKT condition to find the stationary points of is willing to “sell”. Then, there are three cases corresponding FEDL. Then we can decompose the KKT condition into three to the solutions to SUB2. independent groups of equations (i.e., no coupling variables a) Low offer: If the offer price κ is lower than the minimum between them), in which the first two groups matches exactly −1 max price request gn (τn ), UE n will sell its lowest service to the KKT conditions of SUB1 and SUB2 that can be solved min by transmitting with the minimum power pn . by closed-form solutions as in Lemmas 2, 3, and the last group b) Medium offer: If the offer price κ is within the range of an for SUB3 is solved by numerical methods. acceptable price range, UE n will find a power level such We then have some discussions on the combined solution that the corresponding energy price will match the offer to FEDL. First, we see that SUB1 and SUB2 solutions price. can be characterized independently, which can be explained c) High offer: If the offer price κ is higher than the maximum that each UE often has two separate processors: one CPU −1 min price request gn (τn ), UE n will sell its highest service for mobile applications and another baseband processor for max by transmitting with the maximum power pn . radio control function. Second, neither SUB1 nor SUB2 Lemma 3 is further illustrated in Fig. 2, showing how the depends on θ because the communication phase in SUB2 solution to SUB2 varies with respect to κ. It is observed from is clearly not affected by the local accuracy, whereas SUB2 this figure that due to the UE heterogeneity of channel gain, considers the computation cost in one local round. However, κ = 0.1 is a medium offer to UEs 2, 3, and 4, but a high offer the solutions to SUB1 and SUB2, which can reveal how much to UE 1, and low offer to UE 5. communication cost is more expensive than computation cost, While SUB1 and SUB2 solutions share the same threshold- are decisive factors to determine the optimal level of local based dependence, we observe their differences as follows. In accuracy. Therefore, we can sequentially solve SUB1 and SUB1 solution, the optimal CPU-cycle frequency of UE n SUB2 first, then SUB3 to achieve the solutions to FEDL. ∗ We also summarize the complexities in the following table: depends on the optimal Tcp, which in turn depends on the loads (i.e., cnDn , ∀n ) of all UEs. Thus all UE load information fn SUB1 SUB2 SUB3 is required for the computation phase. On the other hand, O(N 2) O(1) O(N) in SUB2 solution, each UE n can independently choose its −1 TABLE II: Summary of the sub-problem complexity. optimal power by comparing its price function gn (·) with κ so that collecting UE information is not needed. The reason is that the synchronization of computation time in constraint (23) of SUB1 requires all UE loads, whereas the UEs’ time-sharing VI.EXPERIMENTS constraint (24) of SUB2 can be decoupled by comparing with This section will validate the FEDL’s learning performance the fixed “offer” price κ. in a heterogeneous network. The experimental results show 9

MNIST MNIST Linear regression on synthetic dataset 0.7 = 1.4 = 2 = 5 FEDL : B = 40 0.96 1.00.5 0.5 0.5 0.6 FedAvg : B = 40 FEDL: = 0.01 0.94 FEDL : B = FEDL: = 0.03 0.5 FedAvg : B = 0.92 0.40.8 FEDL: = 0.05 0.4 0.4 FEDL: = 0.07 0.90 0.4 optimal solution 0.6 0.88 0.3 0.3 0.3 0.3

Training Loss 0.86 Test Accuracy FEDL : B = 40 0.2 0.20.4 0.2 0.2 0.84 FedAvg : B = 40 FEDL : B = Training Loss 0.82 0.1 FedAvg : B = 0.2 0.1 0.1 0.1 0.80 0 200 400 600 800 0 200 400 600 800 Global rounds Global rounds 0.0 0.00 100 0.2 200 0 0.4 100 0.6200 0 0.8 100 2001.0 Global rounds Kg Fig. 6: FEDL’s performance on non-convex (MNIST dataset). Fig. 3: Effect of η on the convergence of FEDL. Training processes use full-batch gradient descent, full devices partici- FedAvg [4] in terms of training loss convergence rate and test pation (N = S = 100 UEs), K = 200, and K = 20. g l accuracy in various settings. All codes and data are published on GitHub [34]. MNIST 0.921.0 0.92 0.92 Experimental settings: In our setting, the performance of FEDL is examined by 0.900.8 0.90 0.90 both classification and regression tasks. The regression task 0.880.6 0.88 0.88 uses linear regression model with mean square error loss 0.860.4 0.86 0.86 function on a synthetic dataset while the classification task FEDL : B = , = 0.2 Testing Accuracy 0.840.2 0.84 0.84 FedAvg : B = , = 0 uses multinomial logistic regression model with cross-entropy FEDL : B = 20, = 0.2 FEDL : B = 40, = 0.2 FEDL : B = , = 2.0 FedAvg : B = 20, = 0 FedAvg : B = 40, = 0 FEDL : B = , = 4.0 0.820.0 0.82 0.82 error loss function on real federated datasets (MNIST [35], 0.00 250 5000.2 750 0 0.4 250 500 0.6750 0 2500.8 500 7501.0 Global rounds Kg FEMNIST [36]). The loss function for each UE is given below: MNIST 1.0 0.45 FEDL : B = 20, = 0.2 0.45 FEDL : B = 40, = 0.2 0.45 FEDL : B = , = 0.2 FedAvg : B = 20, = 0 FedAvg : B = 40, = 0 FedAvg : B = , = 0 1) Mean square error loss for linear regression (synthetic 0.8 FEDL : B = , = 2.0 0.40 0.40 0.40 FEDL : B = , = 4.0 dataset):

0.6 0.35 0.35 0.35 1 X 2 Fn(w) = (hxi, wi − yi) . 0.4 Dn (x ,y )∈Dn Training Loss 0.30 0.30 0.30 i i 0.2 0.25 0.25 0.25 2) Cross-entropy loss for multinomial logistic regression 0.0 0.00 250 5000.2 750 0 0.4 250 500 0.6750 0 2500.8 500 7501.0 with C class labels (MNIST and FEMNIST datasets): Global rounds Kg  Dn C  −1 X X exp(hxi, wci) K = Fn(w) = 1 log Fig. 4: Effect of different batch-size with fixed value of l {yi=c} PC Dn exp(hxi, wj i) 20, Kg = 800, N = 100, S = 10,(B = ∞ means full batch i=1 c=1 j=1 C size) on FEDL’s performance. β X 2 + kw k . 2 c c=1 FEMNIST 1.0 We consider N = 100 UEs. To verify that FEDL also works 0.8 0.8 0.8 0.8 with UE subset sampling, we allow FEDL to randomly sample 0.7 0.7 0.7 a number of subset of UEs, denoted by S, following a uniform 0.6 0.6 0.6 0.6 distribution as in FedAvg in each global iteration. In order to 0.4 0.5 0.5 0.5 generate datasets capturing the heterogeneous nature of FL,

Testing Accuracy FEDL : B = 20, = 0.2, K = 10 FEDL : B = 20, = 0.2, K = 20 FEDL : B = 20, = 0.2, K = 40 0.40.2 l 0.4 l 0.4 l all datasets have different sample sizes based on the power FedAvg : B = 20, = 0, Kl = 10 FedAvg : B = 20, = 0, Kl = 20 FedAvg : B = 20, = 0, Kl = 40

FEDL : B = , = 0.5, Kl = 10 FEDL : B = , = 0.5, Kl = 20 FEDL : B = , = 0.5, Kl = 40 0.00.3 0.3 0.3 law in [9]. In MNIST, each user contains three of the total of 0.00 250 5000.2 750 0 0.4 250 500 0.6750 0 2500.8 500 7501.0 Global rounds Kg ten labels. FEMNIST is built similar to [9] by partitioning the FEMNIST data in Extended MNIST [37]. For synthetic data, to allow for 1.03.0 3.0 3.0 FEDL : B = 20, = 0.2, Kl = 10 FEDL : B = 20, = 0.2, Kl = 20 FEDL : B = 20, = 0.2, Kl = 40 FedAvg : B = 20, = 0, K = 10 FedAvg : B = 20, = 0, K = 20 FedAvg : B = 20, = 0, K = 40 d = 40 l l l the non-iid setting, each user’s data has the dimension . 0.82.5 FEDL : B = , = 0.5, Kl = 10 2.5 FEDL : B = , = 0.5, Kl = 20 2.5 FEDL : B = , = 0.5, Kl = 40 We control the value of ρ by using the data generation method 0.62.0 2.0 2.0 similar to that in [38], in which user i has data drawn from

0.41.5 1.5 1.5 N (0, σiΣ) where σi ∼ U(1, 10). Here Σ is a diagonal covari- Training Loss ance matrix with Σ = i−p, i ∈ [1, d] and p = log(ρ) , where 0.21.0 1.0 1.0 ii log(d) ρ is considered as multiplicative inverse for the minimum 0.00.5 0.5 0.5 0.00 250 5000.2 750 0 0.4 250 500 0.6750 0 2500.8 500 7501.0 covariance value of Σ. The number of data samples of each Global rounds K g UE is in the ranges [55, 3012], [504, 1056], and [500, 5326] for Fig. 5: Effect of increasing local computation time on the MNIST, FEMNIST, and Synthetic, respectively. All datasets convergence of FEDL (N = 100, S = 10, Kg = 800). are split randomly with 75% for training and 25% for testing. Each experiment is run at least 10 times and the average results are reported. We summarized all parameters used for that FEDL gains performance improvement from the vanilla the experiments in Table. III. 10

Parameters Description 400 400 0 L L =0.15 L =0.31 ρ Condition number of Fn(·) s Hessian Matrix, ρ = cmp com β Lcmp=0.75 Lcom=32.85 300 300 L =186.8 Kg Global rounds for global model update Lcmp=150.0 com Kl Local rounds for local model update η Hyper-learning rate 200 200 h k Local learning rate FEDL's obj FEDL's obj B Batch Size (B = ∞ means full batch size for GD) 100 100

0 0 TABLE III: Experiment parameters. 10 1 100 10 1 100

Effect of the hyper-learning rate on FEDL’s conver- (a) Impact of Lcp (b) Impact of Lco gence: We first verify the theoretical finding by predetermining Fig. 7: Impact of UE heterogeneity on FEDL. the value of ρ and observing the impact of changing η on the convergence of FEDL using a synthetic dataset. In Fig. Lcmp=0.15 350 Lcom=0.31 ρ 400 3, we examine four different values of . As can be seen in Lcmp=0.75 Lcom=32.85 =0.1 300 L =186.8 the figure, with all value of ρ, there were exist small enough Lcmp=150.0 =0.1 com 300 values of η that allow FEDL to converge. We also observe 250 200 200 =1.0 =1.0 that using the larger value of η makes FEDL converge faster. Time Cost Time Cost 150

In addition, even if FEDL allows UEs to solve their local 100 =10.0 100 =10.0 problems approximately, the experiment shows that the gap 50 0 100 200 300 400 0 100 200 300 400 between the optimal solution and our approach in Fig. 3 is Energy Cost Energy Cost negligible. It is noted that the optimal solution is obtained by (a) Impact of Lcp and κ. (b) Impact of Lco and κ. solving directly the global loss function (1) as we consider the local loss function at each UE is mean square error loss. Fig. 8: Pareto-optimal points of FEDL. Effect of different gradient descent algorithms on FEDL’s performance: As UEs are allowed to use different gradient descent methods to minimize the local problem (2), positive impact on the convergence time FEDL. However, the the convergence of FEDL can be evaluated on different larger Kl requires higher local computation at UEs, which optimization algorithms: GD and mini-batch SGD by changing costs the EU’s energy consumption. the configuration of the batch size during the local training FEDL’s performance on non-convex problem: Even process. Although our analysis results are based on GD, though the convergence of FEDL is only applicable to strongly we also monitor the behavior of FEDL using SGD in the convex and smooth loss functions in theory, we will show that experiments for the comparison. While a full batch size is FEDL also empirically works well in non-convex case. We applied for GD, mini-batch SGD is trained with a batch size consider a simple non-convex model that has a two-layer deep neural network (DNN) with one hidden layer of size 100 using of 20 and 40 samples. We conducted a grid search on hk to find the value allowing FEDL and FedAvg to obtain the ReLU activation, an output layer using a softmax function, best performance in terms of accuracy and stability. Fig. 4 and the negative log-likelihood loss function. In Fig. 6, we demonstrates that FEDL outperforms FedAvg on all batch size see that by using a more complexed model, both FEDL and settings (the improvement in terms of testing accuracy and FedAvg achieve higher accuracy than the strongly convex case. training loss are approximately 1.3% and 9.1% respectively for Although FEDL still outperforms FedAvg in case of DNN, the batch size 20, 0.7% and -0.2% for the batch size 40, and the performance improvement is negligible compared to the 0.8% and 14% for the full batch size). Besides, FEDL is more strongly convex case, and FEDL become less stable when stable than FedAvg when the small number of devices is sub- using small mini-batch size. sampling randomly during the training process. Even though using larger batch size benefits the stability of both FedAvg VII.NUMERICAL RESULTS and FEDL, very large batch size can make the convergence In this section, both the communication and computation of FEDL slow. However, increasing the value of η allows models follow the same setting as in Fig. 1, except the speeding up the convergence of FEDL in case of GD. number of UEs is increased to 50, and all UEs have the same max Effect of increasing local computation on convergence fn = 2.0 GHz and cn = 20 cycles/bit. Furthermore, we time: In order to validate the performance of FEDL on a define two new parameters, addressing the UE heterogeneity different value of local updates Kl, in the Fig. 5, we use regarding to computation and communication phases in FEDL, cnDn maxn∈N max min both mini-batch SGD algorithm with the fixed batch size of fn maxn∈N τn respectively, as Lcp = cnDn and Lco = min τ max . 20 and GD for the local update and increase K from 10 minn∈N min n∈N n l fn to 40. For all Kl, even when the learning rate of FedAvg We see that higher values of Lcp and Lco indicate higher levels is tuned carefully, FEDL in all batch size settings achieve a of UE heterogeneity. The level of heterogeneity is controlled significant performance gap over FedAvg in terms of training by two different settings. To vary Lcp, the training size Dn Dmin  loss and testing accuracy. Also in this experiment, FEDL is generated with the fraction Dmax ∈ 1, 0.2, 0.001 but using minibatch outperforms FEDL using GD for FEMNIST the average UE data size is kept at the same value 7.5 MB dataset. While larger Kl does not show the improvement on for varying values of Lcp. On the other hand, to vary Lco, the convergence of FedAvg, the rise of Kl has an appreciably the distance between these devices and the edge server is 11

dmin  generated such that dmax ∈ 1., 0.2, 0.001 but the average [7] S. Wang et al., “Adaptive Federated Learning in Resource Constrained distance of all UEs is maintained at 26 m. Edge Computing Systems,” IEEE Journal on Selected Areas in Com- munications, vol. 37, no. 6, pp. 1205–1221, Jun. 2019. We first examine the impact of UE heterogeneity. We [8] V. Smith et al., “CoCoA: A General Framework for Communication- observe that the high level of UE heterogeneity has negative Efficient Distributed Optimization,” Journal of Machine Learning Re- impact on the FEDL system, as illustrated in Figs. 7a and 7b, search, vol. 18, no. 230, pp. 1–49, 2018. L [9] T. Li et al., “Federated Optimization for Heterogeneous Networks,” in such that the total cost is increased with higher value of cp Proceedings of the 1st Adaptive & Multitask Learning, ICML Workshop, and Lco respectively. We next illustrate the Pareto curve in 2019, Long Beach, CA, 2019, p. 16. Fig. 8. This curve shows the trade-off between the conflicting [10] C. Ma et al., “Distributed optimization with arbitrary local solvers,” goals of minimizing the time cost K(θ)T and energy cost Optimization Methods and Software, vol. 32, no. 4, pp. 813–848, Jul. g 2017. K(θ)Eg, in which we can decrease one type of cost yet with [11] O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient Dis- the expense of increasing the other one. This figure also shows tributed Optimization Using an Approximate Newton-type Method,” in that the Pareto curve of FEDL is more efficient when the ICML, Beijing, China, 2014, pp. II–1000–II–1008. [12] J. Wang and G. Joshi, “Cooperative SGD: A unified Framework for system has low level of UE heterogeneity (i.e., small Lcp the Design and Analysis of Communication-Efficient SGD Algorithms,” and/or Lco). arXiv:1808.07576 [cs, stat], Jan. 2019. [13] S. U. Stich, “Local SGD Converges Fast and Communicates Little,” VIII.CONCLUSIONS In ICLR 2019 International Conference on Learning Representations, 2019. In this paper, we studied FL, a learning scheme in which the [14] F. Zhou and G. Cong, “On the Convergence Properties of a K-step training model is distributed to participating UEs performing Averaging Stochastic Gradient Descent Algorithm for Nonconvex Opti- mization,” in Proceedings of the 27th International Joint Conference on training over their local data. Although FL shows vital ad- Artificial Intelligence, Stockholm, Sweden, Jul. 2018, pp. 3219–3227. vantages in data privacy, the heterogeneity across users’ data [15] J. Konecnˇ y´ et al., “Federated Learning: Strategies for Improving Com- and UEs’ characteristics are still challenging problems. We munication Efficiency,” http://arxiv.org/abs/1610.05492, Oct. 2016. [16] V. Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated Multi- proposed an effective algorithm without the i.i.d. UEs’ data task Learning,” in NeurIPS’17, CA, U.S., 2017, pp. 4427–4437. assumption for strongly convex and smooth FL’s problems and [17] C. T. Dinh et al., “Federated Learning with Proximal Stochastic Variance then characterize the algorithm’s convergence. For the wireless Reduced Gradient Algorithms,” in 49th International Conference on resource allocation problem, we embedded the proposed FL Parallel Processing - ICPP, Edmonton Canada, Aug. 2020, pp. 1–11. [18] M. M. Amiri and D. Gund¨ uz,¨ “Over-the-Air Machine Learning at the algorithm in wireless networks which considers the trace-offs Wireless Edge,” in 2019 IEEE 20th International Workshop on Signal not only between computation and communication latencies Processing Advances in Wireless Communications (SPAWC), Jul. 2019, but also the FL time and UE energy consumption. Despite pp. 1–5. [19] H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu, “Communication the non-convex nature of this problem, we decomposed it Compression for Decentralized Training,” in NeurIPS, 2018, pp. 7663– into three sub-problems with convex structure before analyz- 7673. ing their closed-form solutions and quantitative insights into [20] T. T. Vu et al., “Cell-Free Massive MIMO for Wireless Federated Learning,” IEEE Transactions on Wireless Communications, vol. 19, problem design. We then verified the theoretical findings of no. 10, pp. 6377–6392, Oct. 2020. the new algorithm by experiments on Tensoflow with several [21] L. U. Khan et al., “Federated Learning for Edge Networks: Resource datasets, and the wireless resource allocation sub-problems Optimization and Incentive Mechanism,” IEEE Communication Maga- by extensive numerical results. In addition to validating the zine, 2020, Sep. 2020. [22] S. R. Pandey et al., “A Crowdsourcing Framework for On-Device theoretical convergence, our experiments also showed that the Federated Learning,” IEEE Transactions on Wireless Communications, proposed algorithm can boost the convergence speed compared vol. 19, no. 5, pp. 3241–3256, May 2020. to an existing baseline approach. [23] M. Chen et al., “A Joint Learning and Communications Framework for Federated Learning over Wireless Networks,” arXiv:1909.07972 [cs, stat], Jun. 2020. ACKNOWLEDGMENT [24] H. Wang et al., “ATOMO: Communication-efficient Learning via This research is funded by Vietnam National Foundation for Atomic Sparsification,” in NeurIPS, 2018, pp. 9850–9861. [25] H. Zhang et al., “ZipML: Training Linear Models with End-to-End Science and Technology Development (NAFOSTED) under Low Precision, and a Little Bit of Deep Learning,” in International grant number 102.02-2019.321. Dr. Nguyen H. Tran and Dr. Conference on Machine Learning, Jul. 2017, pp. 4035–4043. CS Hong are the corresponding authors. [26] S. J. Reddi, J. Konecnˇ y,´ P. Richtarik,´ B. Pocz´ os,´ and A. Smola, “AIDE: Fast and Communication Efficient Distributed Optimization,” arXiv:1608.06879 [cs, math, stat], Aug. 2016. REFERENCES [27] Y. Nesterov, Lectures on Convex Optimization. Springer International [1] N. H. Tran, W. Bao, A. Zomaya, M. N. Nguyen, and C. S. Hong, “Fed- Publishing, 2018, vol. 137. erated Learning over Wireless Networks: Optimization Model Design [28] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge and Analysis,” in IEEE INFOCOM 2019, Paris, France, Apr. 2019. University Press, Mar. 2004. [2] D. Reinsel, J. Gantz, and J. Rydning, “The Digitization of the World [29] A. P. Miettinen and J. K. Nurminen, “Energy Efficiency of Mobile from Edge to Core,” p. 28, 2018. Clients in Cloud Computing,” in USENIX HotCloud’10, Berkeley, CA, [3] “Free Community-based GPS, Maps & Traffic Navigation App — USA, 2010, pp. 4–4. Waze,” https://www.waze.com/. [30] T. D. Burd and R. W. Brodersen, “Processor Design for Portable [4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, Systems,” Journal of VLSI Signal Processing System, vol. 13, no. 2- “Communication-Efficient Learning of Deep Networks from Decentral- 3, pp. 203–221, Aug. 1996. ized Data,” in Artificial Intelligence and Statistics, Apr. 2017, pp. 1273– [31] B. Prabhakar, E. U. Biyikoglu, and A. E. Gamal, “Energy-efficient 1282. transmission over a wireless link via lazy packet scheduling,” in IEEE [5] https://www.techradar.com/news/what-does-ai-in-a-phone-really-mean. INFOCOM 2001, vol. 1, 2001, pp. 386–394. [6] J. Konecnˇ y,´ H. B. McMahan, D. Ramage, and P. Richtarik,´ “Federated [32]A.W achter¨ and L. T. Biegler, “On the implementation of an interior- Optimization: Distributed Machine Learning for On-Device Intelli- point filter line-search algorithm for large-scale nonlinear programming,” gence,” arXiv:1610.02527 [cs], Oct. 2016. Mathematical Programming, vol. 106, no. 1, pp. 25–57, Mar. 2006. 12

[33] G. Scutari, F. Facchinei, and L. Lampariello, “Parallel and Distributed L kw − w0k , ∀w, w0, by using Jensen’s inequality and L- Methods for Constrained Nonconvex Optimization—: ,” IEEE Transac- smoothness, respectively), we have tions on Signal Processing, vol. 65, no. 8, pp. 1929–1944, Apr. 2017. [34] C. T. Dinh, https://github.com/CharlieDinh/FEDL Pytorch. t t−1 F (wn) − F (w ) [35] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning t−1 t t−1 L t t−1 2 applied to document recognition,” Proceedings of the IEEE, vol. 86, ≤ h∇F (w ), wn − w i + wn − w no. 11, pp. 2278–2324, Nov./1998. 2 [36] S. Caldas et al., “LEAF: A Benchmark for Federated Settings,” t−1 ¯t−1 t t−1 L t t−1 2 = h∇F (w ) − ∇F , wn − w i + wn − w arXiv:1812.01097 [cs, stat], Dec. 2018. 2 [37] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, “EMNIST: Extending ¯t−1 t t−1 + h∇F , wn − w i (37) MNIST to handwritten letters,” in 2017 International Joint Conference on Neural Networks (IJCNN), May 2017, pp. 2921–2926. t−1 ¯t−1 t t−1 L t t−1 2 ≤ ∇F (w ) − ∇F wn − w + wn − w [38] X. Li, W. Yang, S. Wang, and Z. Zhang, “Communication Efficient 2 ¯t−1 t t−1 Decentralized Training with Multiple Local Updates,” arXiv:1910.09126 + h∇F , wn − w i (38) [cs, math, stat], Oct. 2019. (36) t−1 ¯t−1 t t−1 L t t−1 2 = ∇F (w ) − ∇F wn − w + wn − w 2 1 − h∇F (w ˆt ) − ∇F (wt−1), wt − wt−1i APPENDIX η n n n n t−1 ¯t−1 t t−1 L t t−1 2 A. Review of useful existing results = ∇F (w ) − ∇F wn − w + wn − w 2 With Assumption 1 on L-smoothness and β-strong convex- 1 t t t t−1 − h∇Fn(w ˆn) − ∇Fn(wn), wn − w i ity of Fn(·), according to [27][Theorems 2.1.5, 2.1.10, and η 2.1.12], we have the following useful inequalities 1 t t−1 t t−1 − h∇Fn(wn) − ∇Fn(w ), wn − w i (39) ∗ 2 η 2L(Fn(w) − Fn(w )) ≥ k∇Fn(w)k , ∀w (30) t−1 ¯t−1 t t−1 L t t−1 2 0 0 1 0 2 ≤ ∇F (w ) − ∇F wn − w + wn − w h∇Fn(w)−∇Fn(w ), w − w i≥ ∇Fn(w)−∇Fn(w ) (31) 2 L L ∗ 2 t t t t−1 + wˆn − wn wn − w (40) 2β(Fn(w) − Fn(w )) ≤ k∇Fn(w)k , ∀w (32) η ∗ β kw − w k ≤ k∇Fn(w)k , ∀w, (33) 1 − h∇F (wt ) − ∇F (wt−1), wt − wt−1i η n n n n ∗ where w is the solution to problem minw∈ d Fn(w). (31) R t−1 ¯t−1 t t−1 L t t−1 2 ≤ ∇F (w ) − ∇F wn − w + wn − w 2 B. Proof of Lemma 1 L t t t t−1 + wˆn − wn wn − w η Due to L-smooth and β-strongly convex Jn, from (30) and (32), respectively, we have 1 t t−1 2 − ∇Fn(wn) − ∇Fn(w ) , (41) t 2 ηL t t ∗ ∇Jn(zk) Jn(zk) − Jn(z ) ≥ , 2L where (37) is by adding and subtracting ∇F¯t−1, (38) is by ∇J t (z ) 2 t t ∗ n 0 Cauchy-Schwarz inequality, (39) is by adding and subtracting Jn(z0) − Jn(z ) ≤ . 2β t ∇Fn(wn), (40) is by using Cauchy-Schwarz inequality and t−1 L-Lipschitz smoothness of Fn(·). The next step is to bound Combining these inequalities with (8), and setting z0 = w t the norm terms in the R.H.S of (41) as follows: and zk = wn, we have • First, we have 2 2 t t L k t t−1 (33) ∇Jn(wn) ≤ c (1 − γ) ∇Jn(w ) . t t−1 1 t t−1 (35) η ¯t−1 β wˆn − w ≤ ∇Jn(w ) = ∇F . (42) β β Since (1 − γ)k ≤ e−kγ , the θ-approximation condition (3) is • Next, (33) (3) satisfied when t t 1 t t θ t t−1 wˆn − wn ≤ ∇Jn(wn) ≤ ∇Jn(w ) β β L −kγ 2 c e ≤ θ . (35) θη t−1 β = ∇F¯ . (43) β Taking log both sides of the above, we complete the proof. • Using triangle inequality, (42), and (43), we have t t−1 t t t t−1 C. Proof of Theorem 1 wn − w ≤ wn − wˆn + wˆn − w η t−1 t ≤ (1 + θ) ∇F¯ . (44) We remind the definition of Jn(w) as follows β t ¯t−1 t−1 Jn(w) = Fn(w) + hη∇F − ∇Fn(w ), wi. (34) • We also have (34) ∇F (wt ) − ∇F (wt−1) = ∇J t (wt ) − ∇J t (wt−1) t t n n n n n n wˆ min d J (w) Denoting n the solution to w∈R n , we have t t−1 t t ≥ ∇Jn(w ) − ∇Jn(wn) t t−1 t−1 ∇J (w ) = η∇F¯ , (35) (3) n t t−1 t t t ¯t−1 t−1 ≥ (1 − θ) ∇Jn(w ) ∇Jn(w ˆn) = 0 = ∇Fn(w ˆn) + η∇F − ∇Fn(w ). (36) (35) t−1 = (1 − θ)η ∇F¯ . (45) Since F (·) is also L-Lipschitz smooth (i.e., 0 PN Dn 0 k∇F (w) − ∇F (w )k ≤ n=1 D k∇Fn(w) − ∇Fn(w )k ≤ 13

t−1 • By definitions of ∇F (.) and ∇F¯ , we have D. Proof of Lemma 2

N The convexity of SUB1 can be shown by its strictly convex t−1 ¯t−1 X t t−1  ∇F (w ) − ∇F = pn ∇Fn(wn) − ∇Fn(w ) objective in (13) and its constraints determine a convex set. n=1 Thus, the global optimal solution of SUB1 can be found using N X t t−1 KKT condition [28]. In the following, we first provide the ≤ pn ∇Fn(wn) − ∇Fn(w ) n=1 KKT condition of SUB1, then show that solution in Lemma 2 (46) satisfy this condition. N The Lagrangian of SUB1 is X t t−1 ≤ pnL wn − w (47) XN h cnDn n=1 L1 = En,cp + λn( − Tcp) n=1 fn L ¯t−1 ≤ (1 + θ)η ∇F , (48) max min i β + µn(fn − fn ) − νn(fn − fn ) + κTcp

where (46), (47) and (48) are obtained using Jensen’s in- where λn, µn, νn are non-negative dual variables with their ∗ ∗ ∗ equality, L-Lipschitz smoothness, and (44), respectively. optimal values denoted by λn, µn, νn, respectively. Then the • Finally, we have KKT condition is as follows: ∂L ∂E c D = n,cp − λ n n + µ − ν = 0, ∀n (56) t−1 2 ¯t−1 t−1 2 ¯t−1 2 n 2 n n ∇F (w ) ≤ 2 ∇F − ∇F (w ) + 2 ∇F ∂fn ∂τn fn (49) ∂L XN = κ − λn = 0, (57) (48) ∂T n=1 2 2 2 t−1 2 t−1 2 cp ≤ 2(1 + θ) η ρ ∇F¯ + 2 ∇F¯ max µn(fn − fn ) = 0, ∀n (58) min which implies νn(fn − fn ) = 0, ∀n (59) cnDn λn( − Tcp) = 0. ∀n (60) ¯t−1 2 1 t−1 2 fn ∇F ≥ 2 2 2 ∇F (w ) , (50) 2(1 + θ) η ρ + 2 Next, we will show that the optimal solution according to KKT condition is also the same as that provided by Lemma 2. where (49) comes from the fact that kx + yk2 ≤ To do that, we observe that the existence of N1, N2, N3 2 kxk2 + 2 kyk2 for any two vectors x and y. and their respective TN1 , TN2 , TN3 produced by Algorithm 2 depends on κ. Therefore, we will construct the ranges of Defining 0 0 0 κ such that there exist three subsets N1, N2, N3 of UEs η(−2(θ − 1)2 + (θ + 1)θ(3η + 2)ρ2 + (θ + 1)ηρ2) satisfying KKT condition and having the same solution as that Z := < 0 2ρ in Lemma 2 in the following cases. and substituting (42), (44), (45), and (48) into (41), we have ∗  a) Tcp = TN1 ≥ max TN2 ,TN3 : This happens when κ is t t−1 large enough so that the condition in line 4 of Algorithm 2 F (wn) − F (w ) satisfies because TN3 is decreasing when κ increase. Thus Z t−1 2 PN max 3 ≤ ∇F¯ we consider κ ≥ αn(f ) (which ensures N1 of β n=1 n Algorithm 2 is non-empty). (50) Z t−1 2 ≤ ∇F (w ) From (57), we have 2β(1 + θ)2η2ρ2 + 1 XN ∗ κ = λn, (61) n=1 (32) −Z ≤ − (F (wt−1) − F (w∗)) (51) 0 2 2 2  thus κ in this range can guarantee a non-empty set N1 = (1 + θ) η ρ + 1 ∗ max 3 {n|λn ≥ αn(fn ) } such that η(2(θ − 1)2 − (θ + 1)θ(3η + 2)ρ2 − (θ + 1)ηρ2) = − (F (wt−1) − F (w∗)) ∗  ∂En,cp(f ) cnDn 2ρ (1 + θ)2η2ρ2 + 1 n − λ∗ ≤ 0, ∀n ∈ N 0 : f ∗ ≤ f max. n ∗2 1 n n ∂fn fn (11) t−1 ∗ = −Θ(F (w ) − F (w )). (52) ∗ ∗ Then from (56) we must have µn −νn ≥ 0, thus, according ∗ max 0 ∗ to (58) fn = fn , ∀n ∈ N1. From (60), we see that By subtracting F (w ) from both sides of (52), we have 0 cnDn ∗ N1 = {n : max = Tcp}. Hence, by the definition in (19), fn t ∗ t−1 ∗  F (wn) − F (w ) ≤ (1 − Θ) F (w ) − F (w ) , ∀n. (53) ∗ cnDn Tcp = max max . (62) n∈N fn Finally, we obtain 0 On the other hand, if there exist a non-empty set N2 = {n|λ∗ = 0}, it must be due to t ∗ XN t ∗  n F (w ) − F (w ) ≤ pn F (wn) − F (w ) (54) n=1 cnDn ∗ 0 (53) min ≤ Tcp, ∀n ∈ N2 ≤ (1 − Θ) (F (wt−1) − F (w∗)), (55) fn according to (60). In this case, from (56) we must have ∗ ∗ ∗ min 0 where (54) is due to the convexity of F (·). µn − νn ≤ 0 ⇒ fn = fn , ∀n ∈ N2. 14

c D c D  ∗ 1/3 n n ∗ n n ∗ λn 0 Finally, if there exists UEs with f min > Tcp and f max < has its solution f = , ∀n ∈ N . Furthermore, n n n αn 3 ∗ 0 min 3 Tcp, they will belong to the set N3 = {n|αn(fn ) < from (60), we have 3 ∗ max min ∗ max 1/3 λn < αn(fn ) } such that fn < fn < fn . cnDn  αn  ∗ 0 ∗ = cnDn = T , ∀n ∈ N . (65) According to (60), Tcp must be the same for all n with f ∗ λ∗ cp 3 λ∗ > 0 , we obtain n n n Combining (65) with (61), we have ∗ cnDn cnDn 0 f = = , ∀n ∈ N . (63) P 3 1/3 n ∗ cnDn 3  0 αn(cnDn)  T max ∗ n∈N3 cp n f max T = . n cp κ In summary, we have 0 On the other hand, if there exist a non-empty set N2 =  f max, ∀n ∈ N 0 {n|λ∗ = 0}, it must be due to  n 1 n ∗ f min, ∀n ∈ N 0 fn = n 2 cnDn ∗ 0 cnDn 0 < Tcp, ∀n ∈ N2  ∗ , ∀n ∈ N3 min Tcp fn ∗ ∗ ∗ with T determined in (62). according to (60). From (56) we must have µn −νn ≤ 0 ⇒ cp ∗ min 0 ∗  fn = fn , ∀n ∈ N2. In summary, we have b) Tcp = TN2 > max TN1 ,TN3 : This happens when κ is small enough such that the condition in line 7 of ( cnDn 0 ∗ ∗ , ∀n ∈ N3 f = Tcp Algorithm 2 satisfies. In this case, N1 is empty and N2 n min 0 fn , ∀n ∈ N2 is non-empty according to line 8 of this algorithm. Thus PN min 3 Considering all cases above, we see that the solutions we consider κ ≤ n=1 αn(fn ) . Due to the considered small κ and (61), there must exist a characterized by KKT condition above are exactly the same 0  ∗ min 3 non-empty set N2 = n : λn ≤ αn(fn ) such that as those provided in Lemma 2. ∂E (f ∗) c D n,cp n − λ∗ n n ≥ 0, ∀n ∈ N 0 : f ∗ ≥ f min. n ∗2 2 n n E. Proof of Lemma 3 ∂fn fn ∗ ∗ According to (16) and (17), the objective of SUB2 is the Then, from (56) we must have µn −νn ≤ 0, and thus from ∗ min 0 sum of perspective functions of convex and linear functions, (58), fn = fn , ∀n ∈ N2. Therefore, by the definition (19) we have and its constraints determine a convex set; thus SUB2 is a convex problem that can be analyzed using KKT condition ∗ cnDn Tcp = max . [28]. n∈N 0 f min 2 n The Lagrangian of SUB2 is 3 If we further restrict κ ≤ min α (f min) , then XN XN n∈N n n L2 = En,co(τn) + λ( τn − Tco) 0 ∗ min n=1 n=1 we see that N2 = N , i.e., fn = fn , ∀n ∈ N . PN min 3 XN max XN min On the other hand, if we consider αn(f ) < + µn(τn − τn ) − νn(τn − τn ) + κTco n=1 n n=1 n=1 min 3 κ ≤ minn∈N αn(fn ) , then there may exist UEs with cnDn ∗ cnDn ∗ where λ, µn, νn are non-negative dual variables. Then the min > Tcp and max < Tcp, which will belong to the fn fn KKT condition is as follows: 0 min 3 ∗ max 3 set N3 = {n | αn(fn ) < λn < αn(fn ) } such that ∂L ∂En,co min ∗ max 0 = + λ + µn − νn = 0, ∀n (66) fn < fn < fn , ∀n ∈ N3. Then, according to (60), ∗ ∗ ∂τn ∂τn Tcp must be the same for all n with λn > 0 , we obtain ∂L = κ − λ = 0, (67) c D c D ∂Tco f ∗ = n n = n n , ∀n ∈ N 0. (64) max n ∗ cnDn 3 µn(τn − τn ) = 0, ∀n (68) Tcp maxn min fn min νn(τn − τn ) = 0, ∀n (69) In summary, we have XN λ( τn − Tco) = 0. (70) n=1 ( cnDn 0 ∗ , ∀n ∈ N3 ∗ Tcp ∗ sn fn = From (67), we see that λ = κ. Let x := , we first min 0 τnB fn , ∀n ∈ N2 consider the equation ∗  c) Tcp = TN3 > max TN1 ,TN2 : This happens when κ is between a range such that in Algorithm 2, the condition ∂En,co ∗ N0 x x ∗ + λ = 0 ⇔ ¯ e − 1 − xe = −λ = −κ at line 7 is violated at some round, while the condition at ∂τn hn −1¯ x −1 x−1 κN hn − 1 line 4 not satisfied. In this case, N1 is empty and N3 is ⇔ e (x − 1) = κN h¯ − 1 ⇔ e (x − 1) = 0 0 n e non-empty according to line 8 of this algorithm. Thus we −1 3  κN h¯ − 1  s /B PN min PN max 3 ⇔ x = 1 + W 0 n ⇔ τ = g (κ) = n . consider n=1 αn(fn ) < κ < n=1 αn(fn ) . n n −1¯ e κN0 hn−1  With this range of κ, and (61), there must exist a non-empty 1 + W e 0 min 3 ∗ max 3 set N3 = {n | αn(fn ) < λn < αn(fn ) } such that min ∗ max 0 Because W (·) is strictly increasing when W (·) > − ln 2, fn < fn < fn , ∀n ∈ N3. Then, from (56) we have ∗ ∗ gn(κ) is strictly decreasing and positive, and so is its inverse µn − νn = 0 and the following equation function

∂En,cp(fn) ∗ cnDn −1 ∂En,co(τn) − λn 2 = 0 gn (τn) = − . ∂fn fn ∂τn 15

Then we have following cases min −1 min a) If gn(κ) ≤ τn ⇔ κ ≥ gn (τn ):

∗ −1 min ∂En,co κ = λ ≥ gn (τn ) ≥ − . min ∂τn τn ≤τn ∗ ∗ ∗ Thus, according to (66), µn − νn ≤ 0. Because both µn ∗ ∗ ∗ and νn cannot be positive, we have µn = 0 and νn ≥ 0. ∗ ∗ Then we consider two cases of ν : a) νn > 0, from (69), ∗ min ∗ τn = τn , and b) νn = 0, from (66), we must have −1 min ∗ min κ = gn (τn ), and thus τn = τn . max −1 max b) If gn(κ) ≥ τn ⇔ κ ≤ gn (τn ), then we have ∂E κ = λ∗ ≤ g−1(τ max) ≤ − n,co . n n max ∂τn τn≤τn ∗ ∗ ∗ Thus, according to (66), µn − νn ≥ 0, inducing νn = 0 ∗ and µn ≥ 0. With similar reasoning as above, we have ∗ max τn = τn . min max −1 max c) If τn < gn(κ) < τn ⇔ gn (τn ) < κ < −1 min ∗ gn (τn ), then from (68) and (69), we must have µn = ∗ νn = 0, with which and (66) we have ∗ τn = gn(κ). ∗ ∗ PN ∗ Finally, with λ = κ > 0, from (70) we have Tco = n=1 τn.