A Communication-Efficient Collaborative Learning Framework for Distributed Features

Yang Liu Yan Kang Xinwei Zhang WeBank WeBank University of Minnesota [email protected] [email protected] [email protected]

Liping Li Yong Cheng University of Science and Technology of China WeBank [email protected] [email protected]

Tianjian Chen Mingyi Hong Qiang Yang WeBank University of Minnesota WeBank [email protected] [email protected] [email protected]

Abstract

We introduce a collaborative learning framework allowing multiple parties having different sets of attributes about the same user jointly build models without exposing their raw data or model parameters. In particular, we propose a Federated Stochastic Block Coordinate Descent (FedBCD) algorithm, in which each party conducts multiple local updates before each communication to effectively reduce the number of communication rounds among parties, a principal bottleneck for collaborative learning problems. We analyze theoretically the impact of the number of local updates, and show that when the batch size, sample size and the local iterations√ are selected appropriately, within T iterations,√ the algorithm performs O( T ) communication rounds and achieves some O(1/ T ) accuracy (measured by the average of the norm squared). The approach is supported by our empirical evaluations on a variety of tasks and datasets, demonstrating advantages over stochastic (SGD) approaches.

1 Introduction arXiv:1912.11187v6 [cs.LG] 31 Jul 2020 One critical challenge for applying today’s Artificial Intelligence (AI) technologies to real-world applications is the common existence of data silos across different organizations. Due to legal, privacy and other practical constraints, data from different organizations cannot be easily integrated. The implementation of user privacy laws such as GDPR [1] has made sharing data among different organizations more challenging. Collaborative learning has emerged to be an attractive solution to the data silo and privacy problem. While distributed learning (DL) frameworks [2] originally aims at parallelizing computing power and distributes data identically across multiple servers, federated learning (FL) [3] focuses on data locality, non-IID distribution and privacy. In most of the existing collaborative learning frameworks, data are distributed by samples thus share the same set of attributes. However, a different scenario is cross-organizational collaborative learning problems where parties share the same users but have different set of features. For example, a local bank and a local retail company in the same city may have large overlap in user base and it is beneficial for these parties to build collaborative learning models with their respective features. FL is further categorized into

2nd International Workshop on Federated Learning for Data Privacy and Confidentiality, in Conjunction with NeurIPS 2019 (FL-NeurIPS 19) horizontal (sample-partitioned) FL, vertical (feature-partitioned) FL and federated transfer learning (FTL) in [4]. Feature-partitioned collaborative learning problems have been studied in the setting of both DL [5–7] and FL [4, 8–10]. However, existing architectures have not sufficiently addressed the communication problem especially in communication-sensitive scenarios where data are geographically distributed and data locality and privacy are of paramount significance (i.e., in a FL setting). In these approaches, per-iteration communication and computations are often required, since the update of algorithm parameters needs contributions from all parties. In addition, to prevent data leakage, privacy- preserving techniques, such as Homomorphic Encryption (HE) [11], Secure Multi-party Computation (SMPC) [12] are typically applied to transmitted data [4, 10, 9], adding expensive communication overhead to the architectures. Differential Privacy (DP) is also a commonly-adopted approach, but such approaches suffer from precision loss [6, 7]. In sample-partitioned FL[3], it is demonstrated experimentally that multiple local updates can be performed with federated averaging (FedAvg), reducing the number of communication round effectively. Whether it is feasible to perform such multiple local update strategy over distributed features is not clear. In this paper, we propose a collaborative learning framework for distributed features named Federated stochastic block coordinate descent (FedBCD), where parties only share a single value per sample instead of model parameters or raw data for each communication, and can continuously perform local model updates (in either a parallel or sequential manner) without per-iteration communication. In the proposed framework, all raw data and model parameters stay local, and each party does not learn other parties’ data or model parameters either before or after the training. There is no loss in performance of the collaborative model as compared to the model trained in a centralized manner. We demonstrate that the communication cost can be significantly reduced by adopting FedBCD. Compared with the existing distributed (stochastic) coordinate descent methods [13–16], we show for the first time that when the number of local updates,√ mini-batch size and√ learning rates are selected appropriately, the FedBCD converges to a O(1/ T ) accuracy with O( T ) rounds of communications despite performing multiple local updates using staled information. We then perform comprehensive evaluation of FedBCD against several alternative protocols, including a sequential local update strategy, FedSeq with both the original Stochastic Gradient Descent (SGD) and Proximal gradient descent. For evaluation, we apply the proposed algorithms to multiple complex models include logistic regression and convolutional neural networks, and on multiple datasets, including the privacy-sensitive medical dataset MIMIC-III [17], MNIST [18] and NUS-WIDE [19] dataset. Finally, we implemented the algorithm for federated transfer learning (FTL) [10] to tackle problems with few labeled data and insufficient user overlaps.

2 Related Work

Traditional distributed learning adopts a parameter server architecture [2] to enable a large amount of computation nodes to train a shared model by aggregating locally-computed updates. The issue of privacy in DL framework is considered in [20]. FL [3] adopted a FedAvg algorithm which runs Stochastic Gradient Descent (SGD) for multiple local updates in parallel to achieve better communication efficiency. The authors of [21] studied the FedAvg algorithm under the parallel restarted SGD framework and analyzed the convergence rate and communication savings under IID settings. In [22], the convergence of the FedAvg algorithm under non-IID settings was investigated. All the work above consider the sample-partitioned scenario. Feature-partitioned learning architectures have been developed for models including trees [9], linear and logistic regression [4, 8, 5, 7], and neural networks [10, 6]. Distributed Coordinate Descent [13] used balanced partitions and decoupled computation at the individual coordinate in each parti- tion; Distributed Block Coordinate Descent [14] assumes feature partitioning is given and performs synchronous block updates of variables which is suited for MapReduce systems with high communi- cation cost settings. These approaches require synchronization at every iteration. Asynchronous BCD [15] and Asynchronous ADMM algorithms [16] tries to tame various kinds of asynchronicity using strategies such as small stepsize, and careful update scheduling, and the design objective is to ensure that the algorithm can still behave reasonably under non-ideal computing environment. Our approach tries to address the expensive communication overhead problem in FL scenario by systematically adopting BCD with sufficient number of local updates guided by theoretical convergence guarantees.

2 3 Problem Definition

Suppose K data parties collaboratively train a model based on N data samples N 1×d k 1×dk K {xi, yi}i=1 and the feature vector xi ∈ R are distributed among K parties {xi ∈ R }k=1, where dk is the feature dimension of party k. Without loss of generality, we assume one party k k holds the labels and it is party K. Let us denote the data set as Di , {xi }, for k ∈ [K − 1], K K K k K Di , {xi , yi }, and Di , {Di }k=1 (where [K − 1] denotes the set {1, ··· ,K − 1}). Then the collaborative training problem can be formulated as

N K 1 X X min L(Θ; D) , f(θ1, . . . , θK ; Di) + λ γ(θk) (1) Θ N i=1 k=1 d where θk ∈ R k denotes the training parameters of the kth party; Θ = [θ1; ... ; θK ]; f(·) and γ(·) denotes the loss function and regularizer and λ is the hyperparatemer; For a wide range of models such as linear and logistic regression, and support vector machines, the loss function has the following form: K X k K f(θ1, . . . , θK ; Di) = f( xi θk, yi ) (2) k=1 k The objective is for each party k to find its θk without sharing its data Di or parameter θk to other parties.

4 The FedSGD Approach

If a mini-batch S ⊂ [N] of data is sampled, the stochastic partial gradient w.r.t. θk is given by

gk(Θ; S) , ∇kf(Θ; S) + λ∇γ(θk). (3)

k k PK k Let Hi = xi θk and Hi = k=1 Hi , then for the loss function in equation (2), we have K 1 X ∂f(Hi, yi ) k T ∇kf(Θ; S) = (xi ) (4) NS ∂Hi i∈S

1 X K k T , g(Hi, yi )(xi ) (5) NS i∈S

Where NS is the number of samples in S. To compute ∇kf(Θ; S) locally, each party k ∈ [K − 1] k k K K sends H = {Hi }i∈S to party K, who then calculates H = {g(Hi, yi )}i∈S and sends to other parties, and finally all parties can compute gradient updates with equation (4). See Algorithm 1. For an arbitrary loss function, let us define the collection of information needed to compute ∇kf(Θ; S) as S k q H−k := {Hq (θq , S )}q6=k . (6) k where Hq (·) is a function summarizing the information required from party q to k, the stochastic (3) can be computed as the following: S gk(Θ; S) = ∇kf(H−k, θk; S) + λ∇γ(θk) , gk(H−k, θk; S). (7)

Therefore, the overall stochastic gradient is given as

g(Θ; S) , [g1(H−1, θ1; S); ··· ; gK (H−K , θK ; S)]. (8)

A direct approach to optimize (1) is to use the vanilla stochastic gradient descent (SGD) algorithm given below θk ← θk − ηgk(H−k, θk; S), ∀ k. (9)

3 Algorithm 1 FedSGD implemented on K parties Input: learning rate η Output: Model parameters θ1, θ2...θK Party 1,2..K initialize θ1, θ2, ...θK . for each iteration j=1,2... do Randomly sample S ⊂ [N] Exchange({1, 2...K}); for each party k=1,2..K in parallel do j+1 j k computes gk with equation (7) and update θk = θk − ηgk; end end Exchange(U): # U is the set of party IDs if equation (2) hold then for each party k ∈ U and k 6= K in parallel do k computes Hk,and send Hk to party K; end party K computes HK and sends to all other parties in U; end else for each party k ∈ U in parallel do 1 K q k computes Hk ...Hk ,and send Hk to party q; end end

The federated implementation of the above SGD iteration is given in Algorithm1. Algorithm1 requires communication of intermediate results at every iteration. This could be very inefficient, especially when K is large or the task is communication heavy. For a task of form in equation (2), the number of communications per round is 2(K − 1), but for an arbitrary task, the number of communications per round can be K2 − K if it requires pair-wise communication. We note that since Algorithm 1 has the same iteration as the vanilla SGD algorithm, it converges with a rate of O( √1 ), T regardless of the choice of K [23]. Since each iteration requires one round of communication among all the parties, T rounds of communication is required to achieve an error of O( √1 ). T

5 The Proposed FedBCD Algorithms

In the parallel version of our proposed algorithm, called FedBCD-p, at each iteration, each party k ∈ [K] performs Q > 1 consecutive local gradient updates in parallel, before communicating the intermediate results among each other; see Algorithm2. Such “multi-local-step" strategy is strongly motivated by our practical implementation (to be shown in our Experiments Section), where we found that performing multiple local steps can significantly reduce overall communication cost. Further, such a strategy also resembles the FedAvg algorithm [3], where each party performs multiple local steps before aggregation. At each iteration the kth feature is updated using the direction (where S is a mini-batch of data points)

dk = −ηgk(H−k, θk; S). (10)

Because H−k is the intermediate information obtained from the most recent synchronization, it may contain staled information so it may no longer be an unbiased estimate of the true partial gradient ∇kL(Θ). On the other hand, during the Q local updates no inter-party communication is required. Therefore, one could expect that there will be some interesting tradeoff between communication efficiency and computational efficiency. In the same spirit, a sequential version of the algorithm allows the parties to update their local θk’s sequentially, while each update consists of Q local updates without inter-party communication, termed FedBCD-s. (Algorithm3).

4 Algorithm 2 FedBCD-p: Parallel Federated Stochastic Block Coordinate Descent Input: learning rate η Output: Model parameters θ1, θ2...θK Party 1,2..K initialize θ1, θ2, ...θK . for each outer iteration t=1,2... do Randomly sample a mini-batch S ⊂ D; Exchange ({1, 2...K}); for each party k ∈ [N], in parallel do for each local iteration j = 1, 2..., Q do k computes gk(H−k, θk; S) using (7); Update θk ← θk − ηgk(H−k, θk; S); end end end

Algorithm 3 Sequential FedBCD: Sequential Federated Stochastic Block Coordinate Descent Input: learning rate η Output: Model parameters θ1, θ2...θK Party 1,2..K initialize θ1, θ2, ...θK . Exchange({1, 2...K}); for each iteration i=1,2... do Randomly sample a mini-batch S ⊂ D; for each party k=1,2..K sequentially do for each local iteration j=1,2...,Q do k computes gk using (7) and update θk ← θk − ηgk(H−k, θk; S); end Exchange({k, K}); end end

6 Security Analysis

Here we aim to find out whether one party can learn other party’s data from collections of messages k exchanged (Hq ) during training. Whereas previous research studied data leakage from exposing complete set of model parameters or gradients [24–26], in our protocol model parameters are kept private, and only the intermediate results (such as inner product of model parameters and feature) are exposed.

Security Definition Let Sj be the set of data point sampled at the jth iteration and ij denotes the i j Hk i th sample of the th iteration. ij is the contribution of the th sample to other parties. At the (j + 1)th iteration, we update weight variables according to equation (4)

1 X θj+1 = θj − η ( g(H , yK )(xk )T + λθj ) k k j ij ij ij k (11) NSj ij ∈Sj

The security definition is that for any party k with undisclosed dataset Dk and training parameters θ {xk } k following FedBCD, there exists infinite solutions for ij i∈Sj,j=0,1··· ,M that yield the same set {Hk } one can not determine party k’s data uniquely from its of contributions ij j=0,1··· ,M . That is, k exchanged messages of {Hj }j=0,1··· ,M regardless the value of M, the total number of iterations. Such a security definition is inline with prior security definitions proposed in privacy-preserving machine learning and secure multiple computation (SMC), such as [27–29, 4, 30]. Under this security definition, when some prior knowledge about the data is known, an adversary may be able to eliminate some alternative solutions or certain derived statistical information may be revealed

5 (a) (b) Figure 1: Illustration of a 2-party collaborative learning framework (a) with neural network(NN)- based local model. (b) FedBCD-s and FedBCD-p algorithms

[28, 29] but it is still impossible to infer the exact raw data ("deep leakage"). However, this practical and heuristic security model provides flexible tradeoff between privacy and efficiency and allows much more efficient solutions. Our problem is particularly challenging in that the observations by other parties are serial outputs from FedBCD algorithm and are all correlated based on equation (11). Hk Although it is easy to show security to send one round of ij due to protection from the reduced dimensionality, it is unclear whether raw data will be leaked after thousands or millions of rounds of iterative communications.

Theorem 1 For K-party collaborative learning framework following (2) with K ≥ 2, the FedBCD Algorithm is secured for party k if k’s feature dimension is larger than 2, i.e., dk ≥ 2. The security proof can be readily extended to collaborative systems where parties have arbitrary local sub-models (such as neural networks) but connect at the final prediction layer with loss function (2) k k (see Figure 1(a)). Let G be the local transformation on xi and is unknown to parties other than k. k k k k We choose Gi to be the identity map, i.e. Gi (xi ) = (xi ), then the problem reduces to Theorem1.

7 Convergence Analysis

In this section, we perform convergence analysis of the FedBCD algorithm. Our analysis will be focused on the parallel version Algorithm 2 and the sequential version can be analyzed use similar techniques. To facilitate the proof, we define the following notations: Let r denote the iteration index, in which each iteration one round of local update is performed; Let r0 denote the latest iteration before r in which synchronization has been performed, and the intermediate information Hk’s are exchanged. r Let yk denote the “local vector" that node k uses to compute its local gradient at iteration r, that is r r0 r gk(yk; S) = gk([Θ−k, θk]; S]) (12) where [v−k, w] denotes a vector v with its kth element replaced by w. Note that by Algorithm 2, r r0 each node k always updates the kth element of yk, while the information about Θ−k is obtained by the most recent synchronization step. Further, we use the “global" variable Θr to collect the most updated parameters at each iteration of each node, where yk,j denotes the jth element of yk: r r r r r Θ = [θ1; ... ; θK ] , [y1,1; ... ; yK,K ]. Note that {Θr} is only a sequence of “virtual" variables, it is never explicitly formed in the algorithm. Assumptions A1: Lipschitz Gradient. Assume that the loss function satisfies the following:

k∇L(Θ1) − ∇L(Θ2)k ≤ LkΘ1 − Θ2k, ∀ Θ1, Θ2 k∇kL(Θ1) − ∇kL(Θ2)k ≤ LkkΘ1 − Θ2k, ∀ Θ1, Θ2.

A2: Uniform Sampling. For simplicity, assume that the data sample is partitioned into B mini- batches S1, ··· , SB, each with size S; at a given iteration, S is sampled uniformly from these mini-batches. Our main result is shown below. The detailed analysis is relegated to the Supplemental Material.

6 Theorem 2 Under√ assumption A1, A2, when the step size in Algorithm 2 satisfies 0 < η ≤ min{ 2 , 1 }, then for all T ≥ 1, we have the following bound: qPK 2 2 L 2Q j=1 Lj +3Lk T −1 1 X 2 [k∇L(Θr)k2] ≤ (L(Θ(0)) − L(Θ?)) (13) T E ηT r=0 K X σ2 σ2 + 2η2(K + 3)Q2 L2 + 2K . (14) k S S k=1 where L(Θ?) denotes the global minimum of problem (1).

r Remark 1. It is non-trivial to find an unbiased estimator for the local stochastic gradient gk(yk; S). This is because after each synchronization step, each agent k performs Q deterministic steps based r0 on the same data set S, while fixing all the rest of the variable blocks at Θ−k. This is significantly different from FedAvg-type algorithms, where at each iteration a new mini-batch is sampled at each node. √ Remark 2. If we pick η = √1 , and S = Q = T , then with any fixed K the convergence speed of T the algorithm is O( √1 ) (in terms of the speed of shrinking the averaged gradient norm squared). T √ To the best of our knowledge, it is the first time that such an O(1/ T ) rate has been proven for any algorithms with multiple local steps designed for the feature-partitioned collaboratively learning problem. Remark 3. Compared with the existing distributed stochastic coordinate descent methods [13–16], our results are different. It shows that, despite using√ stochastic gradients and performing multiple local updates using staled information,√ only O( T ) communication rounds are requires (out of total T iterations) to achieves O(1/ T ) rate. We are not aware of any of such guarantees for other distributed coordinate descent methods. Remark 4. If we consider the impact of the number of nodes K and pick η = √ 1 , and S = Q = √ √ KT TK, then the convergence speed of the algorithm is O( √K ). This indicates that the proposed T algorithm has a slow down w.r.t the number of parties involved. However, in practice, this factor is mild assuming that the total number of parties involved in a feature-partitioned collaboratively learning problem is usually not large.

8 Experiments

Datasets and Models MIMIC-III. MIMIC-III (Medical Information Mart for Intensive Care) [17] is a large database comprising information related to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, survival data, and more. Following the data processing procedures of [31], we compile a subset of the MIMIC-III database containing more than 31 million clinical events that correspond to 17 clinical variables and get the final training and test sets of 17,903 and 3,236 ICU stays, respectively. For each variable we compute six different sample statistic features on seven different subsequences of a given time series, obtaining 17 × 7 × 6 = 714 features. We focus on the in-hospital mortality prediction task based on the first 48 hours of an ICU stay with area under the receiver operating characteristic (AUC-ROC) being the main metric. We partition each sample vertically by its clinical features. In a practical situation, clinical variables may come from different hospitals or different departments in the same hospital and can not share their data due to the patients personal privacy. This task is refered to as MIMIC-LR. MNIST. We partition each MNIST [18] image with shape 28 × 28 × 1 vertically into two parts (each part has shape 28 × 14 × 1). Each party uses a local CNN model to learn feature representation from raw image input. The local CNN model for each party consists of two 3x3 convolution layers that each has 64 channels and are followed by a fully connected layer with 256 units. Then, the two feature representations produced by the two local CNN models respectively are fed into the logistic regression model with 512 parameters for a binary classification task. We refer this task as MNIST-CNN.

7 MIMIC-LR MIMIC-LR MNIST-CNN MNIST-CNN 0.85 1.000 0.80 0.69 FedSGD Q=1 FedSGD Q=1 FedBCD-p Q=5 FedBCD-p Q=3 0.84 0.68 0.78 FedBCD-p Q=50 0.995 FedBCD-p Q=5

0.67 FedBCD-s Q=1 0.76 FedBCD-s Q=1 0.83 FedBCD-s Q=5 0.990 FedBCD-s Q=3 FedSGD Q=1 0.66 FedSGD Q=1 FedBCD-s Q=50 0.74 FedBCD-s Q=5 Auc AUC FedBCD-p Q=5 FedBCD-p Q=3 Loss 0.82 0.985 FedBCD-p Q=50 0.65 FedBCD-p Q=5 training loss 0.72 FedBCD-s Q=1 0.64 FedBCD-s Q=1 0.81 FedBCD-s Q=5 0.980 FedBCD-s Q=3 0.70 FedBCD-s Q=50 0.63 FedBCD-s Q=5 0.68 0.80 0.975 0 500 1000 1500 2000 2500 3000 3500 4000 0 1000 2000 3000 4000 5000 0 20 40 60 80 100 120 0 20 40 60 80 100 120 communication rounds communication rounds communication rounds communication rounds (a) (b) (c) (d) Figure 2: Comparison of AUC (Left) and training loss (Right) in MIMIC-III dataset with varying local iterations, denoted by Q.

MIMIC-LR MNIST-CNN AUC 84% AUC 99.7% Algo. Q rounds Q rounds FedSGD 1 334 1 46 FedBCD-p 5 71 3 16 50 52 5 8 FedBCD-s 1 407 1 48 5 74 3 15 50 52 5 9 Table 1: Number of communication rounds to reach a target AUC-ROC for FedBCD-p, FedBCD-s and FedSGD on MIMIC-LR and MNIST-CNN respectively.

NUS-WIDE. The NUS-WIDE dataset [19] consists of 634 low-level images features extracted from Flickr images as well as their associated tags and ground truth labels. We put 634 low-level image features on party B and 1000 textual tag features with ground truth labels on party A. The objective is to perform a federated transfer learning (FTL) task studied in [10]. FTL aims to predict labels to unlabeled images of party B through transfer learning form A to B. Each party utilizes a neural network having one hidden layer with 64 units to learn feature representation from their raw inputs. Then, the feature representations of both sides are fed into the final federated layer to perform federated transfer learning. This task is refered to as NUS-FTL. Default-Credit. The Default-Credit consists of credit card records including user demographics, history of payments, and bill statements, etc., with a user’s default payments as labels. We separate the features in a way that the demographic features are on one side, separated from the payment and balance features. This segregation is common in industry applications such as retail and car rental leveraging banking data for user credibility prediction and customer segmentation. In our experiments, party A has labels and 18 features including six months of payment and bill balance data, whereas party B has 15 features including user profile data such as education, marriage . We perform a FTL task as described above but with homomorphic encryption applied. We refer to this task as Credit-FTL. ηt = √η0 η0 For all experiments, we adopt a decay learning rate strategy with t+1 , where is opti- mized for each experiment. We fix the batch size to 64 and 256 for MIMIC-LR and MNIST-CNN respectively. Results and Discussion FedBCD-p vs FedBCD-s. We first study the impact of varying local iterations on the communication efficiency of both FedBCD-p and FedBCD-s algorithms based on MIMIC-LR and MNIST-CNN (Figure2). We observe similar convergence for FedBCD-s and FedBCD-p for various values of Q. However, for the same communication round, the running time of FedBCD-s doubles that of FedBCD-p due to sequential execution. As the number of local iteration increases, we observe that the required number of communication rounds reduce dramatically (Table1). Therefore, by reasonably increasing the number of local iteration, we can take advantage of the parallelism on participants and save the overall communication costs by reducing the number of total communication rounds required.

8 Impact of Q. Theorem2 suggests that as Q grows the required number of communication rounds may first decrease and then increase again, and eventually the algorithm may not converge to optimal solution. To further investigate the relationship between the convergence rate and the local iteration Q, we evaluate FedBCD-p algorithm on NUS-FTL with a large range of Q. The results are shown in Figure3 and Figure 4(a), which illustrate that FedBCD-p achieves the best AUC with the least number of communication rounds when Q = 15. For each target AUC, there exists an optimal Q. This manifests that one needs to carefully select Q to achieve the best communication efficiency, as suggested by Theorem2.

NUS-FTL NUS-FTL

FedBCD-p Q=1 0.84 7000 FedBCD-p Q=5 FedBCD-p Q=15 0.82 FedBCD-p Q=20 6000

0.80 5000 AUC Loss 0.78 FedBCD-p Q=1 4000 FedBCD-p Q=5 0.76 FedBCD-p Q=15 FedBCD-p Q=20 3000 0.74 0 20 40 60 80 100 120 140 0 50 100 150 200 250 300 communication rounds communication rounds Figure 3: Comparison of AUC (Left) and training loss (Right) in NUS-WIDE dataset with varying local iterations, denoted by L.

NUS-FTL NUS-FTL 25 AUC=0.830 0.840 AUC=0.835 20 AUC=0.837 0.835

0.830 15

AUC 0.825 FedBCD-p Q=25 FedBCD-p Q=50 10 0.820 FedBCD-p Q=100 FedPBCD-p Q=25 0.815 Communication Rounds FedPBCD-p Q=50 5 FedPBCD-p Q=100 0.810 3 5 10 15 20 25 0 10 20 30 40 Q communication rounds (a) (b) Figure 4: The relationship between communication rounds and varying local iterations denoted by Q for three target AUC (a), and the comparison between FedBCD-p and FedPBCD-p for large local iterations (b).

Figure 4(b) shows that for very large local iteration Q = 25, 50 and 100, the FedBCD-p cannot converge to the AUC of 83.7%. This phenomenon is also supported by Theorem2, where if Q is too large the right hand side of (13) may not go to zero. Next we further address this issue by making the algorithm less sensitive in choosing Q. Proximal Gradient Descent. [32] proposed adding a proximal term to the local objective function to alleviate potential divergence when local iteration is large. Here, we explore this idea to our scenario. We rewrite (12) as follows:

r r0 r r r0 gk(yk; Di) = gk([Θ−k, θk]; Di]) + µ(θk − θk ) (15)

r r0 µ r r0 2 where µ(θk −θk ) is the gradient of the proximal term 2 ||θk −θk || , which exploits the initial model r0 θk of party k to limit the impact of local updates by restricting the locally updated model to be close r0 to θk . We denote the proximal version of FedBCD-p as FedPBCD-p. We then apply FedPBCD-p with µ = 0.1 to NUS-FTL for Q = 25, 50 and 100 respectively. Figure 4(b) illustrates that if Q is too large, FedBCD-p fails to converge to optimal solutions whereas the FedPBCD-p converges faster and is able to reach at a higher test AUC than corresponding FedBCD-p does. Increasing number of Parties. In this section, we increase the number of parties to five and seventeen and conduct experiments for MIMIC-LR task. We partition data by clinical variables with each party having all the related features of the same variable. We adopt a decay learning rate strategy with √η0 according to Theorem2. The results are shown in Figure5. We can see that the proposed T k method still performs well when we increase the local iterations for multiple parties. As we increase

9 the number of parties to five and seventeen, FedBCD-p is slightly slower than the two-party case, but the impact of node K is very mild, which verifies the theoretical analysis in Remark 3.

MIMIC-LR MIMIC-LR 0.85 0.85

0.84 0.84

0.83 0.83 FedSGD Q=1 K=2 FedSGD Q=1 K=2

Auc FedBCD-p Q=5 K=2 Auc FedBCD-s Q=5 K=2 0.82 0.82 FedBCD-p Q=50 K=2 FedBCD-s Q=50 K=2 FedBCD-p Q=5 K=5 FedBCD-s Q=5 K=5 0.81 FedBCD-p Q=50 K=5 0.81 FedBCD-s Q=50 K=5 FedBCD-p Q=5 K=17 FedBCD-s Q=5 K=17 0.80 0.80 0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000 communication rounds communication rounds Figure 5: Comparison of AUC in MIMIC-III dataset with varying local iterations (denoted by Q) and number of parties (denoted by K). (Left) FedBCD-p; (Right) FedBCD-s

Implementation with HE. In this section, we investigate the efficiency of FedBCD-p algorithm with homomorphic encryption (HE) applied. Using HE to protect transmitted information ensures higher security but it is extremely computationally expensive to perform computations on encrypted data. In such a scenario, carefully selecting Q may reduce communication rounds but may also introduce computational overhead because the total number of local iterations may increase (Q× number of communication rounds). We integrated the FedBCD-p algorithm into the current FTL implementation on FATE1 and simulate two-party learning on two machines with Intel Xeon Gold model with 20 cores, 80G memory and 1T hard disk. The experimental results are summarized in Table2. It shows that FedBCD-p with larger Q costs less communication rounds and total training time to reach a specific AUC with a mild increase in computation time but more than 70 percents reduction in communication round from FedSGD to Q = 10.

Credit-FTL AUC Algo. Q R comp. comm. total FedSGD 1 17 11.33 11.34 22.67 70% FedBCD-p 5 4 13.40 2.94 16.34 10 2 10.87 2.74 13.61 FedSGD 1 30 20.50 20.10 40.60 75% FedBCD-p 5 8 26.78 5.57 32.35 10 4 23.73 2.93 26.66 FedSGD 1 46 32.20 30.69 62.89 80% FedBCD-p 5 13 43.52 9.05 52.57 10 7 41.53 5.12 46.65 Table 2: Number of communication rounds and training time to reach target AUC 70%, 75% and 80% respectively for FedSGD versus FedBCD-p. R, comp. and comm. denote communication rounds, computation time (mins), and communication time (mins) respectively.

9 Conclusions and Future Work

In this paper, we propose a new collaboratively learning framework for distributed features based on block coordinate gradient descent in which parties perform more than one local update of gradients before communication. Our approach significantly reduces the number of communication rounds and the total communication overhead. We theoretically prove that the algorithm achieves global convergence with a decay learning rate and proper choice of Q. The approach is supported by our extensive experimental evaluations. We also show that adding proximal term can further enhance convergence at large value of Q. Future work may include investigating and further improving the communication efficiency of such approaches for more complex and asynchronized collaborative systems.

1https://github.com/FederatedAI/FATE

10 References [1] EU. REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Available at: https://eur-lex. europa. eu/legal-content/EN/TXT, 2016.

[2] Dean, J., G. Corrado, R. Monga, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231. 2012.

[3] McMahan, H. B., E. Moore, D. Ramage, et al. Federated learning of deep networks using model averaging. CoRR, abs/1602.05629, 2016.

[4] Yang, Q., Y. Liu, T. Chen, et al. Federated machine learning: Concept and applications. CoRR, abs/1902.04885, 2019.

[5] Gratton, C., V. D., R. Arablouei, et al. Distributed ridge regression with feature partitioning. pages 1423–1427. 2018.

[6] Hu, Y., D. Niu, J. Yang, et al. Fdml: A collaborative machine learning framework for distributed features. In KDD. 2019.

[7] Hu, Y., P. Liu, L. Kong, et al. Learning privately over distributed features: An ADMM sharing approach. CoRR, abs/1907.07735, 2019.

[8] Hardy, S., W. Henecka, H. Ivey-Law, et al. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. CoRR, abs/1711.10677, 2017.

[9] Cheng, K., T. Fan, Y. Jin, et al. Secureboost: A lossless federated learning framework. CoRR, abs/1901.08755, 2019.

[10] Liu, Y., T. Chen, Q. Yang. Secure federated transfer learning. CoRR, abs/1812.03337, 2018.

[11] Rivest, R. L., L. Adleman, M. L. Dertouzos. On data banks and privacy homomorphisms. Foundations of Secure Computation, Academia Press, pages 169–179, 1978.

[12] Yao, A. C. Protocols for secure computations. In SFCS, pages 160–164. 1982.

[13] Richtárik, P., M. Takác.ˇ Distributed coordinate descent method for learning with big data. J. Mach. Learn. Res., 17(1):2657–2681, 2016.

[14] Mahajan, D., S. S. Keerthi, S. Sundararajan. A distributed block coordinate descent method for training l1regularized linear classifiers. J. Mach. Learn. Res., 18(1):3167–3201, 2017.

[15] Peng, Z., Y. Xu, M. Yan, et al. Arock: an algorithmic framework for asynchronous parallel coordinate updates. 2015. Preprint, arXiv:1506.02396.

[16] Niu, F., B. Recht, C. Re, et al. Hogwild!: A lock-free approach to parallelizing stochastic gradient dea distributed, asynchronous and incremental algorithm for nonconvex optimization: An admm based approachvia a scent. 2011. Online at arXiv:1106.57320v2.

[17] Johnson, A. E., T. J. Pollard, L. Shen, et al. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016.

[18] LeCun, Y., C. Cortes. MNIST handwritten digit database. 2010.

[19] Chua, T.-S., J. Tang, R. Hong, et al. NUS-WIDE: A real-world web image database from National University of Singapore. In CIVR. 2009.

[20] Shokri, R., V. Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, pages 1310–1321. ACM, New York, NY, USA, 2015.

11 [21] Yu, H., S. Yang, S. Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pages 5693–5700. 2019. [22] Li, X., K. Huang, W. Yang, et al. On the convergence of FedAvg on non-iid data. arXiv preprint arXiv:1907.02189, 2019. [23] Ghadimi, S., G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013. [24] Zhu, L., Z. Liu, S. Han. Deep leakage from gradients. In Advances in Neural Information Processing Systems 32, pages 14774–14784. Curran Associates, Inc., 2019. [25] Hitaj, B., G. Ateniese, F. Perez-Cruz. Deep models under the gan: Information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, pages 603–618. ACM, New York, NY, USA, 2017. [26] Melis, L., C. Song, E. D. Cristofaro, et al. Exploiting unintended feature leakage in collaborative learning. 2019 IEEE Symposium on Security and Privacy (SP), pages 691–706, 2018. [27] Li, Q., Z. Wen, B. He. Practical federated gradient boosting decision trees. In 34th AAAI Conference on Artificial Intelligence (AAAI-20). 2020. [28] Du, W., Y. Han, S. Chen. Privacy-preserving multivariate statistical analysis: Linear regression and classification. In M. Berry, U. Dayal, C. Kamath, D. Skillicorn, eds., SIAM Proceedings Series, pages 222–233. 2004. [29] Vaidya, J., C. Clifton. Privacy preserving association rule mining in vertically partitioned data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pages 639–644. ACM, New York, NY, USA, 2002. [30] Mangasarian, O. L., E. W. Wild, G. M. Fung. Privacy-preserving classification of vertically partitioned data via random kernels. ACM Trans. Knowl. Discov. Data, 2(3):12:1–12:16, 2008. [31] Harutyunyan, H., H. Khachatrian, D. C. Kale, et al. Multitask learning and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017. [32] Li, T., A. K. Sahu, M. Zaheer, et al. Federated optimization for heterogeneous networks. arXiv preprint arXiv:1812.06127, 2019.

12 Supplementary Materials

A Proof of Theorem 1

Lemma 1 For any vector θ0 ∈ Rdk . There exists infinite many of non-identity orthogonal matrix U ∈ Rdk×dk such that UU T = I (16) Uθ0 = θ0 (17)

Proof. First we construct a orthogonal U1 satisfying (16) and (17) for

0 T dk θ := e1 = (1, 0, ··· , 0) ∈ R (18) Then we complete the proof by generalizing the construction for an arbitrary θ0 ∈ Rdk . 0 With θ = e1, we construct U1 in the following way  1 0  U := (19) 1 0 V

(d −1)×(d −1) where V ∈ R k k is any non-identity orthogonal matrix with dk > 2, i.e., VV T = I. (20) Condition (16) is satisfied since  1 0   1 0  U U T = (21) 1 1 0 V 0 V T  1 0  = (22) 0 VV T = I, (23) and condition (17) is satisfied trivially, i.e., T U1e1 = [ 1 0, ··· , 0 ] = e1 (24) 0 For any arbitrary θ , we apply the Householder transformation to "rotate" it to the basis vector e1, i.e., 0 0 θ = kθ k2P e1 (25) where P is the Householder transformation operator such as P = P T (26) PP T = PP = I (27)

Therefore from U1 defined in (19) we can construct U by

U = PU1P. (28) Finally, we verifies that U satisfies condition (16)) and (17)): T T UU = PU1P (PU1 P ) (29) T = PU1(PP )U1 P (30) T = PU1U1 P (31) = PP = I (32) 0 0 Uθ = PU1P θ (33) 0 = kθ k2P (e1U1) (34) 0 = kθ k2P e1 (35) = θ0 (36) where (34) holds since from (25) we have 0 0 P θ = kθ k2e1 (37)

13 A.1 Proof of Theorem 1

Proof: We first show that the conclusion holds for the case when k < K. With initial weight 0 dk dk×dk θk ∈ R , Lemma1 shows that we can find infinite number of non-identity matrix U ∈ R such that 0 T 0 θk = U θk. (38) xk i S j Let ij denotes the th sample of the data set j sampled at th iteration. We show that for any {xk } {Hk } ij i∈Sj ,j=0,1,··· that yields observations ij j=0,1,···, we can construct another set of data x˜k := xk U ij ij (39) U {H˜ k } {x˜k } {θ˜j } where is chosen to satisfy condition (38). Let ij be observations generated by ij , and k be weight variables with ˜0 T 0 θk = U θk. (40) We show in the following that for j = 0, 1, ··· H˜ k = Hk ij ij (41) ˜j T j θk = U θk. (42) It is easy to verify (41) for j = 0, i.e., Hk = xk UU T θ0 i0 i0 k = (xk U)(U T θ0) i0 k =x ˜k θ˜0 i0 k = H˜ k , i0 From equation (11), we define K gij := g(Hij , yi ) (43) Now assuming that condition (41) and (42) hold for j ≤ J. Then

gij =g ˜ij (44) ˜j T j θk = U θk (45) We first show (42) holds for j = J + 1. ˜J+1 θk (46) 1 X = θ˜J − η( g˜ (˜xk )T + λθ˜J ) k iJ iJ k (47) NSJ i∈SJ 1 X = U T θJ − η( g (xk U)T + λU T θJ ) k iJ iJ k (48) NSJ i∈SJ 1 X = U T (θJ − η( g (xk )T + λθJ )) k iJ iJ k (49) NSJ i∈SJ T J+1 = U θk (50) where (48) follows from (44) and (45). Note if Q local updates are performed, locally we have j,q+1 j,q j,q j,q−1 θk − θk = (1 − ηλ)(θk − θk ) (51) j,q where θk denotes the qth local update of jth iteration. it is thus easy to show that ˜J+1,q T J+1,q θk = U θk (52) Next we show (41) holds for j = J + 1. H˜ k =x ˜k θ˜J+1 iJ+1 i,J+1 k (53) k T J+1 = xi,J+1UU θk (54) k = xi,J+1θk (55) = Hk iJ+1 (56)

14 Finally we complete the proof by showing the conclusion holds for k = K. Similarly for any {xK } {yK } ij j=0,1,··· and ij , we construct a different solution as follows: x˜K := xK U ij ij (57) y˜K := yK . ij ij (58) where U ∈ Rdk×dk is a non-identity orthogonal matrix satisfying (39) for k = K. Therefore we have HK = g(H , yK ) ij ij ij (59) = g(H˜ , y˜K ) ij ij (60) = H˜ K ij (61)

B Proof of Theorem 2

To begin with, let us first introduce some notations.

Notations. For notation simplicity, we divide the total data set [N] into S1,..., SB, each with |Si| = S. Let r denote the iteration index, where each iteration one round of local update is performed among all the nodes; Let r r r r r Θ = [θ1; ... ; θK ] , [y1,1; ... ; yK,K ] denotes the updated parameters at each iteration of each node. Note that {Θr} is only a sequence of “virtual" variable that helps the proof, but it is never explicitly formed in the algorithm.

r0 r0 Let r0 denote the latest iteration before r with global communication, so Θ = yk , k = 1,...,K and r r0 r yk = [Θ−k, θk]. Also we denote the full gradient of the loss function as ∇L(Θ). Further denote the stochastic gradient as G , [g1(y1; S); ... ; gK (yK ; S)]. Using the above notation, the update rule becomes: Θr+1 = Θr − γGr. We also need the following important definition. Let us denote the variables updated in the inner r loops, using mini-batch Sl, l ∈ [B] as Θ[l]. That is, we have the following:

r0+1 r0 r r0 Θ[l],k , Θk − ηgk(Θ ; Sl),

r0+τ r0+τ−1 r r0+τ−1 Θ[l],k , Θ[l],k − ηgk(y[l],k ; Sl), where r r0 r y[l],k , [Θ−k, θ[l],k]. Further define r r r y¯k , [y[1],k; ··· ; y[B],k]. (62)

r0+τ Note that if Si is sampled at r0, then the quantities Θ[l],k , l 6= i are not realized in the algorithm. They are introduced only for analytical purposes. By using these notations, Algorithm 2 can be written in the compact form as in Algorithm4. Further, the uniform sampling assumption A2 implies the following: A2’: Unbiased Gradient with Bounded Variance. Suppose Assumption A2 holds. Then we have the following: B 1 X [gr(yr ; S ) | {yt }r0 ] = gr(yr ; S ) ∇ L¯(y¯r ) E k k i k t=1 B k [`],k ` , k k `=1 where the expectation is taken on the choice of i at iteration r0, and conditioning all the past histories of the algorithm up to iteration r0; Further, the following holds: σ2 [kgr(yr ; S ) − ∇ L¯(y¯r )k2 | {yt }r0 ] ≤ E k k i k k k t=1 S

15 Algorithm 4 Federated Stochastic BCD Algorithm Input: Θ(0), η, Q 0 (0) Initialize: yk , Θ , k = 1,...,K for r = 0, 1,... do if r mod Q = 0 then do global communication r r T r T T yk , [y1,1 ,..., yK,K ] Randomly sample a mini-batch Si for nodes 1,...,K end r r Calculate gk(yk; Si) at each node using Si for k = 1, ··· ,K do Local updates ( r r+1 yk,j, j = 1, . . . , K, j 6= k yk,j , r r r yk,j − ηgk(yk; Si), j = k end end

2 where the expectation is taken over the random selection of the data point at iteration r0; σ is the variance introduced by sampling from a single data point. To begin our proof, we first present a lemma that bounds the difference of the gradients evaluated at r r the global vector Θ and the local vector yk.

Lemma 2 Bounded difference of gradient. Under Assumption A1, A2, the difference between the r r partial gradients evaluated at Θ and yk is bounded by the following:

K K 2 X r r 2 2 X 2 2 σ k∇ L(Θ ) − g (y ; S)k ≤ (K + 3)η L Q E k k k k S k=1 k=1 K K r−1 2 2 X X 2 2 X τ 2 σ + η Q ( L + 3L ) k∇ L(y )k + K , j k E k k S k=1 j=1 τ=r−Q where σ is some constant related to the variances of the sampling process.

B.1 Proof of Lemma2

Proof. First we take the expectation and apply assumption A1,A2

r r r 2 E[k∇kL(Θ ) − gk(yk; Si)k ] r r r 2 r0 = E[ESi [k∇kL(Θ[i]) − gk(yk; Si)k | Θ ]] r ¯ r 2 r0 ≤ 2E[ESi [k∇kL(Θ[i]) − ∇kL(y¯k)k | Θ ] ¯ r r r 2 r0 + ESi [∇kL(y¯k) − gk(yk; Si)k | Θ ]] B 1 X (63) ≤ 2 [L2 [ kΘr − yr k2 | Θr0 ] E kESi B [i] [l],k l=1 ¯ r r r 2 r0 + ESi [k∇kL(y¯k) − gk(yk; Si)k | Θ ]] " B # 1 X 2σ2 ≤ 2 L2 [ kΘr − yr k2 | Θr0 ] + E kESi B [i] [l],k S l=1

16 Note that K r r 2 X r r 2 kΘ[i] − y[l],kk = kθ[i],j − y[l],k,j k j=1

X r r0 2 r r 2 = kθ[i],j − θj k + kθ[i],k − θ[l],kk j6=k r−1 X r0 X τ τ r0 2 = kθj − η gj (y[i],j ; Si) − θj k j6=k τ=r0 r−1 r−1 X τ τ X τ τ 2 + kη gk (y[i],k; Si) − η gk (y[l],k; Sl)k τ=r0 τ=r0 (64)  r−1 2 X X τ τ 2 ≤ η k gj (y[i],j ; Si)k j6=k τ=r0 r−1 r−1  X τ τ 2 X τ τ 2 + 2k gk (y[l],k; Sl)k + 2k gk (y[l],k; Si)k τ=r0 τ=r0 r−1 K 2 X X τ τ 2 ≤ η (r − r0 − 1) ( kgj (y[i],j ; Si)k τ=r0 j=1 τ τ 2 τ τ 2 + 2kgk (y[l],k; Sl)k + kgk (y[i],k; Si)k ).

Notice that r − r0 − 1 ≤ Q and substitute (64) to (63), by taking expectation over Si, we obtain

r r r 2 E[k∇kL(Θ ) − gk(yk; Si)k ] r−1  B K 2 2 X 1 X X τ τ 2 ≤ 2L η (r − r − 1)  ( kg (y ; S )k k 0 E ESi B j [i],j i τ=r0 l=1 j=1  2σ2 + 2kgτ (yτ ; S )k2 + kgτ (yτ ; S )k2) + k [l],k l k [i],k i S

(i) r−1  K 2 2 X  X τ τ 2 ≤ 2Lkη Q E ESi ( kgj (y[i],j ; Si)k τ=r0 j=1 2σ2  2σ2 + 2k∇ L¯(y¯τ )k2 + + kgτ (yτ ; S )k2) + k k S k [i],k i S (65)

(ii) r−1 K 2 2 2 X X τ 2 σ ≤ 2L η Q [( (k∇ L¯(y¯ )k + ) k E k k S τ=r0 j=1 3σ2 2σ2 + 3k∇ L¯(y¯τ )k2 + )] + k k S S r−1 K 2 2 2 X X τ 2 (K + 3)σ ≤ 2L η Q [( k∇ L¯(y¯ )k + k E k k S τ=r−Q j=1 2σ2 + 3k∇ L¯(y¯τ )k2)] + k k S ¯ r 2 2 where in (i) we used the definition of ∇kL(y¯k); in (ii) we use the fact that E[x ] − (E[x]) = 2 E[(x − E(x)) ]; in the last inequality is true because summing from r0 to r − 1 is always less that summing from r − Q to r − 1. Next we sum over K, we have

K K 2 X r r 2 2 X 2 2 σ k∇ L(Θ ) − g (y ; S )k ≤ 2(K + 3)η L Q E k k k i k S k=1 k=1 (66) K K r−1 2 2 X X 2 2 X τ 2 σ + 2η Q ( L + 3L ) k∇ L¯(y¯ )k + 2K , j k E k k S k=1 j=1 τ=r−Q This completed the proof of this result.

17 B.2 Proof of Theorem 2

Proof. First apply the Lipschitz condition of L, we have

L(Θr+1) ≤L(Θr) + h∇L(Θr), Θr+1 − Θri L (67) + kΘr+1 − Θrk2. 2

Next we apply the update step in Algorithm4 and take the expectation

L [L(Θr+1) − L(Θr)] ≤ −η h∇L(Θr), Gri + η2 kGrk2 E E 2 E η η = − k∇L(Θr)k2 − kGrk2 2 E 2 E η L + k∇L(Θr) − Grk2 + η2 kGrk2 2 E 2 E (68) η η = − k∇L(Θr)k2 − (1 − Lη) kGrk2 2 E 2 E K η X r r 2 + k∇ L(Θ ) − g (y ; S )k 2 E k k k i k=1

Also we have

r 2 r 2 r r 2 EkG k = EkE[G ]k + EkG − E[G ]k K K X ¯ r 2 X r ¯ r 2 = E k∇kL(y¯k)k + E kG − ∇kL(y¯k)k k=1 k=1 (69) K X ¯ r 2 ≥ E k∇kL(y¯k)k . k=1

1 Substitute (69) into (68), choose 0 < η ≤ L and apply Lemma2, we have

r+1 r E[L(Θ ) − L(Θ )] K η r 2 η X r 2 ≤ − k∇L(Θ )k − (1 − Lη) k∇ L¯(y¯ )k 2 E 2 E k k k=1 K η X r r 2 + k∇ L(Θ ) − g (y ; S )k 2 E k k k i k=1 K (66) η r 2 η X r 2 (70) ≤ − k∇L(Θ )k − (1 − Lη) k∇ L¯(y¯ )k 2 E 2 E k k k=1 K 2 2 3 X 2 2 σ σ + (K + 3)η L Q + ηK k S S k=1 K K r−1 3 X X 2 2 X ¯ τ 2 + η Q ( Lj + 3Lk) Ek∇kL(y¯k )k k=1 j=1 τ=r−Q

Average over T and reorganize the terms, let η satisfy

K 2 2 X 2 2 1 − Lη − 2η Q ( Lj + 3Lk) ≥ 0, j=1

18 √ (0 < η ≤ 2 ), then we obtain qPK 2 2 2Q j=1 Lj +3Lk

T −1 1 X r 2 2 0 T +1 k∇L(Θ )k ≤ [L(Θ ) − L(Θ )] T E ηT E r=0 K K X 2 2 X 2 2 ¯ r 2 − (1 − Lη − 2η Q ( Lj + 3Lk))Ek∇kL(yk)k k=1 j=1

K 2 2 2 X 2 2 σ σ (71) + (K + 3)2η L Q + 2K k S S k=1 2 ≤ [L(Θ0) − L(ΘT +1)] ηT E K 2 2 2 X 2 2 σ σ + (K + 3)2η L Q + 2K . k S S k=1 The proof is completed.

19