RSA: Byzantine-Robust Stochastic Aggregation Methods for Distributed Learning from Heterogeneous Datasets

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) RSA: Byzantine-Robust Stochastic Aggregation Methods for Distributed Learning from Heterogeneous Datasets Liping Li,1 Wei Xu,1 Tianyi Chen,2 Georgios B. Giannakis,2 Qing Ling3 1Department of Automation, University of Science and Technology of China, Hefei, Anhui, China 2Digital Technology Center, University of Minnesota, Twin Cities, Minneapolis, Minnesota, USA 3School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, Guangdong, China Abstract and computation tasks are distributed across multiple workers such as Internet-of-Things (IoT) devices in a smart home, In this paper, we propose a class of robust stochastic sub- which are programmed to collaboratively learn a model. gradient methods for distributed learning from heterogeneous Parallel implementations of popular machine learning algo- datasets at presence of an unknown number of Byzantine workers. The Byzantine workers, during the learning pro- rithms, such as stochastic gradient descent (SGD), are ap- cess, may send arbitrary incorrect messages to the master plied to learning from the distributed data (Bottou 2010). due to data corruptions, communication failures or malicious However, federated learning still faces two significant attacks, and consequently bias the learned model. The key challenges: high communication overhead and serious secu- to the proposed methods is a regularization term incorpo- rity risk. While several recent approaches have been devel- rated with the objective function so as to robustify the learn- oped to tackle the communication bottleneck of distributed ing task and mitigate the negative effects of Byzantine at- learning (Li et al. 2014; Liu et al. 2017; Smith et al. 2017; tacks. The resultant subgradient-based algorithms are termed Chen et al. 2018b), the security issue has not been ade- Byzantine-Robust Stochastic Aggregation methods, justifying quately addressed. In federated learning applications, a num- our acronym RSA used henceforth. In contrast to most of ber of devices may be highly unreliable or even easily com- the existing algorithms, RSA does not rely on the assumption that the data are independent and identically distributed promised by hackers. We call these devices as Byzantine (i.i.d.) on the workers, and hence fits for a wider class of ap- workers. In this scenario, the learner lacks secure training plications. Theoretically, we show that: i) RSA converges to ability, which makes it vulnerable to failures, not mention- a near-optimal solution with the learning error dependent on ing adversarial attacks (Lynch 1996). For example, SGD, the the number of Byzantine workers; ii) the convergence rate workhorse of large-scale machine learning, is vulnerable to of RSA under Byzantine attacks is the same as that of the even one Byzantine worker (Chen, Su, and Xu 2017). stochastic gradient descent method, which is free of Byzan- In this context, the present paper studies distributed ma- tine attacks. Numerically, experiments on real dataset corrob- chine learning under a general Byzantine failure model, orate the competitive performance of RSA and a complexity where the Byzantine workers can arbitrarily modify the mes- reduction compared to the state-of-the-art alternatives. sages transmitted from themselves to the master. With such a model, it simply does not have any constraints on the com- Introduction munication failures or attacks. We aim to develop efficient distributed machine learning methods tailored for this set- The past decade has witnessed the proliferation of smart ting with provable performance guarantee. phones and Internet-of-Things (IoT) devices. They gener- ate a huge amount of data every day, from which one can Related work learn models of cyber-physical systems and make decisions to improve the welfare of human being. Nevertheless, stan- Byzantine-robust distributed learning has received increas- dard machine learning approaches that require centraliz- ing attention in recent years. Most of the existing algo- ing the training data on one machine or in a datacenter rithms extend SGD to incorporate the Byzantine-robust set- may not be suitable for such applications, as data collected ting and assume that the data are independent and identi- from distributed devices and stored at clouds lead to sig- cally distributed (i.i.d.) on the workers. Under this assump- nificant privacy risks (Sicari et al. 2015). To alleviate user tion, stochastic gradients computed by regular workers are privacy concerns, a new distributed machine learning frame- presumably distributed around the true gradient, while those work called federated learning has been proposed by Google sent from the Byzantine workers to the master could be ar- and become popular recently (McMahan and Ramage 2017; bitrary. Thus, the master is able to apply robust estimation Smith et al. 2017). Federated learning allows the training techniques to aggregate the stochastic gradients. Typical gra- data to be kept locally on the owners’ devices. Data samples dient aggregation rules include geometric median (Chen, Su, and Xu 2017), marginal trimmed mean (Yin et al. 2018a; Copyright c 2019, Association for the Advancement of Artificial Xie, Koyejo, and Gupta 2018b), dimensional median (Xie, Intelligence (www.aaai.org). All rights reserved. Koyejo, and Gupta 2018a; Alistarh, Allen-Zhu, and Li 1544 2018), etc. A more sophisticated algorithm termed as Krum Algorithm 1 Distributed SGD selects a gradient which has minimal summation of Eu- Master: clidean distances from a given number of nearest gradients 1: Input: x~0, αk. At time k + 1: (Blanchard et al. 2017). Targeting high-dimensional learn- 2: Broadcast its current iterate x~k to all workers; ing, an iterative filtering algorithm is developed in (Su and k k 3: Receive all gradients rF (~x ; ξi ) sent by workers; Xu 2018), which achieves the optimal error rate in the high- 4: Update the iterate via (2). dimensional regime. The main disadvantage of these exist- Worker i: ing algorithms comes from the i.i.d. assumption, which is ar- guably not the case in federated learning over heterogeneous 1: At time k + 1: 2: Receive the master’s current iterate x~k; computing units. Actually, generalizing these algorithms to k k the non-i.i.d. setting is not straightforward. In addition, some 3: Compute a local stochastic gradient rF (~x ; ξi ); of these algorithms rely on sophisticated gradient selection 4: Send the local stochastic gradient to the master. subroutines, such as those in Krum and geometric median, which incur high computational complexity. Other related work in this context includes (Yin et al. work which assumes the distributed data across the workers 2018b) that targets escaping saddle points of nonconvex op- are i.i.d., we consider a more practical situation: ξi ∼ Di, timization problems under Byzantine attacks, and (Chen et where Di is the data distribution on worker i and could be al. 2018a) that leverages a gradient-coding based algorithm different to the distributions on other workers. for robust learning. However, the approach in (Chen et al. In the master-worker architecture, at time k +1 of the dis- 2018a) needs to relocate the data points, which is not easy tributed SGD algorithm, every worker i receives the current to implement in the federated learning paradigm. Leverag- model x~k from the master, samples a data point from the dis- k ing additional data, (Xie, Koyejo, and Gupta 2018c) stud- tribution Di with respect to a random variable ξi , and com- k k ies the trustworthy score-based schemes that guarantee ef- putes the gradient of the local empirical loss rF (~x ; ξi ). ficient learning even when there is only one non-Byzantine Note that this sampling process can be easily generalized to worker, but additional data may not always be available in the mini-batch setting, in which every worker samples mul- practice. Our algorithms are also related to robust decentral- tiple i.i.d. data points and computes the averaged gradient ized optimization studied in, e.g., (Ben-Ameur, Bianchi, and of the local empirical losses. The master collects and ag- Jakubowicz 2016; Xu, Li, and Ling 2018), which consider gregates the gradients sent by the workers, and updates the optimizing a static or dynamic cost function over a decen- model. Its update at time k + 1 is: tralized network with unreliable nodes. In contrast, the focus m ! of this work is Byzantine-robust stochastic optimization. k+1 k k+1 k X k k x~ =x ~ − α rf0(~x ) + rF (~x ; ξi ) (2) Our contributions i=1 k+1 The contributions of this paper are summarized as follows. where α is a diminishing learning rate at time k + 1. The c1) We develop a class of robust stochastic methods ab- distributed SGD is outlined in Algorithm 1. breviated as RSA for distributed learning over heteroge- SGD is vulnerable to Byzantine attacks. While SGD has neous datasets and under Byzantine attacks. RSA has several well-documented performance in conventional large-scale variants, each tailored for an `p-norm regularized robustify- machine learning settings, its performance will significantly ing objective function. degrade at the presence of Byzantine workers (Chen, Su, c2) Performance is rigorously established for the resultant and Xu 2017). Suppose that some of the workers are Byzan- RSA approaches, in terms of the convergence rate as well as tine, they can report arbitrary messages or strategically send the error caused by the Byzantine attacks. well-designed messages according to the information sent c3) Extensive numerical tests using the MNIST dataset by other workers so as to bias the learning process. Specifi- are conducted to corroborate the effectiveness of RSA in cally, if worker m is Byzantine, at time k + 1, it can choose term of both classification accuracy under Byzantine attacks one of two following attacks: k k and runtime. a1) sending rF (~x ; ξm) = 1; k k Pm−1 k k a2) sending rF (~x ; ξm) = − i=1 rF (~x ; ξi ).

Load more