Learning to Learn Gradient Aggregation by Gradient Descent

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Learning to Learn Gradient Aggregation by Gradient Descent Jinlong Ji1 , Xuhui Chen1;2 , Qianlong Wang1 , Lixing Yu1 and Pan Li1∗ 1Case Western Reserve University 2Kent State University fjxj405, qxw204, lxy257, [email protected], [email protected] Abstract in a geographically distributed manner. Such scenarios with distributed computing power and datasets have spurred the In the big data era, distributed machine learning study of distributed machine learning techniques, such as emerges as an important learning paradigm to mine multi-agent optimization [Nedic and Ozdaglar, 2009], multi- large volumes of data by taking advantage of dis- party learning [Hamm et al., 2016], and federated learning tributed computing resources. In this work, mo- [Konecnˇ y` et al., 2016]. tivated by learning to learn, we propose a meta- The most typical distributed machine learning paradigm learning approach to coordinate the learning pro- is the master-slave pattern, and has been adopted in many cess in the master-slave type of distributed sys- industrial-level applications [Abadi et al., 2016; Chen et al., tems. Specifically, we utilize a recurrent neural net- 2018b]. Particularly, each worker (the slave) in the system is work (RNN) in the parameter server (the master) a computing node, which has access to its own share of the to learn to aggregate the gradients from the work- data. All workers learn parallelly and only send their local ers (the slaves). We design a coordinatewise pre- model information (e.g., gradients of the model) to a param- processing and postprocessing method to make the eter server, which can significant reduce the communication neural network based aggregator more robust. Be- cost, while the parameter server (the master) aggregates all sides, to address the fault tolerance, especially the the received information with hand-designed methods so as to Byzantine attack, in distributed machine learning update the global model. This training procedure is applica- systems, we propose an RNN aggregator with ad- ble to a rich diversity of machine learning algorithms, such as ditional loss information (ARNN) to improve the linear regression/classification, support vector machine and system resilience. We conduct extensive experi- deep neural network. ments to demonstrate the effectiveness of the RNN aggregator, and also show that it can be easily gen- As mentioned above, the aggregation method plays a vital eralized and achieve remarkable performance when role in the distributed learning process. So far, the aggrega- [ transferred to other distributed systems. Moreover, tion methods are linear combination like averaging Polyak ] [ under majoritarian Byzantine attacks, the ARNN and Juditsky, 1992 and its variants Zhang et al., 2015; ] aggregator outperforms the Krum, the state-of-art Lian et al., 2015 . However, it still requires study on how fault tolerance aggregation method, by 43.14%. In to make the aggregators more robust in distributed learning. addition, our RNN aggregator enables the server Particularly, distributed systems are vulnerable to computing to aggregate gradients from variant local models, error from the workers, especially when adversarial attackers which significantly improve the scalability of dis- compromise one or more computing nodes to launch Byzan- tributed learning. tine attacks. In such scenarios, the distributed learning process may not converge and the system may crash [Blanchard et al., 2017]. In addition, traditionally, in master-slave type of 1 Introduction distributed machine learning systems, the linear combination aggregator in the parameter server can only work when all With the onset of the big data era, governments, companies workers hold the same local model. However, in many dis- and research institutions for the first time have the oppor- tributed systems, the workers may vary in computing power tunity to gain deep insight into their problems and develop and memory, and hence can only afford models of different better solutions by leveraging massive available data [Dean complexities. Therefore, the current aggregators impede the et al., 2012; Chen et al., 2017]. However, developing ma- large-scale implementation of distributed machine learning. chine learning algorithms to analyze a huge amount of data usually requires intensive computing resources, which may In this paper, we take the learning to learn approach to co- not be provided by any single user. Moreover, the massive ordinate the distributed learning among the workers. Particu- data are typically generated by multiple parties and stored larly, Learning to learn is a meta-learning strategy [Thrun and Pratt, 1998], where the abstract knowledge extracted from the ∗Contact Author learning (or optimization) process can enable fast and effi- 2614 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) cient learning. We develop a deep meta-learning aggregation human beings are born to be able to learn new skills via pre- method for distributed machine learning. Specifically, we vious learning experiences. The idea of learning to learn model the aggregation process in distributed learning as a ma- has been widely explored in various machine learning sys- chine learning task. Then we use a recurrent neural network tems such as meta-learning, life-long learning and contin- (RNN) to aggregate the gradients collected from the workers. uous learning, as well as many optimization problems, in- To learn this neural network based aggregator, we choose to cluding learning to optimize hyperparameters [Maclaurin et optimize it along the horizontal direction (i.e., the number of al., 2015], learning to optimize neural networks [Ge et al., steps) of the distributed learning process. To make the aggre- 2017] and learning to optimize loss functions [Houthooft et gator robust, we design a coordinatewise preprocessing and al., 2018]. postprocessing method. Once the aggregator is well-trained, Particularly, learning to learn provides an efficient ap- it makes decisions on how to aggregate the scattered local proach to designing learning systems through learning. For gradients by the information learned from its previous opti- instances, Bengio et al. proposed learning local Hebbian up- mization process. Moreover, in order to address the Byzan- dates by gradient descent [Bengio et al., 1992], which uti- tine attacks, we improve the RNN aggregator with additional lizes supervised learning to learn unsupervised learning al- loss information to make the system robust and stable. gorithms. Moreover, learning to learn by gradient descent The contributions of our work are summarized as follows: by gradient descent [Andrychowicz et al., 2016] and learning • To the best of our knowledge, this is the first study to learn without gradient descent by gradient descent [Chen proposing a learning to learn approach to coordinate a et al., 2016] employ supervised learning at the meta level distributed learning process. Instead of adopting a pre- to learn supervised learning algorithms and Bayesian opti- defined deterministic aggregation rule, we model the ag- mization algorithms, respectively. In this paper, motivated by gregating process as a learning problem and utilize a re- these previous works, we utilize supervised learning at the current neural network to optimize it. In addition, we meta level to learn an aggregation method for distributed ma- design a preprocessing and postprocessing method to en- chine learning. hance the stability of the optimization process. We also investigate the robustness and stability of distributed 2.2 Aggregation Methods learning, and propose an improved RNN aggregator with The study of aggregation methods has drawn intensive at- additional information loss to address the fault tolerance tention from researchers in machine learning. For example, issues, especially Byzantine attacks. some researchers study the aggregation of gradients with re- • The experiment results demonstrate that the proposed duced communication and computation costs [Chen et al., RNN aggregators outperforms the classical averaging 2018a], and some others focus on accelerating the conver- aggregator by a large margin on both the MNIST and gence process [Tsitsiklis et al., 1986]. There are also some the CIFAR-10 datasets. Our experiments also demon- works that develop aggregation schemes aiming to improve strate that the proposed aggregators can be easily gener- the stability and robustness of the learning process [Scardovi alized and achieves remarkable performance when trans- and Leonard, 2009]. Note that most aggregation methods are ferred to other distributed systems with different learn- pre-defined and deterministic, and hence are difficult to be ing models and different datasets, respectively. In ad- generalized. They are also generally sensitive to errors in the dition, compared to the state-of-the-art fault tolerance learning system. algorithm Krum, the proposed ARNN aggregator shows In this work, we propose a deep meta-learning system, i.e., its competitive performance and outperforms Krum in an RNN aggregator that can achieve better optimization per- the majoritarian Byzantine attacks. formance compared with existing deterministic approaches like the averaging aggregator, as proposed by [Konecnˇ y` et al., • The empirical results prove that the proposed scheme 2015]. We also address the fault tolerance issue, especially

Load more