<<

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Learning to Learn Gradient Aggregation by

Jinlong Ji1 , Xuhui Chen1,2 , Qianlong Wang1 , Lixing Yu1 and Pan Li1∗ 1Case Western Reserve University 2Kent State University {jxj405, qxw204, lxy257, pxl288}@case.edu, [email protected]

Abstract in a geographically distributed manner. Such scenarios with distributed computing and datasets have spurred the In the big data era, distributed machine learning study of distributed machine learning techniques, such as emerges as an important learning paradigm to mine multi-agent optimization [Nedic and Ozdaglar, 2009], multi- large volumes of data by taking advantage of dis- party learning [Hamm et al., 2016], and federated learning tributed computing resources. In this , mo- [Konecnˇ y` et al., 2016]. tivated by learning to learn, we propose a meta- The most typical distributed machine learning paradigm learning approach to coordinate the learning pro- is the master-slave pattern, and has been adopted in many cess in the master-slave type of distributed sys- industrial-level applications [Abadi et al., 2016; Chen et al., tems. Specifically, we utilize a recurrent neural net- 2018b]. Particularly, each worker (the slave) in the system is work (RNN) in the parameter server (the master) a computing node, which has access to its own share of the to learn to aggregate the gradients from the work- data. All workers learn parallelly and only send their local ers (the slaves). We design a coordinatewise pre- model information (e.g., gradients of the model) to a param- processing and postprocessing method to make the eter server, which can significant reduce the communication neural network based aggregator more robust. Be- cost, while the parameter server (the master) aggregates all sides, to address the fault tolerance, especially the the received information with hand-designed methods so as to Byzantine attack, in distributed machine learning update the global model. This training procedure is applica- systems, we propose an RNN aggregator with ad- ble to a rich diversity of machine learning algorithms, such as ditional loss information (ARNN) to improve the linear regression/classification, support vector machine and system resilience. We conduct extensive experi- deep neural network. ments to demonstrate the effectiveness of the RNN aggregator, and also show that it can be easily gen- As mentioned above, the aggregation method plays a vital eralized and achieve remarkable performance when role in the distributed learning process. So far, the aggrega- [ transferred to other distributed systems. Moreover, tion methods are linear combination like averaging Polyak ] [ under majoritarian Byzantine attacks, the ARNN and Juditsky, 1992 and its variants Zhang et al., 2015; ] aggregator outperforms the Krum, the state-of-art Lian et al., 2015 . However, it still requires study on how fault tolerance aggregation method, by 43.14%. In to make the aggregators more robust in distributed learning. addition, our RNN aggregator enables the server Particularly, distributed systems are vulnerable to computing to aggregate gradients from variant local models, error from the workers, especially when adversarial attackers which significantly improve the scalability of dis- compromise one or more computing nodes to launch Byzan- tributed learning. tine attacks. In such scenarios, the distributed learning pro- cess may not converge and the system may crash [Blanchard et al., 2017]. In addition, traditionally, in master-slave type of 1 Introduction distributed machine learning systems, the linear combination aggregator in the parameter server can only work when all With the onset of the big data era, governments, companies workers hold the same local model. However, in many dis- and research institutions for the first time have the oppor- tributed systems, the workers may vary in computing power tunity to gain deep insight into their problems and develop and memory, and hence can only afford models of different better solutions by leveraging massive available data [Dean complexities. Therefore, the current aggregators impede the et al., 2012; Chen et al., 2017]. However, developing ma- large-scale implementation of distributed machine learning. chine learning algorithms to analyze a huge amount of data usually requires intensive computing resources, which may In this paper, we take the learning to learn approach to co- not be provided by any single user. Moreover, the massive ordinate the distributed learning among the workers. Particu- data are typically generated by multiple parties and stored larly, Learning to learn is a meta-learning strategy [Thrun and Pratt, 1998], where the abstract knowledge extracted from the ∗Contact Author learning (or optimization) process can enable fast and effi-

2614 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) cient learning. We develop a deep meta-learning aggregation human beings are born to be able to learn new skills via pre- method for distributed machine learning. Specifically, we vious learning experiences. The idea of learning to learn model the aggregation process in distributed learning as a ma- has been widely explored in various machine learning sys- chine learning task. Then we use a recurrent neural network tems such as meta-learning, life-long learning and contin- (RNN) to aggregate the gradients collected from the workers. uous learning, as well as many optimization problems, in- To learn this neural network based aggregator, we choose to cluding learning to optimize hyperparameters [Maclaurin et optimize it along the horizontal direction (i.e., the number of al., 2015], learning to optimize neural networks [Ge et al., steps) of the distributed learning process. To make the aggre- 2017] and learning to optimize loss functions [Houthooft et gator robust, we design a coordinatewise preprocessing and al., 2018]. postprocessing method. Once the aggregator is well-trained, Particularly, learning to learn provides an efficient ap- it makes decisions on how to aggregate the scattered local proach to designing learning systems through learning. For gradients by the information learned from its previous opti- instances, Bengio et al. proposed learning local Hebbian up- mization process. Moreover, in order to address the Byzan- dates by gradient descent [Bengio et al., 1992], which uti- tine attacks, we improve the RNN aggregator with additional lizes supervised learning to learn unsupervised learning al- loss information to make the system robust and stable. gorithms. Moreover, learning to learn by gradient descent The contributions of our work are summarized as follows: by gradient descent [Andrychowicz et al., 2016] and learning • To the best of our knowledge, this is the first study to learn without gradient descent by gradient descent [Chen proposing a learning to learn approach to coordinate a et al., 2016] employ supervised learning at the meta level distributed learning process. Instead of adopting a pre- to learn supervised learning algorithms and Bayesian opti- defined deterministic aggregation rule, we model the ag- mization algorithms, respectively. In this paper, motivated by gregating process as a learning problem and utilize a re- these previous works, we utilize supervised learning at the current neural network to optimize it. In addition, we meta level to learn an aggregation method for distributed ma- design a preprocessing and postprocessing method to en- chine learning. hance the stability of the optimization process. We also investigate the robustness and stability of distributed 2.2 Aggregation Methods learning, and propose an improved RNN aggregator with The study of aggregation methods has drawn intensive at- additional information loss to address the fault tolerance tention from researchers in machine learning. For example, issues, especially Byzantine attacks. some researchers study the aggregation of gradients with re- • The experiment results demonstrate that the proposed duced communication and computation costs [Chen et al., RNN aggregators outperforms the classical averaging 2018a], and some others focus on accelerating the conver- aggregator by a large margin on both the MNIST and gence process [Tsitsiklis et al., 1986]. There are also some the CIFAR-10 datasets. Our experiments also demon- works that develop aggregation schemes aiming to improve strate that the proposed aggregators can be easily gener- the stability and robustness of the learning process [Scardovi alized and achieves remarkable performance when trans- and Leonard, 2009]. Note that most aggregation methods are ferred to other distributed systems with different learn- pre-defined and deterministic, and hence are difficult to be ing models and different datasets, respectively. In ad- generalized. They are also generally sensitive to errors in the dition, compared to the state-of-the-art fault tolerance learning system. algorithm Krum, the proposed ARNN aggregator shows In this work, we propose a deep meta-learning system, i.e., its competitive performance and outperforms Krum in an RNN aggregator that can achieve better optimization per- the majoritarian Byzantine attacks. formance compared with existing deterministic approaches like the averaging aggregator, as proposed by [Konecnˇ y` et al., • The empirical results prove that the proposed scheme 2015]. We also address the fault tolerance issue, especially enables the aggregator to utilize abstract information Byzantine attacks, in the proposed RNN aggregator. More- across variant local learning models, which significantly over, our proposed aggregation method can work across dif- improve the scalability of the traditional distributed ferent local models, which enables distributed learning sys- learning system. tems to scale. The rest of the paper is organized as follows. In the next section, we briefly introduce the most related works. Then we 3 Distributed Machine Learning Model detail the design of recurrent neural network based aggregator and its variant, RNN aggregator with additional loss informa- In this section, we present the master-slave type of distributed tion. After that, we evaluate the performance of the proposed learning system, and introduce a general learning algorithm RNN aggregator through extensive experiments. Finally, we based on distributed gradient descent. conclude the paper. As shown in Figure 1, a distributed learning system is composed of a parameter server S and a group of workers 2 Related Works {W1, W2, ··· , WM }. Distributed machine learning can ac- count for a variety of learning problems, including in feder- 2.1 Learning to Learn ated learning [Konecnˇ y` et al., 2016] and multi-party learning The idea of learning to learn was inspired by psychology [Hamm et al., 2016], where each worker Wm can only access studies [Ward, 1937], where researchers discovered that the its own dataset xm, and parallel train on a large-scale dataset,

2615 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Noticeably, in the literature, the aggregation rule A is typi- cally averaging [Polyak and Juditsky, 1992] and its variants [Zhang et al., 2015; Lian et al., 2015; Tsitsiklis et al., 1986].

Workers ...... 4 Learning to Learn Gradients Aggregation w1 wm wM by Gradient Descent Data ...... Shards 1 m M In this section, we model the gradients aggregation in dis- tributed machine learning as a learning problem and develop Figure 1: The architecture of the master-slave type of distributed an deep neural network aggregator. learning system. 4.1 Recurrent Neural Network Based Aggregator

where each worker Wm, an individual computing unit, is as- Motivated by the idea of learning to learn, instead of hand- sociated with a specific shard of data xm. designed aggregation methods, our goal is to develop a learn- The objective of distributed machine learning is to let the ing based aggregator that is able to learn at the meta-level to M workers solve the following problem in a distributed and aggregate the information received from the workers in dis- collaborative fashion : tributed learning. X To achieve this goal, we consider directly parameterizing min L(θ) := Lm(θ, xm), QU QU QθU∈Rd the aggregation process as a learning model A with φ being m∈M its parameter vector. The aggregator A learns from the pa- where θ ∈ Rd is the shared global model, and L and rameter server’s aggregation experiences and updates itself. {Lm, m ∈ M} are objective functions with M := Thus, in the tth iteration, the aggregator works as follows: {1, 2, ··· ,M} L . Specifically, m is computed by the worker 1. The aggragator A at the parameter server S takes the up- Wm as follows, t t t loaded gradients (g1, ··· , gi , ··· , gM ) as input, which X t Lm(θ, xm) := `m(θ, xi), we denote by gW . x ∈x i m 2. Given the internally stored hidden knowledge ht and where `m is the loss (e.g., hinge loss or cross-entropy the current parameter vector φ, which was learned from t loss) and xm is the data shard that the worker Wm can access. the past learning experiences, A(φ, gW ) outputs its pre- Generally, a learning model is determined by the structure of dicted global model θt+1. the model and the corresponding paramters. In the follow- 3. The aggregator A updates its internally stored hidden ing, we will slightly abuse the notation of θ as the parameter state to obtain h . vector of the model. t+1 Note that most learning algorithms are based on a common We summarize the process in the tth iteration as follows: optimized method called stochastic gradient descent (SGD). t+1 t QU Here,QU we introduceQU the distributed SGD algorithm for dis- ht+1, θ = A(ht, gW , φ). tributed learning system [Rajkumar and Agarwal, 2012]. Par- ticularly, all the workers jointly learn a common model itera- We notice that this procedure can easily map onto a re- tively. In tth iteration, the parameter server S broadcasts the current neural network (RNN), which learns to utilize its in- current global model θt to the workers. Then, each worker ternal memory to store information about previous processes and function evaluations, and learns to access this memory to Wm computes its local gradient based on its own data shard and uploads it to the server S. Once S receives the gradients produce current output. To address the gradients vanishing or from all the workers, it updates the common global model by exploding issues when dealing with long sequential data, in aggregating all the gradients via an aggregation function A. this paper, we apply two popular and efficient variants of the We summarize the distributed SGD algorithm by two main standard RNN, RNN with Long Short-Term Memory units [ ] steps in each iteration t : (LSTM) Hochreiter and Schmidhuber, 1997 and RNN with Gated Recurrent units (GRU) [Chung et al., 2014] . • Local gradient calculation: each worker Wm calcu- Subsequently, we can formulate the proposed aggregation t lates the local gradient gm based on the current global procedure as : t model θ and its local data shard xm, i.e., t+1 t ht+1, θ = ARNN(ht, gW , φ), t ∂ t gm = t Lm(θ , xn). ∂θ where ARNN(·) represents the aggregation method at the pa- • Global gradients aggregation: the server S aggregates rameter server S, and the φ represents all the parameters in the gradients received from all the workers and updates the RNN network. the global model : 4.2 Distributed Learning with an RNN Aggregator t+1 t t t t θ = θ − γt ·A(g1, g2, ··· , gM ), With the RNN aggregator ARNN, the distributed learning pro- where γt is the step size. cess discussed before works as follows:

2616 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

t-2 t-1 t are shared across all iterations and workers, so as to efficiently train a stable aggregator. RNN Aggregator In particular, we first develop an algorithm to preprocess QU QU QU t the aggregator’s inputs. Since the input gradients group gW are from several workers, we design a coordinatewise gra- Worker w1

. . . dients rescaling method to eliminate significant differences ...... among the local gradients. Worker wm t Recall that the aggregator’s input gW is defined as follows: ...... gt := {gt , gt , ··· , gt } Worker wM W 1 2 M  t t t  g1,1 g2,1 ··· gM,1  t t t  g1,2 g2,2 ··· gM,2 Figure 2: The computational graph used for computing the gradient = . . . . of the RNN aggregator.  . . . .   t t t  g1,d g2,d ··· gM,d t t t T • Local gradient calculation: = {g∗,1, g∗,2, ··· , g∗,d} , t t ∂ t where d is the dimension of gm. We conduct the following gm = t Lm(θ , xn), preprocessing: ∂θ t Ai = kg∗,ik2 • Global gradients aggregation: t t+1 t gm,i ht+1, θ = ARNN(ht, gW , φ). t gm,i := Ai Therefore, the objective of a distributed learning with the t ∗ Then we also rescale g as: RNN aggregator is to find φ : m,i  t t −p ∗ t t gm,i if |gm,i| ≥ e φ = arg min LRNN(θ , xn, φ) gm,i := −p , φ e otherwise Loss Function where p > 0 is a control parameter to disregard very small A common loss function that is usually used is the loss of the gradients. final iteration T : On the other hand, we rescale the output of the aggregator t X t as follows: LRNN(θ , xn, φ) = E Lm(θ , xn, φ). (1) t+1 t+1 θi := θi · Ai. m∈M The objective function in the Equation (1) depends only on 4.3 RNN Aggregator with Additional Loss the parameters in the final iteration. In order to efficiently Information train the aggregator ARNN (φ), it is convenient to have an As discussed in the previous section, in a large-scale dis- objective that depends on the entire trajectory of optimization tributed machine learning system, some workers may have for some horizon T . Thus, we employ the same loss function computing errors. What is worse, adversarial attackers may as that in previous works [Andrychowicz et al., 2016], which compromise one or more workers and upload forged mislead- is : ing gradients to hamper the learning process. To address these T issues, we utilize additional information to improve the effi- t E X X t ciency and robustness of the RNN aggregator. LRNN(θ , xn, φ) = Lm(θ , xn, φ) m We sample nr(nr  nb ) data drawn from a public train- m∈M t=1 m ing dataset, where nb is the training batch size of worker t m We minimize the value of LRNN(θ , xn, φ) using gradient de- Wm, and then calculate additional loss information fr = 1 Pnr t scent on φ. ∂L /∂φ is estimated based on the compu- `m(xi; g ) for each worker. This additional loss RNN nr i=1 m m tational graph in Figure 2 with Back-propagation Through information fr can help estimate the descendant of the local t Time (BPTT). Note that the solid edges in the graph allow gradient gm. the gradients to flow along, while the dashed edges drop the In particular, as shown in Figure 3, in the tth iteration, the t gradients on them. parameter sever S takes local gradient gm and its correspond- Preprocessing and Postprocessing ing additional loss information as input data from worker Wm to train the RNN aggregator A . The magnitudes of the aggregator’s inputs and outputs may RNN vary dramatically among the workers because their data shards can be very different. However, neural networks usu- 5 Experiments ally work robustly only for inputs and outputs that are neither In this section, we evaluate the performance of the proposed very small nor very large. Therefore, we rescale the inputs RNN aggregator and RNN aggregator with additional loss in- and outputs of the aggregator using suitable constants, which formation (ARNN aggregator).

2617 QU QU QU

QU QU QU

QU QU QU

QU QU QU

QU QU QU

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Output

RNN Aggregator RNN

QU .. . QU ... QU

Worker w1 Worker wm WorkerwM

Figure 3: RNN aggregtor with additional loss information. Figure 4: Performance among RNN aggregators, ARNN aggrega- 5.1 Experiments Setup tors and averaging aggregator. We conduct experiments on two image classification tasks: handwritten digits classification on MNIST dataset and ob- ject recognition on CIFAR-10 dataset. We equally divide the training datasets into 10 subsets and assign each subset to an individual worker. In all experiments, both the RNN aggregators and ARNN aggregators use two-layer LSTMs (or GRUs) with 20 hid- den units in each layer and they are trained with BPFT. The optimization process is performed by using ADAM. In addi- tion, we use early stopping to avoid overfitting while training the aggregators. After some fixed number of learning steps, we freeze the aggregator’s parameters and evaluate its perfor- mance. We pick the best aggregator according to its perfor- Figure 5: Generalization Performance of LSTM-based RNN aggre- mance on validation data and report its performance. gator. Left: Generalized to different models. Right: Generalized to different datasets. 5.2 Performance on MNIST and CIFAR-10 In Figure 4, we compare the learning curves for different ag- gregators, including RNN aggregator with LSTM units, RNN layer of 40 units with the sigmoid activation function, (2) 2 aggregator with GRU units, ARNN aggregator with LSTM hidden: an MLP using two hidden layers, each of which has units, ARNN aggregator with GRU units and classical aver- 20 units with the sigmoid function, and (3) ReLU: an MLP aging aggregators. In the MNIST task, the base network in using one hidden layer of 20 units with the ReLU activation each worker is multiple layer perceptrons (MLP) with one function. As shown in the left plot of Figure 5, the learning hidden layer of 20 units using a sigmoid activation func- cures demonstrate that our learned LSTM-based RNN aggre- tion. Meanwhile, in the CIFAR-10 task, each worker holds a gator works well on these generalization tasks. The RNN model, including two convolutional layers with max pooling aggregator still outperforms the averaging aggregator in (1) followed by a fully-connected layer. In general, we find that and (2), and has competitive performance in (3). the performance of the LSTM network is similar to GRU net- work in RNN aggregators, as well as in ARNN aggregators. Generalization Performance on Different Datasets Therefore, in the following, we only investigate the LSTM We apply the LSTM-based RNN aggregator trained from based aggregators. In addition, both the RNN aggregators a system with the previous setting, and transfer it to other and ARNN aggregators outperform the current state-of-the- datasets. We utilized two modified datasets CIFAR-2 and art averaging aggregator, and ARNN aggregators are more CIFAR-5. (For example, the CIFAR-2 dataset only contains efficient than the RNN aggregators. data corresponding to 2 of the labels.) The learning curves 5.3 Generalization Performance are shown in the right plot of Figure 5 demonstrate that our learned LSTM-based RNN aggregator works well on trans- An efficient and effective aggregation method should be able ferring among the disjoint subset of the data, and they can to work on different distributed learning systems. Thus, it compete with classic averaging aggregation rules. is important to consider the generalization performance of the proposed aggregators. Specifically, we take LSTM based RNN aggregators as an example and investigate the general- 5.4 Fault Resilience ization performance of the proposed scheme in Figure 5. Distributing a computation over a group of nodes may induce Generalization Performance on Different Models a high risk of failure due to unreliable communication and We apply the LSTM-based RNN aggregator trained from a erroneous computation. In this section, we investigate the system with the previous setting to other distributed systems fault resilience of RNN aggregators and the improved ARNN with different worker models. The modifications of the work- aggregators under Byzantine attack, where attackers compro- ers’ local models are (1) 40 units: an MLP using one hidden mise the workers and actively upload forged gradients [Xie et

2618 QU QU QU

QU QU QU

QU QU QU QU QU QU

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Figure 6: Performance Comparisons among LSTM aggregator, av- eraging aggregator and Krum aggregator under Byzantine attack. Figure 7: Performance of RNN aggregator across different learning models.

al., 2018] to jeopardize the distributed learning process1. We compare our RNN aggregator and ARNN aggrega- efficiently utilize gradients from different local models to im- tor with the averaging aggregator and the Krum aggregator, prove the global learning model’s performance. Therefore, which is a state-of-art error tolerant aggregation algorithm our proposed RNN aggregator can easily improve the scal- [Blanchard et al., 2017], in distributed learning systems with ability of distributed machine learning systems by involving 10 honest workers, 3 out of 10 Byzantine workers, and 7 out workers with different types of local learning models. Mean- of 10 Byzantine workers. As shown in Figure 6, the averaging while, the experiment also demonstrates that the RNN aggre- aggregator is unable to tolerant any Byzantine attack. When gator is not only able to coordinate workers in one distributed only a few computing nodes are Byzantine workers, the RNN system, but also it can work hierarchically and be used to aggregator and ARNN aggregator have slightly better perfor- manage multiple distributed systems. For instance, the RNN mance than the Krum aggregator. When Byzantine workers aggregator may work as an assistant to utilize information dominate the honest workers, the Krum aggregator and RNN from other distributed learning systems to improve its own aggregator cannot resist the attackers and the learning system learning performance. Note that the ARNN aggregator would crashes. However, the ARNN aggregator still works well and obtain the same performance as the RNN aggregator since it outperforms the Krum by 43.14%. The reason behind this is mainly differs in the resilience to Byzantine attackers. that the Krum aggregates the reliable gradients only and fil- ters our the outlier gradients, which will be fooled by the at- 6 Conclusions tackers and filter out the correct gradients when the attackers In this work, different from traditional linear combination take the majority. In contrast, the ARNN aggregator utilizes aggregation methods, we attempt an alternative approach, the additional loss information to learn which of the gradients learning to learn, to coordinate the master-slave type of dis- are from the honest participants and aggregates them only. tributed learning process. Specifically, we model the gradi- ents aggregating process at the parameter server as a learning 5.5 System Scalability task, which enables us to utilize the recurrent neural network In this section, we discuss the performance of RNN aggre- to learn to aggregate the gradients. In addition, we improve gator collecting gradients from variant local models. We set the RNN aggregator with additional loss information to ad- up 4 distributed learning systems with different types of local dress fault tolerance issues, especially Byzantine attacks, in models: an MLP with 1 hidden layer of 20 nodes and sigmoid the distributed machine learning system. activation function, an MLP with 1 hidden layer of 40 nodes The results on the MNIST and CIFAR datasets show that and sigmoid activation function, an MLP with 2 hidden lay- our proposed aggregation methods can outperform classic av- ers of 20 nodes and sigmoid activation function, and an MLP eraging aggregation method, and Krum, the state-of-the art with 1 hidden layer of 20 nodes with ReLU activation func- fault resilience aggregation method. Moreover, the RNN ag- tion. Each system has 10 workers. We randomly choose 3 gregator is the first to utilize gradients from variant models to workers from each system and send the gradients to RNN ag- increase the distributed learning performance. gregator to update an MLP with 1 hidden layer of 20 nodes and sigmoid activation function. References We demonstrate the learning curves of 4 distributed learn- ing systems in Figure 7 and find that the RNN aggregator can [Abadi et al., 2016] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu 1In the case of unreliable communications, local gradients may Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, be lost during transmissions, which can be considered equivalent to et al. Tensorflow: a system for large-scale machine learn- the scenario that the parameter server receives zero gradients. In ing. In OSDI, volume 16, pages 265–283, 2016. the case of erroneous computations, the parameter server receives incorrect gradients from workers. Both cases can be included by the [Andrychowicz et al., 2016] Marcin Andrychowicz, Misha Byzantine attacks. Therefore, in this paper, we focus on evaluating Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, the Byzantine attack tolerance of the proposed RNN aggregator and Tom Schaul, Brendan Shillingford, and Nando De Freitas. ARNN aggregator. Learning to learn by gradient descent by gradient descent.

2619 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

In Advances in Neural Information Processing Systems, [Konecnˇ y` et al., 2015] Jakub Konecnˇ y,` Brendan McMahan, pages 3981–3989, 2016. and Daniel Ramage. Federated optimization: Distributed arXiv preprint [Bengio et al., 1992] Samy Bengio, Yoshua Bengio, Jocelyn optimization beyond the datacenter. arXiv:1511.03575 Cloutier, and Jan Gecsei. On the optimization of a synap- , 2015. tic learning rule. In Preprints Conf. Optimality in Artifi- [Konecnˇ y` et al., 2016] Jakub Konecnˇ y,` H Brendan McMa- cial and Biological Neural Networks, pages 6–8. Univ. of han, Felix X Yu, Peter Richtarik,´ Ananda Theertha Suresh, Texas, 1992. and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint [Blanchard et al., 2017] Peva Blanchard, Rachid Guerraoui, arXiv:1610.05492, 2016. Julien Stainer, et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural [Lian et al., 2015] Xiangru Lian, Yijun Huang, Yuncheng Li, Information Processing Systems, pages 119–129, 2017. and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Informa- [ ] Chen et al., 2016 Yutian Chen, Matthew W Hoffman, Ser- tion Processing Systems, pages 2737–2745, 2015. gio Gomez´ Colmenarejo, Misha Denil, Timothy P Lilli- crap, Matt Botvinick, and Nando de Freitas. Learning to [Maclaurin et al., 2015] Dougal Maclaurin, David Duve- learn without gradient descent by gradient descent. arXiv naud, and Ryan Adams. Gradient-based hyperparame- preprint arXiv:1611.03824, 2016. ter optimization through reversible learning. In Inter- national Conference on Machine Learning, pages 2113– [Chen et al., 2017] Xuhui Chen, Jinlong Ji, Kenneth Loparo, 2122, 2015. and Pan Li. Real-time personalized cardiac arrhythmia de- tection and diagnosis: A cloud computing architecture. In [Nedic and Ozdaglar, 2009] Angelia Nedic and Asuman 2017 IEEE EMBS International Conference on Biomed- Ozdaglar. Distributed subgradient methods for multi-agent ical & Health Informatics (BHI), pages 201–204. IEEE, optimization. IEEE Transactions on Automatic Control, 2017. 54(1):48–61, 2009. [Chen et al., 2018a] Tianyi Chen, Georgios B Giannakis, [Polyak and Juditsky, 1992] Boris T Polyak and Anatoli B Tao Sun, and Wotao Yin. Lag: Lazily aggregated gradi- Juditsky. Acceleration of stochastic approximation by ent for communication-efficient distributed learning. arXiv averaging. SIAM Journal on Control and Optimization, preprint arXiv:1805.09965, 2018. 30(4):838–855, 1992. [ ] [Chen et al., 2018b] Xuhui Chen, Jinlong Ji, Changqing Rajkumar and Agarwal, 2012 Arun Rajkumar and Shivani Luo, Weixian Liao, and Pan Li. When machine learn- Agarwal. A differentially private stochastic gradient de- ing meets blockchain: A decentralized, privacy-preserving scent algorithm for multiparty classification. In Artificial and secure design. In 2018 IEEE International Conference Intelligence and Statistics, pages 933–941, 2012. on Big Data (Big Data), pages 1178–1187. IEEE, 2018. [Scardovi and Leonard, 2009] Luca Scardovi and Naomi Ehrich Leonard. Robustness of aggregation [Chung et al., 2014] Junyoung Chung, Caglar Gulcehre, in networked dynamical systems. In Robot Communica- KyungHyun Cho, and Yoshua Bengio. Empirical evalua- tion and Coordination, 2009. ROBOCOMM’09. Second tion of gated recurrent neural networks on sequence mod- International Conference on, pages 1–6. IEEE, 2009. eling. arXiv preprint arXiv:1412.3555, 2014. [Thrun and Pratt, 1998] Sebastian Thrun and Lorien Pratt. [ ] Dean et al., 2012 Jeffrey Dean, Greg Corrado, Rajat Learning to learn: Introduction and overview. In Learn- Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew ing to learn, pages 3–17. Springer, 1998. Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural in- [Tsitsiklis et al., 1986] John Tsitsiklis, Dimitri Bertsekas, formation processing systems, pages 1223–1231, 2012. and Michael Athans. Distributed asynchronous deter- ministic and stochastic gradient optimization algorithms. [Ge et al., 2017] Rong Ge, Jason D Lee, and Tengyu Ma. IEEE transactions on automatic control, 31(9):803–812, Learning one-hidden-layer neural networks with land- 1986. scape design. arXiv preprint arXiv:1711.00501, 2017. [Ward, 1937] Lewis B Ward. Reminiscence and rote learn- [Hamm et al., 2016] Jihun Hamm, Yingjun Cao, and ing. Psychological Monographs, 49(4):i, 1937. Mikhail Belkin. Learning privately from multiparty data. In International Conference on Machine Learning, pages [Xie et al., 2018] Cong Xie, Oluwasanmi Koyejo, and In- 555–563, 2016. dranil Gupta. Zeno: Byzantine-suspicious stochastic gra- dient descent. arXiv preprint arXiv:1805.10032, 2018. [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and [ ] Jurgen¨ Schmidhuber. Long short-term memory. Neural Zhang et al., 2015 Sixin Zhang, Anna E Choromanska, and computation, 9(8):1735–1780, 1997. Yann LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, [Houthooft et al., 2018] Rein Houthooft, Richard Y Chen, pages 685–693, 2015. Phillip Isola, Bradly C Stadie, Filip Wolski, Jonathan Ho, and Pieter Abbeel. Evolved policy gradients. arXiv preprint arXiv:1802.04821, 2018.

2620