Arxiv:2005.04828V3 [Cs.LG] 17 Aug 2020 Machine Learning on Big Data Is an Extremely Rnns in H

Pretraining Federated Text Models for Next Word Prediction Joel Stremmel∗ Arjun Singh∗ University of Washington University of Washington Seattle, WA 98105 Seattle, WA 98105 [email protected] [email protected] Abstract conduct federated learning experiments on data grouped by clients but never aggregated (Bonawitz Federated learning is a decentralized approach et al., 2019). We build on the existing body of fed- for training models on distributed devices, by summarizing local changes and sending ag- erated learning experiments, focusing on enhanc- gregate parameters from local models to the ing accuracy and reducing the required number of cloud rather than the data itself. In this re- training rounds for federated text models for NWP search we employ the idea of transfer learning through a variety of pretraining approaches. to federated training for next word prediction (NWP) and conduct a number of experiments demonstrating enhancements to current base- 2 Related Work lines for which federated NWP models have been successful. Specifically, we compare fed- This research builds on related work in language erated training baselines from randomly initial- modeling and federated learning to demonstrate ized models to various combinations of pre- the benefits of pretraining language models in the training approaches including pretrained word federated setting which are designed for next word embeddings and whole model pretraining followed by federated fine-tuning for NWP on prediction. We use an LSTM-RNN language model a dataset of Stack Overflow posts. We real- as in Jing and Xu(2019) for NWP, and take inspi- ize lift in performance using pretrained em- ration from the shallow LSTM (Hochreiter and beddings without exacerbating the number of Schmidhuber, 1997) architecture with 10M param- required training rounds or memory footprint. eters from Melis et al.(2017). We also observe notable differences using centrally pretrained networks, especially depend- ing on the datasets used. Our research of- For federated model training, we use the Feder- fers effective, yet inexpensive, improvements ated Averaging Algorithm from H. Brendan McMa- to federated NWP and paves the way for more han(2017) which averages model parameters after rigorous experimentation of transfer learning applying gradient updates to local models based techniques for federated learning. on individual client datasets. Our network is di- rectly comparable to the network architecture used 1 Introduction by Reddi et al.(2020) 1 and similar to the federated arXiv:2005.04828v3 [cs.LG] 17 Aug 2020 Machine learning on big data is an extremely RNNs in H. Brendan McMahan(2017) and Hard popular and useful field of research and develop- et al.(2018) in that we train on a dataset split by ment. However, there are a variety of limitations to clients. In this case, client datasets are collections centrally aggregating data, such as compromised of posts from Stack Overflow users, and we apply user privacy, single point of failure security risks, our language model to predict the next word in a and the maintenance of often expensive hardware given Stack Overflow post, similar to predicting and compute resources. Federated learning aims the next word of a text message as in Hard et al. to address this and has exhibited promising re- (2018). sults for text completion tasks on mobile devices (Hard et al., 2018). The Tensorflow Federated API 1https://github.com/tensorflow/ provides methods to train federated models and federated/tree/master/tensorflow_ federated/python/research/optimization/ ∗Authors contributed equally to this work. stackoverflow 3 Enhancing Federated Text Models with For the task of model pretraining, we also lever- Pretraining Methods age the collected works of Shakespeare (as in the RNN from H. Brendan McMahan(2017)) from We apply three enhancements to federated training Project Gutenberg released under the Project Guten- of our LSTM-RNN language model, demonstrat- berg license (Shakespeare). We download the full ing increased top-1 accuracy with fewer required text of these collected works totaling 124,788 lines. training rounds. Our enhancements include: 1. Central pretraining followed by federated fine- 5 Model Design tuning. In this study, we train a variety of small and large 2. Using a pretrained word embedding layer in- neural networks with four layers each as in table1. stead of randomly initialized embeddings dur- ing federated training. The output layer represents the top 10,000 most frequently occurring vocab words in the Stack 3. Combining centralized model pretraining and Overflow dataset plus four special tokens used dur- pretrained word embeddings with federated ing training denoting: padding, beginning of a sen- fine-tuning. tence, end of a sentence, and out of vocabulary. The following sections detail the methods we We report accuracy with and without these tokens apply to achieve these enhancements as well as included. our experimental results. All code for this research We train both networks using the Adam opti- is freely available under the MIT license in our mizer and Sparse Categorical Cross Entropy loss GitHub repository2. for batches of size 16 and compare train and valida- 4 Data tion accuracy at each training round for 800 training rounds by sampling 10 non-IID client datasets The main dataset used for these experiments is per round, though we run some initial tests with 500 hosted by Kaggle and made available through the training rounds and a final test with 1,500. Each tff.simulation.datasets module in the Tensorflow client dataset has 5,000 text samples from Stack Federated API (Bonawitz et al., 2019). Stack Over- Overflow at maximum, and a total of 20,000 val- flow owns the data and has released the data under idation samples. Model parameters are averaged the CC BY-SA 3.0 license. The Stack Overflow centrally after each federated training round and the data contains the full body text of all Stack Over- contribution of each client dataset to the Sparse Cat- flow questions and answers along with metadata, egorical Cross Entropy loss function is weighted and the API pointer is updated quarterly. The data by the number of text samples drawn from each is split into the following sets at the time of writing: client. We do not apply additional training rounds • 342,477 distinct users and 135,818,730 train- on the client datasets before averaging parameters ing examples and for this reason use the terminology of rounds and epochs interchangeably. • 38,758 distinct users and 16,491,230 validation examples All models are trained with the Federated Aver- • 204,088 distinct users and 16,586,035 test ex- aging algorithm as in H. Brendan McMahan(2017) amples using the Tensorflow Federated simulated training environment from Bonawitz et al.(2019). The large Challenges with the data include the size of the network outperforms the small network but with data and the distribution of words. As is common about three times the number of trainable param- with text data (Zipf’s law), the most common words eters (7,831,328 vs 2,402,072) and is about three occur with frequency far greater than the least com- times the size (31.3MB vs 9.6MB). See the model mon words. Therefore, in our experiments, we layers in table1. limit the vocab size to exclude very rare words. We provide a notebook of exploratory data analysis in 6 Central Pretraining with Federated our GitHub repository. Fine-Tuning 2https://github.com/ federated-learning-experiments/ The communication and computation costs of train- fl-text-models ing models across distributed devices necessitates Size Embedding Size LSTM Size Dense Layer Size Output Layer Size Small 100 256 100 10,004 Large 300 512 300 10,004 Table 1: Model sizes. limiting the number of federated training rounds as the next section. much as possible. Transfer learning provides a way We fine-tune three different models in the fed- to trade computation time on independent devices erated style for 500 rounds (figure1). Although for computation time on a central server. In this the network remains the same for all three, the key way, we propose that by initializing weights for a difference is whether they are pretrained. The three model to be trained on federated, private data with models are as follows: pretrained weights learned from centralized, public data, it is possible to limit training rounds on dis- 1. Federated training on Stack Overflow without tributed devices, as the federated model will begin any pretraining which yields the two learning training with some information about the sequence curves that exhibit the lowest levels of train of words, that is, which word should follow the and validation accuracy respectively. text observed so far. We recognize that the English in Shakespeare differs greatly from the English in 2. Central pretraining on Shakespeare for 50 Stack Overflow posts, and therefore submit that rounds followed by federated fine-tuning on the value of our work is mostly mechanical in na- Stack Overflow which yields curves exhibit- ture, providing a simple method to extract weights ing marginal lift in both train and validation learned from a centrally trained model and apply accuracy. them to a model to be trained in the federated setting. 3. Pretraining on distinct Stack Overflow IDs with federated fine-tuning. To centrally pretrain our federated model, we The two main takeaways from this experiment first load, preprocess, and fit a model to a pretrain- are as follows: ing dataset using the Keras submodule from Ten- sorflow. In doing so, we fit the same model archi- 1. Pretraining generally improves the perfor- tecture as described above for federated training mance of fine-tuning. but to the entire dataset for a predefined number of pretraining rounds. We then extract the tensors 2. When the source of data is identical for pre- of model weights from the trained model and use training and fine-tuning, fine-tuning adds no these layer tensors to initialize the federated model.

Arxiv:2005.04828V3 [Cs.LG] 17 Aug 2020 Machine Learning on Big Data Is an Extremely Rnns in H

Intro to Tensorflow 2.0 MBL, August 2019

Ways to Use Machine Learning Approaches for Software Development

Keras2c: a Library for Converting Keras Neural Networks to Real-Time Compatible C

Tensorflow, Theano, Keras, Torch, Caffe Vicky Kalogeiton, Stéphane Lathuilière, Pauline Luc, Thomas Lucas, Konstantin Shmelkov Introduction

Deep Learning with Keras I

Intro to Keras

Auto-Keras: an Efficient Neural Architecture Search System

Wavefront Parallelization of Recurrent Neural Networks on Multi-Core

Introduction to Keras Tensorflow

Deep Learning Software Security and Fairness of Deep Learning SP18 Today

Approach Pre-Trained Deep Learning Models with Caution Pre-Trained Models Are Easy to Use, but Are You Glossing Over Details That Could Impact Your Model Performance?

Fake News Detection and Production Using Transformer-Based NLP Models