Efficient Training of BERT by Progressively Stacking
Total Page:16
File Type:pdf, Size:1020Kb
Efficient Training of BERT by Progressively Stacking Linyuan Gong 1 Di He 1 Zhuohan Li 1 Tao Qin 2 Liwei Wang 1 3 Tie-Yan Liu 2 Abstract especially in domains that require particular expertise. Unsupervised pre-training is commonly used in In natural language processing, using unsupervised pre- natural language processing: a deep neural net- trained models is one of the most effective ways to help work trained with proper unsupervised prediction train tasks in which labeled information is not rich enough. tasks are shown to be effective in many down- For example, word embedding learned from Wikipedia cor- stream tasks. Because it is easy to create a large pus (Mikolov et al., 2013; Pennington et al., 2014) can monolingual dataset by collecting data from the substantially improve the performance of sentence classifi- Web, we can train high-capacity models. There- cation and textual similarity systems (Socher et al., 2011; fore, training efficiency becomes a critical issue Tai et al., 2015; Kalchbrenner et al., 2014). Recently, pre- even when using high-performance hardware. In trained contextual representation approaches (Devlin et al., this paper, we explore an efficient training method 2018; Radford et al., 2018; Peters et al., 2018) have been for the state-of-the-art bidirectional Transformer developed and shown to be more effective than conven- (BERT) model. By visualizing the self-attention tional word embedding. Different from word embedding distributions of different layers at different po- that only extracts local semantic information of individual sitions in a well-trained BERT model, we find words, pre-trained contextual representations further learn that in most layers, the self-attention distribution sentence-level information by sentence-level encoders. will concentrate locally around its position and the BERT (Devlin et al., 2018) is the current state-of-the-art pre- start-of-sentence token. Motivated by this, we pro- trained contextual representations based on a huge multi- pose the stacking algorithm to transfer knowledge layer Transformer encoder architecture (BERT-Base has from a shallow model to a deep model; then we 110M parameters and BERT-Large has 330M parameters) apply stacking progressively to accelerate BERT and trained by masked language modeling and next-sentence training. Experiments showed that the models prediction tasks. Because these tasks require no human su- trained by our training strategy achieve similar pervision, the size of the available training data easily scales performance to models trained from scratch, but up to billions of tokens. Therefore, the training efficiency our algorithm is much faster. of such a model becomes the most critical issue, and the requirement of extremely high-performance hardware be- comes a barrier to its practical application. 1. Introduction In this paper, we aim to improve the training efficiency of the In recent years, deep neural networks have pushed the limits BERT model from in an algorithmic sense. Our motivation of many applications, including speech recognition (Hinton is from the observation of self-attention layers, which is the et al., 2012), image classification (He et al., 2016), and core component of the BERT model. We visualize a shallow machine translation (Vaswani et al., 2017). The keys to the BERT model and a deep BERT model and then study their success are the advanced neural network architectures and differences and relationships. By carefully investigating massive databases of labeled instances (Deng et al., 2009). the attention distributions in different layers at different However, human annotations may be very costly to collect, positions, we find some interesting phenomena: First, the The work was done while the first and third author were visit- attention distributions of the shallow model are quite similar ing Microsoft Research Asia. 1Key Laboratory of Machine Per- across different position and layers. At any position, the ception, MOE, School of EECS, Peking University 2Microsoft attention distribution is a mixture of two distributions. One Research 3Center for Data Science, Peking University, Beijing distribution is local attention that focuses on neighbors. The Institute of Big Data Research. Correspondence to: Tao Qin <tao- other distribution focuses on the start-of-sentence token. [email protected]>. Second, we find the attention distribution in the shallow Proceedings of the 36 th International Conference on Machine model is similar to that of a deep model. This suggests that Learning, Long Beach, California, PMLR 97, 2019. Copyright such knowledge can be shared from the shallow model to a 2019 by the author(s). Efficient Training of BERT by Progressive Stacking deep model: Once we have a shallow model, we can stack Output the shallow model into a deep model by sharing weight Probabilities between the top self-attention layers and the bottom self- Classifier attention layers, and then fine-tune all the parameters. As we can train the model from a shallow one to a deep one, + training time can be largely reduced as training a shallow model usually requires less time. Feed Forward We conduct extensive experiments on our proposed method to see (1) whether it can improve the training efficiency and Layer Norm convergence rate at the pre-training step, and (2) whether the L x trained model can achieve similar performance compared + to the baseline models. According to our results, we find first during pre-training, our proposed method is about 25% Multi-Head faster than several baselines to achieve the same validation Attention accuracy. Second, our final model is competitive and even better than the baseline model on several downstream tasks. Layer Norm Positional + + 2. Related Work Encoding 2.1. Unsupervised Pre-training in Natural Language Token Segment Processing Embedding Embedding Pre-trained word vectors (Mikolov et al., 2013; Pennington Inputs et al., 2014) have been considered a standard component of most state-of-the-art NLP architectures, especially for those Figure 1. The model architecture of BERT. tasks where the amount of labeled data is not large enough (Socher et al., 2011; Tai et al., 2015; Kalchbrenner et al., 2014). However, these learned word vectors only capture transfer. Chen et al.(2015) tackles the problem about how the semantics of a single word independent of its surround- to train a deep neural network efficiently when we have a ing context. The rich syntactic and semantic structures of shallow neural network. In particular, function-preserving sentences are not effectively exploited. initialization is proposed which first initializes a deep neural Pre-trained contextual representations overcomes the short- network that represents the same function as the shallow comings of traditional word vectors by considering its sur- one, and then continue to train the deep network by standard rounding context. Peters et al.(2018) first train language optimization methods. However, when dealing with sophis- models using stacked LSTMs, and then use the hidden ticated structures such as Transformer, function-preserving states in the stacked LSTMs as the contextual represen- initialization is usually not effective. For example, the basic tation. Since LSTM processes word sequentially, the hidden component in the Transformer is a composition of a self- state of LSTM at one position contains the information of attention layer and a feed-forward layer. According to our the words in previous positions, and thus the representation empirical study, simply setting the feed-forward layer to be contains not only the word semantics but also the sentence near zero and randomly initializing the self-attention layer contexts. Radford et al.(2018) uses advanced self-attention is a function-preserving initialization, but it is ineffective as units instead of LSTM units in language models. Devlin most parameters in the self-attention layer stay untrained. In et al.(2018) further develops a masked language modeling our work, we propose a different and more efficient method task and achieves state-of-the-art performance on multiple to transfer knowledge from shallow models to deep models. natural language understanding tasks. As (masked) lan- guage modeling requires no human labeling effort, billions 3. Method of sentences on the web can be used to train a very deep network. Therefore, a major challenge in learning such a The BERT (Bidirectional Encoder Representation from model is training efficiency. Transformers) model is developed on a multi-layer bidi- rectional Transformer (Vaswani et al., 2017) encoder. The 2.2. Network Training by Knowledge Transfer architecture is shown in Figure1. The encoder consists of L encoder layers, each of which consists of a multi-head Our iterative training method is also closely related to ef- self-attention sub-layer and a feed forward sub-layer: both ficiently training deep neural networks using knowledge of them have residual connections (He et al., 2015). The Efficient Training of BERT by Progressive Stacking feed forward layer (FFN) is point-wise, i.e., it applies inde- pendently to each position of the input. The key component of the Transformer encoder is the multi- head self-attention layer. An attention function can be formulated as querying a dictionary with key-value pairs (Vaswani et al., 2017), e.g., QKT Attention(Q; K; V ) = softmax p · V; dk (1) n ;d n ;d n ;d where Q 2 R q k ;K 2 R e k ;V 2 R e v : dk is the dimension of each key and each query, nq is the number of queries, andpne is the number of key-value entries. T nq ;ne A = softmax(QK = dk) 2 R defines the attention distribution. The output of each query is a weighted average of the rows of V with A as the coefficient. The attention Figure 2. Visualization of attention distributions of BERT-Base. distribution A helps us understand the attention function: For a randomly chosen sample sentence, we visualize the attention A reflects the importance of the i-th key-value entry with distributions of 6 heads from different layers.