Distilling BERT for Natural Language Understanding
Total Page:16
File Type:pdf, Size:1020Kb
TinyBERT: Distilling BERT for Natural Language Understanding Xiaoqi Jiao1∗,y Yichun Yin2∗z, Lifeng Shang2z, Xin Jiang2 Xiao Chen2, Linlin Li3, Fang Wang1z and Qun Liu2 1Key Laboratory of Information Storage System, Huazhong University of Science and Technology, Wuhan National Laboratory for Optoelectronics 2Huawei Noah’s Ark Lab 3Huawei Technologies Co., Ltd. fjiaoxiaoqi,[email protected] fyinyichun,shang.lifeng,[email protected] fchen.xiao2,lynn.lilinlin,[email protected] Abstract natural language processing (NLP). Pre-trained lan- guage models (PLMs), such as BERT (Devlin et al., Language model pre-training, such as BERT, has significantly improved the performances 2019), XLNet (Yang et al., 2019), RoBERTa (Liu of many natural language processing tasks. et al., 2019), ALBERT (Lan et al., 2020), T5 (Raf- However, pre-trained language models are usu- fel et al., 2019) and ELECTRA (Clark et al., 2020), ally computationally expensive, so it is diffi- have achieved great success in many NLP tasks cult to efficiently execute them on resource- (e.g., the GLUE benchmark (Wang et al., 2018) restricted devices. To accelerate inference and the challenging multi-hop reasoning task (Ding and reduce model size while maintaining et al., 2019)). However, PLMs usually have a accuracy, we first propose a novel Trans- former distillation method that is specially de- large number of parameters and take long infer- signed for knowledge distillation (KD) of the ence time, which are difficult to be deployed on Transformer-based models. By leveraging this edge devices such as mobile phones. Recent stud- new KD method, the plenty of knowledge en- ies (Kovaleva et al., 2019; Michel et al., 2019; Voita coded in a large “teacher” BERT can be ef- et al., 2019) demonstrate that there is redundancy fectively transferred to a small “student” Tiny- in PLMs. Therefore, it is crucial and feasible to BERT. Then, we introduce a new two-stage reduce the computational overhead and model stor- learning framework for TinyBERT, which per- age of PLMs while retaining their performances. forms Transformer distillation at both the pre- training and task-specific learning stages. This There have been many model compression tech- framework ensures that TinyBERT can capture niques (Han et al., 2016) proposed to accelerate the general-domain as well as the task-specific deep model inference and reduce model size while knowledge in BERT. maintaining accuracy. The most commonly used 1 TinyBERT4 with 4 layers is empirically ef- techniques include quantization (Gong et al., 2014), fective and achieves more than 96.8% the per- weights pruning (Han et al., 2015), and knowl- formance of its teacher BERTBASE on GLUE edge distillation (KD) (Romero et al., 2014). In benchmark, while being 7.5x smaller and 9.4x this paper, we focus on knowledge distillation, an faster on inference. TinyBERT4 is also signifi- idea originated from Hinton et al.(2015), in a cantly better than 4-layer state-of-the-art base- teacher-student framework. KD aims to transfer lines on BERT distillation, with only ∼28% parameters and ∼31% inference time of them. the knowledge embedded in a large teacher net- Moreover, TinyBERT6 with 6 layers performs work to a small student network where the student on-par with its teacher BERTBASE. network is trained to reproduce the behaviors of the teacher network. Based on the framework, we pro- 1 Introduction pose a novel distillation method specifically for the Pre-training language models then fine-tuning on Transformer-based models (Vaswani et al., 2017), downstream tasks has become a new paradigm for and use BERT as an example to investigate the method for large-scale PLMs. ∗Authors contribute equally. yThis work is done when Xiaoqi Jiao is an intern at KD has been extensively studied in NLP (Kim Huawei Noah’s Ark Lab. and Rush, 2016; Hu et al., 2018) as well as for z Corresponding authors. pre-trained language models (Sanh et al., 2019; 1The code and models are publicly available at https: //github.com/huawei-noah/Pretrained-Language-Model/tree/ Sun et al., 2019, 2020; Wang et al., 2020). The master/TinyBERT pre-training-then-fine-tuning paradigm firstly pre- 4163 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174 November 16 - 20, 2020. c 2020 Association for Computational Linguistics General Task-specific Distillation Distillation TinyBERT can absorb both the general-domain and Large-scale General Fine-tuned Text Corpus TinyBERT TinyBERT task-specific knowledge of the teacher BERT. 3) We show in the experiments that our TinyBERT4 Data Augmentation Augmented Task Dataset Task Dataset can achieve more than 96.8% the performance of teacher BERTBASE on GLUE tasks, while having Figure 1: The illustration of TinyBERT learning. much fewer parameters (∼13.3%) and less infer- ence time (∼10.6%), and significantly outperforms other state-of-the-art baselines with 4 layers on trains BERT on a large-scale unsupervised text BERT distillation; 4) We also show that a 6-layer corpus, then fine-tunes it on task-specific dataset, TinyBERT6 can perform on-par with the teacher which greatly increases the difficulty of BERT dis- BERTBASE on GLUE. tillation. Therefore, it is required to design an ef- fective KD strategy for both training stages. 2 Preliminaries To build a competitive TinyBERT, we firstly propose a new Transformer distillation method to In this section, we describe the formulation of distill the knowledge embedded in teacher BERT. Transformer (Vaswani et al., 2017) and Knowledge Specifically, we design three types of loss func- Distillation (Hinton et al., 2015). Our proposed tions to fit different representations from BERT Transformer distillation is a specially designed KD layers: 1) the output of the embedding layer; 2) the method for Transformer-based models. hidden states and attention matrices derived from 2.1 Transformer Layer the Transformer layer; 3) the logits output by the prediction layer. The attention based fitting is in- Most of the recent pre-trained language mod- spired by the recent findings (Clark et al., 2019) els (e.g., BERT, XLNet and RoBERTa) are built that the attention weights learned by BERT can cap- with Transformer layers, which can capture long- ture substantial linguistic knowledge, and it thus term dependencies between input tokens by self- encourages the linguistic knowledge can be well attention mechanism. Specifically, a standard transferred from teacher BERT to student Tiny- Transformer layer includes two main sub-layers: BERT. Then, we propose a novel two-stage learn- multi-head attention (MHA) and fully connected ing framework including the general distillation feed-forward network (FFN). and the task-specific distillation, as illustrated in Multi-Head Attention (MHA). The calculation of Figure1. At general distillation stage, the original attention function depends on the three components BERT without fine-tuning acts as the teacher model. of queries, keys and values, denoted as matrices Q, The student TinyBERT mimics the teacher’s behav- K and V respectively. The attention function can ior through the proposed Transformer distillation be formulated as follows: on general-domain corpus. After that, we obtain QKT a general TinyBERT that is used as the initializa- A= p ; (1) tion of student model for the further distillation. dk At the task-specific distillation stage, we first do Attention(Q; K; V )=softmax(A)V ; (2) the data augmentation, then perform the distilla- tion on the augmented dataset using the fine-tuned where dk is the dimension of keys and acts as a BERT as the teacher model. It should be pointed scaling factor, A is the attention matrix calculated out that both the two stages are essential to improve from the compatibility of Q and K by dot-product the performance and generalization capability of operation. The final function output is calculated TinyBERT. as a weighted sum of values V , and the weight is The main contributions of this work are as fol- computed by applying softmax() operation on lows: 1) We propose a new Transformer distilla- the each column of matrix A. According to Clark tion method to encourage that the linguistic knowl- et al. (2019), the attention matrices in BERT can edge encoded in teacher BERT can be adequately capture substantial linguistic knowledge, and thus transferred to TinyBERT; 2) We propose a novel play an essential role in our proposed distillation two-stage learning framework with performing the method. proposed Transformer distillation at both the pre- Multi-head attention is defined by concatenating training and fine-tuning stages, which ensures that the attention heads from different representation 4164 subspaces as follows: Transformer Layer: Embedding Layer: Attn푙표푠푠 Prediction Layer: MHA(Q; K; V )=Concat(hLayer1;:::; Number:h 푁k>)W푀 ; (3) … ′ Attention Matrices Attention Matrices Hidden Size: 푑 > 푑 ℎ푒푎푑∗푙∗푙 ℎ푒푎푑∗푙∗푙 Teacher (BERT) (ℝ ) (ℝ ) where k is the number of attention heads, and hi Student (TinyBERT) … denotes the i-th attention head, which is calculated Hidn푙표푠푠 … by the Attention… () function with inputs from Hidden States Hidden States 푙∗풅 ′ different representation subspaces. The… matrix W (ℝ ) (ℝ푙∗풅 ) … … … acts as a linear푁 transformation.Transformer 푀 ADD & Norm ADD & Norm … Position-wise Feed-Forward… Distillation Network (FFN). … Transformer layer also contains a fully connected FFN FFN … 푑′ feed-forward network, which is formulated as fol- ADD & Norm ADD & Norm lows: 푑 MHA MHA Text Input Teacher Layer Student Layer FFN(x) = max(0; xW1 + b1)W(a) 2 + b2: (4) (b) Figure 2: The details of Transformer-layer distillation We can see that the FFN contains two linear trans- consisting of Attnloss(attention based distillation) and formations and one ReLU activation. Hidnloss(hidden states based distillation). 2.2 Knowledge Distillation 3.1 Transformer Distillation KD aims to transfer the knowledge of a large teacher network T to a small student network S.