MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models

MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models Linqing Liu,∗1 Huan Wang,2 Jimmy Lin,1 Richard Socher,2 and Caiming Xiong2 1 David R. Cheriton School of Computer Science, University of Waterloo 2 Salesforce Research flinqing.liu, [email protected], fhuan.wang, rsocher, [email protected] Abstract Teacher 1 Task 1 Task 1 Teacher 2 Task 2 Shared Task 2 Pretrained language models have led to signif- Student Teacher Network … icant performance gains in many NLP tasks. … … Layers However, the intensive computing resources to train such models remain an issue. Knowledge Teacher N Task N Task N distillation alleviates this problem by learning a light-weight student model. So far the dis- Figure 1: The left figure represents task-specific KD. tillation approaches are all task-specific. In The distillation process needs to be performed for each this paper, we explore knowledge distillation different task. The right figure represents our proposed under the multi-task learning setting. The stu- multi-task KD. The student model consists of shared dent is jointly distilled across different tasks. layers and task-specific layers. It acquires more general representation capacity through multi-tasking distillation and can be further fine-tuned to improve the model in leading to resource-intensive inference. The con- the target domain. Unlike other BERT distillation methods which specifically designed for sensus is that we need to cut down the model size Transformer-based architectures, we provide and reduce the computational cost while maintain- a general learning framework. Our approach ing comparable quality. is model agnostic and can be easily applied One approach to address this problem is knowl- on different future teacher model architectures. edge distillation (KD; Ba and Caruana, 2014; Hin- We evaluate our approach on a Transformer- ton et al., 2015), where a large model functions as based and LSTM based student model. Com- pared to a strong, similarly LSTM-based ap- a teacher and transfers its knowledge to a small proach, we achieve better quality under the student model. Previous methods focus on task- same computational constraints. Compared to specific KD, which transfers knowledge from a the present state of the art, we reach compara- single-task teacher to its single-task student. Put ble results with much faster inference speed. it another way, the knowledge distillation process needs to be conducted all over again when perform- 1 Introduction ing on a new NLP task. The inference speed of the Pretrained language models learn highly effective large-scale teacher model remains the bottleneck arXiv:1911.03588v2 [cs.CL] 30 Apr 2020 language representations from large-scale unla- for various downstream tasks distillation. beled data. A few prominent examples include Our goal is to find a distill-once-fits-many so- ELMo (Peters et al., 2018), BERT (Devlin et al., lution. In this paper, we explore the knowledge 2019), RoBERTa (Liu et al., 2019c), and XL- distillation method under the setting of multi-task Net (Yang et al., 2019), all of which have achieved learning (MTL; Caruana, 1997; Baxter, 2000). We state of the art in many natural language process- propose to distill the student model from different ing (NLP) tasks, such as natural language infer- tasks jointly. The overall framework is illustrated ence, sentiment classification, and semantic textual in Figure1. The reason is twofold: first, the dis- similarity. However, such models use dozens, if tilled model learns a more universal language rep- not hundreds, of millions of parameters, invariably resentation by leveraging cross-task data. Second, All∗ work was done while the first author was a research the student model achieves both comparable qual- intern at Salesforce Research. ity and fast inference speed across multiple tasks. MTL is based on the idea (Maurer et al., 2016) that rectional LSTM to encode contexts around target tasks are related by means of a common low dimen- words; CoVe (McCann et al., 2017) trains LSTM sional representation. We also provide an intuitive encoders on some machine translation datasets, explanation on why using shared structure could showing that these encoders are well-transferable possibly help by assuming some connections over to other tasks. Prominently, ELMo (Peters et al., the conditional distribution of different tasks. 2018) learns deep word representations using a bidi- We evaluate our approach on two different stu- rectional language model. It can be easily added dent model architectures. One uses three layers to an existing model and boost performance across Transformers (Vaswani et al., 2017), since most of six challenging NLP tasks. the KD works (Sun et al., 2019; Jiao et al., 2019) Fine-tuning approaches are mostly employed use Transformers as their students. Another is in more recent work. They pretrain the language LSTM based network with bi-attention mechanism. model on a large-scale unlabeled corpus and then Previously Tang et al.(2019) examine the represen- fine-tune it with in-domain labeled data for a super- tation capacity of a simple, single-layer Bi-LSTM vised downstream task (Dai and Le, 2015; Howard only, so we are interested in whether adding more and Ruder, 2018). BERT (Devlin et al., 2019), previous effective modules, such as an attention GPT (Radford et al., 2018) and GPT-2 (Radford mechanism, will further improve its effectiveness. et al.) are some of the prominent examples. Fol- It exemplifies that our approach is model agnostic, lowing BERT, XLNet (Yang et al., 2019) proposes i.e., the choice of student model does not depend on a generalized autoregressive pretraining method the teacher model architecture; The teacher model and RoBERTa (Liu et al., 2019c) optimizes BERT can be easily switched to other powerful language pretraining approach. These pretrained models are models other than BERT. large in size and contain millions of parameters. We further study several important problems in We target the BERT model and aim to address this knowledge distillation, such as the choice of mod- problem through knowledge distillation. Our ap- ules in student model, the influence of different proach can be easily applied to other models as tokenization methods, and the influence of MTL in well. KD. We evaluate our approach on seven datasets across four different tasks. For LSTM based stu- Knowledge distillation. Knowledge distillation dent, our approach keeps the advantage of infer- (Ba and Caruana, 2014; Hinton et al., 2015) trans- ence speed while maintaining comparable perfor- fers knowledge from a large teacher model to a mances as those specifically designed for Trans- smaller student model. Since the distillation only former methods. For our Transformer based stu- matches the output distribution, the student model dent, it does provide a modest gain, and outper- architecture can be completely different from that forms other KD methods without using external of the teacher model. There are already many ef- training data. forts trying to distill BERT into smaller models. BERT-PKD (Sun et al., 2019) extracts knowledge 2 Related Work not only from the last layer of the teacher, but also from previous layers. TinyBERT (Jiao et al., 2019) Language model pretraining. Given a sequence introduces a two-stage learning framework which of tokens, pretrained language models encode performs transformer distillation at both pretrain- each token as a general language representational ing and task-specific stages. Zhao et al.(2019) embedding. A large body of literature has ex- train a student model with smaller vocabulary and plored this area. Traditional pretrained word rep- lower hidden states dimensions. DistilBERT (Sanh resentations (Turian et al., 2010) presume singu- et al., 2019) reduces the layers of BERT and uses lar word meanings and thus adapt poorly to mul- this small version of BERT as its student model. tiple contexts—for some notable examples, see All the aforementioned distillation methods are per- word2vec (Mikolov et al., 2013), GloVe (Pen- formed on a single task, specifically designed for nington et al., 2014), and FastText (Bojanowski the transformer-based teacher architecture, result- et al., 2017). For more flexible word represen- ing in poor generalizability to other type of mod- tations, a few advancements exist: Neelakantan els. Our objective is to invent a general distillation et al.(2015) learn multiple embeddings per word framework, applicable to either transformer-based type; context2vec (Melamud et al., 2016) uses bidi- models or other architectures as well. Tang et al. (2019) distill BERT into a single-layer BiLSTM. In trained BERTLARGE to initialize these shared lay- our paper, we hope to extract more knowledge from ers. BERT through multi-task learning, while keeping Multi-task refining. The contextual embeddings the student model simple. C are then passed through the upper task-specific Multi-task learning. Multi-task learning (MTL) layers. Following Liu et al.(2019b), our cur- has been successfully applied on different appli- rent NLU training tasks on GLUE (Wang et al., cations (Collobert and Weston, 2008; Deng et al., 2018) can be classified into four categories: single- 2013; Girshick, 2015). MTL helps the pretrained sentence classification (CoLA and SST-2), pairwise language models learn more generalized text rep- text classification (RTE, MNLI, WNLI, QQP, and resentation by sharing the domain-specific infor- MRPC), pairwise text similarity (STS-B), and rele- mation contained in each related task training sig- vance ranking (QNLI). Each category corresponds nal (Caruana, 1997). Liu et al.(2019b, 2015) pro- to its own output layer. pose a multi-task deep neural network (MT-DNN) Here we take the text similarity task as an ex- for learning representations across multiple tasks. ample to demonstrate the implementation details. (Clark et al., 2019) propose to use knowledge distil- Following Devlin et al.(2019), we consider the lation so that single task models can teach a multi- contextual embedding of the special [CLS] token task model.

MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models

Layer-Level Knowledge Distillation for Deep Neural Network Learning

Similarity Transfer for Knowledge Distillation Haoran Zhao, Kun Gong, Xin Sun, Member, IEEE, Junyu Dong, Member, IEEE and Hui Yu, Senior Member, IEEE

Understanding and Improving Knowledge Distillation

Optimising Hardware Accelerated Neural Networks with Quantisation and a Knowledge Distillation Evolutionary Algorithm

Memory-Replay Knowledge Distillation

Knowledge Distillation: a Survey 3

Towards Effective Utilization of Pre-Trained Language Models

Zero-Shot Knowledge Distillation from a Decision-Based Black-Box Model

Knowledge Distillation from Internal Representations

Sequence-Level Knowledge Distillation

Improved Knowledge Distillation Via Teacher Assistant

Reinforced Multi-Teacher Selection for Knowledge Distillation