Lifelong Language Knowledge Distillation

Lifelong Language Knowledge Distillation Yung-Sung Chuang Shang-Yu Su Yun-Nung Chen National Taiwan University, Taipei, Taiwan fb05901033,[email protected] [email protected] Abstract Task 1 Task 2 Task 3 It is challenging to perform lifelong language … learning (LLL) on a stream of different tasks without any performance degradation comparing to the multi-task counterparts. To ad- LLL model LLL model dress this issue, we present Lifelong Language (a) Normal Lifelong Language Learning. Knowledge Distillation (L2KD), a simple but Task 1 Task 2 Task 3 efficient method that can be easily applied to existing LLL architectures in order to miti- … gate the degradation. Specifically, when the Teacher 1 Teacher 2 Teacher 3 LLL model is trained on a new task, we as- sign a teacher model to first learn the new task, and pass the knowledge to the LLL model via knowledge distillation. Therefore, the LLL model LLL model LLL model can better adapt to the new task (b) Lifelong Language Knowledge Distillation. while keeping the previously learned knowledge. Experiments show that the proposed Figure 1: The difference between LLL and L2KD. L2KD consistently improves previous state-of- the-art models, and the degradation comparing to multi-task models in LLL tasks is well mit- 2017), conversational agents (Lee, 2017), word and igated for both sequence generation and text sentence representation learning (Xu et al., 2018; 1 classification tasks. Liu et al., 2019), text classification, and question answering (d’Autume et al., 2019). 1 Introduction In recent, LAMOL (Sun et al., 2020) improved Training a single model to learn a stream of differ- the performance of LLL by a general framework: ent tasks sequentially usually faces the catastrophic 1) it followed the idea about considering many NLP forgetting problem (McCloskey and Cohen, 1989): tasks as question answering (QA) (McCann et al., after learning a new task, the model forgets how to 2018) and adapted all tasks into the language mod- handle the samples from previous tasks. Lifelong eling (LM) form. In the unified framework, it can learning manages to accumulate the knowledge perform LLL on many NLP tasks by generating an- and retain the performance of previously learned swers based on the contexts and the questions using tasks. It is important especially for real-world a single language model, and 2) it outperformed natural language processing (NLP) applications, the previous methods by a considerable margin and because these applications need to interact with is only 2%-3% worse than the multi-tasking upper many users from different domains everyday, and bound, which jointly learns all tasks in a mixed the language usage also evolves from time to time. dataset. Hence, various NLP tasks have been studied for This paper further improves LLL by introduc- lifelong learning in the previous work, including ing Lifelong Language Knowledge Distillation sentiment analysis (Chen et al., 2015; Xia et al., (L2KD), which can be flexibly applied upon the 1The source code and data are available at https:// LAMOL architecture or other LLL methods for github.com/voidism/L2KD. sequence generation learning. 2914 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2914–2924, November 16–20, 2020. c 2020 Association for Computational Linguistics The motivation of our work mainly comes from Target �! �" <EOS> how to efficiently compress the knowledge under . a lifelong language learning framework. If the Language Model model can learn a new task in an efficient way, the . <BOS> �! �" �#$! �# �! �" <ANS> �! �" �% previously learned knowledge may not be affected Context Question Answer and thus the problem of catastrophic forgetting can (a) Learning to solve target tasks (QA). be mitigated. Target Inspired by knowledge distillation (Bucila et al., �! �" �#$! �# �! �" <ANS> �! �" <EOS> . 2006; Hinton et al., 2015; Kim and Rush, 2016), in Language Model which a student (smaller) model is trained to imitate . the behavior of a teacher (larger) model in order to <BOS> �! �" �#$! �# �! �" <ANS> �! �" �% reach the performance closer to the teacher model, Context Question Answer the LLL model in L2KD can be seen as a weak (b) Learning to generate pseudo-data (LM). learner that needs to compress knowledge from dif- Figure 2: Illustration of learning QA and LM in ferent tasks into a compact single model. Thus LLL LAMOL. can benefit from the similar procedure of knowledge distillation, although the model size is equal to its teacher model. The similar idea about distilling Besides generating answers for the given ques- knowledge from equal-size models has also been tions, the model simultaneously learns to model the studied in born-again neural network (Furlanello whole training sample, as illustrated in Figure 2b. et al., 2018), multitask learning (Clark et al., 2019) By doing that, when training on the next task, the and lifelong computer vision learning (Hou et al., model can generate training samples for the previ- 2018), but never been explored in lifelong language ous tasks and train on both data from the new task learning research. and the generated pseudo-data for the prior tasks. In L2KD, we train a new teacher model when Thus the model would forget less when adapting to facing a new task, and the LLL model imitates the the new tasks. behavior of its teacher at each training stage, as LAMOL can outperform previous regularization- illustrated in Figure1. This method only needs a based (Schwarz et al., 2018; Aljundi et al., 2018) little extra time to train a disposable teacher model or memory-based (Lopez-Paz et al., 2017; Yo- for each new task, and the teacher model can be dis- gatama et al., 2019) LLL methods by a large mar- carded when learning the next task; therefore, there gin. While most of previous methods usually get re- is no extra memory or model capacity required for sults slightly better than the finetuning baseline (do- the target LLL model, making the proposed model ing nothing to prevent forgetting), LAMOL already more memory-efficient for real-world usage. get significant results that are very close to the mul- titasking upper bound and only 2%-3% worse (Sun 2 Proposed Approach et al., 2020) than it. Thus, in this paper, we focus on how to apply L2KD based on LAMOL. Before describing how L2KD works, in Sec- tion 2.1 we briefly introduce the architecture of 2.2 Knowledge Distillation LAMOL (Sun et al., 2020), which L2KD is built Language Modeling The training objective for upon. Then we introduce different knowledge dis- normal language modeling is to minimize the neg- tillation strategies in Section 2.2, and how to apply ative log-likelihood (NLL) in predicting the next them to L2KD in Section 2.3. word (hard target): 2.1 LAMOL: Language Modeling for T X Lifelong Language Learning LNLL(x; θ) = − log P (xt j x<t; θ); t=t0 In the setting of LAMOL, all samples in language datasets have three parts: context, question and where xt denotes the t-th word in the sentence, x<t answer. We can simply concatenate these three denotes all words prior to xt, and θ is the parame- parts into a single sentence and train the model ters of the language model. to generate the answer based on the context and In knowledge distillation, instead, we minimize question prior to it, as illustrated in Figure 2a. the prediction errors between student and teacher 2915 models. The target unit for considering the errors Algorithm 1 L2KD: Lifelong Language Knowl- can be done in the word level or the sequence level. edge Distillation Input: current task dataset D , teacher model with param- Word-Level (Word-KD) We minimize the m eters θT , knowledge distillation loss function LKD, pseudo- cross-entropy between the output distributions data sample rate γ. from student and teacher models when predicting Output: LLL model parameters θS . Optimize teacher model on Dm to get parameters θT . the next word: Sample γ · jDmj pseudo-data from θS to form Dprev. m n for all training samplesfXi gi=1 2 Dm do L (x; θ ; θ ) = for i = 1 to n do Word-KD S T m update θS to minimize LKD(Xi ; θS ; θT ) T jVj end for X X 0 prev n0 −P (Vk jx<t; θT ) log P (Vk j x<t; θS); Sample n = γn samples fXj gj=1 from Dprev for j = 1 to n0 do t=t0 k=1 prev update θS to minimize LNLL(Xj ; θS ) where the input x is from the ground truth se- end for <t end for quence. V denotes the vocabulary set and Vk is the k-th word in V. θS and θT are parameters of the student model and teacher model respectively. 2.3 L2KD: Lifelong Language Knowledge Sequence-Level (Seq-KD) Similar to Kim and Distillation Rush(2016), we minimize the negative log- likelihood directly on the greedy decode or beam Knowledge distillation can be applied to minimiz- search output sequence x^ from the teacher model as ing both LM and QA loss in LAMOL. Assum- the hard target, just like normal language modeling: ing that there is a stream of tasks with datasets T fD1;D2; :::g, our LLL model has learned from D1 X to Dm−1 and now was adapted to Dm. First we LSeq-KD(^x; θS) = − log P (^xt j x^<t; θS): train a teacher model on D by minimizing the t=t0 m negative log-likelihood loss both for LM and QA Seq-KD is usually applied for improving weak non- m in LAMOL and obtain the model parameters θT . autoregressive translation (NAT) models (Zhou Now our LLL model (with parameters θ ) can be et al., 2020) by reducing the multi-modality prob- S trained on D by knowledge distillation from the lem in machine translation datasets (Gu et al., m teacher model.

Load more