Continual Learning for Natural Language Generation in Task-Oriented Dialog Systems

Continual Learning for Natural Language Generation in Task-oriented Dialog Systems Fei Mi1*, Liangwei Chen1, Mengjie Zhao2, Minlie Huang3* and Boi Faltings1 1LIA, EPFL, Lausanne, Switzerland 2CIS, LMU, Munich, Germany 3CoAI, DCST, Tsinghua University, Beijing, China ffei.mi,liangwei.chen,[email protected] [email protected], [email protected] Abstract of continual learning after a dialog system is de- ployed. Specifically, an NLG model should be able Natural language generation (NLG) is an essential component of task-oriented dialog sys- to continually learn new utterance patterns without tems. Despite the recent success of neural ap- forgetting the old ones it has already learned. proaches for NLG, they are typically devel- The major challenge of continual learning lies oped in an offline manner for particular do- in catastrophic forgetting (McCloskey and Cohen, mains. To better fit real-life applications where 1989; French, 1999). Namely, a neural model new data come in a stream, we study NLG trained on new data tends to forget the knowledge it in a “continual learning” setting to expand its has acquired on previous data. We diagnose in Sec- knowledge to new domains or functionalities tion 4.4 that neural NLG models suffer such detri- incrementally. The major challenge towards this goal is catastrophic forgetting, meaning mental catastrophic forgetting issues when continu- that a continually trained model tends to for- ally trained on new domains. A naive solution is to get the knowledge it has learned before. To retrain the NLG model using all historical data ev- this end, we propose a method called ARPER ery time. However, it is not scalable due to severe (Adaptively Regularized Prioritized Exemplar computation and storage overhead. Replay) by replaying prioritized historical ex- To this end, we propose storing a small set of rep- emplars, together with an adaptive regulariza- resentative utterances from previous data, namely tion technique based on Elastic Weight Consol- idation. Extensive experiments to continually exemplars, and replay them to the NLG model each learn new domains and intents are conducted time it needs to be trained on new data. Methods on MultiWoZ-2.0 to benchmark ARPER with using exemplars have shown great success in differ- a wide range of techniques. Empirical results ent continual learning (Rebuffi et al., 2017; Castro demonstrate that ARPER significantly outper- et al., 2018; Chaudhry et al., 2019) and reinforce- forms other methods by effectively mitigating ment learning (Schaul et al., 2016; Andrychowicz the detrimental catastrophic forgetting issue. et al., 2017) tasks. In this paper, we propose a 1 Introduction prioritized exemplar selection scheme to choose representative and diverse exemplar utterances for As an essential part of task-oriented dialog systems NLG. We empirically demonstrate that the priori- (Wen et al., 2015b; Bordes et al., 2016), the task of tized exemplar replay helps to alleviate catastrophic Natural Language Generation (NLG) is to produce forgetting by a large degree. a natural language utterance containing the desired In practice, the number of exemplars should be information given a semantic representation (so- reasonably small to maintain a manageable mem- called dialog act). Existing NLG models (Wen ory footprint. Therefore, the constraint of not for- et al., 2015c; Tran and Nguyen, 2017; Tseng et al., getting old utterance patterns is not strong enough. 2018) are typically trained offline using annotated To enforce a stronger constraint, we propose a reg- data from a single or a fixed set of domains. How- ularization method based on the well-known tech- ever, a desirable dialog system in real-life applica- nique, Elastic Weight Consolidation (EWC (Kirk- tions often needs to expand its knowledge to new patrick et al., 2017)). The idea is to use a quadratic domains and functionalities. Therefore, it is crucial term to elastically regularize the parameters that are to develop an NLG approach with the capability important for previous data. Besides the wide appli- * Correspondence Author cation in computer vision, EWC has been recently 3461 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3461–3474 November 16 - 20, 2020. c 2020 Association for Computational Linguistics applied to the domain adaptation task for Neural Zenke et al., 2017; Aljundi et al., 2018). Machine Translation (Thompson et al., 2019; Saun- Exemplar replay methods store past samples, ders et al., 2019). In this paper, we combine EWC a.k.a exemplars, and replay them periodically. with exemplar replay by approximating the Fisher Instead of selecting exemplars at random, Re- Information Matrix w.r.t. the carefully chosen ex- buffi et al.(2017) incorporated the Herding tech- emplars so that not all historical data need to be nique (Welling, 2009) to choose exemplars that stored. Furthermore, we propose to adaptively ad- best approximate the mean feature vector of a class, just the regularization weight to consider the dif- and it is widely used in Castro et al.(2018); Wu ference between new and old data to flexibly deal et al.(2019); Hou et al.(2019); Zhao et al.(2019); with different new data distributions. Mi et al.(2020a,b). Ramalho and Garnelo(2019) To summarize our contribution, (1) to the best of proposed to store samples that the model is least our knowledge, this is the first attempt to study the confident. Chaudhry et al.(2019) demonstrated the practical continual learning configuration for NLG effectiveness of exemplars for various continual in task-oriented dialog systems; (2) we propose a learning tasks in computer vision. method called Adaptively Regularized Prioritized Exemplar Replay (ARPER) for this task, and bench- Catastrophic Forgetting in NLP. The catas- mark it with a wide range of state-of-the-art contin- trophic forgetting issue in NLP tasks has raised ual learning techniques; (3) extensive experiments increasing attention recently (Mou et al., 2016; are conducted on the MultiWoZ-2.0 (Budzianowski Chronopoulou et al., 2019). Yogatama et al.(2019); et al., 2018) dataset to continually learn new tasks, Arora et al.(2019) identified the detrimental catas- including domains and intents using two base trophic forgetting issue while fine-tuning ELMo NLG models. Empirical results demonstrate the (Peters et al., 2018) and BERT (Devlin et al., 2019). superior performance of ARPER and its ability To deal with this issue, He et al.(2019) proposed to mitigate catastrophic forgetting. Our code is to replay pre-train data during fine-tuning heavily, available at https://github.com/MiFei/ and Chen et al.(2020) proposed an improved Adam Continual-Learning-for-NLG optimizer to recall knowledge captured during pre- training. The catastrophic forgetting issue is also 2 Related Work noticed in domain adaptation setups for neural machine translation (Saunders et al., 2019; Thompson Continual Learning. The major challenge for et al., 2019; Varis and Bojar, 2019) and the reading continual learning is catastrophic forgetting (Mc- comprehension task (Xu et al., 2019). Closkey and Cohen, 1989; French, 1999), where Lee(2017) firstly studied the continual learning optimization over new data leads to performance setting for dialog state tracking in task-oriented degradation on data learned before. Methods de- dialog systems. However, their setting is still a one- signed to mitigate catastrophic forgetting fall into time adaptation process, and the adopted dataset is three categories: regularization, exemplar replay, small. Shen et al.(2019) recently applied progres- and dynamic architectures. Methods using dy- sive network (Rusu et al., 2016) for the semantic namic architectures (Rusu et al., 2016; Maltoni slot filling task from a continual learning perspec- and Lomonaco, 2019) increase model parameters tive similar to ours. However, their method is based throughout the continual learning process, which on a dynamic architecture that is beyond the scope leads to an unfair comparison with other methods. of this paper. Liu et al.(2019) proposed a Boolean In this work, we focus on the first two categories. operation of “conceptor” matrices for continually Regularization methods add specific regulariza- learning sentence representations using linear en- tion terms to consolidate knowledge learned be- coders. Li et al.(2020) combined continual learn- fore. Li and Hoiem(2017) introduced the knowl- ing and language systematic compositionality for edge distillation (Hinton et al., 2015) to penalize sequence-to-sequence learning tasks. model logit change, and it has been widely em- ployed in Rebuffi et al.(2017); Castro et al.(2018); Natural Language Generation (NLG). In this Wu et al.(2019); Hou et al.(2019); Zhao et al. paper, we focus on NLG for task-oriented dialog (2019). Another direction is to regularize param- systems. A series of neural methods have been eters crucial to old knowledge according to vari- proposed to generate accurate, natural, and diverse ous importance measures (Kirkpatrick et al., 2017; utterances, including HLSTM (Wen et al., 2015a), 3462 SCLSTM (Wen et al., 2015c), Enc-Dec (Wen et al., Attraction Restaurant Hotel 2015b), RALSTM (Tran and Nguyen, 2017), SC- Data1 Data2 Data3 VAE (Tseng et al., 2018). Recent works have considered the domain adap- NLG Model NLG Model NLG Model …... tation setting. Tseng et al.(2018); Tran and Nguyen (2018b) proposed to learn domain-invariant representations using VAE (Kingma and Welling, 2013). They later designed two domain adaptation crit- Figure 1: An example for a NLG model to continually ics (Tran and Nguyen, 2018a). Recently, Mi et al. learn new domains. The model needs to perform well on all domains it has seen before. For example fθ3 (2019); Qian and Yu(2019); Peng et al.(2020) needs to deal with all three previous domains (Attrac- studied learning new domains with limited training tion, Restaurant, Hotel). data. However, existing methods only consider a one-time adaptation process. The continual learn- logits) of the ground truth token yk at position k. ing setting and the corresponding catastrophic for- The typical objective function for an utterance Y getting issue remain to be explored.

Load more