<<

Continual Learning for Natural Language Generation in Task-oriented Dialog Systems

Fei Mi1*, Liangwei Chen1, Mengjie Zhao2, Minlie Huang3* and Boi Faltings1 1LIA, EPFL, Lausanne, Switzerland 2CIS, LMU, Munich, Germany 3CoAI, DCST, Tsinghua University, Beijing, China {fei.mi,liangwei.chen,boi.faltings}@epfl.ch [email protected], [email protected]

Abstract of continual learning after a dialog system is de- ployed. Specifically, an NLG model should be able Natural language generation (NLG) is an es- sential component of task-oriented dialog sys- to continually learn new utterance patterns without tems. Despite the recent success of neural ap- forgetting the old ones it has already learned. proaches for NLG, they are typically devel- The major challenge of continual learning lies oped in an offline manner for particular do- in catastrophic forgetting (McCloskey and Cohen, mains. To better fit real-life applications where 1989; French, 1999). Namely, a neural model new data come in a stream, we study NLG trained on new data tends to forget the knowledge it in a “continual learning” setting to expand its has acquired on previous data. We diagnose in Sec- knowledge to new domains or functionalities tion 4.4 that neural NLG models suffer such detri- incrementally. The major challenge towards this goal is catastrophic forgetting, meaning mental catastrophic forgetting issues when continu- that a continually trained model tends to for- ally trained on new domains. A naive solution is to get the knowledge it has learned before. To retrain the NLG model using all historical data ev- this end, we propose a method called ARPER ery time. However, it is not scalable due to severe (Adaptively Regularized Prioritized Exemplar computation and storage overhead. Replay) by replaying prioritized historical ex- To this end, we propose storing a small set of rep- emplars, together with an adaptive regulariza- resentative utterances from previous data, namely tion technique based on Elastic Weight Consol- idation. Extensive experiments to continually exemplars, and replay them to the NLG model each learn new domains and intents are conducted time it needs to be trained on new data. Methods on MultiWoZ-2.0 to benchmark ARPER with using exemplars have shown great success in differ- a wide range of techniques. Empirical results ent continual learning (Rebuffi et al., 2017; Castro demonstrate that ARPER significantly outper- et al., 2018; Chaudhry et al., 2019) and reinforce- forms other methods by effectively mitigating ment learning (Schaul et al., 2016; Andrychowicz the detrimental catastrophic forgetting issue. et al., 2017) tasks. In this paper, we propose a 1 Introduction prioritized exemplar selection scheme to choose representative and diverse exemplar utterances for As an essential part of task-oriented dialog systems NLG. We empirically demonstrate that the priori- (Wen et al., 2015b; Bordes et al., 2016), the task of tized exemplar replay helps to alleviate catastrophic Natural Language Generation (NLG) is to produce forgetting by a large degree. a natural language utterance containing the desired In practice, the number of exemplars should be information given a semantic representation (so- reasonably small to maintain a manageable mem- called dialog act). Existing NLG models (Wen ory footprint. Therefore, the constraint of not for- et al., 2015c; Tran and Nguyen, 2017; Tseng et al., getting old utterance patterns is not strong enough. 2018) are typically trained offline using annotated To enforce a stronger constraint, we propose a reg- data from a single or a fixed set of domains. How- ularization method based on the well-known tech- ever, a desirable dialog system in real-life applica- nique, Elastic Weight Consolidation (EWC (Kirk- tions often needs to expand its knowledge to new patrick et al., 2017)). The idea is to use a quadratic domains and functionalities. Therefore, it is crucial term to elastically regularize the parameters that are to develop an NLG approach with the capability important for previous data. Besides the wide appli- * Correspondence Author cation in , EWC has been recently

3461 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3461–3474 November 16 - 20, 2020. c 2020 Association for Computational Linguistics applied to the domain adaptation task for Neural Zenke et al., 2017; Aljundi et al., 2018). Machine Translation (Thompson et al., 2019; Saun- Exemplar replay methods store past samples, ders et al., 2019). In this paper, we combine EWC a.k.a exemplars, and replay them periodically. with exemplar replay by approximating the Fisher Instead of selecting exemplars at random, Re- Information Matrix w.r.t. the carefully chosen ex- buffi et al.(2017) incorporated the Herding tech- emplars so that not all historical data need to be nique (Welling, 2009) to choose exemplars that stored. Furthermore, we propose to adaptively ad- best approximate the mean feature vector of a class, just the regularization weight to consider the dif- and it is widely used in Castro et al.(2018); Wu ference between new and old data to flexibly deal et al.(2019); Hou et al.(2019); Zhao et al.(2019); with different new data distributions. Mi et al.(2020a,b). Ramalho and Garnelo(2019) To summarize our contribution, (1) to the best of proposed to store samples that the model is least our knowledge, this is the first attempt to study the confident. Chaudhry et al.(2019) demonstrated the practical continual learning configuration for NLG effectiveness of exemplars for various continual in task-oriented dialog systems; (2) we propose a learning tasks in computer vision. method called Adaptively Regularized Prioritized Exemplar Replay (ARPER) for this task, and bench- Catastrophic Forgetting in NLP. The catas- mark it with a wide range of state-of-the-art contin- trophic forgetting issue in NLP tasks has raised ual learning techniques; (3) extensive experiments increasing attention recently (Mou et al., 2016; are conducted on the MultiWoZ-2.0 (Budzianowski Chronopoulou et al., 2019). Yogatama et al.(2019); et al., 2018) dataset to continually learn new tasks, Arora et al.(2019) identified the detrimental catas- including domains and intents using two base trophic forgetting issue while fine-tuning ELMo NLG models. Empirical results demonstrate the (Peters et al., 2018) and BERT (Devlin et al., 2019). superior performance of ARPER and its ability To deal with this issue, He et al.(2019) proposed to mitigate catastrophic forgetting. Our code is to replay pre-train data during fine-tuning heavily, available at https://github.com/MiFei/ and Chen et al.(2020) proposed an improved Adam Continual-Learning-for-NLG optimizer to recall knowledge captured during pre- training. The catastrophic forgetting issue is also 2 Related Work noticed in domain adaptation setups for neural ma- chine translation (Saunders et al., 2019; Thompson Continual Learning. The major challenge for et al., 2019; Varis and Bojar, 2019) and the reading continual learning is catastrophic forgetting (Mc- comprehension task (Xu et al., 2019). Closkey and Cohen, 1989; French, 1999), where Lee(2017) firstly studied the continual learning optimization over new data leads to performance setting for dialog state tracking in task-oriented degradation on data learned before. Methods de- dialog systems. However, their setting is still a one- signed to mitigate catastrophic forgetting fall into time adaptation process, and the adopted dataset is three categories: regularization, exemplar replay, small. Shen et al.(2019) recently applied progres- and dynamic architectures. Methods using dy- sive network (Rusu et al., 2016) for the semantic namic architectures (Rusu et al., 2016; Maltoni slot filling task from a continual learning perspec- and Lomonaco, 2019) increase model parameters tive similar to ours. However, their method is based throughout the continual learning process, which on a dynamic architecture that is beyond the scope leads to an unfair comparison with other methods. of this paper. Liu et al.(2019) proposed a Boolean In this work, we focus on the first two categories. operation of “conceptor” matrices for continually Regularization methods add specific regulariza- learning sentence representations using linear en- tion terms to consolidate knowledge learned be- coders. Li et al.(2020) combined continual learn- fore. Li and Hoiem(2017) introduced the knowl- ing and language systematic compositionality for edge distillation (Hinton et al., 2015) to penalize sequence-to-sequence learning tasks. model logit change, and it has been widely em- ployed in Rebuffi et al.(2017); Castro et al.(2018); Natural Language Generation (NLG). In this Wu et al.(2019); Hou et al.(2019); Zhao et al. paper, we focus on NLG for task-oriented dialog (2019). Another direction is to regularize param- systems. A series of neural methods have been eters crucial to old knowledge according to vari- proposed to generate accurate, natural, and diverse ous importance measures (Kirkpatrick et al., 2017; utterances, including HLSTM (Wen et al., 2015a),

3462 SCLSTM (Wen et al., 2015c), Enc-Dec (Wen et al., Attraction Restaurant Hotel

2015b), RALSTM (Tran and Nguyen, 2017), SC- Data1 Data2 Data3

VAE (Tseng et al., 2018). Recent works have considered the domain adap- NLG Model NLG Model NLG Model …... tation setting. Tseng et al.(2018); Tran and Nguyen (2018b) proposed to learn domain-invariant repre- sentations using VAE (Kingma and Welling, 2013). They later designed two domain adaptation crit- Figure 1: An example for a NLG model to continually ics (Tran and Nguyen, 2018a). Recently, Mi et al. learn new domains. The model needs to perform well on all domains it has seen before. For example fθ3 (2019); Qian and Yu(2019); Peng et al.(2020) needs to deal with all three previous domains (Attrac- studied learning new domains with limited training tion, Restaurant, Hotel). data. However, existing methods only consider a one-time adaptation process. The continual learn- logits) of the ground truth token yk at position k. ing setting and the corresponding catastrophic for- The typical objective function for an utterance Y getting issue remain to be explored. with DA d is the average cross-entropy loss w.r.t. all tokens in the utterance (Wen et al., 2015c,b; 3 Model Tran and Nguyen, 2017; Peng et al., 2020): In this section, we first introduce the background of K neural NLG models in Section 3.1, and the contin- 1 X L (Y, d, θ) = − log(p ) (3) ual learning formulation in Section 3.2. In Section CE K yk k=1 3.3, we introduce the proposed method ARPER.

3.1 Background on Neural NLG Models 3.2 Continual Learning of NLG The NLG component of task-oriented dialog sys- In practice, an NLG model needs to continually tems is to produce natural language utterances con- learn new domains or functionalities. Without loss ditioned on a semantic representation called dialog of generality, we assume that new data arrive task act (DA). Specifically, the dialog act d is defined as by task (Rebuffi et al., 2017; Kirkpatrick et al., the combination of intent I and a set of slot-value 2017). In a new task t, new data Dt are used to p train the NLG model fθ obtained till the last task. pairs S(d) = {(si, vi)}i=1: t−1 The updated model fθt needs to perform well on d = [ I , (s1, v1),..., (sp, vp)], (1) all tasks so far. An example setting of continually |{z} | {z } Intent Slot-value pairs learning new domains is illustrated in Figure1.A task can be defined with different modalities to where p is the number of slot-value pairs. In- reflect diverse real-life applications. In subsequent tent I controls the utterance functionality, while experiments, we consider continually learning new slot-value pairs contain information to express. domains and intents in Eq. (1). For example, “There is a restaurant called [La We emphasize that the setting of continual learn- Margherita] that serves [Italian] food.” is an utter- ing is different from that of domain adaptation. ance corresponding to a DA “[Inform, (name=La The latter is a one-time adaptation process, and the Margherita, food=Italian)]” focus is to optimize performance on a target do- Neural models have recently shown promising main transferred from source domains but without results for NLG tasks. Conditioned on a DA, a neu- considering potential performance drop on source ral NLG model generates an utterance containing domains (Mi et al., 2019; Qian and Yu, 2019; Peng the desired information word by word. For a DA et al., 2020). In contrast, continual learning re- d with the corresponding ground truth utterance quires a NLG model to continually learn new tasks Y = (y1, y1, ..., yK ), the probability of generating in multiple transfers, and the goal is to make the Y is factorized as below: model perform well on all tasks learned so far. K K Y Y fθ(Y, d) = pyk = p(yk|y

where fθ is the NLG model parameterized by θ, We introduce the proposed method (ARPER) with and pyk is the output probability (i.e. softmax of prioritized exemplar replay and an adaptive regu- 3463 larization technique to further alleviate the catas- Algorithm 1 select exemplars: Prioritized exem- trophic forgetting issue. plar selection procedure of ARPER for task t 1: procedure select exemplars(D , f , m) 3.3.1 Prioritized Exemplar Replay t θt 2: E ← new P riority list() To prevent the NLG model catastrophically forget- t 3: D ← sort(D , key = U, order = asc) ting utterance patterns in earlier tasks, a small sub- t t 4: while |E | < m do set of a task’s utterances are selected as exemplars, t 5: S ← new Set() and exemplars in previous tasks are replayed to the seen 6: for {Y, d} ∈ D do later tasks. During training the NLG model f for t θt 7: if S(d) ∈ S then continue task t, the set of exemplars in previous tasks, de- seen 8: else noted as E = {E ,..., E }, is replayed by 1:t−1 1 t−1 9: D .remove({Y, d}) joining with the data D of the current task. There- t t 10: E .insert({Y, d}) fore, the training objective with exemplar replay t 11: S .insert(S(d)) can be written as: seen 12: if |Et| == m then X 13: return Et LER(θt) = LCE(Y, d, θt). (4)

{Y,d}∈Dt∪E1:t−1

The set of exemplars of task t, referred to as Et, diverse set of slots. To encourage diversity of se- is selected after fθt has been trained and will be lected exemplars, we propose an iterative approach replayed to later tasks. to add data from Dt to the priority list Et based on The quality of exemplars is crucial to preserve the above priority score. At each iteration, if the set the performance on previous tasks. We propose of slots of the current utterance is already covered a prioritized exemplar selection method to select by utterances in Et, we skip it and move on to the representative and diverse utterances as follows. data with the next best priority score. Representative utterances. The first criterion is Algorithm 1 shows the procedure to select m exemplars as a priority list Et from Dt. The outer that exemplars Et of a task t should be representa- loop allows multiple passes through Dt to select tive of Dt. We propose to select Et as a priority various utterances for the same set of slots S(d). list from Dt that minimize a priority score:

β U(Y, d) = LCE(Y, d, θt) · |S(d)| , (5) 3.3.2 Reducing Exemplars in Previous Tasks where S(d) is the set of slots in Y, and β is a Algorithm 1 requires the number of exemplars to hyper-parameter. This formula correlates the repre- be given. A straightforward choice is to store the same and fixed number of exemplars for each task sentativeness of an utterance to its LCE. Intuitively, as in Castro et al.(2018); Wu et al.(2019); Hou the NLG model fθ trained on Dt should be confi- t et al.(2019). However, there are two drawbacks dent with representative utterances of Dt, i.e., low in this method: (1). the usage increases LCE. However, LCE is agnostic to the number of slots. We found that an utterance with many com- linearly with the number of tasks; (2) it does not discriminate tasks with different difficulty levels. mon slots in a task could also have very low LCE, yet using such utterances as exemplars may lead to To this end, we propose to store a fixed num- overfitting and thus forgetting of previous general ber of exemplars throughout the entire continual knowledge. The second term |S(d)|β controls the learning process to maintain a bounded memory importance of the number of slots in an utterance to footprint as in Rebuffi et al.(2017). As more tasks be prioritized as exemplars. We empirically found are continually learned, exemplars in previous tasks in Appendix A.1 that the best β is greater than 0. are gradually reduced by only keeping the ones in the front of the priority list1, and the exemplar size Diverse utterances. The second criterion is that of a task is set to be proportional to the training data exemplars should contain diverse slots of the task, size of the task to differentiate the task’s difficulty. rather than being similar or repetitive. A drawback To be specific, suppose M exemplars are kept in of the above priority score is that similar or dupli- cated utterances containing the same set of frequent 1the priority list implementation allows reducing exem- slots could be prioritized over utterances w.r.t. a plars in constant time for each task

3464 total. The number of exemplars for a task is: Algorithm 2 learn task: Procedure of ARPER to continually learn task t |Di| |E | = M · , ∀i ∈ 1, . . . , t, (6) 1: procedure learn task(D , E , f ,M) i Pt t 1:t−1 θt−1 |Dj| j=1 2: θt ← θt−1 where we choose 250/500 for M in experiments. 3: while θt not converged do 4: θt ← update(LER EWC (θt)) 3.3.3 Constraint with Adaptive Elastic |Dt| 5: m ← M · Σt |D | Weight Consolidation j=1 j 6: E ← select exemplars(D , f , m) Although exemplars of previous tasks are stored t t θt 7: for j = 1 to t − 1 do and replayed, the size of exemplars should be rea- |Dj | 8: Ej ← Ej.top(M · Σt |D | ) sonably small (M  |D1:t|) to reduce memory j=1 j overhead. As a consequence, the constraint we 9: return fθt , Et have made to prevent the NLG model catastrophi- cally forgetting previous utterance patterns is not strong enough. To enforce a stronger constraint, where V1:t−1 is the old word vocabulary size in we propose a regularization method based on the previous tasks, and Vt is the new word vocabulary well-known Elastic Weight Consolidation (EWC, size in the current task t; λbase is a hyper-parameter. Kirkpatrick et al., 2017) technique. In general, λ increases when the ratio of the size of old word vocabularies to that of new ones increases. Elastic Weight Consolidation (EWC). EWC In other words, the regularization term becomes utilizes a quadratic term to elastically regularize more important when the new task contains fewer parameters important for previous tasks. The loss new vocabularies to learn. function of using the EWC regularization together Algorithm 2 summarizes the continual learning with exemplar replay for task t can be written as: procedure of ARPER for task t. θt is initialized N with θ , and it is trained with prioritized exem- X 2 t−1 LER EWC (θt) = LER(θt)+λ Fi(θt,i−θt−1,i) plar replay and adaptive EWC in Eq. (7). After i training θ , exemplars E of task t are computed (7) t t by Algorithm 1, and exemplars in previous tasks where N is the number of model parameters; θt−1,i are reduced by keeping the most prioritized ones to is the i-th converged parameter of the model trained 2 E1:t−1 preserve the total exemplar size. till the previous task; Fi = ∇ LCE (θt−1,i) is the i-th diagonal element of the Fisher Informa- 4 Experiments tion Matrix approximated w.r.t. the set of previous exemplars E1:t−1. Fi measures the importance 4.1 Dataset of θt−1,i to previous tasks represented by E1:t−1. We use the MultiWoZ-2.0 dataset 2 (Budzianowski Typical usages of EWC compute Fi w.r.t. a uni- et al., 2018) containing six domains (Attraction, formly sampled subset from historical data. Yet, Hotel, Restaurant, Booking, Taxi and Train) and we propose to compute Fi w.r.t. the carefully cho- seven DA intents (“Inform, Request, Select, Rec- sen E1:t−1 so that not all historical data need to ommend, Book, Offer-Booked, No-Offer”). The be stored. The scalar λ controls the contribution original train/validation/test splits are used. For of the quadratic regularization term. The idea is methods using exemplars, both training and valida- to elastically penalize changes on parameters im- tion set are continually expanded with exemplars portant (with large Fi) to previous tasks, and more extracted from previous tasks. plasticity is assigned to parameters with small Fi. To support experiments on continual learning Adaptive regularization. In practice, new tasks new domains, we pre-processed the original dataset have different difficulties and similarities compared by segmenting multi-domain utterances into single- to previous tasks. Therefore, the degree of need domain ones. For instance, an utterance “The ADC to preserve the previous knowledge varies. To this Theatre is located on Park Street. Before I find end, we propose an adaptive weight (λ) for the your train, could you tell me where you would like EWC regularization term as follows: to go?” is split into two utterances with domain 2 p extracted for NLG at https://github.com/ λ = λbase V1:t−1/Vt, (8) andy194673/nlg-sclstm-multiwoz 3465 overall performance, while Ωfirst evaluates the ability to alleviate catastrophic forgetting.

4.3 Baselines Two methods without exemplars are as below: • Finetune: At each task, the NLG model is ini- tialized with the model obtained till the last task, and then fine-tuned with the data from the cur- rent task. • Full: At each task, the NLG model is trained with the data from the current and all historical tasks. This is the “upper bound” performance for continual learning w.r.t. Ωall. Several exemplar replay (ER) methods trained Figure 2: Venn diagram visualizing intents in different with Eq. (4) using different exemplar selection domains. The number of utterances of each domain schemes are compared: (bold) and intents (italic) is indicated in parentheses. • ERherding (Welling, 2009; Rebuffi et al., 2017): This scheme chooses exemplars that best ap- “Attraction” and “Train” separately. If multiple proximate the mean DA vector over all training sentences of the same domain in the original utter- examples of this task. ance exist, they are still kept in one utterance after • ERrandom: This scheme selects exemplars at pre-processing. In each continual learning task, all random. Despite its simplicity, the distribution training data of one domain are used to train the of the selected exemplars is the same as the dis- NLG model, as illustrated in Figure1. Similar pre- tribution of the current task in expectation. processing is done at the granularity of DA intents for experiments in Section 4.6. The statistics of • ERprio: The proposed prioritized scheme (c.f. the pre-processed MultiWoZ-2.0 dataset is illus- Algorithm 1) to select representative and diverse trated in Figure2. The resulting datasets and the exemplars. pre-processing scripts are open-sourced. Based on ERprio, four regularization methods (including ours) to further alleviate catastrophic 4.2 Evaluation Metrics forgetting are compared:

Following previous studies, we use the slot error • L2: A static L2 regularization by setting Fi = 1 rate (SER) and the BLEU-4 score (Papineni et al., in Eq. (7). It regularizes all parameters equally. 2002) as evaluation metrics. SER is the ratio of • KD (Rebuffi et al., 2017; Wu et al., 2019; Hou the number of missing and redundant slots in a et al., 2019): The widely-used knowledge distil- generated utterance to the total number of ground lation (KD) loss (Hinton et al., 2015) is adopted truth slots in the DA. by distilling the prediction logit of current model To better evaluate the continual learning abil- w.r.t. the prediction logit of the model trained ity, we use two additional commonly used metrics till the last task. More implementation details (Kemker et al., 2018) for both SER and BLEU-4: are included in Appendix A.1.

T T • Dropout (Mirzadeh et al., 2020): Dropout Hin- 1 X 1 X Ω = Ω , Ω = Ω ton et al.(2012) is recently shown by (Mirzadeh all T all,i first T first,i i=1 i=1 et al., 2020) that it effectively alleviates catas- trophic forgetting. We tuned different dropout where T is the total number of continual learning rates assigned to the non-recurrent connections. tasks; Ωall,i is the test performance on all the tasks th • ARPER (c.f. Algorithm 2): The proposed after the i task has been learned; Ωfirst,i is that on the first task after the ith task has been learned. method using adaptive EWC with ERprio. Since Ω can be either SER or BLEU-4, both Ωall We utilized the well-recognized semantically- and Ωfirst have two versions. Ωall evaluates the conditioned LSTM (SCLSTM Wen et al., 2015c) as 3466 250 exemplars in total 500 exemplars in total Ωall Ωfirst Ωall Ωfirst SER% BLEU-4 SER% BLEU-4 SER% BLEU-4 SER% BLEU-4 Finetune 64.46 0.361 107.27 0.253 64.46 0.361 107.27 0.253 ERherding 16.89 0.535 9.89 0.532 12.25 0.555 4.53 0.568 ERrandom 10.93 0.552 6.96 0.553 8.36 0.569 4.41 0.572 ?? ?? ?? ? ERprio 9.67 0.578 5.28 0.578 7.48 0.597 3.59 0.620 ?? ?? ERprio+L2 14.94 0.579 5.31 0.587 10.51 0.596 4.28 0.605 ?? ?? ERprio+KD 8.65 0.586 6.87 0.601 7.37 0.596 4.89 0.617 ?? ?? ? ?? ERprio+Dropout 7.15 0.588 5.53 0.594 6.09 0.595 4.51 0.616 ARPER 5.22 0.590 2.99 0.624 5.12 0.598 2.81 0.627 Full 4.26 0.599 3.60 0.616 4.26 0.599 3.60 0.616 Table 1: Average performance of continually learning 6 domains using 250/500 exemplars. Best Performance excluding “Full” are in bold in each column. In each column , ? indicates p < 0.05 and ?? indicates p < 0.01 for a one-tailed t-test comparing ARPER to the three top-performing competitors except Full.

60% 0.6 Full an example, a model pre-trained on the “Attraction” APRER 50% domain is continually trained on the “Train” do- ERprio 0.5 ERrandom main. We present test performance on “Attraction” 40% Finetune at different epochs in Figure3 with 250 exemplars. first first 30% 0.4 We can observe: (1) catastrophic forgetting in- SER 20% BLEU-4 Full deed exists as indicated by the sharp performance APRER 0.3 ERprio drop of Finetune; (2) replaying carefully chosen 10% ERrandom exemplars helps to alleviate catastrophic forgetting Finetune 0% 0.2 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 by a large degree, and ERprio does a better job Epoch Epoch than ERrandom; and (3) ARPER greatly mitigates catastrophic forgetting by achieving similar or even Figure 3: Diagnose the catastrophic forgetting issue better performance compared to Full. in NLG. SER (Left) and BLEU-4 (Right) on the test data of “Attraction” at different epochs when a model 4.5 Continual Learning New Domains pre-trained on the “Attraction” domain is continually trained on another “Train” domain. In this experiment, the data from six domains are presented sequentially. We test 6 runs with dif- 3 the base NLG model fθ with one hidden of ferent domain order permutations. Each domain size 128. Dropout is not used by default, which is is selected as the first task for one time, and the evaluated as a separate regularization technique (c.f. remaining five domains are randomly ordered 4. ERprio+Dropout). For all the above methods, the Results averaged over 6 runs using 250 and 500 learning rate of Adam is set to 5e-3, batch size is set total exemplars are presented in Table1. Several to 128, and the maximum number of epochs used interesting observations can be noted: to train each task is set to 100. Early stop to avoid • All methods except Finetune perform worse on over-fitting is adopted when the validation loss does all seen tasks (Ωall) than on the first task (Ωfirst). not decrease for 10 consecutive epochs. To fairly This is due to the diverse knowledge among dif- compare different methods, they are trained with ferent tasks, which increases the difficulty of the identical configuration on the first task to have handling all the tasks. Finetune performs poorly a consistent starting point. Hyper-parameters of in both metrics because of the detrimental catas- different methods are included in Appendix A.1. trophic forgetting issue. 4.4 Diagnose Forgetting in NLG • Replaying exemplars helps to alleviate the catas- ER Before proceeding to our main results, we first trophic forgetting issue. Three methods sub- Finetune diagnose whether the catastrophic forgetting issue stantially outperform . Moreover, the exists when training an NLG model continually. As proposed prioritized exemplar selection scheme is effective, indicated by the superior perfor- 3Comparisons based on other base NLG models are in- cluded in Section 4.9. 4Exact domain orders are provided in Appendix A.2

3467 Ω Ω 18% all first Model SER% BLEU-4 SER% BLEU-4 15% ERprio ERprio+Dropout Finetune 49.94 0.382 44.00 0.375 12% ARPER ERherding 13.96 0.542 8.50 0.545 Full 10% Metric ERrandom 8.58 0.626 5.53 0.618 SER _all ERprio 8.21 0.684 5.20 0.669 8% _first ERprio+L2 6.87 0.693 4.92 0.661 5% ERprio+KD 10.59 0.664 10.87 0.649 ERprio+Dropout 6.32 0.689 5.55 0.658 2% ARPER 3.63 0.701 3.52 0.685 1 2 3 4 5 6 Full 3.08 0.694 2.98 0.672 Number of Domains Figure 4: SER on all seen domains (solid) and on the Table 2: Performance of continually learning 7 DA in- first domain (dashed) when more domains are continu- tents using 250 exemplars. Best Performance exclud- ally learned using 250 exemplars. ing “Full” are in bold.

Ωall Ωfirst mance of ERprio over ERherding and ERrandom. SER% BLEU-4 SER% BLEU-4 • ARPER significantly outperforms three ER meth- ARPER 4.82 0.592 3.88 0.569 ods and other regularization-based baselines. w/o ER 6.41 0.584 5.85 0.559 Compared to the three closest competitors, w/o PE 5.53 0.587 5.85 0.562 ARPER is significantly better with p-value < w/o AR 5.57 0.587 4.57 0.563 0.05 w.r.t SER. Table 3: Ablation study for ARPER. ER / PE / AR • The improvement margin of ARPER is signifi- stands for the Exemplar Replay loss / Prioritized Ex- cant w.r.t SER that is critical for measuring an emplars / Adaptive Regularization, respectively. output’s fidelity to a given dialog act. Different methods demonstrate similar performance w.r.t the data of seven DA intents are presented sequen- BLEU-4, where several of them approach Full, tially in the order of decreasing data size, i.e., “In- thus are very close to the upper bound perfor- form, Request, Book, Recommend, Offer-Booked, mance. No-Offer, Select”. Results using 250 exemplars are • ARPER achieves comparable performance w.r.t presented in Table2. We can observe that ARPER still largely outperforms other methods, and similar to the upper bound (Full) on all seen tasks (Ωall) even with a very limited number of exemplars. observations for ARPER can be made as before. Moreover, it outperforms Full on the first task Therefore, we conclude that ARPER is able to learn new functionalities continually. (Ωfirst), indicating that ARPER better mitigates forgetting the first task than Full, and the latter Compared to previous experiments, the perfor- is still interfered by data in later domains. mance of ERprio+KD degrades, while the perfor- mance of ERprio+L2 improves due to the very large Dynamic Results in Continual Learning In data size in the first task (“Inform”), which means Figure4, several representative methods are com- that they are sensitive to task orders. pared as more domains are continually learned. With more tasks continually learned, ARPER per- 4.7 Ablation Study forms consistently better than other methods on In Table3, we compare several simplified versions all seen tasks (solid lines), and it is comparable to of ARPER to understand the effects of different Full. On the first task (dashed lines), ARPER out- components. Comparisons are based on continu- performs all the methods, including Full, at every ally learning 6 domains staring with “Attraction”. continual learning step. These results illustrate the We can observe that: (1). LER is beneficial be- advantage of ARPER through the entire continual cause dropping it (“w/o ER”) degrades the perfor- learning process. mance of ARPER. (2). Using prioritized exemplars is advantageous because using random exemplars 4.6 Continual Learning New DA Intents (“w/o PE”) for ARPER impairs its performance. (3). It is also essential for a task-oriented dialog system Adaptive regularization is also effective, indicated to continually learn new functionalities, namely, by the superior performance of ARPER compared supporting new DA intents. To test this ability, to using fixed regularization weights (“w/o AR”).

3468 Recommend (Addr=regent street, Fee=free, Name=Downing College) Reference [Downing College] is my favorite. It’s located on [regent street] and it’s [free] to get in. [Downing College] is located in the city and it’s located in the [regent street]. it’s ER +Dropout prio located at located at! it’s located in the[ Slot-Hotel-Area]. (missing: Fee=fre) I would recommend [Downing College]. It is located at [regent street] and has a ARPER entrance fee of [free]. (correct) Recommend (Area=centre of town, Name=saints church, Type=architecture destination) There is a [saints church] that is an [architecture destination] in the [centre of Reference town], would you like that? I recommend [saints church] in the [centre of town]. it is a nice. it is a guest house ER +Dropout prio in a in a[ Slot-Restaurant-Food]. (missing: Type=architecture destination) ARPER [saints church] is a [architecture destination] in the [centre of town]. (correct)

Table 4: Sample utterances generated for the first domain (“Attraction”) after the NLG is continually trained on all 6 domains using 250 exemplars. Redundant and missing slots are colored in orange and blue respectively. Obvious grammar mistakes (redundant repetitions) are colored in purple.

SCVAE GPT-2 2, we used the pre-trained model with 12 layers and

Ωall Ωfirst Ωall Ωfirst 117M parameters. As in Peng et al.(2020), exact Finetune 60.83 98.86 28.69 31.76 slot values are not replaced by special placeholders ERherding 17.95 11.48 11.95 10.48 during training as in SCLSTM and SCVAE. The ERrandom 9.31 7.52 9.87 8.85 dialog act is concatenated with the corresponding ERprio 8.92 6.16 8.72 8.20 utterance before feeding into GPT-2. More details ER +L2 12.47 6.67 10.51 9.20 prio are included in Appendix A.1. ERprio+KD 6.32 6.09 8.41 8.09 ERprio+Dropout 8.01 8.77 7.60 7.72 Results of using 250 exemplars to continually ARPER 4.45 4.04 5.32 5.05 learn 6 domains starting with “Attraction” are pre- Full 3.99 4.03 4.75 4.53 sented in Table5. Thanks to the large-scale pre- trained language model, GPT-2 suffers less from Table 5: SER in % of using SCVAE and GPT-2 as fθ. Best Performance excluding “Full” are in bold. the catastrophic forgetting issue because of the bet- ter performance of Finetune. In general, the rela- tive performance patterns of different methods are 4.8 Case Study similar to that we observed in Section 4.5 and 4.6. Table4 shows two examples generated by ARPER Therefore, we can claim that the superior perfor- and the closest competitor (ERprio+Dropout) on the mance of ARPER can generalize to different base first domain (“Attraction”) after the NLG model is NLG models. continually trained on all 6 domains starting with “Attraction”. In both examples, ERprio+Dropout 5 Conclusion fails to generate slot “Fee” or “Type”, instead, it mistakenly generates slots belonging to later do- In this paper, we study the practical continual learn- mains (“Hotel” or “Restaurant”) with several ob- ing setting of language generation in task-oriented vious redundant repetitions colored in purple. It dialog systems. To alleviate catastrophic forget- means that the NLG model is interfered by ut- ting, we present ARPER which replays representa- terance patterns in later domains, and it forgets tive and diverse exemplars selected in a prioritized some old patterns it has learned before. In contrast, manner, and employs an adaptive regularization ARPER succeeds in both cases without forgetting term based on EWC (Elastic Weight Consolida- previously learned patterns. tion). Extensive experiments on MultiWoZ-2.0 in different continual learning scenarios reveal the 4.9 Results using Other NLG Models superior performance of ARPER . The realistic con- In this experiment, we changed the base NLG tinual learning setting and the proposed technique model From SCLSTM to SCVAE (Tseng et al., may inspire further studies towards building more 2018) and GPT-2 (Radford et al., 2019). For GPT- scalable task-oriented dialog systems.

3469 References Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages Rahaf Aljundi, Francesca Babiloni, Mohamed Elho- 4171–4186. seiny, Marcus Rohrbach, and Tinne Tuytelaars. 2018. Memory aware synapses: Learning what (not) Robert M French. 1999. Catastrophic forgetting in con- to forget. In Proceedings of the European Confer- nectionist networks. Trends in Cognitive Sciences, ence on Computer Vision (ECCV), pages 139–154. pages 128–135.

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Tianxing He, Jun Liu, Kyunghyun Cho, Myle Ott, Bing Schneider, Rachel Fong, Peter Welinder, Bob Mc- Liu, James Glass, and Fuchun Peng. 2019. Mix- Grew, Josh Tobin, OpenAI Pieter Abbeel, and Woj- review: Alleviate forgetting in the pretrain-finetune ciech Zaremba. 2017. Hindsight experience replay. framework for neural language generation models. In Advances in neural information processing sys- arXiv preprint arXiv:1910.07117. tems, pages 5048–5058. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Gaurav Arora, Afshin Rahimi, and Timothy Baldwin. Distilling the knowledge in a neural network. arXiv 2019. Does an lstm forget more than a cnn? an em- preprint arXiv:1503.02531. pirical study of catastrophic forgetting in nlp. In Geoffrey E Hinton, Nitish Srivastava, , Proceedings of the The 17th Annual Workshop of , and Ruslan R Salakhutdinov. 2012. the Australasian Language Technology Association, Improving neural networks by preventing co- pages 77–86. adaptation of feature detectors. arXiv preprint arXiv:1207.0580 Antoine Bordes, Y-Lan Boureau, and Jason Weston. . 2016. Learning end-to-end goal-oriented dialog. Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, arXiv preprint arXiv:1605.07683. and Dahua Lin. 2019. Learning a unified classifier incrementally via rebalancing. In Proceedings of the Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang IEEE Conference on Computer Vision and Pattern Tseng, Inigo˜ Casanueva, Stefan Ultes, Osman Ra- Recognition, pages 831–839. madan, and Milica Gasic. 2018. Multiwoz-a large- scale multi-domain wizard-of-oz dataset for task- Ronald Kemker, Marc McClure, Angelina Abitino, oriented dialogue modelling. In Proceedings of the Tyler L Hayes, and Christopher Kanan. 2018. Mea- 2018 Conference on Empirical Methods in Natural suring catastrophic forgetting in neural networks. In Language Processing, pages 5016–5026. Thirty-second AAAI conference on artificial intelli- gence. Francisco M Castro, Manuel J Mar´ın-Jimenez,´ Nicolas´ Guil, Cordelia Schmid, and Karteek Alahari. 2018. Diederik P Kingma and Max Welling. 2013. Auto- End-to-end incremental learning. In Proceedings encoding variational bayes. arXiv preprint of the European Conference on Computer Vision arXiv:1312.6114. (ECCV), pages 233–248. James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Arslan Chaudhry, Marcus Rohrbach, Mohamed Elho- Joel Veness, Guillaume Desjardins, Andrei A Rusu, seiny, Thalaiyasingam Ajanthan, Puneet K Doka- Kieran Milan, John Quan, Tiago Ramalho, Ag- nia, Philip HS Torr, and Marc’Aurelio Ranzato. nieszka Grabska-Barwinska, et al. 2017. Over- 2019. Continual learning with tiny episodic mem- coming catastrophic forgetting in neural networks. ories. arXiv preprint arXiv:1902.10486. Proceedings of the national academy of sciences, 114(13):3521–3526. Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. 2020. Recall and learn: Sungjin Lee. 2017. Toward continual learn- Fine-tuning deep pretrained language models with ing for conversational agents. arXiv preprint less forgetting. arXiv preprint arXiv:2004.12651. arXiv:1712.09943. Yuanpeng Li, Liang Zhao, Kenneth Church, and Mo- Alexandra Chronopoulou, Christos Baziotis, and hamed Elhoseiny. 2020. Compositional continual Alexandros Potamianos. 2019. An embarrassingly language learning. simple approach for transfer learning from pre- trained language models. In Proceedings of the Zhizhong Li and Derek Hoiem. 2017. Learning with- 2019 Conference of the North American Chapter of out forgetting. IEEE transactions on pattern analy- the Association for Computational Linguistics: Hu- sis and machine intelligence, 40(12):2935–2947. man Language Technologies, Volume 1 (Long and Short Papers), pages 2089–2095. Tianlin Liu, Lyle Ungar, and Joao˜ Sedoc. 2019. Contin- ual learning for sentence representations using con- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and ceptors. In Proceedings of the 2019 Conference of Kristina Toutanova. 2019. Bert: Pre-training of the North American Chapter of the Association for deep bidirectional transformers for language under- Computational Linguistics: Human Language Tech- standing. In Proceedings of the 2019 Conference of nologies, Volume 1 (Long and Short Papers), pages the North American Chapter of the Association for 3274–3279.

3470 Davide Maltoni and Vincenzo Lomonaco. 2019. Con- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, tinuous learning in single-incremental-task scenar- Dario Amodei, and Ilya Sutskever. 2019. Language ios. Neural Networks, 116:56–73. models are unsupervised multitask learners.

Michael McCloskey and Neal J Cohen. 1989. Catas- Tiago Ramalho and Marta Garnelo. 2019. Adaptive trophic interference in connectionist networks: The posterior learning: few-shot learning with a surprise- sequential learning problem. In Psychology of learn- based memory module. In International Conference ing and motivation, volume 24, pages 109–165. El- on Learning Representations (ICLR)). sevier. Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Fei Mi, Minlie Huang, Jiyong Zhang, and Boi Faltings. Georg Sperl, and Christoph H Lampert. 2017. 2019. Meta-learning for low-resource natural lan- icarl: Incremental classifier and representation guage generation in task-oriented dialogue systems. learning. In Proceedings of the IEEE Conference In Proceedings of the 28th International Joint Con- on Computer Vision and , pages ference on Artificial Intelligence, pages 3151–3157. 2001–2010. AAAI Press.

Fei Mi, Lingjing Kong, Tao Lin, Kaicheng Yu, and Matthew Riemer, Tim Klinger, Djallel Bouneffouf, and Boi Faltings. 2020a. Generalized class incremental Michele Franceschini. 2019. Scalable recollections learning. In Proceedings of the IEEE/CVF Confer- for continual lifelong learning. In AAAI, volume 33, ence on Computer Vision and Pattern Recognition pages 1352–1359. Workshops, pages 240–241. Andrei A Rusu, Neil C Rabinowitz, Guillaume Des- Fei Mi, Xiaoyu Lin, and Boi Faltings. 2020b. Ader: jardins, Hubert Soyer, James Kirkpatrick, Koray Adaptively distilled exemplar replay towards contin- Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. ual learning for session-based recommendation. In 2016. Progressive neural networks. arXiv preprint Fourteenth ACM Conference on Recommender Sys- arXiv:1606.04671. tems, pages 408–413. Danielle Saunders, Felix Stahlberg, Adria` de Gispert, Seyed-Iman Mirzadeh, Mehrdad Farajtabar, and Has- and Bill Byrne. 2019. Domain adaptive inference san Ghasemzadeh. 2020. Dropout as an implicit for neural machine translation. In Proceedings of gating mechanism for continual learning. arXiv the 57th Annual Meeting of the Association for Com- preprint arXiv:2004.11545. putational Linguistics, pages 222–228.

Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Tom Schaul, John Quan, Ioannis Antonoglou, and Lu Zhang, and Zhi Jin. 2016. How transferable are David Silver. 2016. Prioritized experience replay. neural networks in nlp applications? In Proceed- ings of the 2016 Conference on Empirical Methods Yilin Shen, Xiangyu Zeng, and Hongxia Jin. 2019. in Natural Language Processing, pages 479–489. A progressive model to enable continual learning for semantic slot filling. In Proceedings of the Kishore Papineni, Salim Roukos, Todd Ward, and Wei- 2019 Conference on Empirical Methods in Natu- Jing Zhu. 2002. Bleu: a method for automatic eval- ral Language Processing and the 9th International uation of machine translation. In Proceedings of the Joint Conference on Natural Language Processing 40th Annual Meeting on Association for Computa- (EMNLP-IJCNLP), pages 1279–1284. tional Linguistics, pages 311–318.

Baolin Peng, Chenguang Zhu, Chunyuan Li, Xi- Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon ujun Li, Jinchao Li, Michael Zeng, and Jian- Kim. 2017. Continual learning with deep generative feng Gao. 2020. Few-shot natural language gen- replay. In Advances in Neural Information Process- eration for task-oriented dialog. arXiv preprint ing Systems, pages 2990–2999. arXiv:2002.12328. Brian Thompson, Jeremy Gwinnup, Huda Khayrallah, Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Kevin Duh, and Philipp Koehn. 2019. Overcoming Gardner, Christopher Clark, Kenton Lee, and Luke catastrophic forgetting during domain adaptation of Zettlemoyer. 2018. Deep contextualized word repre- neural machine translation. In Proceedings of the sentations. In Proceedings of the 2018 Conference 2019 Conference of the North American Chapter of of the North American Chapter of the Association the Association for Computational Linguistics: Hu- for Computational Linguistics: Human Language man Language Technologies, Volume 1 (Long and Technologies, Volume 1 (Long Papers), pages 2227– Short Papers), pages 2062–2068. 2237. Van-Khanh Tran and Le-Minh Nguyen. 2017. Natural Kun Qian and Zhou Yu. 2019. Domain adaptive dialog language generation for spoken dialogue system us- generation via meta learning. In Proceedings of the ing rnn encoder-decoder networks. In Proceedings 57th Annual Meeting of the Association for Compu- of the 21st Conference on Computational Natural tational Linguistics, pages 2639–2649. Language Learning, pages 442–451.

3471 Van-Khanh Tran and Le-Minh Nguyen. 2018a. Adver- Dani Yogatama, Cyprien de Masson d’Autume, Jerome sarial domain adaptation for variational neural lan- Connor, Tomas Kocisky, Mike Chrzanowski, Ling- guage generation in dialogue systems. In Proceed- peng Kong, Angeliki Lazaridou, Wang Ling, Lei ings of the 27th International Conference on Com- Yu, Chris Dyer, et al. 2019. Learning and evaluat- putational Linguistics, pages 1205–1217. ing general linguistic intelligence. arXiv preprint arXiv:1901.11373. Van-Khanh Tran and Le-Minh Nguyen. 2018b. Dual latent variable model for low-resource natural lan- Friedemann Zenke, Ben Poole, and Surya Ganguli. guage generation in dialogue systems. In Proceed- 2017. Continual learning through synaptic intelli- ings of the 22nd Conference on Computational Nat- gence. In Proceedings of the 34th International ural Language Learning, pages 21–30. Conference on , pages 3987–3995. JMLR. org. Bo-Hsiang Tseng, Florian Kreyssig, Paweł Budzianowski, Inigo˜ Casanueva, Yen-chen Wu, Bowen Zhao, Xi Xiao, Guojun Gan, Bin Zhang, Stefan Ultes, and Milica Gasic. 2018. Variational and Shutao Xia. 2019. Maintaining discrimination cross-domain natural language generation for and fairness in class incremental learning. arXiv spoken dialogue systems. In 19th Annual SIG- preprint arXiv:1911.07053. dial Meeting on Discourse and Dialogue, pages 338–343. Dusan Varis and Ondrejˇ Bojar. 2019. Unsupervised pretraining for neural machine translation using elas- tic weight consolidation. In Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics: Student Research Workshop, pages 130–135. Max Welling. 2009. Herding dynamical weights to learn. In Proceedings of the 26th International Conference on Machine Learning, pages 1121–1128. ACM. Tsung-Hsien Wen, Milica Gasic,ˇ Dongho Kim, Nikola Mrksic,ˇ Pei-Hao Su, David Vandyke, and Steve Young. 2015a. Stochastic language generation in di- alogue using recurrent neural networks with convo- lutional sentence reranking. In 16th Annual Meeting of the Special Interest Group on Discourse and Dia- logue, page 275. Tsung-Hsien Wen, Milica Gasic,ˇ Nikola Mrksic,ˇ Lina M Rojas-Barahona, Pei-Hao Su, David Vandyke, and Steve Young. 2015b. Toward multi- domain language generation using recurrent neural networks. In NIPS Workshop on Machine Learn- ing for Spoken Language Understanding and Inter- action. Tsung-Hsien Wen, Milica Gasic, Nikola Mrksiˇ c,´ Pei- Hao Su, David Vandyke, and Steve Young. 2015c. Semantically conditioned lstm-based natural lan- guage generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1711–1721. Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. 2019. Large scale incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 831–839. Ying Xu, Xu Zhong, Antonio Jose Jimeno Yepes, and Jey Han Lau. 2019. Forget me not: Reducing catas- trophic forgetting for domain adaptation in reading comprehension. arXiv preprint arXiv:1911.00202.

3472 Appendix Domains DA Intents ERprio (β) 0.5/0.5/0.5/0.5 0.5 L2 (weight on L2) 1e-3/1e-3/1e-3/5e-4 1e-2 A Reproducibility Checklist KD (weight on LKD) 2.0/3.0/2.0/0.5 5.0 A.1 Model Details and Hyper-parameters Dropout (rate) 0.25/0.25/0.25/0.1 0.25 ARPER (λ ) 300k/350k/200k/30k 100k We first elaborate implementation details of the base (KD) baseline compared in Table 6: Optimal hyper-parameters of methods experi- our paper. We used the below loss term: mented in this paper. Four different values in the col- K |L| umn “Domains” correspond to using 250 exemplars in P P both Table1 and Table2 / 500 exemplars in Table1/ LKD(Y, d, fθt−1 , fθt ) = − pˆk,i · log(pk,i) k=1 i=1 using SCVAE / GPT-2 as f(θ) in Table5, respectively. where L is the vocabulary that appears in previous tasks but not in task t. At each position k of Y, Run 1 0 5 2 1 3 4 5 Run 2 1 4 0 5 3 2 [ˆpk,1,..., pˆk,|L|] is the predicted distribution over Run 3 2 0 3 1 4 5 L given by fθ , and [pk,1, . . . , p ] is the dis- t−1 k,|L| Run 4 3 2 4 0 1 5 tribution given by fθ . LKD penalizes prediction t Run 5 4 2 1 5 0 3 changes on the vocabularies specific to earlier tasks. Run 6 5 3 2 0 1 4 For all {Y, d} ∈ Dt ∪ E1:t−1, LKD is linearly in- terpolated with LER by LER + η · LKD, with the Table 7: Each row corresponds to a domain order η tuned as a hyper-parameter . permutation. The mapping from domain to id is: Hyper-parameters of SCVAE reported in {“Attraction”: 0, “Booking’‘: 1, “Hotel”: 2, “Restau- Section 4.9 are set by default according to https: rant”: 3, “Taxi”: 4, “Train”: 5.} //github.com/andy194673/nlg-scvae, except that the learning rate is set to 2e-3. For Finetune ERprio L2 KD Dropout ARPER Full GPT-2, we used the implementation pipeline 17.5s 18.5s 19.5s 24.6s 15.5s 39.5s 242.5s https://github.com/pengbaolin/ from Table 8: Average training time of one epoch at the last SC-GPT. We pre-processed the dialog act d into task when continually learning 6 domains starting with 0 the format of : d = [ I ( s1 = v1, . . . , sp = vp )], “Attraction” using 250 exemplars. Methods other than and the corresponding utterance Y is appended to Finetune and Full are applied on top of ERprio. be Y0 with a special start token [BOS] and an end token [EOS]. d0 and Y0 are concatenated before different methods. Full spends more than 200s of feeding into GPT-2. The learning rate of Adam extra computation overhead per epoch than other optimizer is set to 5e-5 without weight decay. As methods using bounded exemplars. ARPER takes GPT-2 converges faster, we train maximum 10 a slightly longer time to train than the methods ex- epochs for each task with early stop applied to 3 cept for Full. Nevertheless, considering its superior consecutive epochs. performance, we contend that ARPER achieves de- Hyper-parameters of different methods are tuned sirable resource-performance trade-off. In addition, to maximize SERall using grid search, and the opti- 250 exemplars are less than 1% of historical data mal settings of different methods in various experi- at the last task, and the memory usage to store a ments are summarized in Table6. small number of exemplars is trivial.

A.2 Domain Order Permutations B Supplementary Empirical Results In Table7, we provide the exact domain order per- B.1 Comparison to Pseudo Exemplar Replay mutations of the 6 runs used in the experiments in Table1 and Figure4. Instead of storing raw samples as exemplars, Shin et al.(2017); Riemer et al.(2019) generate A.3 Computation Resource “pseudo‘’ samples akin to past data. The NLG All experiments are conducted using a single GPU model itself can generate pseudo exemplars. In this (GeForce GTX TITAN X). In Table8, we com- experiment, we replace the 500 raw exemplars of pared the average training time of one epoch using ERrandom, ERprio, and ARPER by pseudo samples 5The temperature in (Hinton et al., 2015) is set to 1 due to generated by the continually trained NLG model its minimum impact based on our experiments. using the dialog acts of the same raw exemplars

3473 0 0 32 0.5 32 0.5 64 64 96 96 128 0.4 128 0.4 160 160 192 192 224 0.3 224 0.3 256 256 288 288 320 0.2 320 0.2 352 352 384 384 416 0.1 416 0.1 448 448 480 480 512 0.0 512 0.0 0 0 16 32 48 64 80 96 16 32 48 64 80 96 112 128 112 128 Attraction->Train Train->Hotel

0 0 32 1.4 32 1.4 64 64 96 1.2 96 1.2 128 128 160 1.0 160 1.0 192 192 224 224 0.8 0.8 256 256 288 288 320 0.6 320 0.6 352 352 384 0.4 384 0.4 416 416 448 0.2 448 0.2 480 480 512 0.0 512 0.0 0 0 16 32 48 64 80 96 16 32 48 64 80 96 112 128 112 128 Attraction->Train Train->Hotel

Figure 5: An visualization of the change of SCLSTM’s hidden layer weights obtained from two consecutive tasks of ARPER (Top) and ERprio+Dropout (Bottom). Two sample task transitions (“from Attraction” to “Train”, and then from “Train” to “Hotel”) are shown. High temperature areas of ARPER is highlighted by red bounding boxes for better visualization.

Ωall Ωfirst B.2 Flow of Parameters Update SER% BLEU-4 SER% BLEU-4 To further understand the superior performance ERrandom 9.82 0.495 8.64 0.405 of ARPER, we investigated the update of param- Pseudo-ERrandom 9.26 0.551 6.88 0.519 eters throughout the continual learning process. ERprio 7.84 0.573 6.20 0.523 Specifically, we compared SCLSTM’s hidden layer Pseudo-ERprio 8.87 0.557 6.37 0.521 weights obtained from consecutive tasks, and the ARPER 4.43 0.597 3.40 0.574 pairwise L1 difference of two sample transitions is Pseudo-ARPER 5.07 0.590 3.51 0.570 shown in Figure5. We can observe that ER +Dropout tends Table 9: Comparison with Pseudo Exemplar Replay. prio to update almost all parameters, while ARPER only updates a small fraction of them. Further- as input. Result comparing using pseudo or raw more, ARPER has different sets of important pa- exemplars to continually learn 6 domains starting rameters for distinct tasks, indicated by different with “Attraction” are illustrated in Table9. We can high-temperature areas in distinct weight updat- see that using pseudo exemplars performs better ing heat maps. In comparison, parameters of for ERrandom, but worse for ERprio and ARPER. It ERprio+Dropout seem to be updated uniformly in means that pseudo exemplars are better when exem- different task transitions. The above observations plars are chosen randomly, while carefully chosen verify that ARPER indeed elastically allocates dif- exemplars (c.f. algorithm 1) is better than pseudo ferent network parameters to distinct NLG tasks to exemplars. Explorations on utilizing pseudo exem- mitigate catastrophic forgetting. plars for NLG is orthogonal to our work, and it is left as future work.

3474