Alternating Recurrent Dialog Model with Large-Scale Pre-Trained Language Models

Alternating Recurrent Dialog Model with Large-scale Pre-trained Language Models Qingyang Wu1, Yichi Zhang2, Yu Li1, Zhou Yu1 1University of California, Davis, 2Tsinghua University, fwilwu, yooli, [email protected], [email protected] Abstract is difficult to utilize these methods on challenging dialog tasks where dialog states and acts are diffi- Existing dialog system models require exten- cult to annotate such as persuasion and negotiation. sive human annotations and are difficult to gen- Eric and Manning(2017) proposed a simple eralize to different tasks. The recent success sequence-to-sequence architecture that requires no of large pre-trained language models has sug- gested the effectiveness of incorporating lan- explicit annotations. The model learns to extract guage priors in down-stream NLP tasks. How- information from dialog history with attention and ever, how much pre-trained language models copy mechanism. However, due to the limited lan- can help dialog response generation is still un- guage modeling capability in the previous model, der exploration. In this paper, we propose a Sequicity (Lei et al., 2018), which uses belief states simple, general, and effective framework: Al- as inputs for supervision, outperforms Eric and ternating Recurrent Dialog Model (ARDM)1. Manning(2017)’s method significantly in the re- ARDM models each speaker separately and takes advantage of large pre-trained language cent dialog datasets. But with the success of large models. It requires no supervision from human pre-trained language models such as BERT (Devlin annotations such as belief states or dialog acts et al., 2019) and GPT-2 (Radford et al., 2019), we to achieve effective conversations. ARDM out- investigate how large-scale pre-trained language performs or is on par with the state-of-the- models can help dialog tasks. art methods on two popular task-oriented di- Previous sequence-to-sequence models are used alog datasets: CamRest676 and MultiWOZ. to tackle documents with only one narrator. How- Moreover, we can generalize ARDM to more challenging, non-collaborative tasks such as ever, in dialogs, two speakers have different roles; persuasion. In the PersuasionForGood task, therefore, their language model distributions are ARDM is capable of generating human-like re- very different from each other. To address this sponses to persuade people to donate to a char- issue, we propose ARDM, a dialog model that ity. encodes and decodes different speaker utterances in alternating order. This structure makes the 1 Introduction model more flexible and efficient than traditional sequence-to-sequence models in processing vari- It has been a long-standing ambition for artificial ous dialogs. We evaluate our model on three dif- intelligence researchers to create an intelligent con- ferent task-oriented dialog datasets: CamRes676, versational agent that can generate human-like re- MultiWOZ, and PersuasionForGood. The first two sponses. Recently, data-driven dialog models are datasets are traditional information request dialog more and more popular. However, most current datasets with well-defined automatic evaluation state-of-the-art approaches still heavily rely on ex- metrics on task completion. By contrast, Persua- tensive human annotations such as belief states and sionForGood is a new dataset that focuses on per- dialog acts (Lei et al., 2018). However, dialog con- suading people to donate to a charity. Due to the tent can vary considerably in different dialog tasks. complexity of dialog content, there is no explicit Having a different intent or dialog act annotation dialog state defined in this task. scheme for each task is costly and even impossible We observe that ARDM is capable of improv- for tasks such as open-domain social chat. Thus, it ing task-oriented dialog tasks performance over 1https://github.com/qywu/ARDM the previous state-of-the-art methods without in- 1292 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 1292–1301 April 19 - 23, 2021. ©2021 Association for Computational Linguistics corporating any explicit supervision from belief notation required and is easier to fix errors. Also, states or dialog acts. Also, because of ARDM’s since belief states are a set of important entities con- simplicity and generality, one can rapidly build a densed from dialog history (i.e., often exact words dialog prototype on different types of applications from utterances), they do not introduce extra infor- using only conversations without additional human mation to the model. Therefore, a dialog model annotations. We also found that ARDM works with powerful representation learning should learn well on complex dialogs, such as persuasion. The a form of belief states information automatically model generates dialog responses that successfully without human annotations as the scaffold. persuade people to donate to a charity, suggesting Recent success of BERT (Devlin et al., 2019) the potential of ARDM being used in wide-scale and GPT2 (Radford et al., 2019) suggests the possi- real-world settings. bility of applying large pre-trained language models to dialog systems. There are some studies 2 Related Work of applying large pre-trained language model to dialog generation. TransferTransfo (Wolf et al., Traditional dialog systems consist of a dialog man- 2019) fine-tuned the pre-trained language model ager to maintain dialog states and control the con- GPT (Radford et al., 2018) on Persona-Chat dataset versation flow. However, a dialog manager re- (Zhang et al., 2018) and obtained significant im- quires extensive manual annotations for training provements on chitchat response generation, sug- the sub-modules such as dialog state tracker or gesting the potential of fine-tuning large pre-trained policy decision-maker. An alternative is to model language models on other dialog response gener- dialog without explicitly modeling belief states. ation tasks. A more recent work (Budzianowski Specifically, Eric and Manning(2017) proposed and Vulic, 2019) adopted the framework of Trans- a sequence-to-sequence model that utilizes copy- ferTransfo. They made the first attempt to lever- mechanism to copy history information directly age large pre-trained language models GPT and from raw dialog history. This method achieved GPT-2 on task-oriented dialog generation, but it the state-of-the-art results on DSTC2 (Henderson included belief states modeling as the input and et al., 2014), which is a simple dialog restaurant did not achieve better results than the baseline. We booking task with abundant data. However, such propose to model dialogs without any annotation method did not perform well on more complex di- but rely on pre-trained large-scale language models alog task data sets CamRes676 (Wen et al., 2017) that alternate. and KVRET (Eric et al., 2017). Sequicity (Lei et al., 2018) attributed the bad performance of Eric and Manning(2017)’s method to the omission of 3 Alternating Recurrent Dialog Model belief tracker. They introduced the concept of belief span and added belief tracker back to the model We propose Alternating Recurrent Dialog Model and achieved state-of-the-art performance. (ARDM) by compositing two separate pre-trained Compared to Sequicity, Eric and Manning language model in alternate order to learn the user (2017)’s method provides a more general frame- and system utterance distribution through memory work that reduces manual dialog state, user intent, recurrence. Figure1 shows an overview of ARDM. and dialog act labeling by bypassing any symbolic annotations. Such a model can apply to datasets with no or partial annotations of belief states. In 3.1 Recurrent Modeling for User and System a real-world setting, if the dialog task introduces new slot values in belief states (i.e. a new type of We aim to model both user and system utter- food), Sequicity will suffer from the belief span de- ances distribution recurrently. Given a multi- coder error in response generation. Thus, Eric and turn dialog (d) between a user (u) and a system Manning(2017)’s method may be potentially more (s), we can represent d as a series of utterances robust than Sequicity in this situation. Besides, if fu1; s1; u2; s2; : : : ; uT ; sT g, where T denotes the the task requires belief states for database search, total number of turns. We decompose the prob- we can treat belief tracking as a separate task. We ability distributions over the utterances in d into can train a good belief tracking with only a small two language models for the user and system re- amount of annotated data, which reduces the an- spectively, denoted as pu and ps. Then we define a 1293 (a) (b) Figure 1: Alternating Recurrent Dialog Model (ARDM) Overview. (a) shows how we feed the entire dialog to ARDM. (b) shows the recurrence mechanism we used to preserve memory. dialog model p(d) with the equation: features denoted as Q; K; V , where the equation is defined as: T Attention(Q; K; V ) = softmax(QKT V ) (4) Y p(d) = pu(utju<t; s<t) ps(stju≤t; s<t) (1) t=1 For simplicity, we assume there is only one layer in Transformer, and ht is the hidden states which pu and ps are standard language models where consist of N vectors for the current input N tokens the task is to predict the next token given the pre- in the utterance at time t. Then a recurrence rela- ceding context. For an utterance ut or st with m tion for ht is defined by computing Qt, K≤t, V≤t tokens fw1; : : : ; wmg, the joint probability of an from h≤t−1 and the current utterance. In practice, utterance is as follows: we reuse K≤t−1 and V≤t−1 (i.e. history keys and values) as Mt−1 instead of ht−1 to avoid recomput- ing history information. Therefore, the final ht is mut Y computed as: pu(utju<t; s<t) = P (wijw<i; u<t; s<t) (2) i=1 Mt−1 = [K≤t−1;V≤t−1] (5) K≤t;V≤t = [K≤t−1; Kt]; [V≤t−1; Vt] (6) ms Yt ht = Attention(Qt;K≤t;V≤t) (7) ps(stju≤t; s<t) = P (wijw<i; u≤t; s<t) (3) i=1 One can use ht (consisting of vectors for each token) to get each token’s probability to calculate Finally, we train the dialog model by maximizing the language model cross entropy loss to maximize the likelihood over Equation 1.

Load more