A Compare Aggregate Transformer for Understanding Document-Grounded Dialogue
Total Page:16
File Type:pdf, Size:1020Kb
A Compare Aggregate Transformer for Understanding Document-grounded Dialogue Longxuan Ma, Weinan Zhang, Runxin Sun, Ting Liu {lxma,wnzhang,rxsun,tliu}@ir.hit.edu.cn Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology Harbin, Heilongjiang, China ABSTRACT Document: Movie Name: The Shape of Water. ... Director: Guillermo del Toro. Gen- Unstructured documents serving as external knowledge of the di- re: Fantasy, Drama.Cast: Sally Hawkins as Elisa Esposito, a mute cleaner alogues help to generate more informative responses. Previous who works at a secret government laboratory. ... Critical Response: one research focused on knowledge selection (KS) in the document of del Toro’s most stunningly successful works ... with dialogue. However, dialogue history that is not related to the Dialogue: current dialogue may introduce noise in the KS processing. In this S1: I thought The Shape of Water was one of Del Toro’s best works. paper, we propose a Compare Aggregate Transformer (CAT) to What about you? jointly denoise the dialogue context and aggregate the document S2: Yes, his style really extended the story. information for response generation. We designed two different S1: I agree. He has a way with fantasy elements that really helped this s- comparison mechanisms to reduce noise (before and during decod- tory be truly beautiful. It has a very high rating on rotten tomatoes, too. S2: Sally Hawkins acting was phenomenally expressive. Didn’t fe- ing). In addition, we propose two metrics for evaluating document el her character was mentally handicapped. utilization efficiency based on word overlap. Experimental results S1: The characterization of her as such was ... off the mark. on the CMUDoG dataset show that the proposed CAT model out- Table 1: One DGD example in the CMUDoG dataset. S1/S2 performs the state-of-the-art approach and strong baselines. means Speaker-1/Speaker-2, respectively. CCS CONCEPTS • Computing methodologies ! Natural language generation; Discourse, dialogue and pragmatics. The Document-grounded Dialogue (DGD) [11, 34, 37] is a new way to use external knowledge. It establishes a conversation mode KEYWORDS in which relevant information can be obtained from the given Dialogue system, Natural language generation, knowledge selection document. The DGD systems can be used in scenarios such as talking over merchandise against the product manual, commenting Longxuan Ma, Weinan Zhang, Runxin Sun, Ting Liu. In . , 7 pages. on news reports, etc. One example of DGD is presented in Table 1. Two interlocutors talk about the given document and freely 1 INTRODUCTION reference the text segment during the conversation. Dialogue system (DS) attracts great attention from industry and To address this task, two main challenges need to be considered academia because of its wide application prospects. Sequence-to- in a DGD model: 1) Determining which of the historical conver- sequence models (Seq2Seq) [24, 26] are verified to be an effective sations are related to the current conversation, 2) Using current framework for the DS task. However, one problem of Seq2Seq mod- conversation and the related conversation history to select proper els is that they tended to generate generic responses that provids document information and to generate an informative response. deficient information Ghazvininejad et al. [5], Li et al. [10]. Previous Previous work Arora et al. [2], Qin et al. [20], Ren et al. [21], Tian researchers proposed different methods to alleviate this issue. One et al. [27], Zhao et al. [34] generally focused on selecting knowl- arXiv:2010.00190v1 [cs.CL] 1 Oct 2020 way is to focus on models’ ability to extract information from con- edge with all the conversations. However, the relationship between versations. Li et al. [10] introduced Maximum Mutual Information historical conversations and the current conversation has not been (MMI) as the objective function for generating diverse response. studied enough. For example, in Table 1, the italics utterance from Serban et al. [25] proposed a latent variable model to capture pos- user1, "Yes, his style really extended the story.", is related to dialogue terior information of golden response. Zhao et al. [33] used condi- history. While the black fold utterance from user1, "Sally Hawkins tional variational autoencoders to learn discourse-level diversity acting was phenomenally expressive. Didn’t feel her charac- for neural dialogue models. The other way is introducing exter- ter was mentally handicapped.", has no direct relationship with nal knowledge, either unstructured knowledge texts Dinan et al. the historical utterances. when employing this sentence as the [4], Ghazvininejad et al. [5], Ye et al. [30] or structured knowledge last utterance, the dialogue history is not conducive to generate a triples [13, 31, 36] to help open-domain conversation generation response. by producing responses conditioned on selected knowledge. In this paper, we propose a novel Transformer-based [28] model for understanding the dialogues and generate informative responses in the DGD, named Compare Aggregate Transformer (CAT). Pre- ,, © vious research [22] has shown that the last utterance is the most important guidance for the response generation in the multi-turn ,, Longxuan Ma, Weinan Zhang, Runxin Sun, Ting Liu setting. Hence we divide the dialogue into the last utterance and is not well studied. In this paper, we propose a compare aggregate the dialogue history, then measure the effectiveness of the dialogue method to investigate this problem. It should be pointed out that history. If the last utterance and the dialogue history are related, when the target response changes the topic, the task is to detect we need to consider all the conversations to filter the document whether the topic is ended and to initiate a new topic [1]. We do not information. Otherwise, the existence of dialogue history is equal study the conversation initiation problem in this paper, although to the introduction of noise, and its impact should be eliminated we may take it as future work. conditionally. For this purpose, on one side, the CAT filters the document information with the last utterance; on the other side, 3 THE PROPOSED CAT MODEL the CAT uses the last utterance to guide the dialogue history and 3.1 Problem Statement employs the guiding result to filter the given document. We judge the importance of the dialogue history by comparing the two parts, The inputs of the CAT model are the given document D = (D1, D2, then aggregate the filtered document information to generate the ..., Dd ) with d words, dialogue history H = (H1, H2, ..., Hh) with h response. Experimental results show that our model can generate words and the last utterance L = (L1, L2, ..., Ll ) with l words. The more relevant and informative responses than competitive baselines. task is to generate the response R = (R1, R2, ..., Rr ) with r tokens When the dialogue history is less relevant to the last utterance, our with probability: model is verified to be even more effective. The main contributions r of this paper are: Ö P¹RjH; L; D; Θº = P¹R jH; L; D; R ; Θº; (1) (1) We propose a compare aggregate method to determine the i <i i=1 relationship between the historical dialogues and the last utterance. Experiments show that our method outperforms strong baselines where R<i = (R1, R2, ..., Ri−1), Θ is the model’s parameters. on the CMUDoG dataset1. (2) We propose two new metrics to evaluate the document knowl- 3.2 Encoder edge utilization in the DGD. They are both based on N-gram overlap The structure of the CAT model is shown in Figure 1. The hidden among generated response, the dialogue, and the document. dimension of the CAT model is hb. We use the Transformer structure [28]. The self-attention is calculated as follow: 2 RELATED WORK The DGD maintains a dialogue pattern where external knowledge QKT Attention¹Q; K; Vº =softmax¹ ºV; (2) can be obtained from the given document. Most recently, some pd DGD datasets Gopalakrishnan et al. [6], Moghe et al. [18], Qin et al. k [20], Zhou et al. [37] have been released to exploiting unstructured where Q, K, and V are the query, the key, and the value, respec- document information in conversations. tively; dk is the dimension of Q and K. The encoder and the decoder Models trying to address the DGD task can be classified into two stack N (N = 3 in our work) identical layers of multihead attention categories based on their encoding process with dialogues: one is (MAtt): parallel modeling and the other is incremental modeling. For the O first category, Moghe et al. [18] used a generation-based model that MAtt¹Q; K; Vº =»A1; :::; An¼W ; (3) learns to copy information from the background knowledge and a ¹ Q K V º span prediction model that predicts the appropriate response span Ai = Attention QWi ; KWi ; VWi ; (4) in the background knowledge. Liu et al. [14] claimed the first to Q where W ; WK ; WV ¹i = 1; :::;nº and WO are learnable param- unify knowledge triples and long texts as a graph. Then employed i i i eters. a reinforce learning process in the flexible multi-hop knowledge The encoder of CAT consists of two branches as figure 1 (a). graph reasoning process. To improve the process of using back- The left branch learns the information selected by dialogue history ground knowledge, [32] firstly adopted the encoder state ofthe H, the right part learns the information chosen by the last utter- utterance history context as a query to select the most relevant ance L. After self-attention process, we get H = MAtt¹H; H; Hº knowledge, then employed a modified version of BiDAF [23] to s and L = MAtt¹L; L; Lº. Then we employ L to guide the H. H1 = point out the most relevant token positions of the background se- s s MAtt¹L ; H; Hº, where H1 is the hidden state at the first layer.