Conversational Word Embedding for Retrieval-Based Dialog System

Conversational Word Embedding for Retrieval-Based Dialog System Wentao May, Yiming Cuizy, Ting Liuz, Dong Wangy, Shijin Wangyx, Guoping Huy yState Key Laboratory of Cognitive Intelligence, iFLYTEK Research, China zResearch Center for Social Computing and Information Retrieval (SCIR), Harbin Institute of Technology, Harbin, China xiFLYTEK AI Research (Hebei), Langfang, China yxfwtma,ymcui,dongwang4,sjwang3,[email protected] zfymcui,[email protected] Abstract 2016; Zhang et al., 2018b). The retrieval-based methods predict the answer based on the similarity Human conversations contain many types of of context and candidate responses, which can be information, e.g., knowledge, common sense, and language habits. In this paper, we pro- divided into single-turn models (Wang et al., 2015) pose a conversational word embedding method and multi-turn models (Wu et al., 2017; Zhou et al., named PR-Embedding, which utilizes the con- 2018; Ma et al., 2019) based on the number of turns versation pairs hpost; replyi 1 to learn word in context. Those methods construct the represen- embedding. Different from previous works, tations of the context and response with a single PR-Embedding uses the vectors from two dif- vector space. Consequently, the models tend to ferent semantic spaces to represent the words select the response with the same words . in post and reply. To catch the information On the other hand, as those static embeddings among the pair, we first introduce the word alignment model from statistical machine can not cope with the phenomenon of polysemy, translation to generate the cross-sentence win- researchers pay more attention to contextual rep- dow, then train the embedding on word-level resentations recently. ELMo (Peters et al., 2018), and sentence-level. We evaluate the method on BERT (Devlin et al., 2019), and XLNet (Yang et al., single-turn and multi-turn response selection 2019) have achieved great success in many NLP tasks for retrieval-based dialog systems. The tasks. However, it is difficult to apply them in the experiment results show that PR-Embedding industrial dialog system due to their low computa- can improve the quality of the selected response. 2 tional efficiency. In this paper, we focus on the static embedding, for it is flexible and efficient. The previous works 1 Introduction learn the embedding from intra-sentence within a Word embedding is one of the most fundamental single space, which is not enough for dialog sys- work in the NLP tasks, where low-dimensional tems. Specifically, the semantic correlation beyond word representations are learned from unlabeled a single sentence in the conversation pair is missing. corpora. The pre-trained embeddings can reflect For example, the words ‘why’ and ‘because’ usu- the semantic and syntactic information of words ally come from different speakers, and we can not and help various downstream tasks get better per- catch their relationship by context window within formance (Collobert et al., 2011; Kim, 2014). the sentence. Furthermore, when the words in post The traditional word embedding methods train and reply are mapped into the same vector space, the models based on the co-occurrence statistics, the model tends to select boring replies with re- such as Word2vec (Mikolov et al., 2013a,b), GloVe peated content because repeated words can easily (Pennington et al., 2014). Those methods are get a high similarity. widely used in dialog systems, not only in retrieval- To tackle this problem, we propose PR- based methods (Wang et al., 2015; Yan et al., 2016) Embedding (Post-Reply Embedding) to learn repre- but also the generation-based models (Serban et al., sentations from the conversation pairs in different spaces. Firstly, we represent the post and the reply 1 In this paper, we name the first utterance in the conversa- in two different spaces similar to the source and tion pair as ‘post,’ and the latter is ‘reply’ 2PR-Embedding source code is available at https:// target languages in the machine translation. Then, github.com/wtma/PR-Embedding . the word alignment model is introduced to gener- 1375 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1375–1380 July 5 - 10, 2020. c 2020 Association for Computational Linguistics Post: P_hi P_, P_where P_are P_you P_from Reply: R_i R_am R_from R_alabama R_, R_how R_about R_you Figure 1: An example of conversational word alignment from the PersonaChat dataset (section 3.1). ‘P ’ and ‘R ’ identify the vocabulary the words come from. For the word ‘where,’ we find the most related word ‘alabama’ based on the alignment model and generate the cross-sentence window with the size of 3 centered on the word. ate the cross-sentence window. Lastly, we train the need to find the most related word from the other embeddings based on the word-level co-occurrence sequence for each word in the pair. In other words, and a sentence-level classification task. we need to build conversational word alignment The main contributions of our work are: (1) we between the post and the reply. propose a new method to learn the conversational In this paper, we solve it by the word alignment word embedding from human dialogue in two dif- model in statistical machine translation (Och and ferent vector spaces; (2) The experimental results Ney, 2003). We treat the post as the source lan- show that PR-Embedding can help the model select guage and the reply as the target language. Then better responses and catch the semantic correlation we align the words in the pair with the word align- among the conversation pair. ment model and generate a cross-sentence window centered on the alignment word. 2 Methods 2.3 Embedding Learning 2.1 Notation We train the conversational word embedding on We consider two vocabularies for the post word and sentence level. p p p p r and the reply V := fv1; v2; :::; vs g;V := Word-level. PR-Embedding learns the word rep- r r r fv1; v2; :::; vsg together with two embedding ma- resentations from the word-level co-occurrence at s×d trices Ep;Er 2 R , where s is the size of the first. Following the previous work (Pennington vocabularity and d is the embedding dimension. et al., 2014), we train the embedding by the global We need to learn the embedding from the conver- log-bilinear regression model sation pair hpost; replyi. They can be formulated T ~ as P = (p1; :::; pm), R = (r1; :::; rn), where m; n wi w~k + bi + bk = log(Xik) (1) are the length of the post and the reply respectively. For each pair in the conversation, we represent the where Xik is the number of times word k occurs in post, reply in two spaces Ep;Er, by which we can the context of word i. w, w~ are the word vector and encode the relationship between the post and reply context word vector, b is the bias. We construct the into the word embeddings. word representations by the summation of w and w~. 2.2 Conversational Word Alignment Sentence-level. To learn the relationship of em- Similar to the previous works (Mikolov et al., beddings from the two spaces, we further train the 2013b; Pennington et al., 2014), we also learn the embedding by a sentence-level classification task. embeddings based on word co-occurrence. The We match the words in the post and reply based difference is that we capture both intra-sentence on the embeddings from word-level learning. Then and cross-sentence co-occurrence. For the single we encode the match features by CNN (Kim, 2014) sentence, the adjacent words usually have a more followed by max-pooling for prediction. We can explicit semantic relation. So we also calculate the formulate it by co-occurrence based on the context window in a M = cosine(p ; r ) (2) fixed size. (i;j) i j ~ However, the relationship among the cross- Mi = tanh(W1 · Mi:i+h−1 + b1) (3) sentence words is no longer related to their dis- ~ m−h+1 ~ M = MaxP oolingi=1 [Mi] (4) tance. As shown in Figure1, the last word in the post ‘from’ is adjacent to the first word ‘i’ in reply, where W1; b1 are trainable parameters, Mi:i+h−1 but they have no apparent semantic relation. So we refers to the concatenation of (Mi; :::; Mi+h−1) and 1376 hits@1 hits@5 hits@10 NDCG NDCG@5 P@1 P@1(s) GloVetrain 12.6 39.6 63.7 GloVetrain 69.97 48.87 51.23 33.48 GloVeemb 18.0 44.6 66.9 DSGemb 70.82 50.45 52.19 35.61 BERTemb 15.4 41.0 62.9 BERTemb 70.06 48.45 51.66 35.08 Fasttext 17.8 44.9 67.2 emb PR-Emb 74.79 58.16 62.03 45.99 PR-Embedding 22.4 60.0 81.1 w/o PR 70.68 50.60 50.48 35.19 IR baseliney 21.4 - - w/o SLL 71.65 52.03 53.48 40.86 Starpacey 31.8 - - Profile Memoryy 31.8 - - Table 2: Experimental results on the Chinese test set. KVMemnn 32.3 62.0 79.2 P@1(s) means only use the response with label ‘good’ +PR-Embedding 35.9 66.1 82.6 as the right one and other metrics treat the label ‘mid- KVMemnn (GloVe) 36.8 68.1 83.6 +PR-Embedding 39.9 72.4 87.0 dle’ and ‘good’ as right. Table 1: Experimental results on the test set of the Tasks. We focus on the response selection tasks PersonaChat dataset. The upper part compares the em- for retrieval-based dialogue systems both in the beddings in the single-turn and the lower one is for the multi-turn task.

Load more