Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network
Total Page:16
File Type:pdf, Size:1020Kb
Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network Xiangyang Zhou∗, Lu Li,∗ Daxiang Dong, Yi Liu, Ying Chen, Wayne Xin Zhao,y Dianhai Yu and Hua Wu Baidu Inc., Beijing, China nzhouxiangyang, lilu12, dongdaxiang, liuyi05,o @baidu.com chenying04, v zhaoxin, yudianhai, wu hua Abstract (Lowe et al., 2017) and the discriminator of GAN- based (Generative Adversarial Networks) neural Human generates responses relying on se- dialogue generation (Li et al., 2017). mantic and functional dependencies, in- cluding coreference relation, among dia- Conversation Context logue elements and their context. In this Speaker A: Hi I am looking to see what packages are installed on my system, I don’t see a path, is the list being held somewhere else? paper, we investigate matching a response Speaker B: Try dpkg - get-selections with its multi-turn context using depen- Speaker A: What is that like? A database for packages instead of a flat file structure? dency information based entirely on atten- Speaker B: dpkg is the debian package manager - get-selections simply shows tion. Our solution is inspired by the re- you what packages are handed by it cently proposed Transformer in machine Response of Speaker A: No clue what do you need it for, its just reassurance translation (Vaswani et al., 2017) and we as I don’t know the debian package manager extend the attention mechanism in two Figure 1: Example of human conversation on Ubuntu sys- ways. First, we construct representations tem troubleshooting. Speaker A is seeking for a solution of of text segments at different granularities package management in his/her system and speaker B recom- solely with stacked self-attention. Second, mend using, the debian package manager, dpkg. But speaker A does not know dpkg, and asks a backchannel-question we try to extract the truly matched seg- (Stolcke et al., 2000), i.e., “no clue what do you need it for?”, ment pairs with attention across the con- aiming to double-check if dpkg could solve his/her problem. text and response. We jointly introduce Text segments with underlines in the same color across con- text and response can be seen as matched pairs. those two kinds of attention in one uni- form neural network. Experiments on two Early studies on response selection only use large-scale multi-turn response selection the last utterance in context for matching a reply, tasks show that our proposed model sig- which is referred to as single-turn response selec- nificantly outperforms the state-of-the-art tion (Wang et al., 2013). Recent works show that models. the consideration of a multi-turn context can fa- cilitate selecting the next utterance (Zhou et al., 1 Introduction 2016; Wu et al., 2017). The reason why richer Building a chatbot that can naturally and con- contextual information works is that human gen- sistently converse with human-beings on open- erated responses are heavily dependent on the pre- domain topics draws increasing research interests vious dialogue segments at different granularities in past years. One important task in chatbots is (words, phrases, sentences, etc), both semanti- response selection, which aims to select the best- cally and functionally, over multiple turns rather matched response from a set of candidates given than one turn (Lee et al., 2006; Traum and Hee- the context of a conversation. Besides playing a man, 1996). Figure1 illustrates semantic con- critical role in retrieval-based chatbots (Ji et al., nectivities between segment pairs across context 2014), response selection models have been used and response. As demonstrated, generally there in automatic evaluation of dialogue generation are two kinds of matched segment pairs at dif- ferent granularities across context and response: ∗ Equally contributed. (1) surface text relevance, for example the lexi- y Work done as a visiting scholar at Baidu. Wayne Xin Zhao is an associate professor of Renmin University of China cal overlap of words “packages”-“package” and and can be reached at batmanfl[email protected]. phrases “debian package manager”-“debian pack- 1118 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 1118–1127 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics age manager”. (2) latent dependencies upon which (DAM), for multi-turn response selection. In prac- segments are semantically/functionally related to tice, DAM takes each single word of an utter- each other. Such as the word “it” in the response, ance in context or response as the centric-meaning which refers to “dpkg” in the context, as well as of an abstractive semantic segment, and hierar- the phrase “its just reassurance” in the response, chically enriches its representation with stacked which latently points to “what packages are in- self-attention, gradually producing more and more stalled on my system”, the question that speaker sophisticated segment representations surround- A wants to double-check. ing the centric-word. Each utterance in context Previous studies show that capturing those and response are matched based on segment pairs matched segment pairs at different granularities at different granularities, considering both textual across context and response is the key to multi- relevance and dependency information. In this turn response selection (Wu et al., 2017). How- way, DAM generally captures matching informa- ever, existing models only consider the textual tion between the context and the response from relevance, which suffers from matching response word-level to sentence-level, important matching that latently depends on previous turns. More- features are then distilled with convolution & max- over, Recurrent Neural Networks (RNN) are con- pooling operations, and finally fused into one sin- veniently used for encoding texts, which is too gle matching score via a single-layer perceptron. costly to use for capturing multi-grained seman- We test DAM on two large-scale public multi- tic representations (Lowe et al., 2015; Zhou et al., turn response selection datasets, the Ubuntu Cor- 2016; Wu et al., 2017). As an alternative, we pus v1 and Douban Conversation Corpus. Exper- propose to match a response with multi-turn con- imental results show that our model significantly text using dependency information based entirely outperforms the state-of-the-art models, and the on attention mechanism. Our solution is inspired improvement to the best baseline model on R10@1 by the recently proposed Transformer in machine is over 4%. What is more, DAM is expected translation (Vaswani et al., 2017), which addresses to be convenient to deploy in practice because the issue of sequence-to-sequence generation only most attention computation can be fully paral- using attention, and we extend the key attention lelized (Vaswani et al., 2017). Our contributions mechanism of Transformer in two ways: are two-folds: (1) we propose a new matching model for multi-turn response selection with self- self-attention By making a sentence attend to it- attention and cross-attention. (2) empirical results self, we can capture its intra word-level de- show that our proposed model significantly out- pendencies. Phrases, such as “debian pack- performs the state-of-the-art baselines on public age manager”, can be modeled with word- datasets, demonstrating the effectiveness of self- level self-attention over word-embeddings, attention and cross-attention. and sentence-level representations can be constructed in a similar way with phrase- 2 Related Work level self-attention. By hierarchically stack- 2.1 Conversational System ing self-attention from word embeddings, we can gradually construct semantic representa- To build an automatic conversational agent is a tions at different granularities. long cherished goal in Artificial Intelligence (AI) (Turing, 1950). Previous researches include task- cross-attention By making context and response oriented dialogue system, which focuses on com- attend to each other, we can generally capture pleting tasks in vertical domain, and chatbots, dependencies between those latently matched which aims to consistently and naturally converse segment pairs, which is able to provide com- with human-beings on open-domain topics. Most plementary information to textual relevance modern chatbots are data-driven, either in a fash- for matching response with multi-turn con- ion of information-retrieval (Ji et al., 2014; Banchs text. and Li, 2012; Nio et al., 2014; Ameixa et al., 2014) or sequence-generation (Ritter et al., 2011). We jointly introduce self-attention and cross- The retrieval-based systems enjoy the advantage attention in one uniform neural matching network, of informative and fluent responses because it namely the Deep Attention Matching Network searches a large dialogue repository and selects 1119 "# " % Ui g(c,r) Mui,r Mui,r Matching self cross Score " $ 3D Matching Image Q ! Word R Multi-grained Word-word Matching Embedding Representation Module Representations with Cross-Attention Input Representation Matching Aggregation Figure 2: Overview of Deep Attention Matching Network. candidate that best matches the current context. which is entirely based on attention. Not only The generation-based models, on the other hand, Transformer can achieve better translation results learn patterns of responding from dialogues and than convenient RNN-based models, but also it is can directly generalize new responses. very fast in training/predicting as the computation of attention can