Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network

Xiangyang Zhou∗, Lu Li,∗ Daxiang Dong, Yi Liu, Ying Chen, Wayne Xin ,† Dianhai Yu and Hua Wu Baidu Inc., Beijing, China nzhouxiangyang, lilu12, dongdaxiang, liuyi05,o @baidu.com chenying04, v zhaoxin, yudianhai, wu hua

Abstract (Lowe et al., 2017) and the discriminator of GAN- based (Generative Adversarial Networks) neural Human generates responses relying on se- dialogue generation (Li et al., 2017). mantic and functional dependencies, in- cluding coreference relation, among dia- Conversation Context logue elements and their context. In this Speaker A: Hi I am looking to see what packages are installed on my system, I don’t see a path, is the list being held somewhere else? paper, we investigate matching a response Speaker B: Try dpkg - get-selections with its multi-turn context using depen- Speaker A: What is that like? A database for packages instead of a flat file structure? dency information based entirely on atten- Speaker B: dpkg is the debian package manager - get-selections simply shows tion. Our solution is inspired by the re- you what packages are handed by it cently proposed Transformer in machine Response of Speaker A: No clue what do you need it for, its just reassurance translation (Vaswani et al., 2017) and we as I don’t know the debian package manager extend the attention mechanism in two Figure 1: Example of human conversation on Ubuntu sys- ways. First, we construct representations tem troubleshooting. Speaker A is seeking for a solution of of text segments at different granularities package management in his/her system and speaker B recom- solely with stacked self-attention. Second, mend using, the debian package manager, dpkg. But speaker A does not know dpkg, and asks a backchannel-question we try to extract the truly matched seg- (Stolcke et al., 2000), i.e., “no clue what do you need it for?”, ment pairs with attention across the con- aiming to double-check if dpkg could solve his/her problem. text and response. We jointly introduce Text segments with underlines in the same color across con- text and response can be seen as matched pairs. those two kinds of attention in one uni- form neural network. Experiments on two Early studies on response selection only use large-scale multi-turn response selection the last utterance in context for matching a reply, tasks show that our proposed model sig- which is referred to as single-turn response selec- nificantly outperforms the state-of-the-art tion (Wang et al., 2013). Recent works show that models. the consideration of a multi-turn context can fa- cilitate selecting the next utterance (Zhou et al., 1 Introduction 2016; Wu et al., 2017). The reason why richer Building a chatbot that can naturally and con- contextual information works is that human gen- sistently converse with human-beings on open- erated responses are heavily dependent on the pre- domain topics draws increasing research interests vious dialogue segments at different granularities in past years. One important task in chatbots is (words, phrases, sentences, etc), both semanti- response selection, which aims to select the best- cally and functionally, over multiple turns rather matched response from a set of candidates given than one turn (Lee et al., 2006; Traum and Hee- the context of a conversation. Besides playing a man, 1996). Figure1 illustrates semantic con- critical role in retrieval-based chatbots (Ji et al., nectivities between segment pairs across context 2014), response selection models have been used and response. As demonstrated, generally there in automatic evaluation of dialogue generation are two kinds of matched segment pairs at dif- ferent granularities across context and response: ∗ Equally contributed. (1) surface text relevance, for example the lexi- † Work done as a visiting scholar at Baidu. Wayne Xin Zhao is an associate professor of Renmin University of China cal overlap of words “packages”-“package” and and can be reached at batmanfl[email protected]. phrases “debian package manager”-“debian pack-

1118 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 1118–1127 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics age manager”. (2) latent dependencies upon which (DAM), for multi-turn response selection. In prac- segments are semantically/functionally related to tice, DAM takes each single word of an utter- each other. Such as the word “it” in the response, ance in context or response as the centric-meaning which refers to “dpkg” in the context, as well as of an abstractive semantic segment, and hierar- the phrase “its just reassurance” in the response, chically enriches its representation with stacked which latently points to “what packages are in- self-attention, gradually producing more and more stalled on my system”, the question that speaker sophisticated segment representations surround- A wants to double-check. ing the centric-word. Each utterance in context Previous studies show that capturing those and response are matched based on segment pairs matched segment pairs at different granularities at different granularities, considering both textual across context and response is the key to multi- relevance and dependency information. In this turn response selection (Wu et al., 2017). How- way, DAM generally captures matching informa- ever, existing models only consider the textual tion between the context and the response from relevance, which suffers from matching response word-level to sentence-level, important matching that latently depends on previous turns. More- features are then distilled with convolution & max- over, Recurrent Neural Networks (RNN) are con- pooling operations, and finally fused into one sin- veniently used for encoding texts, which is too gle matching score via a single-layer perceptron. costly to use for capturing multi-grained seman- We test DAM on two large-scale public multi- tic representations (Lowe et al., 2015; Zhou et al., turn response selection datasets, the Ubuntu Cor- 2016; Wu et al., 2017). As an alternative, we pus v1 and Douban Conversation Corpus. Exper- propose to match a response with multi-turn con- imental results show that our model significantly text using dependency information based entirely outperforms the state-of-the-art models, and the on attention mechanism. Our solution is inspired improvement to the best baseline model on R10@1 by the recently proposed Transformer in machine is over 4%. What is more, DAM is expected translation (Vaswani et al., 2017), which addresses to be convenient to deploy in practice because the issue of sequence-to-sequence generation only most attention computation can be fully paral- using attention, and we extend the key attention lelized (Vaswani et al., 2017). Our contributions mechanism of Transformer in two ways: are two-folds: (1) we propose a new matching model for multi-turn response selection with self- self-attention By making a sentence attend to it- attention and cross-attention. (2) empirical results self, we can capture its intra word-level de- show that our proposed model significantly out- pendencies. Phrases, such as “debian pack- performs the state-of-the-art baselines on public age manager”, can be modeled with word- datasets, demonstrating the effectiveness of self- level self-attention over word-embeddings, attention and cross-attention. and sentence-level representations can be constructed in a similar way with phrase- 2 Related Work level self-attention. By hierarchically stack- 2.1 Conversational System ing self-attention from word embeddings, we can gradually construct semantic representa- To build an automatic conversational agent is a tions at different granularities. long cherished goal in Artificial Intelligence (AI) (Turing, 1950). Previous researches include task- cross-attention By making context and response oriented dialogue system, which focuses on com- attend to each other, we can generally capture pleting tasks in vertical domain, and chatbots, dependencies between those latently matched which aims to consistently and naturally converse segment pairs, which is able to provide com- with human-beings on open-domain topics. Most plementary information to textual relevance modern chatbots are data-driven, either in a fash- for matching response with multi-turn con- ion of information-retrieval (Ji et al., 2014; Banchs text. and Li, 2012; Nio et al., 2014; Ameixa et al., 2014) or sequence-generation (Ritter et al., 2011). We jointly introduce self-attention and cross- The retrieval-based systems enjoy the advantage attention in one uniform neural matching network, of informative and fluent responses because it namely the Deep Attention Matching Network searches a large dialogue repository and selects

1119 "#

" % Ui g(c,r) Mui,r Mui,r Matching self cross Score " $ 3D Matching Image Q

!

Word R Multi-grained Word-word Matching Embedding Representation Module Representations with Cross-Attention

Input Representation Matching Aggregation

Figure 2: Overview of Deep Attention Matching Network. candidate that best matches the current context. which is entirely based on attention. Not only The generation-based models, on the other hand, Transformer can achieve better translation results learn patterns of responding from dialogues and than convenient RNN-based models, but also it is can directly generalize new responses. very fast in training/predicting as the computation of attention can be fully parallelized. Previous 2.2 Response Selection works on attention mechanism show the superior Researches on response selection can be generally ability of attention to capture semantic dependen- categorized into single-turn and multi-turn. Most cies, which inspires us to improve multi-turn re- early studies are single-turn that only consider the sponse selection with attention mechanism. last utterance for matching response (Wang et al., 3 Deep Attention Matching Network 2013, 2015). Recent works extend it to multi- turn conversation scenario, Lowe et al.,(2015) and 3.1 Problem Formalization Zhou et al.,(2016) use RNN to read context and N Given a dialogue data set D = {(c, r, y)Z } , response, use the last hidden states to represent Z=1 where c = {u0, ..., un−1} represents a conversa- context and response as two semantic vectors, and n−1 tion context with {ui}i=0 as utterances and r as measure their relevance. Instead of only consider- a response candidate. y ∈ {0, 1} is a binary la- ing the last states of RNN, Wu et al.,(2017) take bel, indicating whether r is a proper response for hidden state at each time step as a text segment c. Our goal is to learn a matching model g(c, r) representation, and measure the distance between with D, which can measure the relevance between context and response via segment-segment match- any context c and candidate response r. ing matrixes. Nevertheless, matching with depen- dency information is generally ignored in previous 3.2 Model Overview works. Figure2 gives an overview of DAM, which generally follows the representation-matching- 2.3 Attention aggregation framework to match response with Attention has been proven to be very effective in multi-turn context. For each utterance ui = nui −1 Natural Language Processing (NLP) (Bahdanau [wui,k]k=0 in a context and its response candi- nr−1 et al., 2015; Yin et al., 2016; Lin et al., 2017) and date r = [wr,t]t=0 , where nui and nr stand for other research areas (Xu et al., 2015). Recently, the numbers of words, DAM first looks up a shared Vaswani et al.,(2017) propose a novel sequence- word embedding table and represents ui and r as 0 to-sequence generation network, the Transformer, sequences of word embeddings, namely Ui = 1120 [e0 , ..., e0 ] and R0 = [e0 , ..., e0 ] ui,0 ui,nui −1 r,0 r,nr−1 d Sum & Norm respectively, where e ∈ R denotes a d-dimension word embedding. Feed-Forward A representation module then starts to construct Sum & Norm

Weighted Sum semantic representations at different granularities Attention for ui and r. Practically, L identical layers of query key value self-attention are hierarchically stacked, each lth self-attention layer takes the output of the l − 1th layer as its input, and composites the input se- Figure 3: Attentive Module. mantic vectors into more sophisticated represen- tations based on self-attention. In this way, multi- tively, where nQ, nK and nV denote the number grained representations of ui and r are gradually of words in each sentence and ei stands for a d- constructed, denoted as [Ul]L and [Rl]L re- i l=0 l=0 dimension embedding, n is equal to n . The At- spectively. K V tentive Module first takes each word in the query Given [U0, ..., UL] and [R0, ..., RL], utterance i i sentence to attend to words in the key sentence u and response r are then matched with each i via Scaled Dot-Product Attention (Vaswani et al., other in a manner of segment-segment similar- 2017), then applies those attention results upon the ity matrix. Practically, for each granularity l ∈ value sentence, which is defined as: [0...L], two kinds of matching matrixes are con- u ,r,l structed, i.e., the self-attention-match M i and T self  Q[i] ·K nQ−1 ui,r,l Att(Q, K) = softmax( √ ) (1) cross-attention-match Mcross , measuring the rele- d i=0 vance between utterance and response with textual nQ×d Vatt = Att(Q, K) · V ∈ R (2) information and dependency information respec- tively. where Q[i] is the ith embedding in the query sen- Those matching scores are finally merged into tence Q. Each row of Vatt, denoted as Vatt[i], a 3D matching image Q1. Each dimension of Q stores the fused semantic information of words in represents each utterance in context, each word the value sentence that possibly have dependen- in utterance and each word in response respec- cies to the ith word in query sentence. For each i, tively. Important matching information between Vatt[i] and Q[i] are then added up together, com- segment pairs across multi-turn context and can- positing them into a new representation that con- didate response is then extracted via convolution tains their joint meanings. A layer normalization with max-pooling operations, and further fused operation (Ba et al., 2016) is then applied, which into one matching score via a single-layer percep- prevents vanishing or exploding of gradients. A tron, representing the matching degree between feed-forward network FFN with RELU (LeCun the response candidate and the whole context. et al., 2015) activation is then applied upon the Specifically, we use a shared component, normalization result, in order to further process the the Attentive Module, to implement both self- fused embeddings, defined as: attention in representation and cross-attention in matching. We will discuss in detail the implemen- FFN(x) = max(0, xW1 + b1)W2 + b2 (3) tation of Attentive Module and how we used it to implement both self-attention and cross-attention where, x is a 2D-tensor in the same shape of query in following sections. sentence Q and W1, b1,W2, b2 are learnt parame- 3.3 Attentive Module ters. This kind of activation is empirically useful in other works, and we also adapt it in our model. Figure3 shows the structure of Attentive Mod- The result FFN(x) is a 2D-tensor that has the same ule, which is similar to that used in Transformer shape as x, FFN(x) is then residually added (He (Vaswani et al., 2017). Attentive Module has et al., 2016) to x, and the fusion result is then nor- three input sentences: the query sentence, the key malized as the final outputs. We refer to the whole sentence and the value sentence, namely Q = nQ−1 nK−1 nV −1 Attentive Module as: [ei]i=0 , K = [ei]i=0 , V = [ei]i=0 respec- 1We refer to it as Q because it is like a cube. AttentiveModule(Q, K, V) (4)

1121 As described, Attentive Module can capture de- new representations for both of them, written as pendencies across query sentence and key sen- l l l l Uei and Re respectively. Both Uei and Re implicitly tence, and further use the dependency information capture semantic structures that cross the utterance to composite elements in the query sentence and and response. In this way, those inter-dependent the value sentence into compositional representa- segment pairs are close to each other in represen- tions. We exploit this property of the Attentive tations, and dot-products between those latently Module to construct multi-grained semantic rep- inter-dependent pairs could get increased, provid- resentations as well as match with dependency in- ing dependency-aware matching information. formation. 3.6 Aggregation 3.4 Representation Given U0 or R0, the word-level embedding rep- DAM finally aggregates all the segmental match- i ing degrees across each utterance and response resentations for utterance ui or response r, DAM 0 0 into a 3D matching image Q, which is defined as: takes Ui ro R as inputs and hierarchically stacks the Attentive Module to construct multi-grained Q = { i,k,t}n×n ×n (11) representations of ui and r, which is formulated Q ui r as: where each pixel ,k,t is formulated as: l+1 l l l Ui = AttentiveModule(Ui, Ui, Ui) (5) Rl+1 = AttentiveModule(Rl, Rl, Rl) (6) Qi,k,t = (12) [Mui,r,l[k, t]]L ⊕ [Mui,r,l[k, t]]L where l ranges from 0 to L − 1, denoting the dif- self l=0 cross l=0 ferent levels of granularity. By this means, words ⊕ is concatenation operation, and each pixel has in each utterance or response repeatedly function 2(L + 1) channels, storing the matching degrees together to composite more and more holistic rep- between one certain segment pair at different lev- resentations, we refer to those multi-grained rep- els of granularity. DAM then leverages two- resentations as U0, ..., UL and R0, ..., RL here- [ i i ] [ ] layered 3D convolution with max-pooling opera- after. tions to distill important matching features from 3.5 Utterance-Response Matching the whole image. The operation of 3D convo- lution with max-pooling is the extension of typi- Given [Ul]L and [Rl]L , two kinds of segment- i l=0 l=0 cal 2D convolution, whose filters and strides are segment matching matrixes are constructed at each 3D cubes2. We finally compute matching score level of granularity l, i.e., the self-attention-match g(c, r) based on the extracted matching features Mui,r,l and cross-attention-match Mui,r,l. Mui,r,l self cross self f (c, r) via a single-layer perceptron, which is defined as: match is formulated as: T Mui,r,l {Ul k · Rl t } (7) self = i[ ] [ ] nui ×nr g(c, r) = σ(W3fmatch(c, r) + b3) (13) in which, each element in the matrix is the dot- l l th product of Ui[k] and R [t], the k embedding in where W3 and b3 are learnt parameters, and σ is l th l Ui and the t embedding in R , reflecting the tex- sigmoid function that gives the probability if r is a th tual relevance between the k segment in ui and proper candidate to c. The loss function of DAM tth segment in r at the lth granularity. The cross- is the negative log likelihood, defined as: attention-match matrix is based on cross-attention, which is defined as: p(y|c, r) = g(c, r)y + (1 − g(c, r))(1 − y) (14) X l l l l L(·) = − log(p(y|c, r)) (15) Uei = AttentiveModule(Ui, R , R ) (8) (c,r,y)∈D l l l l Re = AttentiveModule(R , Ui, Ui) (9) l T l Mui,r,l {U k · R t } (10) 4 Experiment cross = ei[ ] e [ ] nui ×nr

l where we use Attentive Module to make Ui and Rl crossly attend to each other, constructing two 2https://www.tensorflow.org/api docs/python/tf/nn/conv3d

1122 Ubuntu Corpus Douban Conversation Corpus R2@1 R10@1 R10@2 R10@5 MAP MRR P@1 R10@1 R10@2 R10@5 DualEncoderlstm 0.901 0.638 0.784 0.949 0.485 0.527 0.320 0.187 0.343 0.720 DualEncoderbilstm 0.895 0.630 0.780 0.944 0.479 0.514 0.313 0.184 0.330 0.716 MV-LSTM 0.906 0.653 0.804 0.946 0.498 0.538 0.348 0.202 0.351 0.710 Match-LSTM 0.904 0.653 0.799 0.944 0.500 0.537 0.345 0.202 0.348 0.720 Multiview 0.908 0.662 0.801 0.951 0.505 0.543 0.342 0.202 0.350 0.729 DL2R 0.899 0.626 0.783 0.944 0.488 0.527 0.330 0.193 0.342 0.705 SMNdynamic 0.926 0.726 0.847 0.961 0.529 0.569 0.397 0.233 0.396 0.724 DAM 0.938 0.767 0.874 0.969 0.550 0.601 0.427 0.254 0.410 0.757 DAMfirst 0.927 0.736 0.854 0.962 0.528 0.579 0.400 0.229 0.396 0.741 DAMlast 0.932 0.752 0.861 0.965 0.539 0.583 0.408 0.242 0.407 0.748 DAMself 0.931 0.741 0.859 0.964 0.527 0.574 0.382 0.221 0.403 0.750 DAMcross 0.932 0.749 0.863 0.966 0.535 0.585 0.400 0.234 0.411 0.733

Table 1: Experimental results of DAM and other comparison approaches on Ubuntu Corpus V1 and Douban Conversation Corpus.

4.1 Dataset Yates et al., 1999), MRR (Mean Reciprocal Rank) We test DAM on two public multi-turn response (Voorhees et al., 1999), and Precision-at-one P @1 selection datasets, the Ubuntu Corpus V1 (Lowe especially for Douban corpus, following the set- et al., 2015) and the Douban Conversation Corpus ting of previous works (Wu et al., 2017). (Wu et al., 2017). The former one contains multi- 4.3 Comparison Methods turn dialogues about Ubuntu system troubleshoot- RNN-based models : Previous best performing ing in English and the later one is crawled from a models are based on RNNs, we choose Chinese social networking on open-domain topics. representative models as baselines, includ- The Ubuntu training set contains 0.5 million multi- ing SMN (Wu et al., 2017), Multi- turn contexts, and each context has one positive re- dynamic view(Zhou et al., 2016), DualEncoder sponse that generated by human and one negative lstm and DualEncoder (Lowe et al., 2015), response which is randomly sampled. Both vali- bilstm DL2R ( et al., 2016), Match-LSTM dation and testing sets of Ubuntu Corpus have 50k (Wang and Jiang, 2017) and MV-LSTM contexts, where each context is provided with one (Pang et al., 2016), where SMN positive response and nine negative replies. The dynamic achieves the best scores against all the other Douban corpus is constructed in a similar way to published works, and we take it as our state- the Ubuntu Corpus, except that its validation set of-the-art baseline. contains 50k instances with 1:1 positive-negative ratios and the testing set of Douban corpus is con- Ablation : To verify the effects of multi-grained sisted of 10k instances, where each context has 10 representation, we setup two comparison candidate responses, collected via a tiny inverted- models, i.e., DAMfirst and DAMlast, which index system (Lucene3), and labels are manually dispense with the multi-grained representa- annotated. tions in DAM, and use representation re- sults from the 0th layer and Lth layer of 4.2 Evaluation Metric self-attention instead. Moreover, we setup We use the same evaluation metrics as in pre- DAMself and DAMcross, which only use vious works (Wu et al., 2017). Each compari- self-attention-match or cross-attention-match son model is asked to select k best-matched re- respectively, in order to examine the ef- sponse from n available candidates for the given fectiveness of both self-attention-match and conversation context c, and we calculate the re- cross-attention-match. call of the true positive replies among the k se- 4.4 Model Training lected ones as the main evaluation metric, denoted Pk i=1 yi We copy the reported evaluation results of all base- as Rn@k = Pn y , where yi is the binary la- i=1 i lines for comparison. DAM is implemented in bel for each candidate. In addition to R @k, n tensorflow4, and the used vocabularies, word em- we use MAP (Mean Average Precision) (Baeza- 4https://www.tensorflow.org. Our code and data will be 3https://lucenent.apache.org/ available at https://github.com/baidu/Dialogue/DAM

1123 bedding sizes for Ubuntu corpus and Douban cor- as industry application or working as an compo- pus are all set as same as the SMN (Wu et al., nent in other neural networks like GANs. 2017). We consider at most 9 turns and 50 words for each utterance (response) in our experiments, 5 Analysis word embeddings are pre-trained using training sets via word2vec (Mikolov et al., 2013), simi- We use the Ubuntu Corpus for analyzing how self- lar to previous works. We use zero-pad to - attention and cross-attention work in DAM from dle the variable-sized input and parameters in FFN both quantity analysis as well as visualization. are set to 200, same as word-embedding size. We test stacking 1-7 self-attention layers, and reported 5.1 Quantity Analysis our results with 5 stacks of self-attention because it gains the best scores on validation set. The We first study how DAM performs in different ut- 1st convolution layer has 32 [3,3,3] filters with terance number of context. The left part in Fig- [1,1,1] stride, and its max-pooling size is [3,3,3] ure4 shows the changes of R10@1 on Ubuntu with [3,3,3] stride. The 2nd convolution layer has Corpus across contexts with different number of 16 [3,3,3] filters with [1,1,1] stride, and its max- utterance. As demonstrated, while being good at pooling size is also [3,3,3] with [3,3,3] stride. We matching response with long context that has more tune DAM and the other ablation models with than 4 utterances, DAM can still stably deal with adam optimizer (Le et al., 2011) to minimize loss short context that only has 2 turns. function defined in Eq 15. Learning rate is initial- ized as 1e-3 and gradually decreased during train- ing, and the batch-size is 256. We use validation sets to select the best models and report their per- formances on test sets. average number of words in each turn average number of words

number of turns in context 4.5 Experiment Result Figure 4: DAM’s performance on Ubuntu Corpus across Table1 shows the evaluation results of DAM as different contexts. The left part shows the performance in different utterance number of context. The right part shows well as all comparison models. As demonstrated, performance in different average utterance text length of con- DAM significantly outperforms other competitors text as well as self-attention stack depth. on both Ubuntu Corpus and Douban Conversa- tion Corpus, including SMNdynamic, which is the Moreover, the right part of Figure4 gives the state-of-the-art baseline, demonstrating the supe- comparison of performance across different con- rior power of attention mechanism in matching re- texts with different average utterance text length sponse with multi-turn context. Besides, both the and self-attention stack depth. As demonstrated, performances of DAMfirst and DAMself decrease stacking self-attention can consistently improve a lot compared with DAM, which shows the effec- matching performance for contexts having differ- tiveness of self-attention and cross-attention. Both ent average utterance text length, implying the DAMfirst and DAMlast underperform DAM, stability advantage of using multi-grained seman- which demonstrates the benefits of using multi- tic representations. The performance of matching grained representations. Also the absence of short utterances, that have less than 10 words, is self-attention-match brings down the precision, as obviously lower than the other longer ones. This shown in DAMcross, exhibiting the necessity of is because the shorter the utterance text is, the jointly considering textual relevance and depen- fewer information it contains, and the more dif- dency information in response selection. ficult for selecting the next utterance, while stack- One notable point is that, while DAMfirst is ing self-attention can still help in this case. How- able to achieve close performance to SMNdynamic, ever for long utterances like containing more than it is about 2.3 times faster than SMNdynamic in our 30 words, stacking self-attention can significantly implementation as it is very simple in computa- improve the matching performance, which means tion. We believe that DAMfirst is more suitable that the more information an utterance contains, to the scenario that has limitations in computation the more stacked self-attention it needs to capture time or memories but requires high precise, such its intra semantic structures.

1124 self-attention-match cross-attention-match self-attention-matchself−attention−match in stackin stack 0 0 selfself-attention-match−attention−match in stack in stack 2 2 selfself-attention-match−attention−match in stack in stack 4 4 crosscross-attention-match−attention−match in stack in stack4 4 manager manager manager manager package package package package debain debain debain debain the the the the know know know know dont dont dont dont i i i i as as as as reassurance reassurance reassurance reassurance just just just just its its its its response response response response for. for. for. for. it it it it need need need need you you you you do do do do what what what what clue clue clue clue no no no no i i i i a a a a is is is is hi hi hi hi to to to to i.1 i.1 i.1 i.1 on on on on list list list list my my my my am the am the am the am the are are are are see see see see else else else else held held held held dont path dont path dont path dont path what what what what see.1 see.1 see.1 see.1 being being being being looking system looking system looking system looking system installed installed installed installed packages packages packages packages somewhere somewhere somewhere somewhere turn 0 turn 0 turn 0 turn 0 prior−match in stack 0 priorprior−−matchmatch inin stackstack 20 self−attention ofprior priorresponse−−matchmatch in inin stack stackstack 342 self−attentionposteriorprior of −turnmatch− 0match in in stack stack in stack 34 4 attentionposterior of response−match inover stack turn 4 0 in stack 4 attention of turn 0 over response in stack 4 manager managermanager manager managermanager else managermanager managermanager else package packagepackage packagepackage somewhere packagepackage package somewhere package held package held debain debaindebain debain debaindebain being debaindebain debain debain being the thethe the thethe list thethe the the list the the know knowknow know knowknow knowknow know know dont dontdont dontdont is dontdont dont is dont path dont path i ii ii ii i i a i a as asas as asas see.1 asas as see.1 as dont reassurance reassurancereassurance reassurance reassurancereassurance dont reassurancereassurance reassurance just justjust justjust i.1 justjust reassurance just i.1 just system just system its itsits itsits itsits its turn 0 turn 0 response response response its response response my response response response my response its response on for. for.for. for for.for. on for.for. for. it itit itit installed itit for it installed it are it are need needneed need needneed needneed need packages you youyou youyou packages youyou need you you what what do dodo dodo dodo you do see do see what whatwhat whatwhat to whatwhat do what to what looking clue clueclue clueclue looking clueclue what clue clue am no nono nono am nono clue no i no i hi hi no i i i i i i i i a a a a a a a a is is is is is is is is hi hi hi hi hi hi hi hi to to to to to to to to on on on on on on on on list list list list list list list list my my my my my my my my am the am am the the am am the the am am the the am the are are are are are are are are see see see see see see see see else else else else else else else else held held held held held held held held dont path dont dont path path dont dont path path dont dont path path dont path what what what what what what what what i i being being being being being being being being it it its its as as i i no do no do looking system looking looking system system looking looking system system looking looking system system looking system for. for. the the a a just just is is you you installed installed installed installed installed installed installed installed hi hi to to clue clue i.1 i.1 dont dont on on list list what what my my packages packages packages packages packages packages packages packages need need am the am the are are know know see see held held dont path dont path else. else. what what debain debain somewhere somewhere somewhere somewhere somewhere somewhere somewhere somewhere see.1 see.1 being being package package manager manager looking system looking system prior-match prior-match posterior-matchinstalled posterior-matchinstalled packages packages turn 0 turnturn 00 reassurance turnturn 00 turnturn 00 turn 0 reassurance somewhere somewhere response response self−attention of response in stack 3 selfself−attention−attention of of turn response 0 in stack in stack 3 3 selfattention−attention of response of turn 0 overin stackself-attention turn 30 in stack 4 attentionattention ofturn of turn response0 0 over responseover turn 0in in stack stack 4 4 attention of turnturn 0 over 0 response in stackcross-attention 4 manager managerelse managerelse managerelse else somewhere somewhere package package package somewherepackage somewhere held held held held debain beingdebain debainbeing beingdebain being the listthe listthe listthe list know knowthe knowthe knowthe the dont dontis Figure 5: Visualizationdontis of self-attention-match,dontis cross-attention-match asis well as the distribution of self-attention and cross- i path i path path path a a i a i a as dontas attention in matchingdontas response with the firstdont utteranceas in Figure1. Each coloreddont grid represents the matching degree or attention reassurance reassurancesystem reassurancesystem reassurancesystem system just myjust myjust myjust my turn 0 turn 0 its turn 0 onits score betweenturn 0 twoon words. The deeper the coloron is, the more important thison grid is. response response its its for installedfor response installed response installed installed are arefor arefor are it it it packages it packages need packagesneed packages what whatneed whatneed what you seeyou see see see you toyou to do todo todo do what what looking looking looking lookingwhat amwhat am clue amclue am clue cluei i no noi i hi hi hi 5.2 Visualizationhino no Also as shown in Figure5, self-attention- i i i i it it it it its its its its as as as as i i i i no do no do no do no do for. for. for. for. the the the the a a a a just just just just is is is is you you you you hi hi hi hi to to to to clue clue clue clue i.1 i.1 i.1 i.1 dont dont dont dont on on on on list list list list what what what what my my my my need need need need am the am am the the am the are are are are know know know know see see see see held held held held dont path dont dont path path dont path else. else. else. else. what what what what debain debain debain debain see.1 see.1 see.1 see.1 being being being being match and cross-attention-match capture com- package package package package manager manager manager manager looking system looking looking system system looking system installed installed installed installed packages packages packages packages reassurance reassurance reassurance reassurance We studysomewhere the case in Figure1 forsomewhere somewhere analyzing in de- somewhere response self-attention responseturn 0 self-attention turnturn 00 cross-attention responseturn 0 plementarycross-attention informationresponse in matching utterance tail how self-attention and cross-attention work. with response. Words like “reassurance” and Practically, we apply a softmax operation over “its” in response significantly get larger match- self-attention-match and cross-attention-match, to ing scores in cross-attention-match compared with examine the variance of dominating matching self-attention-match. According to the visual- pairs during stacking self-attention or applying ization of cross-attention, “reassurance” generally cross-attention. Figure5 gives the visualization depends on “system” “don’t” and “held” in utter- results of the 0th, 2nd and 4th self-attention-match ance, which makes it close to words like “list”, matrixes, the 4th cross-attention-match matrix, as “installed” or “held” of utterance. Scores of cross- well as the distribution of self-attention and cross- attention-match trend to centralize on several seg- attention in the 4th layer in matching response with ments, which probably means that those segments the first utterance (turn 0) due to space limitation. in response generally capture structure-semantic As demonstrated, important matching pairs in self- information across utterance and response, ampli- attention-match in stack 0 are nouns, verbs, like fying their matching scores against the others. “package” and “packages”, those are similar in topics. However matching scores between prepo- 5.3 Error Analysis sitions or pronouns pairs, such as “do” and “what”, become more important in self-attention-match in To understand the limitations of DAM and where stack 4. The visualization results of self-attention the future improvements might lie, we analyze 100 show the reason why matching between preposi- strong bad cases from test-set that fail in R10@5. tions or pronouns matters, as demonstrated, self- We find two major kinds of bad cases: (1) fuzzy- attention generally capture the semantic structure candidate, where response candidates are basi- of “no clue what do you need package manager” cally proper for the conversation context, except for “do” in response and “what packages are in- for a few improper details. (2) logical-error, stalled” for “what” in utterance, making segments where response candidates are wrong due to logi- surrounding “do” and “what” close to each other cal mismatch, for example, given a conversation in representations, thus increases their dot-product context A: “I just want to stay at home tomor- results. row.”, B: “Why not go hiking? I can go with

1125 you.”, response candidate like “Sure, I was plan- learning to align and translate. international con- ning to go out tomorrow.” is logically wrong be- ference on learning representations. cause it is contradictory to the first utterance of Rafael E Banchs and Haizhou Li. 2012. Iris: a chat- speaker A. We believe generating adversarial ex- oriented dialogue system based on the vector space amples, rather than randomly sampling, during model. In Proceedings of the ACL 2012 System training procedure may be a good idea for address- Demonstrations, pages 37–42. Association for Com- ing both fuzzy-candidate and logical-error, and putational Linguistics. to capture logic-level information hidden behind Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian conversation text is also worthy to be studied in Sun. 2016. Deep residual learning for image recog- the future. nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 6 Conclusion 778. In this paper, we investigate matching a response Zongcheng Ji, Zhengdong Lu, and Hang Li. 2014. An with its multi-turn context using dependency in- information retrieval approach to short text conver- sation. arXiv preprint arXiv:1408.6988. formation based entirely on attention. Our solu- tion extends the attention mechanism of Trans- Quoc V Le, Jiquan Ngiam, Adam Coates, Abhik former in two ways: (1) using stacked self- Lahiri, Bobby Prochnow, and Andrew Y Ng. 2011. attention to harvest multi-grained semantic repre- On optimization methods for deep learning. In Pro- ceedings of the 28th International Conference on In- sentations. (2) utilizing cross-attention to match ternational Conference on Machine Learning, pages with dependency information. Empirical results 265–272. on two large-scale datasets demonstrate the ef- fectiveness of self-attention and cross-attention in Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature, 521(7553):436–444. multi-turn response selection. We believe that both self-attention and cross-attention could bene- Alan Lee, Rashmi Prasad, Aravind Joshi, Nikhil Di- fit other research area, including spoken language nesh, and Bonnie Webber. 2006. Complexity of de- understanding, dialogue state tracking or seq2seq pendencies in discourse: Are dependencies in dis- dialogue generation. We would like to explore course more complex than in syntax. In Proceed- ings of the 5th International Workshop on Treebanks in depth how attention can help improve neu- and Linguistic Theories, Prague, Czech Republic, ral dialogue modeling for both chatbots and task- page 12. oriented dialogue systems in our future work. Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Acknowledgement Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. In EMNLP, pages 372–381. We gratefully thank the anonymous reviewers for their insightful comments. This work is supported Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San- tos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua by the National Basic Research Program of China Bengio. 2017. A structured self-attentive sentence (973 program, No. 2014CB340505). embedding. international conference on learning representations.

References Ryan Lowe, Michael Noseworthy, Iulian V Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and David Ameixa, Luisa Coheur, Pedro Fialho, and Paulo Joelle Pineau. 2017. Towards an automatic turing Quaresma. 2014. Luke, i am your father: dealing test: Learning to evaluate dialogue responses. In with out-of-domain requests by using movies subti- ACL, pages 372–381. tles. In International Conference on Intelligent Vir- tual Agents, pages 13–21. Springer. Ryan Lowe, Nissan Pow, Iulian Vlad Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- A large dataset for research in unstructured multi- ton. 2016. Layer normalization. arXiv preprint turn dialogue systems. annual meeting of the spe- arXiv:1607.06450. cial interest group on discourse and dialogue, pages Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 285–294. 1999. Modern information retrieval, volume 463. ACM press New York. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013. Efficient estimation of word Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- representations in vector space. arXiv preprint gio. 2015. Neural machine translation by jointly arXiv:1301.3781.

1126 Lasguido Nio, Sakriani Sakti, Graham Neubig, Tomoki Rui Yan, Yiping Song, and Hua Wu. 2016. Learning Toda, Mirna Adriani, and Satoshi Nakamura. 2014. to respond with deep neural networks for retrieval- Developing non-goal dialog system based on exam- based human-computer conversation system. In ples of drama television. In Natural Interaction with Proceedings of the 39th International ACM SIGIR Robots, Knowbots and Smartphones, pages 355– conference on Research and Development in Infor- 361. Springer. mation Retrieval, pages 55–64.

Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Wenpeng Yin, Hinrich Schutze,¨ Bing Xiang, and Shengxian Wan, and Xueqi Cheng. 2016. Text Bowen Zhou. 2016. Abcnn: Attention-based convo- matching as image recognition. In AAAI, pages lutional neural network for modeling sentence pairs. 2793–2799. Transactions of the Association of Computational Linguistics, 4(1):259–272. Alan Ritter, Colin Cherry, and William B Dolan. 2011. Data-driven response generation in social media. In Xiangyang Zhou, Daxiang Dong, Hua Wu, Shiqi Zhao, In Proc. EMNLP, pages 583–593. Association for Dianhai Yu, Hao Tian, Xuan Liu, and Rui Yan. 2016. Computational Linguistics. Multi-view response selection for human-computer conversation. In EMNLP, pages 372–381. Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza- beth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. 2000. Dialogue act modeling for au- tomatic tagging and recognition of conversational speech. Computational linguistics, 26(3):339–373.

David R Traum and Peter A Heeman. 1996. Utter- ance units in spoken dialogue. In Workshop on Dialogue Processing in Spoken Language Systems, pages 125–140.

Alan M Turing. 1950. Computing machinery and in- telligence. Mind, 59(236):433–460.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, pages 6000–6010.

Ellen M Voorhees et al. 1999. The trec-8 question an- swering track report. In Trec, pages 77–82.

Hao Wang, Zhengdong Lu, Hang Li, and Enhong Chen. 2013. A dataset for research on short-text conversations. In EMNLP, pages 935–945.

Mingxuan Wang, Zhengdong Lu, Hang Li, and Qun Liu. 2015. Syntax-based deep matching of short texts. International Joint Conferences on Artificial Intelligence.

Shuohang Wang and Jing Jiang. 2017. Machine com- prehension using match-lstm and answer pointer. in- ternational conference on learning representations.

Yu Wu, Wu, Ming Zhou, and Zhoujun Li. 2017. Sequential match network: A new architecture for multi-turn response selection in retrieval-based chat- bots. In ACL, pages 372–381.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual at- tention. In International Conference on Machine Learning, pages 2048–2057.

1127