Model Selection Using Question Answering Metrics

Building Chatbots from Forum Data: Model Selection Using Question Answering Metrics Martin Boyanov, Ivan Koychev Preslav Nakov, Alessandro Moschitti, Faculty of Mathematics and Informatics Giovanni Da San Martino Sofia University “St. Kliment Ohridski” Qatar Computing Research Institute Sofia, Bulgaria HBKU, Doha, Qatar [email protected] pnakov,amoschitti @hbku.edu.qa [email protected] { [email protected]} Abstract This aspect is a critical bottleneck as such systems We propose to use question answering must be engineered from scratch. Very recently, (QA) data from Web forums to train chat- models based on neural networks have been devel- bots from scratch, i.e., without dialog oped, e.g., using seq2seq models (Vinyals and Le, training data. First, we extract pairs of 2015). Such models provide shallow solutions, but question and answer sentences from the at the same time are easy to train, provided that a typically much longer texts of questions large amount of dialog data is available. Unfortu- and answers in a forum. We then use nately, the latter is a critical bottleneck as (i) the these shorter texts to train seq2seq mod- specificity of the domain requires the creation of els in a more efficient way. We further new data; and (ii) this process is rather costly in improve the parameter optimization using terms of human effort and time. a new model selection strategy based on Many real-world businesses aiming at acquir- QA measures. Finally, we propose to use ing chatbot technology are associated with cus- extrinsic evaluation with respect to a QA tomer services, e.g., helpdesk or forums, where task as an automatic evaluation method for question answering (QA) sections are often pro- chatbots. The evaluation shows that the vided, sometimes with user evaluation. Although model achieves a MAP of 63.5% on the this data does not follow a dialog format, it is still extrinsic task. Moreover, it can answer useful to extract pairs of questions and answers, correctly 49.5% of the questions when which are essential to train seq2seq models. Typi- they are similar to questions asked in the cally, forum or customer care sections contain a lot forum, and 47.3% of the questions when of content, and thus the requirement about having they are more conversational in style. large datasets is not an issue. The major problem comes from the quality of the text in the pairs that 1 Introduction we can extract automatically. One solution is to Recently, companies active in diversified business select data using crowdsourcing, but the task will ecosystems have become more and more inter- still be very costly given the required size (hun- ested in intelligent methods for interacting with dreds of thousands of pairs) and its complexity. their customers, and even with their employees. In this paper, we propose to use data extracted Thus, we have seen the development of several from a standard question answering forum for general-purpose personal assistants such as Ama- training chatbots from scratch. The main problem zon’s Alexa, Apple’s Siri, Google’s Assistant, and in using such data is that the questions and their Microsoft’s Cortana. However, being general- associated forum answers are noisy, i.e., not all purpose, they are not a good fit for every specific answers are good. Moreover, many questions and need, e.g., an insurance company that wants to in- answers are very long, e.g., can span several para- teract with its customers would need a new system graphs. This prevents training effective seq2seq trained on specific data; thus, there is a need for models, which can only manage (i.e., achieve ef- specialized assistants. fective decoding for) short pieces of text. 121 Proceedings of Recent Advances in Natural Language Processing, pages 121–129, Varna, Bulgaria, Sep 4–6 2017. https://doi.org/10.26615/978-954-452-049-6_018 We tackle these problems by selecting a pair of The initial seq2seq model assumed that the seman- sentences from each questions–answer pair, using tics of the input sequence can be encoded in a sin- dot product over averaged word embedding repre- gle vector, which is hard, especially for longer in- sentations. The similarity works both (i) as a filter puts. Thus, attention mechanisms have been intro- of noisy text as the probability that random noise duced (Bahdanau et al., 2015). This is what we occurs in the same manner in both the question use here as well. and the answer is very low, and (ii) as a selector Training seq2seq models for dialog requires of the most salient part of the user communication large conversational corpora such as Ubuntu through the QA interaction. (Lowe et al., 2015). Unstructured conversations, We further design several approaches to model e.g., from Twitter, have been used as well (Sor- selection and to the evaluation of the output of the doni et al., 2015). See (Serban et al., 2015) for a seq2seq models. The main idea is, given a ques- survey of corpora for dialog. Unlike typical dia- tion, (i) to build a classical vector representation log data, here we extract, filter, and use question- of the utterance answered by the model, and (ii) to answer pairs from a Web forum. evaluate it by ranking the answers to the question An important issue with the general seq2seq provided by the forum users. We rank them us- model is that it tends to generate general answers ing several metrics, e.g., the dot product between like I don’t know, which can be given to many the utterance and a target answer. This way, we questions. This has triggered researchers to ex- can use the small training, development and test plore diversity promotion objectives (Li et al., data from a SemEval task (Nakov et al., 2016b) to 2016). Here, we propose a different idea: select indirectly evaluate the quality of the utterance in training data based on performance with respect terms of Mean Averaged Precision (MAP). More- to question answering, and also optimize with re- over, we use this evaluation in order to select the spect to a question answering task, where giving best model on the development set, while training general answers would be penalized. seq2seq models. It is not clear how dialog systems should be We evaluate our approach using (i) our new evaluated automatically, but it is common practice MAP-based extrinsic automatic evaluation on the to use BLEU (Papineni et al., 2002), and some- SemEval test data, and (ii) manual evaluation car- times Meteor (Lavie and Agarwal, 2007): after all, ried out by four different annotators on two sets seq2seq models have been proposed for machine of questions: from the forum and completely new translation (MT), so it is natural to try MT eval- ones, which are more conversational but still re- uation metrics for seq2seq-based dialog systems lated to the topics discussed in the forum (life in as well. However, it has been shown that BLEU, Qatar). The results of our experiments demon- as well as some other popular ways to evaluate a strate that our models can learn well from forum dialog system, do not correlate well with human data, achieving MAP of 63.45% on the SemEval judgments (Liu et al., 2016). Therefore, here we task, and accuracy of 49.50% on manual evalua- propose to do model selection as well as evalu- tion. Moreover, the accuracy on new, conversa- ation extrinsically, with respect to a related task: tional questions drops very little, to 47.25%, ac- Community Question Answering. cording to our manual evaluation. 3 Data Creation 2 Related Work In order to train our chatbot system, we converted Nowadays, there are two main types of dialog sys- an entire Community Question Answering forum tems: sequence-to-sequence and retrieval-based. into a set of question–answer pairs, containing Here we focus on the former. Seq2seq is a kind only one selected sentence for each question and of neural network architecture, initially proposed for each answer.1 We then used these selected for machine translation (Sutskever et al., 2014; pairs in order to train our seq2seq models. Below, Cho et al., 2014). Since then, it has been ap- we describe in detail our data selection method plied to other tasks such as text summarization along with our approach to question-answer sen- (See et al., 2017), image captioning (Vinyals et al., tence pair selection. 2017), and, of course, dialog modeling (Shang et al., 2015; Li et al., 2016; Gu et al., 2016). 1We released the data here: http://goo.gl/e6UWV6 122 3.1 Forum Data Description We measured the similarity between two sentences We used data from a SemEval task on Community based on the cosine between their embeddings. Question Answering (Nakov et al., 2015, 2016b, We computed the latter as the average of the em- 2017). The data consists of questions from the beddings of the words in a sentence. We used Qatar Living forum2 and a (potentially truncated) pre-trained word2vec embeddings (Mikolov et al., 4 thread of answers for each question. Each an- 2013a,b) fine-tuned for Qatar Living (Mihaylov swer is annotated as Good, Potentially Useful or and Nakov, 2016), and proved useful in a number Bad, depending on whether it answers the question of experments with this dataset (Guzman´ et al., well, does not answer well but gives some poten- 2016; Hoque et al., 2016; Mihaylov et al., 2017; tially useful information, or does not address the Mihaylova et al., 2016; Nakov et al., 2016a). question at all (e.g., talks about something unre- More specifically, we generated the vector rep- lated, asks a new question, is part of a conversation resentation for each sentence by averaging 300- between the forum users, etc.).

Load more