Evaluating Natural Language Understanding Services for Conversational Question Answering Systems
Total Page:16
File Type:pdf, Size:1020Kb
Evaluating Natural Language Understanding Services for Conversational Question Answering Systems Daniel Braun Adrian Hernandez Mendez Florian Matthes Manfred Langen Technical University of Munich Siemens AG Department of Informatics Corporate Technology daniel.braun,adrian.hernandez,matthes manfred.langen { @tum.de } @siemens.com Abstract Advances in machine learning (ML) • Natural Language Understanding (NLU) as a Conversational interfaces recently gained • a lot of attention. One of the reasons service for the current hype is the fact that chat- In this paper, we focus on the latter. As we bots (one particularly popular form of con- will show in Section2, NLU services are already versational interfaces) nowadays can be used by a number of researchers for building con- created without any programming knowl- versational interfaces. However, due to the lack edge, thanks to different toolkits and so- of a systematic evaluation of theses services, the called Natural Language Understanding decision why one services was prefered over an- (NLU) services. While these NLU ser- other, is usually not well justified. With this paper, vices are already widely used in both, in- we want to bridge this gap and enable both, re- dustry and science, so far, they have not searchers and companies, to make more educated been analysed systematically. In this pa- decisions about which service they should use. We per, we present a method to evaluate the describe the functioning of NLU services and their classification performance of NLU ser- role within the general architecture of chatbots. vices. Moreover, we present two new cor- We explain, how NLU services can be evaluated pora, one consisting of annotated ques- and conduct an evaluation, based on two different tions and one consisting of annotated corpora consisting of nearly 500 annotated ques- questions with the corresponding answers. tions, of the most popular services. Based on these corpora, we conduct an evaluation of some of the most popular 2 Related Work NLU services. Thereby we want to enable Recent publications have discussed the usage of both, researchers and companies to make NLU services in different domains and for differ- more educated decisions about which ser- ent purposes, e.g. question answering for localized vice they should use. search (McTear et al., 2016), form-driven dialogue systems (Stoyanchev et al., 2016), dialogue man- 1 Introduction agement (Schnelle-Walka et al., 2016), and the in- Long before the terms conversational interface ternet of things (Kar and Haldar, 2016). or chatbot were coined, Turing(1950) described However, none of these publications explicitly them as the ultimate test for artificial intelligence. discuss, why they choose one particular NLU ser- Despite their long history, there is a recent hype vice over another and how this decision may have about chatbots in both, the scientific community influenced the performance of their system and (cf. e.g. Ferrara et al.(2016)) and industry (Gart- hence their results. Moreover, to the best of our ner, 2016). While there are many related rea- knowledge, so far there exists no systematic evalu- sons for this development, we think that three key ation of a particular NLU service, let alone a com- changes were particularly important: parison of multiple services. Dale(2015) lists five NLP cloud services and Rise of universal chat platforms (like Tele- describes their capabilities, but without conduct- • gram, Facebook Messenger, Slack, etc.) ing an evaluation. In the domain of spoken dialog 174 Proceedings of the SIGDIAL 2017 Conference, pages 174–185, Saarbrucken,¨ Germany, 15-17 August 2017. c 2017 Association for Computational Linguistics systems, similar evaluations have been conducted bels to messages or parts of messages. At the time for automatic speech recognizer services, e.g. by of writing, among the most popular NLU services Twiefel et al.(2014) and Morbini et al.(2013). are: Speaking about chatbots in general, Shawar and LUIS1 Atwell(2007) present an approach to conduct end- • to-end evaluations, however, they do not take into Watson Conversation2 account the single elements of a system. Resnik • and Lin(2010) provide a good overview and eval- API.ai3 • uation of Natural Language Processing (NLP) sys- wit.ai4 tems in general. Many of the principals they • apply for their evaluation (e.g. inter-annotator Amazon Lex5 agreement and partitioning of data) play an impor- • tant role in our evaluation too. A comprehensive Moreover, there is a popular open source alter- and extensive survey of question answering tech- native which is called RASA6. RASA offers the nologies was presented by Kolomiyets and Moens same functionality, while lacking the advantages (2011). However, there has been a lot of progress of cloud-based solutions (managed hosting, scal- since 2011, including the here presented NLU ser- ability, etc). On the other hand, it offers the typi- vices. cal advantages of self-hosted open source software One of our two corpora was labelled using (adaptability, data control, etc). Amazon Mechanical Turk (AMT, cf. Section Table1 shows a comparison of the basic func- 5.2), while there have been long discussions about tionality offered by the different services. All of whether or not AMT can replace the work of ex- them, except for Amazon Lex, share the same perts for labelling linguistic data, the recent con- basic concept: Based on example data, the user sensus is that, given enough annotators, crowd- can train a classifier to classify so-called intents sourced labels from AMT are as reliable as ex- (which represent the intent of the whole message pert data. (Snow et al., 2008; Munro et al., 2010; and are not bound to a certain position within the Callison-Burch, 2009) message) and entities (which can consist of a sin- gle or multiple characters). 3 Chatbot Architecture Service Intents Entities Batch import In order to understand the role of NLU services LUIS + + + for chatbots, one first has to look at the general ar- Watson + + + chitecture of chatbots. While there exist different API.ai + + + documented chatbot architectures for concrete use wit.ai + + O cases, no universal model of how a chatbot should Lex + O - be designed has emerged yet. Our proposal for a RASA + + + universal chatbot architecture is shown in Figure 1. It consists of three main parts: Request Inter- Table 1: Comparison basic functionality of NLU pretation, Response Retrieval and Message Gener- services ation. The Message Generation follows the classi- cal Natural Language Generation (NLG) pipeline Figure2 shows a labelled sentence in the LUIS described by Reiter and Dale(2000). In the con- web interface. The intent of this sentence was text of Request Interpretation, a “request” is not classified as FindConnection, with a confidence of necessarily a question, but can also be any user in- 97%. The labelled entities are: (next, Criterion), put like “My name is John”. Equally, a “response” (train, Vehicle), (Universitat,¨ StationStart), (Max- to this input could e.g. be “What a nice name”. Weber-Platz, StationDest). Amazon Lex shares 4 NLU Services 1https://www.luis.ai 2https://www.ibm.com/watson/ The general goal of NLU services is the extraction developercloud/conversation.html 3https://www.api.ai of structured, semantic information from unstruc- 4https://www.wit.ai tured natural language input, e.g. chat messages. 5https://aws.amazon.com/lex They mainly do this by attaching user-defined la- 6https://www.rasa.ai 175 Discourse Manager Contextual Information Request Interpretation Response Retrieval Message Generation Analysis Candidate Retrieval Text Planning Query Generation Candidate Selection Sentence Planning Linguistic Realizer Knowledge Base Figure 1: General Architecture for Chatbots the concept of intents with the other services, but The exception in this case is, of course, RASA, instead of entities, Lex is using so-called slots, which can either use MITIE (Geyer et al., 2016) which are not trained by concrete examples, but or spaCy (Choi et al., 2015) as ML backend. example patterns like “When is the Criterion { } Vehicle to StationDest ”. Moreover, all ser- 5 Data Corpus { } { } vices, except for Amazon Lex, also offer an export Our evaluation is based on two very different and import functionality which uses a json-format data corpora. The Chatbot Corpus (cf. Sec- to export and import the training data. While wit.ai tion 5.1) is based on questions gathered by a offers this functionality, as of today, it only works Telegram chatbot in production use, answering reliably for creating backups and restoring them, questions about public transport connections. but not importing new data7. The StackExchange Corpus (cf. Section 5.2) is based on data from two StackExchange8 platforms: ask ubuntu9 and Web Applications10. Both corpora are available on GitHub under the Creative Commons CC BY-SA 3.0 license11: https://github.com/sebischair/ Figure 2: Labelled sentence with intent and enti- NLU-Evaluation-Corpora. ties in Microsoft LUIS 5.1 Chatbot Corpus When it comes to the core of the services, The Chatbot Corpus consists of 206 questions, the machine learning algorithms and the data on which were manually labelled by the authors. which they are initially trained, all services are There are two different intents (Departure Time, very secretive. None of them gives specific infor- mation about the used technologies and datasets. 8https://www.stackexchange.com 9https://www.askubuntu.com 7cf. e.g. https://github.com/wit-ai/wit/ 10https://webapps.stackexchange.com issues?q=is%3Aopen+is%3Aissue+label% 11https://creativecommons.org/licenses/ 3Aimport by-sa/3.0/ 176 Find Connection) in the corpus and five different our evaluation, we included them in our corpus, entity types (StationStart, StationDest, Criterion, in order to create a corpus that is not only useful Vehicle, Line). The general language of the ques- for this particular evaluation, but also for research tions was English, however, mixed with German on question answering in general.