Conversational Chatbots with Memory-Based Question and Answer Generation Mikael Lundell Vinkler Peilin Yu

LiU-ITN-TEK-A--20/060-SE Conversational Chatbots with Memory-based Question and Answer Generation Mikael Lundell Vinkler Peilin Yu 2020-11-13 Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings universitet SE-601 74 Norrköping, Sweden 601 74 Norrköping LiU-ITN-TEK-A--20/060-SE Conversational Chatbots with Memory-based Question and Answer Generation The thesis work carried out in Medieteknik at Tekniska högskolan at Linköpings universitet Mikael Lundell Vinkler Peilin Yu Norrköping 2020-11-13 Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings universitet SE-601 74 Norrköping, Sweden 601 74 Norrköping Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extra- ordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/ © Mikael Lundell Vinkler , Peilin Yu Abstract The aim of the study is to contribute to research in the field of maintaining long-term engagingness in chatbots, which is done through rapport building with the help of user and agent specific memory. Recent advances in end-to-end trained neural conversational models (fully functional chit-chat chatbots created by training a neural model) present chatbots that converse well with respect to context understanding with the help of their short-term memory. However, these chatbots do not consider long-term memory, which in turn moti- vates further research. In this study, short-term memory is developed to allow the chatbot to understand context, such as context-based follow-up questions. Long-term memory is developed to re- member information between multiple interactions, such as information about the user and the agent’s own persona/personality. By introducing long-term memory, the chatbot is able to generate long-term memory-based questions, and to refer to the previous conversation, as well as retain a consistent persona. A question answering chatbot and question asking chatbot were initially developed in parallel as individual components and finally integrated into one chatbot system. The question answering chatbot was built in python and consisted of three main components; a generative model using GPT-2, a template structure with a related sentiment memory, and a retrieval structure. The question asking chatbot was built using a framework called Rasa. User tests were performed to primarily measure perceived engagingness and realness. The aim of the user studies was to compare performance between three chatbots; a) individual question asking, b) individual question answering and c) the integrated one. The results show that chatbots perceived as more human-like are not necessarily more engaging conversational partners than chatbots with lower perceived human-likeness. Although, while still not being near human level performance on measures such as consistency and engagingness, the developed chatbots achieved similar scores on these measures to that of chatbots in a related task (Persona-Chat task in ConvAI2). When measuring the effects of long-term memory in question asking, it was found that measures on perceived realness and persona increased when the chatbot asked long-term memory generated questions, referring to the previous interaction with the user. Acknowledgments First of all, we would like to thank Dirk Heylen and Mariët Theune for welcoming and giving us the opportunity to perform this research at the Human Media Interaction group at the University of Twente. Thank you Mariët Theune and Jelte van Waterschoot for supervising and providing feedback and ideas throughout the entire project. Special thanks to Jelte van Waterschoot for introducing us to relevant tools and frameworks and for suggesting relevant literature. Furthermore, thanks to Elmira Zohrevandi for taking on the role as our internal supervisor at Linköping University, and for providing helpful feedback and literature. ii Contents Abstract i Acknowledgments ii Contents iii List of Figures vii List of Tables viii 1 Introduction 1 1.1 Motivation . 2 1.2 Purpose . 3 1.3 Research Questions . 3 1.4 Delimitations . 4 1.5 Thesis Structure . 4 2 Background and Related Work 5 2.1 Conversational Agents . 5 2.1.1 Rule-Based Methods . 5 2.1.2 Corpus-Based Methods . 5 2.2 Generative Models . 6 2.2.1 Sentence and Word Embeddings . 6 2.2.2 Fine-Tuning and Transfer Learning . 7 2.2.3 Seq2Seq or Encoder-Decoder . 8 2.2.4 Transformer . 9 2.2.4.1 GPT-2 . 10 2.2.4.2 Distillation . 10 2.2.4.3 Other Auto-Regressive Models . 11 2.3 Persona-Chat Task and Agents . 11 2.4 Relevant Conversational Agents . 13 2.4.1 Long-Term Engagingness . 13 2.4.2 Mitsuku . 13 2.4.3 Hugging Face’s Persona Chatbot . 13 2.4.4 Microsoft’s XiaoIce . 14 2.4.5 Meena . 16 2.4.6 Replika . 17 2.5 User Testing and Evaluation . 18 2.5.1 PARADISE . 18 2.5.2 Godspeed . 19 2.5.3 SASSI . 19 2.5.4 Automatic Evaluation of Responses . 20 2.5.5 Conversation-Turns per Session . 20 2.6 Open Source Conversational AI, Rasa . 21 iii 2.6.1 Rasa NLU . 21 2.6.1.1 Tokenization . 21 2.6.1.2 Featurization . 22 2.6.1.3 Entity Recognition, Intent Classification and Response Selector 22 2.6.2 Rasa Core . 22 2.6.2.1 Story . 22 2.6.2.2 Domain . 23 2.6.2.3 Slot . 24 2.6.2.4 Response . 24 2.6.3 Rasa X . 25 2.7 VADER Sentiment Analysis . 25 2.8 Semantic Network . 25 2.9 Conclusion . 25 3 Development of a Question Answering Chatbot 28 3.1 Architecture . 28 3.2 Datasets . 28 3.3 Generative Model . 30 3.3.1 Preprocessing and Creating New Datasets . 30 3.3.2 Manual Cleaning of Data . 32 3.3.3 GPT-2 Fine-Tuning . 32 3.4 Data Analysis . 33 3.5 Templates . 35 3.6 Sentiment Memory . 36 3.7 Answer Retrieval Structure . 37 3.8 Chatbot Development . 38 3.9 User test - Environment . 41 3.10 User test - Question Answering Chatbot . 43 3.10.1 Survey . 44 3.10.2 Method . 44 3.10.3 Hypothesis . 45 3.10.4 Results . 45 4 Development - Post User Test 48 4.1 Refactoring . 48 4.2 Template Component Improvements . 48 4.3 Follow-Up Question Test . 49 4.4 Answer Ranking . 50 4.4.1 BM25 . 50 4.4.2 Neural Network Ranking . 51 4.4.3 LDA . 51 4.4.4 Cosine Similarity With Penalty and Reward Functions . 52 4.4.5 Ranking Tests . 56 4.5 Question and Answer Classifiers . 57 4.6 Generative Component Improvements . 58 4.6.1 Preprocessing and Creating New Datasets II . 58 4.6.2 Fine-Tuning new Generative Models . 60 4.6.3 Context Testing and Automatic Evaluation . 61 4.6.4 Repeated Answer Removal . 62 4.6.5 Saving and Re-Using Past Messages . ..

Load more