IntelliBot: A Domain-specific Chatbot for the Insurance Industry
MOHAMMAD NURUZZAMAN
A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy
UNSW Canberra at Australia Defence Force Academy (ADFA) School of Business
20 October 2020
ORIGINALITY STATEMENT
‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institute, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’
Signed
Date
To my beloved parents
Acknowledgement
Writing a thesis is a great process to review not only my academic work but also the journey I took as a PhD student. I have spent four lovely years at UNSW Canberra in the Australian Defence Force Academy (ADFA). Throughout my journey in graduate school, I have been fortunate to come across so many brilliant researchers and genuine friends. It is the people who I met shaped who I am today. This thesis would not have been possible without them. My gratitude goes out to all of them.
Above all, I would express my supreme gratitude to my PhD supervisor, Assoc. Prof. Omar Khadeer Hussain, who has been a fantastic advisor throughout this journey. He has encouraged me to the challenging field of natural language processing and deep neural networks. He has been the most supportive of my work, providing me with excellent guidance and support in both academic and professional; and encouragement in times of need. In addition to this, he has also been very patient and understanding, more than anyone I have known. Prof Omar has made my journey an enjoyable one, provided lots of useful feedback, corrected my work extremely promptly, for which I have the utmost gratitude. He has also provided me with great opportunities to participate in a lot of interesting research projects. There would not exist this thesis without his constant support.
Importantly, I would like to thank my parents Al-Hajj Sultan Ahammad and Sahana Sultana for their unconditional love throughout my life and provided me with the soil to grow. Without their support I may not have found myself at UNSW, nor had the courage to engage in this journey and see it through. My parents, to whom my academic career owes greatly and supported me wholeheartedly for my pursuit of dreams, even if that meant a distance of thousands of miles for many years. Furthermore, I thank to my younger brother Kamruzzaman Poran, two sisters Aspiya Sultana Shekha and Sultana Momotaz Sima, and two brothers-in-law Md Mahmud Reza and Salam Prodhan. I would not be at this stage I am today without their unflagging support and great sacrifices towards my education along the way. Words are not enough to express their encouragement and love in my life. Thank you very much for always being there for me. I am very grateful to have such a supportive family.
I am also grateful to Assoc Prof Farookh Hussain and Dr Morteza Saberi, who always has confidence on me and encourages me to pursue a higher standard and for their generous
help in addressing research issues and paper revision. I thank to Gianin Zogg for giving me valuable career advice and inspiration. To Samuel Sun and Ramesh Thiagalingam, who helped me solve the issues in the project and industry collaboration. Their insightful guidance and sense of responsibility motivate me towards professional personnel. To my colleagues Peter Scott and Greg Creighton for their supports on hypothesis explanation insightful technical conversations and participation in fruitful discussions. They have continuously demonstrated how to discover interesting problems, how to bake ideas and how to ruthlessly question a work, especially that of oneself. To my friends who aside from being teammates, Gautham Ravi and Yifan Zhao, whom I learned a lot while working with them. Thank you buddies, for your hard work and making our time worthwhile.
And my deepest thanks to Saleh Ibne Rosul, Ashraf Chowdhury, his beloved wife Mahzabin Akhter and their little adorable daughter Ophelia Chowdhury, who makes it a big family to me and cares for me more than himself. Special appreciation goes to Farida Yesmin Shetu with whom I always share my good news and frustration. I couldn't complete this journey and achieve any accomplishment without their unconditional love.
I would like to acknowledge the financial supports from UNSW, for providing international postgraduate award, tuition fees, research stipend and student health coverage. Life at UNSW Canberra at ADFA would have been much more difficult without members of the administrative and technical staffs. I would like to thanks to all staffs in the School of Business and School of Engineering and Information Technology at UNSW. I would like to take this opportunity to thanks my committee members Prof Michael O’Donnell, Prof Satish Chand, Dr Fiona Buick, Prof Elizabeth Chang, Jessica Campbell and Elvira Berra, for their extreme support and provided insightful comments to make this thesis deeper and more coherent.
Special thanks to my UNSW friends Sukanto Kumer Shill, Abdul Khaleque, Hang Thanh Bui, Ahmad Jorban Al-mahasneh, Tasneem Rahman, Ahasanul Haque, Xiao Zhang, Md Alamgir Hossain, Tanmoy Das Gupta, Mohiuddin Khan, Sohel Ahmed, Anwar Us Saadat, Forhad Zaman, Mousa Hadipour, Wenxin Chen and Jo Ji.
Also, I would like to thanks to all of my school friends Salahuddin Ahmed, Istiaque Ahmed, Saiful Islam, Mydul Hossain Khan, Kaniz Fatema, Mahbub Alam, Alauddin, Mohammad Razzakul Haider, Bashir Ahmed, Noor Mohammad, Mohammad Shazzad Hossain Bhuiyan, Isa
Ahmmed Saleh, Shahin Ahmed, Golam Mostofa, Monower Hossen, Tanvir Hasan, Hedayet Ullah, Nazmul Islam, Shamim Hossain, Ruma Pervin, Zahid Hasan, Tahamina Alich, Shahana Akter Shelly, Ayesha Siddiqua, Mahmud Hasan, Nazmul Haque, Homayun Khan, Mohasin Ali, Mohammad Al Amin, Husne Ara, Asmaul Husna, Shamima Akter, Jesmine Akter, Rustom Ali, Shohidul Shazib, Nilima Ibrahim, Abu Ohab, Abbas Uddin, Rokonuzzaman, Azman Khan, Nasreen, Mahamuda, Bappi, Belal Hossain, Abu Bakar Siddique, Arifur Rahman and Sir Afzal Hossain, who left precious memories for me. I thank you for steering me to a better self.
Last but not least, I thank to all my friends and family members specially SalBadiul Alam, Kamal, Abul Khair Mojumdar, Jahangir Alam, Mohammed Reaz, Al-Mamun, Alamgir Alam, Ahmad Abdul Majid, Walid Abouaghreb, Rajib Hasan, Aziz, Anawarul, Mahadi Miraz, Shamimul Azim, Taslima Akter Tasu, Faria Zaman, Nurjahan Akter Shanta, Michelle Williams, Nafis Iqbal, Zakir Hossain, Vinay Kumar Adepu, Fatin Nurul Ghazali, Sharmin Afroz, Tania Habib, Prof Hatim Mohammad Tahir, Prof Azham Hussain, Zhamri Che Ani, Mohammad Amir, Dr. Husbullah Omar, Prof Wan Rozaini Sheik Osman, Zaini Mostafa, Muhammad Aiman Mazlan, Dr Shifa Mahmod, Dr Azman Yasin, Peter Chong, Chang Fu Tong, Sing Choong Lau, Lee Teik Hui, Kevin Yap, Imam Hossain, Freed Jawad, Aumio Chowdhury, Akhlaqur Rahman, Ishrar Tabenda Hasan, Saeed Nezamy, Yogi Babria, Bahar Torkaman, Arif Yusof, Dina Rayhan, Ritesh Singh, Ajita Shah, Santosh Mainali, Sheikh Salam, Mesbahur Rahman Topu, Surya Maharjan, Ram Sharma, Reejo Augusti, Samir Khan, Tenzin, Baysaa Baatar, Farid Ahmed, Ang Ling Weay and Zannatul Mawa Dihan. I thank you for steering me to a better self.
Abstract
Communication is an indispensable aspect in the success of any business. Due to the increase in digital innovation, Internet-based services such as chatbots now play a vital role in maintaining communication between users. However, a traditional chatbot’s dialogue capability is quite inflexible as It can answer the user only if there is pattern-matching between the set of questions-answers and user’s query. This may leave the customers unhappy and research has shown that 91% of unhappy customers tend not to engage with the business again. To address this, chatbots needs to have meaningful dialogue abilities rather than merely providing either a yes, no or a short response. The major challenge in building a better AI model is ensuring it has a domain-specific conversational capability to engage with the user while presenting meaningful responses and semantically correct information.
The existing literature has explored the capabilities of advanced techniques such as recurrent neural networks (DBRNN) for chatbots to engage in human-like conversation and generate responses. However, while an enormous amount of research has been done to bring this idea to realization, no significant outcome in the area of engaging with the users while generating a response is achieved to date. To address this problem, in this thesis, an innovative framework architecture, named IntelliBot, is designed and developed. IntelliBot is a chatbot which to facilitate a high degree of engagement with the user using the seq2seq model when generating its response with the user. Additionally, it not only has the ability to answer a question, but also complex user queries with a semantically correct meaningful response and solves user queries specifically in the insurance domain. To meet these challenges, IntelliBot generates a response in four distinct ways, namely, template-based strategy, knowledge- based strategy, internet retrieval strategy and generative-based strategy. An AI selection process is adopted which sequentially determines which strategy fits best according to the specifics of the user’s question.
To demonstrate the effectiveness of IntelliBot in generating a superior response, its outputs are evaluated against three publicly available chatbots. These results were then evaluated by the experts to determine their accuracy to the questions asked. Metrics such as Cohen Kappa
and F1 score were computed to benchmark the results of each chatbot. These scores demonstrated that IntelliBot outperformed and overcome the shortcomings of the existing chatbots and had better conversational capabilities which led it to give the highest number of complete, meaningful and semantically correct answers to the questions asked in a service industry.
List of Publications arising from this thesis
Referred Journal Articles: 1. Nuruzzaman, M. and Hussain, OK. (2020). "IntelliBot: A Dialogue-based chatbot for the insurance industry," Knowledge-Based Systems, vol. 196, p. 105810. 21/05/2020. doi> https://doi.org/10.1016/j.knosys.2020.105810 2. Nuruzzaman, M., Hussain, OK., and Hussain, FK. (2020). “Design and Training of Response Generation Strategies for Domain-oriented Chatbot with Grammar Error Check”, ID: TEIS-2020-0204, Submitted to Enterprise Information Systems Journal.
Referred Conference Articles: 1. Nuruzzaman, M. and Hussain, OK. (2019). “Identifying facts for chatbot's question answering via sequence labelling using Recurrent Neural Networks”. Proceedings of the ACM Turing Celebration Conference, China on May 17–19. ACM Computing, Article No. 93. doi>https://dl.acm.org/doi/10.1145/3321408.3322626 2. Nuruzzaman, M. and Hussain, OK. (2018). “A Survey on Chatbot Implementation in Customer Service Industry through Deep Neural Networks”. 2018 IEEE 15th International Conference on e-Business Engineering (ICEBE) on 12-14 Oct 2018. IEEE Computer Society. doi> https://ieeexplore.ieee.org/document/8592630
News Mentioned: 1. Anonymous,"Engineering - Knowledge Engineering; Findings on Knowledge Engineering Detailed by Investigators at UNSW" NewsRx, United States, Atlanta, pp. 143, 25th May 2020, retrieved on 15/06/2020, available at: https://login.wwwproxy1.library.unsw.edu.au/login?qurl=https%3A%2F%2Fsearch.pro quest.com%2Fdocview%2F2406951496%3Faccountid%3D12763
TABLE OF CONTENTS
Abstract ...... xii List of Publications arising from this thesis ...... iii TABLE OF CONTENTS ...... v List of Tables ...... x List of Figures ...... xii List of Abbreviations ...... xv CHAPTER 1 ...... 1 1.1 Chatbots and Their Evolution to Answer a Customer’s Queries ...... 1 1.2 Shortcomings of Existing Chatbots to Answer Customer Questions in a Service Industry ...... 3 1.2.1 The motivation of choosing the insurance industry as the area of application ...... 5 1.3 Objectives of the Thesis ...... 6 1.4 Research Questions to Achieve the Research Objectives ...... 7 1.5 Contributions of the Thesis ...... 8 1.6 Significance of the Thesis ...... 9 1.7 Scope of the Thesis ...... 9 1.8 Structure of the Thesis ...... 9 CHAPTER 2 ...... 11 2.1 Overview ...... 11 2.2 Taxonomy of Chatbots ...... 11 2.2.1 Goal-based chatbot ...... 13 2.2.2 Knowledge-based chatbot ...... 13 2.2.3 Service-based chatbot ...... 14 2.2.4 Response generated-based chatbot ...... 14 2.3 Analysis of Models Used to Generate a Response that Mimics a Human Brain ...... 15 2.3.1 Template-based Model ...... 15 2.3.2 Retrieval-based Model...... 16 2.3.3 Search Engine Model ...... 19 2.3.4 Generative Model ...... 20 2.4 Workings of the Existing Chatbots in the Literature ...... 21 2.4.1 Elizabot ...... 22 2.4.2 Alicebot ...... 22 2.4.3 Elizabeth bot ...... 23 2.4.4 Mitsuku ...... 24 2.4.5 Cleverbot ...... 24 2.4.6 Chatfuel ...... 25 2.4.7 ChatScript ...... 25 2.4.8 IBM Watson ...... 26 2.4.9 Microsoft LUIS ...... 26 2.4.10 Google Dialogflow ...... 27 2.4.11 Amazon Lex ...... 27 2.5 Techniques Used in Existing Dialogue-based Chatbots to Build and Generate a Response ...... 31 2.5.1 Rule-based approach ...... 31 2.5.2 TF-IDF approach ...... 31 2.5.3 End-to-End approach ...... 33 2.5.4 RNN approach with seq2seq mechanism ...... 33 2.5.5 RNN approach with memory network ...... 38 2.6 Critical Evaluation of the Literature ...... 39 2.7 Conclusion ...... 41 CHAPTER 3 ...... 42 3.1 Introduction ...... 42 3.2 Key Terms ...... 42
3.3 Existing Gaps in Domain-oriented Dialogue-based Chatbots Which Aim to Engage with Customers in the Service Industry ...... 43 3.3.1 Drawback 1: Use of templates to map questions and answers to respond to user questions ...... 43 3.3.2 Drawback 2: Inability to respond to a user’s complex queries ...... 44 3.3.3 Drawback 3: Deciding which strategy to select according to the question asked to generate a meaningful and domain-specific response ...... 44 3.3.4 Drawback 4: Unable to engage users in a meaningful conversation ...... 44 3.3.5 Drawback 5: Unable to identify errors in user questions ...... 45 3.3.6 Drawback 6: Unable to learn continuously from a user-bot conversation ...... 46 3.4 Research Problem Addressed in this Thesis ...... 46 3.5 Adopted Research Methodology to Solve the Thesis Problem ...... 48 3.5.1 Theoretical study ...... 50 3.5.2 Addressing the problem ...... 50 3.5.3 Solution design ...... 50 3.5.4 Experiment ...... 51 3.6 Conclusion ...... 51 CHAPTER 4 ...... 52 4.1 Introduction ...... 52 4.2 Key Terms ...... 52 4.3 Requirements of a Domain-specific Chatbot ...... 53 4.4 Methodological Approach for Designing and Building a Domain-specific Chatbot ...... 55 4.4.1 Identify components ...... 56 4.4.2 Design conceptual framework ...... 58 4.4.3 Develop and train AI model ...... 59 4.4.4 Experiment and validation ...... 59 4.5 Proposed Conceptual Model of IntelliBot’s Response Generation Component ...... 59 4.5.1 User emulator ...... 60 4.5.2 Input Processing Unit (IPU) ...... 61 4.5.3 Neural Dialogue Manager (NDM) ...... 61 4.5.3.1 Language Understanding Unit (LUU) ...... 62 4.5.3.2 Strategy Selection Unit (SSU)...... 64 4.5.3.3 Context Tracking Unit (CTU) ...... 68 4.5.3.4 Response Generator Unit (RGU) ...... 70 4.5.3.5 Response Analyser Unit (RAU)...... 70 4.6 Conclusion ...... 71 CHAPTER 5 ...... 72 5.1 Introduction ...... 72 5.2 Key Terminology ...... 73 5.3 Strategy Selection Unit’s Workflow to Generate a Response to the User’s Query ...... 73 5.4 Design and Working of the Template-based Strategy ...... 75 5.4.1 Objective ...... 75 5.4.2 Summary of the working of the template-based strategy ...... 75 5.4.3 Detailed process of generating a response ...... 76 5.4.3.1 User question resulting in a direct match with the defined templates ...... 78 5.4.3.2 User question resulting in an induced match with the defined templates ...... 78 5.4.4 Limitation of template-based strategy ...... 79 5.5 Design and Working of the Knowledge-based Strategy ...... 80 5.5.1 Objective ...... 80 5.5.2 Summary of the working of the knowledge-based strategy ...... 80 5.5.3 Detailed Process of generating a response ...... 81 5.5.4 Limitation of knowledge-based strategy ...... 86 5.6 Design of the Internet Retrieval (IR) Strategy ...... 87
5.6.1 Objective ...... 87 5.6.2 Summary of working of Internet retrieval strategy ...... 87 5.6.3 Detailed process of generating a response ...... 88 5.6.3.1 Question analyser ...... 88 5.6.3.2 Answer analyser ...... 91 5.6.4 Limitation of Internet retrieval strategy ...... 92 5.7 Design of Generative-based Strategy ...... 92 5.7.1 Objective ...... 92 5.7.2 Summary of the working of the generative-based strategy ...... 92 5.7.3 Detailed process of generating a response ...... 93 5.7.4 Limitation of generative-based strategy ...... 97 5.8 Conclusion ...... 97 CHAPTER 6 ...... 98 6.1 Introduction ...... 98 6.2 Key Terms ...... 99 6.3 Natural Language Processing (NLP) Tasks Performed in the LUU of IntelliBot ...... 99 6.3.1 Lowercase conversion ...... 100 6.3.2 Tokenization ...... 101 6.3.3 Abbreviation determination ...... 102 6.3.3.1 Abbreviation recognizer ...... 104 6.3.3.2 Abbreviation extractor ...... 105 6.3.3.3 Definition finder ...... 105 6.3.3.4 Abbreviation matcher ...... 105 6.3.4 POS tagging using HMM ...... 106 6.3.5 Grammar check and correction ...... 107 6.3.5.1 Classification of grammatical errors ...... 108 6.3.5.2 Process in GEC to detect and correct errors ...... 110 6.3.5.2.1 Text classification to detect errors ...... 112 6.3.5.2.2 Text transformation to correct errors ...... 114 6.3.6 Removing stopwords ...... 116 6.3.7 Lemmatization ...... 117 6.3.8 Entity extraction ...... 117 6.3.9 Punctuation removal ...... 120 6.4 Computing Semantic Similarity of a Possible Answer with the User Question ...... 121 6.4.1 Detail of determining the semantic similarity at the word level ...... 123 6.4.1.1 Identifying words and POS tagging ...... 124 6.4.1.2 Find word sense disambiguation ...... 125 6.4.1.3 Calculate the shortest path between two synsets ...... 125 6.4.1.4 Hierarchical distribution of words ...... 127 6.4.1.5 Measuring the similarity between the two vectors ...... 128 6.4.2 Detail of semantic similarity at the sentence level ...... 128 6.5 Process of sentence scoring at RAU ...... 131 6.6 Conclusion ...... 132 CHAPTER 7 ...... 133 7.1 Introduction ...... 133 7.2 Key Terms ...... 134 7.3 Process of collecting the data required for each response generation strategy ...... 134 7.4 Process of training the generative-based strategy for response generation of IntelliBot ...... 138 1) Prepare Data ...... 139 2) Extract features ...... 139 3) Design of neural networks for training ...... 139 4) Setup training environment ...... 139 5) Training the RNN to generate a response ...... 139
7.5 Data preparation ...... 140 7.5.1 Data cleansing ...... 140 7.5.2 Removal of duplicate data ...... 144 7.6 Feature Engineering to Extract Features and Use them for Training at DBRNN ...... 144 7.6.1 Extracting features required to train IntelliBot ...... 146 7.6.1.1 Character-level layer ...... 147 7.6.1.2 Highway layer ...... 148 7.6.1.3 Word-level layer ...... 148 7.6.1.4 CRF layer ...... 148 7.7 Design Neural Networks ...... 149 7.7.1 Input standardization ...... 149 7.7.2 Determine neuron and neural network layers ...... 151 7.7.3 Determine the activation function for each layer ...... 153 7.7.4 Identify values of weights initialization ...... 155 7.7.5 Adding bias ...... 157 7.7.6 Word embeddings ...... 159 7.7.7 Batch normalization ...... 163 7.8 Training environment ...... 165 7.8.1 Phase 1: Training on the Cornell dialogue dataset ...... 166 7.8.2 Phase 2: Training on insurance domain dataset ...... 166 7.8.3 Phase 3: Training on particular words ...... 167 7.9 Training of IntelliBot using DBRNN ...... 168 7.9.1 Forward propagation ...... 168 7.9.1.1 Input layer ...... 169 7.9.1.2 Hidden layer ...... 169 7.9.1.3 Output layer ...... 171 7.9.2 Backward propagation ...... 172 7.9.2.1 Calculate the total error in the output layer ...... 174 7.9.2.2 Check whether error is minimized (Iterate until converged) ...... 175 7.9.2.3 Update parameters ...... 175 7.9.3 Stochastic gradient descent (SGD)...... 176 7.9.4 Attention mechanism in training ...... 177 7.9.4.1 Global attention model ...... 177 7.9.4.2 Local attention model...... 178 7.10 Conclusion ...... 179 CHAPTER 8 ...... 180 8.1 Overview ...... 180 8.2 Process of Evaluating IntelliBot’s Output Against the Requirements and the Outputs of the Other Chatbots ...... 181 8.3 Tools and Techniques Used to Develop the IntelliBot Prototype ...... 182 8.4 Different Categories of Questions for Chatbot Evaluation ...... 185 8.5 High-level Overview of the Three Existing Chatbots Used in the Experiment for Comparison with IntelliBot ...... 188 8.6 Output of RootyAI, ChatterBot, DeepQA and IntelliBot on the Considered Questions ...... 189 8.7 Evaluate Engagement with the User in Relation to the Responses Generated by the Chatbots . 200 8.7.1 Expert judgment ...... 200 8.7.2 Measuring cohen’s kappa co-efficient to ensure agreement between the experts ...... 203 8.8 Demonstrating IntelliBot’s Ability to Correct Grammatical Errors in the Questions before Generating a Meaningful Response ...... 204 8.9 Exploratory Test (ET) ...... 212 8.9.1 Strategy of conducting an exploratory test ...... 213 8.10 Conclusion ...... 217 CHAPTER 9 ...... 219
9.1 Recapitulation of the Thesis ...... 219 9.2 Contributions of the Thesis ...... 220 9.2.1 Contribution 1: Develops a modular-based framework for generating appropriate responses to user queries...... 221 9.2.2 Contribution 2: Develops different response generation strategies that can answer a user’s question according to its complexity...... 221 9.2.3 Contribution 3: Develops the detailed working of the different sub-components of IntelliBot that assist it to process and understand the user’s input...... 222 9.2.4 Contribution 4: Develops an approach to collect insurance domain-specific data required to train IntelliBot...... 222 9.2.5 Contribution 5: Compares and validates the outputs of IntelliBot with three existing chatbots to demonstrate IntelliBot’s accuracy and superiority in engaging with the users while answering their questions...... 222 9.3 Future Work Arising from this Thesis ...... 223 9.3.1 Future improvement for the chatbot to be domain independent ...... 223 9.3.2 Future improvement in response generation techniques ...... 223 9.3.3 Future improvement in domain-oriented dataset ...... 224 9.3.4 Future improvement in the neural network model...... 224 9.3.5 Future improvement in evaluation approach ...... 224 9.3.6 Future improvement in correcting grammatical errors and identifying abbreviations .... 225 9.3.7 Future improvement in unsupervised and self-learning capability ...... 225 9.3.8 Future improvement in speech chatbots ...... 226 REFERENCES ...... 227 APPENDIX A ...... 237 APPENDIX B ...... 241 APPENDIX C ...... 246
List of Tables
Table 2.1 Template-based models – description with issues and impacts ...... 16 Table 2.2 Retrieval-based models – description with issues and impacts ...... 18 Table 2.3 Search engine models – description with issues and impacts ...... 19 Table 2.4 Generative-based models – description with issues and impacts ...... 20 Table 2.5 Features and drawbacks of existing chatbots ...... 28 Table 2.6 Summary of the TF-IDF approaches to generate a response with their description and issues ...... 32 Table 2.7 Summary of the end-to-end approaches to generate a response with their description and issues ...... 33 Table 2.8 Summary of the RNN approaches with seq2seq to generate a response with their description and issues ...... 37 Table 2.9 Summary of RNN approaches with memory networks to generate a response with their description and issues ...... 39 Table 5.1 Template-based pattern matching ...... 76 Table 5.2 Corresponding question types and event elements ...... 80 Table 5.3 POS and entity dependency relationship of the user question ...... 82 Table 5.4 Example of similar meaning (senses) of a word ...... 83 thankTable 6.1 Example of lowercase conversation ...... 100 Table 6.2 List of abbreviations in full form ...... 103 Table 6.3 List of abbreviations ...... 103 Table 6.4 Abbreviation categorization ...... 105 Table 6.5 Confidence score of responses ...... 122 Table 6.6 Synsets of words ...... 125 Table 6.7 Similarity between answer relevant to the question ...... 131 Table 7.1: Statistics for the insurance domain-specific QA dataset ...... 135 Table 7.2 Statistics for the Cornell movie corpus ...... 137 Table 7.3 Sample raw data from Cornell movie corpus ...... 137 Table 7.4 Sample raw data from Cornell movie corpus ...... 137 Table 7.5 Statistics for the vocabulary dataset used to build IntelliBot ...... 138 Table 7.6 List of features used in experiments ...... 145 Table 7.7 Example of lexical feature extraction ...... 146 Table 7.8 List of tokens to fill the input sequence ...... 150 Table 7.9 Filling the input sequence in bucket size of (5,10) ...... 150 Table 7.10 List of activation functions of neural networks ...... 154 Table 7.11 Importance of appropriate weight initialization ...... 155 Table 7.12 Training system's specification ...... 165 Table 7.13 Summary of training parameters’ specification ...... 167 Table 7.14 Vector representation of x ...... 169 Table 8.1 List of hardware used in developing IntelliBot ...... 182 Table 8.2 List of software used in developing IntelliBot ...... 182 Table 8.3 List of library packages used install in python environment ...... 183 Table 8.4 Questions in the greetings category ...... 186 Table 8.5 Questions in the asking for assistance category ...... 186 Table 8.6 Questions in the asking for time & date category ...... 186
Table 8.7 Questions in the general category...... 187 Table 8.8 Questions in the arithmetic problem-solving category ...... 187 Table 8.9 Questions in the domain-specific category ...... 187 Table 8.10 Questions in ending the chat session category ...... 188 Table 8.11 User questions and the response received from each chatbot ...... 190 Table 8.12 Confusion matrix from the results of each chatbot ...... 197 Table 8.13 Precision, Recall, and F1 Score ...... 198 Table 8.14 Example of rating used by an expert to score the answer of each chatbot ...... 202 Table 8.15 Statistics of the general conversation rating ...... 202 Table 8.16 Statistics of the domain-specific conversation rating ...... 203 Table 8.17 Expert’s agreement ...... 203 Table 8.18 Cohen kappa co-efficient value for each chatbot ...... 204 Table 8.19 Confusion matrix from the results of each chatbot ...... 205 Table 8.20 Precision, Recall, and F1 Score ...... 205 Table 8.21 Error responses from the three existing chatbots and IntelliBot ...... 206 Table 8.22 Generating meaningful responses and engaging the user in conversation ...... 214 Table 8.23 Detecting and correcting grammatical errors based on user confirmation ...... 215 Table 8.24 Multiple strategy selection for generating a response ...... 216 Table 8.25 Validate the IntelliBot prototype ...... 216
List of Figures
Fig. 1.1 Adoption of chatbots in different industries ...... 2 Fig. 2.1 Taxonomy of chatbot classification according to the requirements ...... 13 Fig. 2.2 Classification of response generated-based models ...... 15 Fig. 3.1 Research methodology adopted in this thesis to solve the research problem ...... 50 Fig. 4.1 Methodological approach ...... 55 Fig. 4.2 Components required for building a chatbot application ...... 56 Fig. 4.3 Components required in a response-generating chatbot application ...... 57 Fig. 4.4 Conceptual framework of IntelliBot ...... 60 Fig. 4.5 Mobile and web Interface of IntelliBot ...... 61 Fig. 4.6 Neural Dialogue Manager (NDM) of IntelliBot ...... 62 Fig. 4.7 Selection policy of AI conversational strategies ...... 65 Fig. 4.8 High-level workflow of template-based strategy ...... 65 Fig. 4.9 High-level workflow of knowledge-based strategy ...... 66 Fig. 4.10 High-level workflow of Internet retrieval strategy ...... 67 Fig. 4.11 High-level workflow of generative-based strategy ...... 67 Fig. 5.1 Conversational strategy selection in SSU ...... 74 Fig. 5.2 Design of the template-based strategy ...... 76 Fig. 5.3 Basic building block of AIML code snippet ...... 77 Fig. 5.4 Recursion of AIML code snippet ...... 78 Fig. 5.5 Memorizing previous conversation of AIML code snippet ...... 79 Fig. 5.6 Design of the knowledge-based strategy ...... 82 Fig. 5.7 Semantic graph and entity dependency of user question ...... 83 Fig. 5.8 Code snapshot of KB query formation ...... 84 Fig. 5.9 Code snapshot of percentage of matching words ...... 85 Fig. 5.10 Design of Internet retrieval strategy...... 87 Fig. 5.11 Semantic graph and entity dependency of the question...... 88 Fig. 5.12 Code snippet of web crawling ...... 89 Fig. 5.13 Information extraction process from the web using a web crawler...... 89 Fig. 5.14 Traverse child node to obtain expected question and answer ...... 90 Fig. 5.15 HTML code snapshot ...... 91 Fig. 5.16 Design of the generative-based strategy ...... 93 Fig. 5.17 Architecture of the DBRNN seq2seq model ...... 94 Fig. 5.18 Visual representation of input to output ...... 96 Fig. 6.1 NLP tasks performed in the LUU of IntelliBot ...... 100 Fig. 6.2 Code snippet of lowercase conversion ...... 101 Fig. 6.3 Code snippet of tokenization ...... 102 Fig. 6.4 Workflow of abbreviation recognition and extraction ...... 104 Fig. 6.5 Part-of-speech tagging into a sentence ...... 107 Fig. 6.6 Classification of grammatical errors ...... 108 Fig. 6.7 Process of grammar checking ...... 111 Fig. 6.8 Working of the text classification & error detection phase ...... 113 Fig. 6.9 Working of the text classification & error correction phase ...... 115 Fig. 6.10 Code snippet of stopwords ...... 116 Fig. 6.11 Code snippet of lemmatization ...... 117
Fig. 6.12 Process of entity extraction ...... 118 Fig. 6.13 POS tagging for both sentences ...... 118 Fig. 6.14 Entity recognition for both sentences ...... 119 Fig. 6.15 Coreference resolution ...... 119 Fig. 6.16 POS tagging with entity dependency relationship ...... 120 Fig. 6.17 Named entity recognition ...... 120 Fig. 6.18 Code snippet of removing punctuation ...... 121 Fig. 6.19 Various senses of a word ...... 122 Fig. 6.20 Semantic similarity determined at the sentence and word levels in the four response generation strategies ...... 123 Fig. 6.21 Semantic similarity at the word level ...... 124 Fig. 6.22 Hierarchical structure graph (subset of wordNet) ...... 126 Fig. 6.23 Hierarchical distribution of words ...... 127 Fig. 6.24 Semantic similarity at the sentence level ...... 129 Fig. 6.25 Two sets with Jaccard similarity 7/13 ...... 131 Fig. 7.1 Data collection procedures for the four strategies ...... 136 Fig. 7.2 Process of training generative-based strategy (RNN) for response generation ...... 138 Fig. 7.3 Data cleansing process flow for the Cornell movie dialogue corpus ...... 141 Fig. 7.4 Code snippet of data cleansing and saved data ...... 142 Fig. 7.5 Cornell dialogue dataset ...... 143 Fig. 7.6 Histogram distribution of the Cornell dataset ...... 143 Fig. 7.7 Exploratory data analysis of the Cornell movie dialogue dataset ...... 144 Fig. 7.8 Designing neural networks of IntelliBot ...... 149 Fig. 7.9 Single neuron connection ...... 151 Fig. 7.10 Architecture of neural networks ...... 152 Fig. 7.11 Activation function in a neuron ...... 153 Fig. 7.12 Code snippet of appropriate weight initialization ...... 156 Fig. 7.13 Parameter initialization with appropriate values ...... 157 Fig. 7.14 Representation of bias in the layer ...... 157 Fig. 7.15 Effect of bias neuron ...... 158 Fig. 7.16 Effect of bias neuron ...... 159 Fig. 7.17 Vector representation (on left) and cosine distances of university (on right) ...... 161 Fig. 7.18 Window and process for computing Pwt + j | ct ...... 161 Fig. 7.19 Window and process for computing Pwt + j | ct ...... 162 Fig. 7.20 Code Snippet of CBoW model ...... 163 Fig. 7.21 Training phases of IntelliBot ...... 165 Fig. 7.22 Training Process of IntelliBot ...... 168 Fig. 7.23 Process of forward propagation ...... 168 Fig. 7.24 Hidden vector for the word "how" ...... 170 Fig. 7.25 Hidden vector for the word "are" ...... 170 Fig. 7.26 Hidden vector for the word "you" ...... 171 Fig. 7.27 Final output ...... 171 Fig. 7.28 Final output from forward propagation ...... 172 Fig. 7.29 Example of the wrong prediction produced by RNN ...... 173 Fig. 7.30 Visualization of the effect of the loss function ...... 173 Fig. 7.31 Gradient flow ...... 174 Fig. 7.32 Process of backward propagation ...... 174
Fig. 8.1 Steps in chatbot evaluation ...... 181 Fig. 8.2 The working of IntelliBot on a desktop application ...... 184 Fig. 8.3 The working of IntelliBot on a mobile device ...... 185 Fig. 8.4 Strategy selection ratio used by IntelliBot to give an answer to the user’s questions ...... 197 Fig. 8.5 F1 Scores of the four chatbots in all question categories ...... 198 Fig. 8.6 Scores of the four chatbots categorised according to domain-specific and conversational questions ...... 199 Fig. 8.7 Chatbot evaluation steps ...... 200 Fig. 8.8 F1 Scores of the four chatbots when there is an error in the question ...... 211 Fig. 8.9 GUIs showing how IntelliBot corrects errors in questions before generating a meaningful response ...... 212 Fig. 8.10 Steps of exploratory test ...... 213
List of Abbreviations
AHRE Attentive Hierarchical Recurrent Encoder AI Artificial Intelligence AIML Artificial Intelligence Markup Language ALICE Artificial Linguistic Internet Computer Entity ANN Artificial Neural Networks ASR Automatic Speech Recognition ASCII American Standard Code for Information Interchange AMEX American Express AP Average Perceptron AWS Amazon Web Services BLUE Bilingual Evaluation Understudy BoW Bag-of-Words BRNN Bidirectional Recurrent Neural Networks CNN Convolutional Neural Networks COVID-19 Corona Virus Disease 2019 CPU Central Processing Unit CRF Conditional Random Field CRM Customer Relationship Management CSR Customer Service Representative CTU context Tracking Unit CBoW Continuous Bag-of-Words CUDA Compute Unified Device Architecture DL Deep Learning DNN Deep Neural Networks DBRNN Deep Bidirectional Recurrent Neural Networks DOM Document Object Mapping DST Dialogue State Tracker ESIM Enhanced Sequential Inference Model FAQ Frequently Asked Question EOS End of Sentence ET Exploratory Test GEC Grammar Error Checker GPU Graphical Processing Unit GRU Question Generator Unit GUI Graphical User Interface HMM Hidden Markov Model HTTP HyperText Terminal Protocol HTML HyperText Markup Language IDF Inverse Document Frequency IR Internet Retrieval
IBM International Business Machines IDE Integrated Development Environment IPU Input Processing Unit KB Knowledge-based KBDB Knowledge-based Database LSTM Long Short-Term Memory LUU Language Understanding Unit LUIS Language Understanding Information Service ML Machine Learning MLE Maximum Likelihood Estimation MEMM Maximum Entropy Markov Model MIT Massachusetts Institute of Technology MSE Mean Squared Error NDM Neural Dialogue Manager NER Named Entity Recognition NLP Natural Language Processing NLU Natural Language Understanding NLTK Natural Language Toolkit NN Neural Networks NMT Neural Machine Translation OOV Out-of-Vocabulary PCFG Probabilistic Context-free Grammar PDS Product Disclosure Statement POS Part-of-speech QA Question Answer RNN Recurrent Neural Networks RGC Response Generation Component RGU Response Generation Unit RAU Response Analyser Unit SGD Stochastic Gradient Descent SMF Sequential Matching Framework SMT Statistical Machine Translation SSU Strategy Selection Unit SP Shortest Path SVM Support Vector Machine TF Term Frequency TF-IDF Term Frequency and Inverse Document Frequency TTS Text to Speech UIMA Unstructured Information Management Architecture UNSW University of New South Wales WSD Word Sense Disambiguation XML Extended Markup Language
CHAPTER 1
“Everything we hear is an opinion, not a fact. Everything we see is a perspective, not the truth.” — Marcus Aurelius
INTRODUCTION
1Nearly 75% of customers have experienced poor customer service [1-3] The generation of meaningful, long and informative responses is a challenging task.
1.1 Chatbots and Their Evolution to Answer a Customer’s Queries For a customer-focused service industry organisation, being connected with their customers and answering their queries is an essential factor for success. Due to the rise of digital innovation, Internet-based communication services now play an exclusive role in an organization maintaining communication with its users. The importance of and need for this medium have been proven during the current unprecedented times of novel coronavirus (COVID-19), where social distancing is a mandatory requirement to be implemented. In such times, customer-focused organizations from retail, business & education industries need to come up with innovative measures that enable them to remain in contact with their customers while adhering to the new requirements of social distancing. Researchers in AI developing one such measure to achieve this, namely the chatbot.
A chatbot is a conversational software that is designed to emulate the communication capabilities of a human being to interact automatically with a user. It represents a new, modern form of customer assistance powered by artificial intelligence via a chat interface. Chatbots are based on AI techniques that understand natural language and identify meaning
1 Parts of this chapter have been published in [8] and [20]. 1 and emotion and are designed for meaningful responses. To businesses, they provide an improved way of connecting with customers and increasing customer satisfaction. To customers, they provide a better and more convenient way of having their questions answered without waiting on the phone or sending emails. Chatbots can reduce the number of customer calls, average handling time and cost of customer care. Alan Turing first conceptualised chatbots in the 1950s [1] by asking “Can machines think?”. Since then, the combined fields of Natural Language Processing (NLP) and Machine Learning (ML) have been used to develop and realise chatbots. In 1966, Weizenbaum [2] developed the first chatbot, named “ELIZA”, that was able to identify the keywords of the given input sentence and pattern match them against a set of predefined rules to generate responses. Since then, significant progress in the development of intelligent chatbots has been made. Hence, as shown in Figure 1.1, it is not surprising to see the widespread adoption of chatbots in many different areas of business in which humans communicate to obtain answers to their queries.
Fig. 1.1 Adoption of chatbots in different industries (Past, Present and Future use)
The chatbots that have been applied in the different areas of business can be categorised according to their style of working [3] as follows:
• Question answering bots are knowledge-based chatbots that answer users’ queries by analysing the underlying information collected from various sources like Wiki,
DailyMail [4], Allen AI science and Quiz Bowl [5, 6]. Examples of areas in which such chatbots have been applied are the Wall Street Journal, CNN, and E-commerce.
• Task-oriented bots are goal-based chatbots that assist in achieving a certain task or attempt to solve a specific problem [7] such as a flight booking, hotel reservation etc. Examples of areas in which such chatbots have been applied are flight centres, hotel bookings and restaurant ordering and booking management.
• Social bots are chatbots that communicate with other users and make recommendations to them [8], for example, Microsoft Xiaoice [9], Replika [10], Google [11], Facebook [12], Skype [13] and Telegram [14]. Even though social bots can communicate autonomously, they can only answer simple questions and have a low degree of engagement with the user. These chatbots are used to do very specific and basic tasks and have drawbacks such as they are not able to answer complex queries and have a low degree of engagement with users.
• Service bots are chatbots which have been developed and are used by a business to answer users’ queries with a specific goal or focus on the completion of certain tasks requested by their customers. Such chatbots are domain-specific and may use a combination of QA and task-oriented to generate a response. A key requirement for service-based chatbots is that they should not only answer the user’s question but also engage in a conversation with the user. So, such chatbots need to be designed with handcrafted rules that not only answer simple and predefined questions but also have the ability to answer complex user queries.
The focus of this thesis is on service-based chatbots which are used in businesses to answer user queries in an automated manner. However, the existing service-based chatbots have shortcomings which are discussed in the next section.
1.2 Shortcomings of Existing Chatbots to Answer Customer Questions in a Service Industry
Even though chatbots are able to communicate autonomously, they are only able to answer simple and predefined questions and have a low degree of engagement with the user. This
was starting to create issues in the service industry as customers not only wanted their questions answered, they also wanted to be engaged in a conversation in a similar way as when speaking to a customer service representative (CSR). According to a Drift report [15], 75% of customers experience problems with traditional online communication channels when dealing with a business. The report [15] also indicated that this leads to further flow- on effects, as 91% of unhappy customers will not engage with the business again [16]. This will result in customer dissatisfaction and a negative experience will be conveyed to other customers, thereby adversely impacting the business [17]. While chatbots exist in the service industry, their current drawbacks are that they do not engage with the users and thus do not provide a similar experience which customers have when dealing with a CSR.
• This is emphasised in a report which states that emotionless chatbots are taking over to respond to customer queries in companies such as Pizza Express, Lufthansa and Uber which is bad news from the perspective of customer service [18]. Chatbots are referred to as emotionless due to their inability to understand difficult user questions and their failure to detect user emotions and respond appropriately. Thus, domain-specific chatbots, as opposed to social bots, need to have the capability to capture particular characteristics from the users’ questions before generating an appropriate response. While modern social bots such as Google’s Assistant, Siri, Alexa, Samsung’s Bixby, Cortana, and Echo [19] utilize modern architectures, retrieval processes and advanced ML techniques, they do not perform well on domain-specific topics and hence cannot be applied to specific domains.
• Furthermore, the majority of existing chatbots only answer the user’s query but do not ‘engage’ in a conversation with them while doing so [20]. To explain the difference, let us consider the question ‘What day is today?’. Two possible answers can be either ‘Tuesday’ or ‘Today is Tuesday, 31st December 2019. Your next appointment is in 13 minutes’. Both responses answer the question, however the second one is more detailed and engages with the user more than the first one. Another way for chatbots to engage with users is to show empathy in their responses rather than merely using their conditional response library [20]. For example, in response to the user’s query, ‘I am not feeling well’ or ‘I am sad’, a chatbot using its conditional response library would simply say ‘How can I help you?’
in response to both questions. But a human would reply ‘How can I help you? Do you need medical help?’ and ‘I am sorry to hear that. Why are you sad?’ respectively. The human response shows the presence of empathy and therefore relates more to the user. This is a feature which chatbots should be able to replicate in their responses. The literature finds that customer support chatbots should not respond in a way that is too serious and transactional as this will not inspire continued use [20]. So, a service-based chatbot needs to keep the customers engaged and have dialogue abilities rather than merely providing either a yes or no to a short response.
For domain-specific chatbots in the service industry, having such ability to engage with the users in domain-specific terminology is a key requirement to answer user queries effectively. To achieve this, domain-specific chatbots need to have to engage with the user to have conversational capabilities and the ability to understand users’ questions thoroughly before providing a semantically correct meaningful response [21]. The objective of this thesis is to address these drawbacks in service-based chatbots by designing and implementing an AI chatbot application for the insurance industry. The motivation for choosing the insurance industry sector as the area of application is explained is the next sub-section.
1.2.1 The motivation of choosing the insurance industry as the area of application
Customer satisfaction with a company’s services is often seen as the key to success and the long-term competitiveness for a company. The insurance industry such as credit card insurance is attracting a lot of attention in the current times as customers all over the world use them frequently. Credit card insurance is a competitive market so from a card provider’s perspective, a strong marketing strategy the provision of the right customer support is vital [22]. A credit card’s inclusions are confusing and complex, and in a world dominated by cashless payments, consumers are using credit cards at an ever-growing rate. Most credit cards offer their consumers some form of embedded complementary insurance products. Consumers are often not aware of these products and the type of language which is used to explain them makes it difficult for consumers to understand the inclusions and benefits. For example, the majority of cards and accounts include complimentary travel insurance, however, customers are often not aware of the details regarding what the cover includes, if the cover includes family or travelling companions, how the cover is activated and who to call when they need help or need to make a claim. In addition, insurance personnel require
reference materials, policies and procedures to answer this question. It is challenging for customers to obtain the information they need as they have to sift through large documents to find the answer. As a result, the best way to get help quickly is to talk to technical support or sales support teams –even for answers to FAQs or basic “how-to” questions. This overload call centres, resulting in long wait times as it takes a long time to process a single request. As a result, the customer experience is poor, and customers become dissatisfied which reduces the throughput and business performance drastically. Research shows that approximately 75% of customers have experienced poor customer service [23-25].
Having a chatbot functionality integrated into a technology platform that allows modelling the entire credit card insurance ecosystem with artificial intelligence (AI) to simulate scenarios of different economic, market and individual conditions is thus needed. Hence, there is an ever-increasing demand for improved AI capabilities so chatbots can interact with customers in relation to advice on benefits, insurance coverage and claims processes. Another advantage of chatbots is that they remove human factors and provide a 24-hour service. This enables the customer to obtain advise on the most appropriate course of action and receive information on the benefits embedded into a credit card, the level of coverage and the insurance claims process at any time without needing the involvement of a CSR or waiting in a queue. This will allow customers to learn about credit card insurance coverage and they will have peace of mind, knowing that they have independent experts looking after them. Furthermore, the card provider’s, revenue will be increased, their costs will be saved, and customer satisfaction will increase.
1.3 Objectives of the Thesis
The objective of the thesis is to develop a domain-specific, dialogue-based response generating and user-oriented chatbot that can assist an insurance business respond to user's questions. The proposed system is termed as IntelliBot, which stands for Intelligent Strategy- based Dialogue Chatbot System and is an AI-based chatbot application system that is able to automate the entire business process by generating a response to the user’s question. For this, the chatbot needs to understand user inputs and thus it should have a natural language processing (NLP) ability in order to generate an appropriate response to the user’s questions using deep neural networks while at the same time ensuring that the customers are kept engaged.
1.4 Research Questions to Achieve the Research Objectives
To accomplish the objective of the thesis, the following research questions need to be explored:
i. Understand how a chatbot works and can we identify what components are required to build an advanced AI chatbot application system to answer user queries in the insurance industry? To answer this question, the aim is to study how existing chatbot applications in the service industry work and identify their drawbacks and the features required to build an AI chatbot. This question will be answered by studying the literature in the area of deep learning, specifically deep neural networks (DNN) and bidirectional recurrent neural networks (BRNN).
ii. How to design and develop various response generation strategies to generate an appropriate response to user queries? User questions will be of different levels of complexity. Some questions may be standard and repetitive while other questions may be complex and may require the chatbot application to synthesize knowledge from the underlying information. So, to achieve that, different response generation strategies need to be chosen according to the complexity of the question to be answered, which not only generate a semantically correct and meaningful response, but also keep the user engaged. In finding a solution to this question, this thesis provides a concrete solution to the given problem and develops four response generation strategies are designed to answer the user’s question. Each strategy is different in the way it generates responses. These strategies will be studied and developed in this research question.
iii. How to develop and train a deep bidirectional recurrent neural network (DBRNN) model so that it can understand human natural language and generate appropriate responses? The ability of a chatbot to generate a response to a user’s question which is semantically and grammatically correct and that is not predefined in a template is one of the main goals of this thesis. To answer this question, an AI chatbot application system needs to be designed and trained, as it is not possible to pre-define a template for every possible question that a user can ask. So, a deep bidirectional recurrent neural network (DBRNN) model that can
understand the user’s question and generate an appropriate response needs to be investigated to answer this research question.
iv. How to evaluate if the developed AI model gives an acceptable response and how to validate this? This question evaluates the developed AI chatbot application system to determine its accuracy in responding to the questions it is being asked. The quality of the generated response needs to be assessed against different factors, such as keeping the user engaged, and answering a question with a response that is grammatically and semantically correct. Furthermore, the generated response needs to be evaluated against other existing chatbots to show its superiority using F1 score and Cohen’s kappa metrics.
1.5 Contributions of the Thesis
This thesis contributes to the literature on AI-based chatbots in the service industry, specifically in the insurance industry, as follows:
• it introduces how the four response generation strategies can be used to generate a response and explains how IntelliBot selects each strategy to generate a response. • it proposes a scalable and flexible conceptual framework for IntelliBot which can converse with the user in a meaningful way in the domain of insurance industry and keep them engaged. • it designs a data pipeline process for processing data and forms appropriate QA pairs from the Cornell movie dialogue dataset and the insurance domain-specific dataset. The QA pairs are then used to create the training and testing dataset for IntelliBot to generate a response when there is no pre-defined model for a question asked by a user. • it develops a prototype model of IntelliBot and compares its performance with existing chatbots to show its superiority in different aspects such as generating a grammatically correct response which engages the user. • it creates an insurance domain-related QA dataset that can be used for future experiments.
1.6 Significance of the Thesis
The significance of the thesis is that it builds a neural dialogue manager (NDM) that incorporates an AI chatbot interface (IntelliBot). The developed chatbot is significant for the following two reasons:
• it has the ability to be used in customer service sectors to answer customer ’s query in a timely manner. • it provides training datasets that can be used for further studies or extensive experiments that can verify the effectiveness of the system performance.
1.7 Scope of the Thesis
To solve the problem in this thesis, the scope of the research is limited to the following factors:
• Identify components required to build a domain-specific chatbot application to answer user queries. • Design a methodological approach for building IntelliBot’s framework. • Design four strategy selection units which can generate responses to user queries according to their complexity. • Collect domain-specific data from the product disclosure statement (PDS) and basic conversational data from the Cornell movie dialogue corpus. • Incorporate NLP tasks and grammar checks with the IntelliBot framework. • Train IntelliBot using DBRNN in a seq2seq model with an attention mechanism to generate responses for question for which there is no defined template. • Evaluate IntelliBot’s generated response to user queries against three publicly available chatbots.
1.8 Structure of the Thesis
The remainder of this thesis is structured as follows:
• Chapter 2 presents an extensive literature study in the area of neural networks. It describes an overview of various types of chatbots and evaluates whether they are suitable for user-bot conversation in the industry domain and summarises their
drawbacks. It also discusses the techniques behind existing QA applications that are relevant to the thesis’s problem. • Chapter 3 explains in detail the problem which this thesis addresses. It also identifies the various research questions that need to be addressed to achieve the thesis’s objective. The research methodology adopted in this thesis is also explained. • Chapter 4 presents an overview of IntelliBot’s proposed framework. It details the requirements which are needed for a domain-specific chatbot to answer user questions. It then presents a methodological approach for building IntelliBot’s framework and the development of a prototype is described in the next chapters. • Chapter 5 explains the four strategies, namely template-based strategy, knowledge-based strategy, internet-based strategy and generative-based strategy which IntelliBot uses to generate a response. It also explains how IntelliBot selects each strategy to generate a response and describes the working process of each strategy in detail. • Chapter 6 defines the key terms required to understand the working of IntelliBot’s Language Understanding Unit (LUU) and explains the various NLP tasks performed by IntelliBot. The process of measuring semantic similarity at a word-level and sentence- level is explained which is used to check if the generated response matches with the user question. • Chapter 7 details the process of how the domain-specific data, which in the context of this thesis is insurance-related, is collected and curated from various sources. This chapter also explains the design of the DBRNN and process of training the DBRNN which is necessary for the generative-based response generation strategy. • Chapter 8 explains how the performance of IntelliBot is validated and compared against other chatbots in the literature. The other chatbots are discussed and the quality of the generated responses of each chatbot are measured using F1 score and Cohen’s kappa. The experiment results show that IntelliBot outperforms the three other chatbots in relation to the different factors required for a domain-specific chatbot. • Chapter 9 concludes the thesis by providing an overview of the proposed solution to address the problem discussed in this thesis. It also introduces areas for future work arising from this thesis.
CHAPTER 2
“Exploratory research is really like working in a fog. You don’t know where you’re going. You’re just groping. Then people learn about it afterwards and think how straightforward it was.”—Discoverer of DNA.
LITERATURE REVIEW
2.1 Overview
2This chapter conducts a systematic review on the existing chatbots and identifies their shortcomings with respect to the requirements of a domain-oriented chatbot to answer customers’ queries in the service industry. The chapter is structured as follows. Section 2.2 illustrates a taxonomy of chatbots according to how they are classified into different groups. Section 2.3 analyses the models used to generate a chatbot response that mimics a human brain. Section 2.4 describes the existing chatbots and their drawbacks in relation to their application in service industries. Section 2.5 explains the different working techniques used in dialogue-based chatbots. Section 2.6 concludes the chapter with a discussion of the existing gaps in the working techniques of dialogue-based chatbots when answering customer questions in the service industry.
2.2 Taxonomy of Chatbots
True to their growth, chatbots have been applied in various industry sectors [26] and have been classified into different groups [27]. For example, The objective of the study [26] is to analyse the ‘purpose’ of chatbots and classify them into the four categories of service, commercial, entertainment and advisory. Service chatbots such as Eliza [2] and Alice [28] provide services to customers. For example, a logistics firm uses a chatbot to respond to
2 Parts of this chapter have been published in [8] and [20].
customers’ questions about deliveries and provide copies of dispatch documents through an instant messaging channel rather than emails or phone calls. Commercial chatbots such as Pendorabots and Alexa [29] streamline purchases for customers. For example, they assist customers by answering their questions and help them to place orders. Entertainment chatbots keep customers engaged by discussing sport, the customer’s favourite band or movie or other events. They also offer customers the option of placing a bet, provide details on upcoming events and give information on ticket deals. Chatbots such as Siri [30] can be classified under the entertainment category although they also fall into other categories as well. Advisory chatbots such as Alexa [29] provide suggestions, give recommendations on services and offer support and advice. Other researchers, such as [27], grouped chatbots according to their purpose, classifying them into task-oriented and non-task-oriented chatbots. Task-oriented chatbots help to complete certain tasks through short conversations with customers. For example, applications such as Siri, Google Now, Alexa [30] can provide customers with travel directions, find restaurants and help them to make phone calls or send texts. On the other hand, non-task-oriented chatbots do not perform a task but may converse with customers to answer their questions.
Another recent study on chatbots classified them as either question answering, task-oriented or social bots [3] according to the scope expected from them. Question answering bots are knowledge-based chatbots that answer users’ queries by analyzing the underlying information collected from various sources like Wiki, DailyMail [4], Allen AI science and Quiz Bowl [5, 6]. Task-oriented bots are goal-based chatbots that assist in performing a certain task or attempt to solve a specific problem such as a flight booking, hotel reservation [7]. Social bots communicate with other users and make recommendations to them [8], for example Microsoft Xiaoice [9] and Replika [10]. When a chatbot is used by a business to answer users’ queries with a specific goal or focus on the completion of certain tasks requested by their customers, this is referred to as a service-based chatbot. Such chatbots are domain-specific and may use one of the aforementioned classifications to generate a response. A key requirement for service-based chatbots is for them to engage with the customers, as figures show that 91% of unhappy customers will not engage again with the business [16]. To keep the customers engaged, the chatbot needs to have dialogue abilities rather than just providing either a yes or no as a response.
Chatbots with dialogue generation ability are classified as either goal-based, knowledge- based, service-based or response-generated-based, as shown in Fig. 2.1. A brief explanation of the models which they use is given in the next sub-sections.
Fig. 2.1 Taxonomy of chatbot classification according to the requirements
2.2.1 Goal-based chatbot
Goal-based chatbots have a primary goal or aim to complete specific tasks. It is designed to have short conversations to obtain the required information from the user to complete a task. For example, a company deploys a chatbot on their website to help clients to answer their questions. For such a chatbot to work, three main capabilities or requirements namely activity-based, conversational, and informative are needed. Activity-based bots are able to perform a particular task, for example, make a flight booking or hotel reservation as required by the user. The conversational capability provides bots with the ability to talk to the user and continue their conversation based on the user’s questions. The informative capability provides bots with the ability to collect information from different knowledge sources. Some examples of chatbots that have these capabilities to respond to a goal are Alexa, Siri, Mitsuku, Xaoice and FAQ bots [30].
2.2.2 Knowledge-based chatbot
Knowledge-based chatbots are able to collect information from underlying data sources or online documents that are either in an open-domain or close-domain. The capabilities required for such a chatbot to work is the ability to access a collection of rational databases or online documents, extract information containing the answer and generate a response. The response from open-domain data sources are publicly available and depends on general
topics. Open-domain data sources are Allen AI Science and Quiz Bowl [5, 6]. On the other hand, a closed-domain data source focusses on a specific knowledge domain and all information are necessary to answer the question is provided in the dataset itself, such as Daily Mail [4], MCTest and bAbI [4].
2.2.3 Service-based chatbot
Service-based chatbots provide either personal or commercial service to the customer. They are classified under the sub-categories of being personal, social bots or agent-based. The capabilities required for such chatbots to work is the ability to access the required knowledge from the relevant sources and achieve the required goal, for example, making a flight booking, hotel reservation, restaurant booking etc. Personal service-based bots require the capability to mimic users’ activities such as manage a user’s calendar, store opinions, set reminders etc. Agent-based bots require the capability to communicate with other bots to accomplish a task. An example of such a bot is the integration of Alexa [29] and Cortana [4].
2.2.4 Response generated-based chatbot
Response generated-based chatbots are dynamic models that mimic the working of the human brain while generating responses. In other words, such models decide on what actions to perform as their response to a question asked by the user. The capabilities required for such chatbots to work is the ability to take inputs from the user, understand it, develop a response to it and communicate the response to the user as a conversation. To generate a response requires the chatbot to use advanced artificial intelligence techniques. Examples of such chatbots with this ability are Microsoft Tay, Apple Siri, Google bot, Amazon Alexa etc. These models are complex as they develop a response from scratch based on techniques such as machine learning, deep learning NLP, recurrent neural networks etc.
This thesis emphasis is on response generated-based type of chatbots. The objectives of this chapter are to study the existing response generation approaches and determine their effectiveness in generating a response that engages with the user in the form of a dialogue that is semantically correct and engages the user in conversations. In the next section, this study explains the techniques that are used in the existing literature to generate a response that mimics the working of the human brain.
2.3 Analysis of Models Used to Generate a Response that Mimics a Human Brain
This section discusses techniques used in the four main models namely template-based, generative, retrieval-based and search engine, as shown in Fig 2.2, and identifies their drawbacks that enable chatbots to generate a response that mimics a human.
Fig. 2.2 Classification of response generated-based models
2.3.1 Template-based Model
This model has pre-defined questions and answers. It matches the users’ question based on the pre-defined collection of rules, question template and in the case of a match, displays their answer to the user. Such chatbots work on determining patterns using rules and are commonly used in the entertainment industry. To bootstrap the interaction, an initiator model is used that acts as the conversation starter [31]. Examples of these types of chatbots are Alicebot [28], Elizabot [2], ChatScript [32], and Storybot [31]. The template-based model uses the Artificial Intelligence Markup Language (AIML) [28], ChatScript [33], RiveScript [34] and Rasa [35] to structure the responses. Some applications of this model are Storybot [31] and BoW movies [36] which respond with a story and with the movie title, actors’ names etc., respectively. However, the common drawback with this type of model is that the user receives a response only if there is a high level of pattern similarity between the input questions and sometimes the response is inappropriate. Additionally, template-based models are difficult to maintain, time-consuming and are weak in terms of pattern matching [37].
Table 2.1 Template-based models – description with issues and impacts
Name of Description of the Design Issues/limitations Impacts Approach Approach technique AIML AIML is an XML Mark-up Does not split This will not based approach, language input and combine enable the chatbot specify rule-based based on the result. to generate a chatbot content and XML. meaningful designed for response and simplicity. cannot answer complex queries. Pattern It is an algorithmic AIML It needs a great It is not possible to Matching task that finds pre- category deal of effort from write rules for defined sequence of pattern the subject matter every possible tokens to match matching. expert. scenario of expressions, or to questions a user detect patterns. may ask.
Initiator This acts as a Fact Inappropriate This will not Model conversation starter. generator responses. enable the chatbot The system takes the model with to generate a initiative by asking pattern meaningful questions. matching. response. Storybot This is a social bot Pattern It is responds It does not engage Model which tells stories. It matching when the user asks the user in outputs a short for one. conversation. fictional story at the request of the use.r
BoW It answers questions Template- Specific to movie It can only answer Movies in the movie domain. based with domain only. questions in the It provides movie string movie domain and title, actors’ names matching. its generated and a description responses are from IMDB. template-based.
2.3.2 Retrieval-based Model
This model is more advanced than the template-based model in that it not only finds matches between the user’s and the predefined question, it also considers the intent of the conversation. To understand the intent of the conversation, these models use techniques such as recurrent neural networks, a logistics regression classifier and sequence-to-
sequence. Based on these, the underlying database is searched for an answer. To select a response, information is retrieved from the knowledge-base, such as previous conversational history, logs, PDS and insurance domain-specific terms. It applies a complex semantic query formation technique to obtain information and matches this using an ensemble of machine learning classifiers. As is the case with the template-based model, this model constructs a response through keywords, identified facts and semantic query formation for corresponding questions. An example where this type of chatbot has been applied is the BoW escape plan [38]. Using a logistics regression classifier, the chatbot engages with the user and provides responses on 35 different topics. The work in [39] develops a dual encoder model that uses recurrent neural networks and sequence encoders to generate a response. The work in [6] uses the sequence-to-sequence approach with Gaussian latent variables that use a logistics regression model to generate responses from Reddit data. The work in [40] uses a bag-of-words model with the Word2Vec approach to select a response from the underlying dataset that has the highest cosine similarity. So, while these models generate a response, they may be inappropriate and need a large amount of data to be reasonably functional.
Table 2.2 Retrieval-based models – description with issues and impacts
Name of Description of the Design Issues/limitations Impacts Approach Approach technique End-to-End It is domain-specific, Recurrent - Retrieval This approach has Method and its aim is to Neural operation is non- difficulties in complete certain job Networks differentiable. extracting a user’s and able to - Result does not natural language communicate with an convey uncertainty questions and the existing database. about semantic model cannot learn parsing. from a conversation. BoW It is capable of Logistic Dependent on 35 It will not enable EscapePlan handling user regression predefined topics the chatbot to involvement and classifier and trained on generate a keep them in a level data. meaningful conversation even if -Performance of response and the model is not able the model is poor. engage a user in a to provide conversation. meaningful response. It returns responses from a set of 35 predefined topics. VHRED It is a logistic seq2seq Inappropriate It cannot identify Model regression model and with responses and poor errors and does uses Reddit data to Gaussian performance. not generate a generate responses. latent meaningful variables response. Dual It uses two sequence Recurrent Inappropriate It cannot identify Encoder encoders with a neural responses and poor errors and does Model single LSTM networks performance. not enable the recurrent layer for a chatbot to response. generate a meaningful response. Bag-of- The model based on Bag-of- -Information It cannot identify Words BoW models, Words duplication issue. errors, it is difficult Model Word2Vec Model - It acquires a large to train the model embeddings and amount of data. and does not Glove word. It generate a retrieves responses meaningful with the highest response.
cosine similarity.
2.3.3 Search Engine Model
This model generates a response by crawling the web or using search engine results using a deep classifier model. Information retrieval using web crawling or search engines is not as simple as searching a knowledge base. It first needs to identify semantic annotations or metadata from the semantic layer of the web. Then, it needs to apply DOM parsing and extract only the required and relevant data that contains an answer. The search engine model uses approaches such as a deep classifier or an LSTM classifier to generate a response from the search results. Techniques such as deep brain and deep learning are used to identify the possible answer before selecting the one to be shown to the user. While it choose best possible answers from a model, the challenge is to choose the best suitable response among many possible answers. Examples of such chatbots are Indri, Google Assistant, Microsoft Bing, Lucene [3].
Table 2.3 Search engine models – description with issues and impacts
Name of the Description of the Design Issues/limitations Impacts Approach Approach technique Deep It searches the web Deep Brain, Lots of response It is not able to Classifier with user queries Deep Learning results for one provide one best Model and responds from a user query. matching response set of search engine from the list of results. responses. It will not generate a meaningful response. LSTM It uses LSTM cell at Binary Finite-length It cannot answer Classifier encoder and Classification sequences complex queries decoder to get a and engaged a user response. in a meaningful conversation.
2.3.4 Generative Model
Generative-based models generate new answers in response to users’ questions. They do not depend on pre-defined questions and answers, rather they use neural network models or deep learning techniques [41] such as ANN, RNN, CNN and DeepQ to develop a dialogue with the user on the fly. The generative model uses a knowledge synthesis approach to generate answers and engage the users in a form of dialogue. These models generate responses by translating from the inputs to the outputs. Various chatbots that use this model have been developed. The work in [42] developed a Question Generator Unit (GRU) that generates follow-up questions to be presented to the user using a word-by-word vector. The work in [43] proposed the seq2seq Model which is a feedforward fully connected neural network to generate a response. The Deep-Q-Network works on the end-to-end decoder approach and uses an iterative decoding strategy to obtain an output sequence with maximal probability [44]. The research in [45, 46] proposed the Markov chain model that builds responses which have the highest probability using the stochastic model and Markov process. The work in [27] proposed the pipeline method which is a task-oriented dialogue system method using neural networks and deep learning. These models, however, have drawbacks as they need large training data, longer training time, and significant human input to correctly train them so that they give a satisfactory performance.
Table 2.4 Generative-based models – description with issues and impacts
Name of the Description of the Design Issues/limitations Impacts Approach Approach technique GRU It generates follow- Word-by-word The generation It will not be Question up questions word- vector procedure works able to engage Generator by-word. The when a question the user in model is used for mark is detected. continuous short questions. conversation seq2seq It is a feedforward Deep learning It is hard to train, It is not easy to Model neural network and RNN. and it takes a long train the chatbot which responses training time and and requires a are generated large dataset. huge dataset on during a specific conversation. domain. Meaningful and continuous conversation is
still questionable Deep Q It is an end-to-end Deep Neural Requires many It is difficult to Network decoder and uses Networks iterations to train chatbot an iterative obtain and requires a decoding strategy. satisfactory huge dataset on It aims to obtain an performance and a specific output sequence requires large domain. with maximal dataset. Meaningful and probability. continuous conversation is still questionable. Pipeline It is a task-oriented Neural - Information is It will not enable Method dialogue system Networks, Deep omitted and the chatbot to method. Learning duplication issues. generate a - Process requires meaningful significant human response and effort. does not engage with user.
Name of the Description of the Design Issues/limitations Impacts Approach Approach technique Markov It is a statistical Stochastic Finite-length It is difficult to Chain Model model that model, Markov sequences train chatbot generates response process and it requires a based on Markov huge dataset on Chains model. The a specific idea is a probability domain. of occurrences for Meaningful and each word in the continuous dataset. conversation is still questionable.
2.4 Workings of the Existing Chatbots in the Literature
In this section, we discuss the existing chatbots in the literature and determine whether the various response generating chatbots have the aforementioned drawbacks.
2.4.1 Elizabot
Elizabot is one of the earliest and well-known chatbots. It was developed in an MIT Lab in 1966 [2] and was intended to demonstrate natural language conversation between humans and machines to provide Rogerian psychotherapy. Rogerian psychotherapy primarily encourages the patient to talk more rather than engaging in a discussion. Elizabot responses are personal questions that are meant to engage the patient to continue the conversation. It uses rule-based techniques and a script to respond to patient’s questions with keyword matching from a set of templates and context identification. The model detects the appropriate template and selects the corresponding responses. If there are multiple templates, a template is selected randomly and the model runs it through a set of reflections to better format the string for a response. Elizabot was able to convince some people and assist in the treatment of patients suffering from psychological issues. Nonetheless, Elizabot could not provide anything comparable to therapy with a human therapist. The drawback of Elizabot is failure to keep a conversation going. Furthermore, Elizabot is incompetent of learning new information or discover context and lack of logical reasoning capabilities [47].
2.4.2 Alicebot
The Artificial Linguistic Internet Computer Entity, also referred to as ALICE, was inspired by [2] and developed in [28]. Alicebot is based on an updated version of Eliza’s pattern or architecture. However, Alicebot is purely based on pattern matching and the depth-first search technique for the user’s input. It is a form of XML dialect that encodes rules for questions and answers. It uses a set of artificial intelligence markup language (AIML) templates to produce responses given the dialogue history and user utterance [48]. First, AIML receives the user sentence as input and this is stored in what is known as a category. Each category comprises of a response template and a set of conditions that give meaning to the template, known as context. Then the model prepossesses it and matches it against nodes of the decision tree. When user input is matched, the chatbot responds or executes an action. The AIML templates repeat the user’s input utterance using recursive techniques but these are not always meaningful responses. Therefore, string-based rules are required to determine if the response is correct or meaningful.
The drawback of Alicebot is the difficulty it has in modelling personalities such as traits, attitudes, mood, emotions and physical states [49]. The botmaster must integrate personality elements within the AIML. However, this is not a straightforward task. Alicebot is also incapable of generating appropriate responses, has no reasoning capabilities and is unable to generate human-like responses (Turing test). A large number of QA pairs require to build a chatbot and maybe difficult to maintain or time-consuming, hence making it unfeasible. Alicebot does not have intelligence features like natural language understanding (NLU), sentiment analysis and grammatical analysis to structure a sentence. In addition, if the same input is repeated during the conversation, Alicebot gives the same answer, most of the time.
2.4.3 Elizabeth bot
Elizabeth bot is a version of Weizenbaum’s ELIZA application which was developed in [50]. However, various selection, substitution, and phrase storage mechanism have enhanced and increased potential adaptability and its flexibility. Elizabeth bot uses four steps to generate a response. First is a command line script processing, where each line has single character which represent a command notation, not a keyword message. For example, the character ‘W’ for the welcome text, ‘S’ for the sending text, ‘N’ for no match etc. It can also be indexed using a user special code. Second is the input transformation rules in which input is mapped to predefined keywords to be compatible form. Third is the output transformation rules and personal pronouns are changed to be an appropriate response. Fourth is the keyword patterns to be matched. Elizabeth bot tries to give a different answer using different selection responses for the same question [51]. The nature of some rules in Elizabeth bot may cause iteration, which is solved by applying the rule only once.
The drawback of Elizabeth bot is that it does not provide a way to partition or split the user input sentence and then combine the results. Due to Elizabeth bot’s structure, it will be difficult to do the splitting. Furthermore, a lot of complications occur due to some rules being written in uppercase and others in lowercase which may cause a lot of errors and result in the generation of unsuitable answers. However, Elizabeth bot has the ability to give the derivation structure for a sentence using grammatical analysis, keyword extraction and pattern matching.
2.4.4 Mitsuku
Mitsuku is the most widely used standalone human-like chatbot developed using AIML [52]. It was designed for a general typed conversation based on the rules written in AIML [53] and integration in a bot network such as Twitter, Telegram, Firebase and Twilio to serve as a personality layer. Mitsuku bot uses NLP using heuristic patterns and hosted at Pandorabot. Bot modules abstract a lot of the work that goes into creating a robust chatbot system. In order to integrate its module, there is a need to include some AIML categories to route inputs from users. Whenever Mitsuku bot fails to find a better match for input, it will automatically redirect to the default category. Mitsuku can hold a long conversation, it learns from the conversation and it remembers personal details about the user (age, location, gender, etc.). Its features include the ability to reason with specific objects. For example, if someone says “Can you eat a house?” Mitsuku will look up the properties for “house” and find the value of “made_from” is set to “brick” and reply “No” as a house is not edible. Mitsuku is a multi-lingual bot and uses supervised machine learning. As it learns something new, the data is sent to the human manager for verification. Only verified data can be further incorporated and used by the app. However, Mitsuku is not effective without a large amount of dataset and fails to provide dialogue management components.
2.4.5 Cleverbot
Cleverbot is one of the most popular entertainment chatbots that implements rule-based AI techniques to communicate with humans [54]. It is developed in [55] to collect a large amount of data based on conversational exchanges with people online through crowdsourcing. Unlike other chatterbots, Cleverbot’s responses are not pre-programmed. Instead, it simulates natural conversation by learning from user input and relying on feedback in order to interact. When the user inputs a sentence, Cleverbot finds all the keywords or phrases matching the input. After searching through its saved conversations, it responds to the input by finding how a user has responded to that input when it was asked. Cleverbot is unique in that it ‘‘learns’’ what users have said to it in previously saved conversations and uses this knowledge to determine how to respond to new conversations [56]. To enhance the realism of the conversation, the bot has its own human avatar that shows emotions. The underlying technology in Cleverbots not only processes verbal and textual interactions but also facial expressions and movements to create a more authentic
conversation. The drawback of Cleverbot is its unpredictable responses and its tendency to suddenly change the subject and respond without context. It is also unable to continue a long conversation, it is not accurate in language translation and may not be suitable for children due to mature themes, profanity or expose them to a little alcohol or tobacco.
2.4.6 Chatfuel
Chatfuel provides a drag and drop user-friendly interface to construct a rule-based chatbot, developed in [57]. It is an artificial intelligence module to train the bot to map input sentences to output. It allows response prompts and integration with services such as social media, a third-party and CRM. With analytics capabilities, users can collect and view valuable information on chatbot performance and subscriptions quickly and effectively. Users can dictate the conversational rules via the Chatfuel dashboard to ensure the chatbot understands and answers user requests efficiently. It also allows a json integration to accommodate custom logic into the bot. The most attractive point of service is that it is simple to build a rule-based bot which is suitable for small businesses. The drawback of Chatfuel is that it is quite inflexible in terms of conversation flow and it does not support knowledge-based and multi-language features. Additionally, NLP is limited, difficult to setup and its documentation is poor. However, it is capable of understanding the user’s intent.
2.4.7 ChatScript
ChatScript is a scripting-based commercial chatbot developed in [32]. It uses pattern matching techniques similar to AIML and is a combination of an NLP and dialogue management system, including some control scripts. This is merely another ordinary topic of rules. A rule consists of a type, label, pattern and output. Rules are bundled into collections called topics such as keywords that allow the engine to automatically search the topic for relevant rules based on user input. Unlike AIML, which finds the best pattern match for an input, ChatScript first finds the best topic match, then executes a rule contained in that topic. ChatScript is well suited for stand-alone applications such as information kiosks and help desks. Although it has excellent documentation, it is difficult to implement. The drawback of CharScript is it is difficult to learn and there are no hosting services. It is also difficult to embed in a web page [56].
2.4.8 IBM Watson
Watson is a rule-based AI chatbot developed by IBM's DeepQA project [58]. It is designed for information retrieval and is a question-answering system that integrates with NLP and hierarchical ML methods. Watson uses a broad range of mechanisms to identify and assign feature values such as names, dates, geographic locations or other entities to the generated response. The machine learning system then learns how to combine the values of these features into a final score for each response. Based on this score, it ranks all possible answers and selects one as its top answer. Watson incorporates a variety of technologies including Hadoop and Apache Unstructured Information Management Architecture (UIMA) framework to examine phrase structure and grammar of the question to better gauge what is being asked.
Watson’s uses cognitive computing technology as underlying structure which able to process text mining and complex analytics on unstructured data and handle enormous quantities of data. As the application gains experience with more input, it can find enough patterns to make accurate predictions. In addition to the advantages of Watson, it has some major drawbacks, such as it does not process structured data directly, it has no relational databases, it incurs a higher maintenance cost, it is targeted towards bigger organizations and it takes a longer time and more effort to train Watson to use its full potential.
2.4.9 Microsoft LUIS
Language Understanding Information Service (LUIS) is a domain-specific AI engine developed by Microsoft [59]. It is built using NLP and information extraction that uses prebuilt domain entities model and context. LUIS performs NLP against big data to find the intent from a sentence. It performs well in retrieving conversational data, interpreting it, extracting user intents and entities. The model starts with a list of general user intentions such as "Book Flight" or "Contact Help Desk." Once the intentions are identified, the user supplies example phrases called utterances for the intents. Then, the utterances are labelled with the specific details the user wants LUIS to pull out of the utterance. After the model is trained, it is able to process user input. LUIS receives the user input via HTTP endpoint and convey set of relevant intensions.
LUIS is integrated with various prebuilt applications and tools such as calendar for organizing days, dictionary for word lookup and collection knowledge of the web, email for communication, music and devices etc. The LUIS model is easily deployable and integrates seamlessly with the Azure Bot Service. The major drawback of LUIS is that it requires Azure subscriptions.
2.4.10 Google Dialogflow
Dialogflow, known as Api.ai, was developed by Google [60] and is a part of the Google Cloud Platform. It allows app developers to enable their users to interact with interfaces through voice and text exchanges powered by machine learning and natural language processing technologies. This lets them focus on other integral parts of app creation rather than on delineating in-depth grammar rules. Dialogflow recognizes the intent and context of what the user says and then matches the user input to specific intents and uses entities to extract relevant data from them. Finally, it allows the conversational interface to provide responses. The drawback of Dialogflow is that there is no handheld device version, it does not have an interactive user interface.
2.4.11 Amazon Lex
Amazon Lex is a service for building conversational capability into applications using deep learning technologies, developed by Amazon [29]. It provides a deep learning functionality and NLU to build a flexible user-bot conversational interfaces which increases user engagement. Amazon Lex integrated with AWS Lambda so that the user can easily trigger functions to execute back-end business logic for data retrieval and updates. The drawback of Amazon Lex is that it is not multilingual, and currently, it only supports English. Unlike Watson, Lex integration processes are complex. Furthermore, the preparation of the dataset and mapping of the entities are difficult.
Table 2.5 summarizes the workings of the existing chatbots in terms of the category in which they fall, their functionality and technique specifications and it also discusses their drawbacks which prevents them from generating a meaningful response and engaging the user in a dialogue. In the next section, a discussion of the technical approaches used in the generative model to build and generate a response is presented.
Table 2.5 Features and drawbacks of existing chatbots
Functionality Technical Specification
Chatbot Extract Semantic Classification Sentence Searching Input/output Technique Drawback Category Structure Intent Entity
Eliza [2] No No No No No ability to Basic Basic pattern Template- No logical reasoning Service-based structure a matching with based capabilities, sentence templates to inappropriate generate a responses response
Alice [28] Yes No Yes Yes No structure Depth-first Pattern matching Recursive Grammatical Goal-based ability. Stores search to represent input techniques analysis to structure a huge corpus and output sentences of text.
Elizabeth Yes No No No Derivation First keyword Command line Iterative Does not split input Goal-based [50] structure of a pattern match script as input and combine the sentence rules, and output result using transformation grammatical rules to generate analysis responses.
Mitsuku Yes Yes Yes Yes Yes Search value AIML category to NLP with Failed to provide Service-based [52] of category route input from heuristic dialogue and the user patterns, components properties supervised ML
28
Functionality Technical Specification
Chatbot Extract Semantic Classification Sentence Searching Input/output Technique Drawback Category Structure Intent Entity
LUIS [59] Yes Yes Yes Yes Uses Find intent Identify valuable NLU with the Requires Azure Knowledge- grammatical from the info. from user prebuilt subscription based analysis input, conversation domain, response with Active extracted learning intentions
Dialogflow Yes Yes Yes Yes Ability to Search Matches input to NLP, ML No interactive UI Response- [60] structure keywords specific intents and and does not based sentence uses entities to support handheld extract devices
Amazon Yes Yes No Yes Ability to Search Matches keywords NLU, AWS Not multilingual, Response- Lex [29] structure keywords for input and Lambda mapping utterances based sentence response & entities are very difficult
Chatfuel No No Yes No Yes Search Maps input Rule-based Inflexible Service-based [57] keywords sentences to conversation flows output
Functionality Technical Specification
Chatbot Extract Semantic Classification Sentence Searching Input/output Technique Drawback Category Structure Intent Entity
Cleverbot Yes No No Yes Ability to Search Matches keywords Rule-based Unpredictable Service-based [55] structure keywords for input and responses without sentence through its response based on context saved previous chat conversation
ChatScript Yes No Yes Yes No structure Finds topic & Pattern matching Script-based Difficult to learn Goal-based [32] ability executes a and embed in a web rule page contained in that topic
Watson Yes Yes Yes Yes Phrase and Search Identify feature Rule-based Does not process Knowledge- [58] grammar keywords values to generate NLP, UIMA structure data based structure responses based No relational analysis on the score databases
2.5 Techniques Used in Existing Dialogue-based Chatbots to Build and Generate a Response
Information extraction and user intention identification are central research topics in NLP. Several models have been presented by researchers in the last few years. Deep neural network models are a recent development in deep learning and have shown potential for building self-learning chatbots. However, there have been several related attempts to address the seq2seq model problems with deep learning approaches such as RNN, DNN and CNN [8]. This section summarises the previous studies and identifies the gaps. This study takes the systematic literature review approach [61] to conduct the review process. The next sub-section presents a summary of each technique and a comparison is conducted to identify the gaps from the perspective of meeting the requirements needed from response- generating chatbots.
2.5.1 Rule-based approach
In earlier days, researchers focused on conversational systems that were built using simple and predefined templates. This approach does not require any training. However, it requires an excessive deal of expertise effort to produce handcrafted rules or templates [62, 63]. However, the authors found that it is expensive to construct rule-based systems and the discussion could easily go beyond the scope. Thus, researchers and industries started to pay more attention to data-driven methods such as retrieval-based and generated response- based methods.
2.5.2 TF-IDF approach
TF-IDF determines the significance of a word in a document depending on the number of times it appears in it. It has two components—Term Frequency (TF) and Inverse Document Frequency (IDF) [64]. The importance of a word is determined according to its TF and IDF values. Wang et al. [65] proposed a two-step retrieval technique to find appropriate responses from the massive data repository using this approach. The retrieval process consists of extracting the user’s input, response matching and ranking the responses according to their TF-IDF values respectively. While such models generate responses, it uses the bag-of-words (BoW) model that does not capture text position, semantics or co- occurrences in distinct articles. Additionally, the frequency of each word needs to be
31 normalized in terms of their occurrence throughout the collection. Cerezo et al. [66] implemented a chatbot for expert recommendation tasks to help developers find the right person to contact. The proposed chatbot is based on NLP for sentence classification and key concept identification using the TF-IDF algorithm. They conducted a preliminary evaluation in two steps. First, three participants were asked to complete a specific task through interaction with the chatbot. In the second step, a semi-structured interview was conducted, and they were asked to recognize emotions while interacting. Although the chatbot gave users the answers they were expecting, its response was just an answer to their questions as opposed to engaging them in a meaningful conversation.
Mondal et al. [67] proposed a chatbot to assist Q&A in an educational domain. It uses an ensemble learning method and builds the application in the form of a telegram bot. The authors pre-processed the crawled data to convert it into a structure. Then, using NLTK features, they extract their corresponding features from a dataset of 1500 questions. The model was trained using a random forest approach and learns from a subset of features to answer the corresponding questions. However, it only answers user queries and fails to start or engage the user in a conversation. Table 2.6 presents a summary of the TF-IDF approaches to generate a response.
Table 2.6 Summary of the TF-IDF approaches to generate a response with their description and issues How does it generate a Approach Description Issues/ drawback response? TF-IDF [65] A retrieval-based Formulates the TF-IDF - Responds to matching with BoW conversational score of each word to questions only system using TF-IDF. generate an appropriate - Does not generate response. meaningful response TF-IDF [66] A chatbot to help It uses NLP for sentence - Answers users’ questions with NLP developers find the classification and key only right person to concept identification, - Trained on basic contact. using TF-IDF algorithm. conversational dataset only TF-IDF [67] A chatbot to assist Converted to structure - Answers users’ questions with Q&A in an data and extracts only and cannot engage the random educational domain features that assist in user in a meaningful forest in a form of a responding to the conversation telegram bot. corresponding question.
32
2.5.3 End-to-End approach
The end-to-end approach uses a single neural network in which all the NLP processing steps to generate a response are carried out. Williams et al. [68] developed a task-oriented chatbot using such an approach to carry out tasks such as booking movie tickets. The authors trained the model through supervised learning techniques. The NLU unit spontaneously classifies user queries with domain-specific intents and fills several slots to create a semantic frame. LSTM was used as an approach for slot filling and to determine the user’s intent simultaneously [68, 69]. The Deep-Q-Network approach is applied during the training on the labelled dataset to fine-tune the chat engine. This approach is similar to that in [5] in which the author conducted extensive experiments and performed a quantitative analysis which showed that errors at the slot-level have a higher effect on the output than errors at the intent-level. Drawbacks of this approach are that each epoch needs to be trained individually which presents several challenges and makes the performance of the entire system less robust. Gu et al. [70] proposed an enhanced sequential inference model (ESIM) with an end- to-end approach where given a partial conversation, the model selects the correct next utterance. ESIM has four features, namely a new word representation method, attentive hierarchical recurrent encoder (AHRE), multi-dimensional pooling and a modification layer for response selection. The drawback is, it requires a vast amount of labelled training data. Table 2.7 presents a summary of the end-to-end approaches to generate a response.
Table 2.7 Summary of the end-to-end approaches to generate a response with their description and issues How does it generate a Approach Description Issues/ drawback response? NLU unit automatically - No conversational A task-oriented bot Deep-Q- classifies the user’s query capabilities with end-to-end Network [68] & uses LSTM for slot methods. filling. ESIM Model Response selection Given a partial - Does not have [70] task conversational conversation, it selects conversational capabilities system the next utterance from a - Trained on basic set of possible candidates conversational dataset only
2.5.4 RNN approach with seq2seq mechanism
The seq2seq mechanism revolutionised the process of translation by making use of deep learning. Seq2seq takes as input a chain of words that are in a sequence and generates their 33 corresponding outputs. In this approach, each word is converted to its target sequence without considering its grammar or the sentence structure. It has two main components—an encoder and a decoder which encodes the input and decodes the output, respectively. While decoding, the decoder also considers the previous and next inputs apart from the current one. It does this by using neural networks such as a RNN, DNN and CNN [8, 71].
Gasic et al. [72] built an agenda-based goal-oriented chatbot for booking movie tickets. They proposed a Gaussian process-based technique to learn, as well as implement strategies that can be used with a small amount of information to adapt to the use case being addressed. The authors used the domain ontology and executed the dialogue system in two levels—NLU and the semantic level. The NLU component determines the intent of the user. The semantic level defines user-bot interactions in semantic frames as messages. Additionally, the user- bot interaction consists of two slots, namely inform and request slots. Inform slots are user- known values such as movie_name (avatar), number_of_persons (4), day (Friday) and the request slots are the user’s answers such as location name (city), theatre name, movie time. One shortcoming of the approach is that it requires an end-to-end supervised approach to add new knowledge in the neural networks. Wu et al. [73] developed a response selection approach in a retrieval-based chatbot. The proposed approach determines the response candidate to be given by determining the context of the conversation, its important parts and modelling the relationships among the utterances in that context. A sequential matching framework (SMF) is proposed to achieve these tasks. In the first stage, each word is transformed to a vector according to the context in which it is being studied. In the next step, the hidden states of RNN are used to determine the relationships between utterances according to their vectors. However, one of the drawbacks of their model is that a large amount of labelled data is needed to train the matching model.
Language generation techniques are another way to build a conversational system. Liu et al. [74] created RubyStar, which is a human-like dialogue system by combining distinct strategies for generating responses. The authors integrate both rule-based and deep learning techniques to decode speech to text. NLU does pre-process including topic detection, intent analysis, entity linking, after which the response generation strategies’ layer and neural networks (NNs) handle the user input. After going through the NN, the input stream flows into the response generator which eliminates incoherent or questionable answers using
34 a content filter. A ranking method is used for ranking in case there is more than one valid answer. The selected answer is passed to Amazon Alexa text to speech (TTS) and this is given to the user. Their results showed that character-level RNN is an efficient overall response generation model. However, the model’s performance can be improved by using other types of conversational topics, such as word embedding to replace the current topic embedding, sentiment embedding and engagement embedding. Cho et al. [42] proposed an NN-based encoder that encodes a chain of words into a fixed-length vector and a decoder that decodes it into another sequence. The encoder and decoder are used to trained to improve model’s accuracy and map the input sequence to an output. The proposed approach was shown to enhance the performance of BLEU scores, which is like the F1 score. Gu et al. [75] addressed a significant issue in seq2seq learning referred to as a copying mechanism. The authors’ proposed approach predicts the output sequence directly from the input by carrying information to the next stage without any nonlinear transformation. A similar method is proposed by Srivastava et al. [76] for deep neural networks training. However, a shortcoming of such approaches is that they cannot predict the output outside of the set of the input sequence. This approach was enhanced by Gu et al. [75] by selectively replicating input segments in the outputs. This is helpful in those cases where people are likely to repeat entity names. However, the challenge in seq2seq comes while copying. The authors addressed this by proposing a new model with an encoder-decoder structure called COPYNET. The proposed approach can integrate a common word generation technique in the decoder with the copying mechanism which can select the input sequence and place them in an output sequence.
Serban et al. [31] proposed a deep reinforcement learning chatbot called MILABOT which is able to interact with humans through speech and text. MILABOT comprises NLP and a neural network-based retrieval model including QA templates. The user-bot interaction provides responses using reinforcement learning from crowdsourced data. Lee et al. [77] state that a conventional seq2seq model discovers sequences more accurately when the input sequences are conditioned without taking into account the output sequences. To demonstrate this, the authors’ model scales the sentiment of the chatbot by training it with Tensorflow (proposed by [78]) using Twitter chatting corpus. In the training stage, the sequence input to the encoder and seq2seq models maximizes the probability of generating an accurate response. Two evaluation metrics are used, namely sentiment coherence and
35 sentiment classifier score. Sentiment coherence gives score regarding whether the output is meaningful or not and sentiment classifier score measures how positive the output sentence is. Wu et al. [79] proposed an attention-based RNN in which the responses are enhanced based on user inputs. First, the RNN influences the input sequence and the generated response is weighed by a pre-trained LDA model. This is used to form topic vectors that are linear combinations of the topic words, which then, through an attention mechanism, refines the given inputs and responses with the topic vectors. Despite the success of the seq2seq model, there is not much focus on dealing with chatbot’s speech recognition errors in the end-to-end dialogue system. Chen et al. [80] investigated the problem of converting speech to text. The study uses DNN to determine the probability of error on spoken text summarization and applied the CRF model. This model has dual encoders (two RNNs) with different parameters, one ASR gate and one decoder. The encoder encodes the input, the ASR gate forwards vectors to the decoder, and the decoder generates the output. Additionally, it makes hidden states similar to the decoder for it to predict the dialogue text. The model demonstrates that it generates similar responses from the given input. However, the output has errors in it. Stroh and Mathur [81] used a seq2seq model with the GloVe word vector to answer questions. The authors used a cross-entropy error on the decoder output to train the RNN on the bAbI English dataset. The proposed approach performed well on questions that required either a yes or no answer, however, it failed for longer response generators. Table 2.8 presents a summary of the seq2seq approaches to generate a response.
36
Table 2.8 Summary of the RNN approaches with seq2seq to generate a response with their description and issues How does it generate a Approach Description Issues/ drawback response? Gaussian Goal-oriented bot in the User-bot interaction - Does not engage user in a Process [72] movie booking domain consists of two slots— long conversational inform and request to - Cannot provide a generate a response. meaningful response Sequential A retrieval-based bot with Generates a response by - Trained on basic matching RNN determining the context conversational dataset only framework of the conversation, its [73] important parts and modelling relationships among the utterances in that context. Language Non-task oriented social Mimics human-like generation bot integrates with rule- conversation by using a - Trained on basic techniques based and deep learning. combination of response conversational dataset only [74] generation strategies. Copying It predicts the output Generates the output - Little conversational mechanism sequence directly from the according to the capability [75] input using seq2seq. sequence of the inputs. - Cannot predict or generate a meaningful response - Trained on basic conversational dataset only Neural Deep reinforcement Able to interact with a - Cannot provide a Network- learning chatbot with NLP. human through speech meaningful response based and text. - Trained on basic retrieval conversational dataset only model [31] Conventional Model to scale the Sentiment coherence and - Focused on sentiment seq2seq sentiment of the chatbot sentiment classifier score rather than conversation model [77] using seq2seq. - Trained on basic conversational dataset only Attention- User inputs and the Responses in high-level - Trained on basic based RNN responses are enhanced. with rich content conversational dataset only [79] End-to-end The method can detect Speech to text - Trained on basic dialogue speech errors and try to conversion. conversational dataset only system [80] recover them using CRF model with DNN.
37 seq2seq Build with Tensorflow GRU It answers well on tasks - Does not have a model with and separated with yes/no questions. conversational capability or Glove word representations for the generate a meaningful vector [81] query e.g ‘Q’ for a response question, ‘GO’ for start. - Trained on bAbI dataset only
2.5.5 RNN approach with memory network
The RNN approach with a memory network is termed long short-term memory (LSTM) [82]. The LSTM cell eliminates the short-term memory characteristics of RNN which means it has a short attention span. Instead, it can remember patterns between words for a longer duration of time and use this to accurately determine the next word in a sequence. It develops the context of the word by taking inputs and determines what should be the next output in the sequence. Researchers have proposed methods by which the LSTM model is able to deal with a longer sequence of inputs and process them to produce an accurate output.
Bahdanau et al. [83] developed an approach that combines the attention mechanism with DNN and applied it to neural machine translation (NMT). To address the LSTM’s drawback when translating long sentences, Sutskever et al. [84] developed a multilayered LSTM built on a limited vocabulary. One LSTM maps the input to a fixed dimension vector whereas another LSTM decodes the target sequences from the vector. Shao et al. [43] proposed an approach that focusses on generating the correct output by addressing the shortcomings of the seq2seq model which struggles to generate long responses. The author proposed a glimpse model with a stochastic beam-search decoding technique. The glimpse model scales the ability to train on bigger datasets (2.3B conversational messages) that were then used to generate responses from a diversified range using MAP-decoding. Weizenbaum [2] improves the alignment of the generated outputs to the inputs by proposing an attention-based seq2seq mechanism. Yin et al. [85] proposed DeepProbe which understands the input question before translating it to a simpler query form. A recommendation model is then used to ascertain the best match to the query. Table 2.9 presents a summary of the RNN approaches with memory networks to generate a response.
38
Table 2.9 Summary of RNN approaches with memory networks to generate a response with their description and issues How does it generate a Approach Description Issues/ drawback response? NMT with the One layer of LSTM maps Adopts multilayered LSTM to - Little conversational attention the input while the improve the attention span. capability mechanism other layer decodes the - Limited vocabulary [84] output. - Trained on basic conversational dataset only seq2seq with Attempts to generate The glimpse model scales - Trained on basic glimpse the correct output by the ability to train on bigger conversational dataset model [43] addressing the datasets that were then used only shortcomings of the to generate responses from seq2seq model. a diversified range by using MAP-decoding. seq2seq with Uses seq2seq model Uses the top hidden vector - Trained on basic the attention with attention-based at the decoder side to conversational dataset mechanism mechanism to generate generate an accurate only [86] outputs. response. DeepProbe Uses attention-based Translates the input question - Little conversation [85] seq2seq RNN to to a simpler query form capability generate output. which is then used to - Trained on basic determine the best match to conversational dataset the query. only
2.6 Critical Evaluation of the Literature
The aim of this section is to discuss current gaps in the existing techniques used by a chatbot to generate a response to answer a user’s questions. These techniques are rule-based, TF- IDF, end-to-end, RNN, DNN and CNN. Although many approaches, techniques and tools are available, a key challenge is to move towards engaging with the user in a meaningful conversation specific to a domain. Furthermore, apart from just responding to the user questions being asked, a chatbot should include additional features and capabilities such as spelling corrections, identifying errors in a user’s question, asking confirmation to resolve the identified errors, sentence structure correction, abbreviation checks, continuous learning from user-bot conversation, retrieving up-to-date information from the web and saving it for further use. As summarised in Table 2.5, existing chatbots do not address these requirements
39 and hence there is a need to address them in a chatbot that can answer domain-specific user questions.
Section 2.5 presents a summary of the approaches and techniques used in existing dialogue- based chatbots to build and generate a response. The findings from the summary show that the majority of legacy chatbots use a rule-based approach which lacks any capability to engage in a discussion with the users. They respond to a user question only if it matches with predefined rules or set of questions in the template. But a challenge is that it is impossible to write rules for every possible scenario of questions which a user may ask. Furthermore, to make such rules, a great deal of effort is required by a subject matter expert. Another issue is that as the questions are predefined, this approach is not able to engage the user in a long and domain-specific conversation if it asks a variant of the question. However, to overcome a rule-based limitation, data-driven methods are introduced such as TF-IDF, generated- based, and RNN are used in the literature. These approaches consider the importance of a word in a document instead of using rules or QA templates. However, as discussed in Section 2.6, the responses generated by TF-IDF were not appropriate to the question the user asked. Furthermore, this approach also fails to generate a meaningful response and does not have the capability to engage a user in a conversation. RNN or generative-based approaches too were used in the literature. However, as discussed in Table 2.8 and Table 2.9, they are not trained on a domain-specific dataset, rather they are trained on a basic conversational dataset only with a limited vocabulary. Furthermore, most of the generated responses were not meaningful and failed to engage the user in a conversation.
As mentioned in Chapter 1, the focus of this thesis is to engage the user in a meaningful conversation by generating appropriate responses in the insurance domain. To achieve this aim, IntelliBot should understand the question the user is asking before selecting a suitable strategy for generating appropriate responses. Among generated multiple responses, IntelliBot should then choose the best possible answer before conveying to the user. To the best of the author’s knowledge, and as shown from the summary of the existing approaches in the literature in the previous sections, while there has been number of attempts in this area, no prior framework exists which addresses all the discussed shortcomings in one framework. Hence, this is a gap in the literature that needs to be addressed. In Chapter 3, this gap is formally defined as the problem to be addressed in this thesis.
40
2.7 Conclusion
This chapter discusses the previous studies in the literature, describes the use and drawbacks of the existing chatbots and classifies them according to the different categories. It also explains the different techniques used in dialogue-based chatbots, and then summarizes the issues within the existing approaches from the perspective of a dialogue-based chatbot. The identified issues will be formally defined as the problem to be addressed in this thesis.
41
CHAPTER 3
“Research is to see what everybody else has seen, and to think what nobody else has thought.”—Unknown.
PROBLEM DEFINITION
3.1 Introduction
3The literature study in Chapter 2 presented the drawbacks of the existing chatbots and the different techniques used in dialogue-based chatbots. In this chapter, the drawbacks which have been identified are defined as the problem addressed in this thesis. This chapter is organized as follows: Section 3.2 defines the key terms used in this chapter. Section 3.3 explains the shortcomings in the existing chatbots to generate meaningful responses. Section 3.4 breaks down the gaps in the literature in terms of the research issues to be addressed to solve them. Section 3.5 discusses the research methodology that is followed in this thesis to solve the research problem. Section 3.6 concludes this chapter.
3.2 Key Terms
Dialogue-based System is an AI software system that is intended to converse with humans in a meaningful way. It addresses features of human-to-human dialogue and aims to integrate them into dialogue system for human-machine interaction. It is also referred as chatbot.
Natural Language Processing (NLP) is an AI method that communicate with an intelligent system to understand the human’s natural language. The objective of NLP is to read,
3 Parts of this chapter have been published in [20]. 42
decipher, understand and make sense of human languages in a manner that is valuable. It performs tasks like translation, grammar checking and topic classification.
Domain-specific means specialized to a particular application domain or a specific context in a problem domain.
Knowledge-based Database (KBDB) is a database system that use database concepts and models to store and retrieve knowledge. It typically links and integrates all available knowledge sources including explicit/inexplicit knowledge. The objectives of KBDB is to make available the most optimal knowledge in the optimal time to enable appropriate decision- making.
3.3 Existing Gaps in Domain-oriented Dialogue-based Chatbots Which Aim to Engage with Customers in the Service Industry
As discussed in Chapter 1 and 2, although there are various domain-oriented dialogue-based chatbots in the service industry, the literature highlights their shortcomings in that they are not able to respond to user’s complex queries, nor can they engage users in a long and meaningful conversation. In order to address these drawbacks and develop a chatbot that can converse naturally with customers in a way which is indistinguishable from a human are described in the literature. The following sub-sections explain the drawbacks that will form the problem that is addressed in this thesis.
3.3.1 Drawback 1: Use of templates to map questions and answers to respond to user questions
Legacy chatbots use a series of predefined rules to map pairs of questions and answers. This is done in anticipation of what the user will ask and then pattern matching techniques are used to check whether the user question matches the predefined rules or questions. To devise these rules, a great deal of effort from a subject matter expert is needed. It is impossible to write rules for every possible scenario of questions which a user may ask. This is despite the rule-based technique being relatively straightforward but having a less flexible conversation flow which is not efficient in answering questions. The obstacles of rule-based techniques are that they cannot learn on their own nor can they generate responses for questions that are not defined in the template. Rather, they only provide answers that are
43 defined in the templates. Additionally, this technique leads to conflict when more than one rule satisfies their conditions to a given question.
3.3.2 Drawback 2: Inability to respond to a user’s complex queries
Chatbots are revolutionizing the way organizations interact with their customers. However, as mentioned in Section 3.3.1, existing chatbots can only handle simple queries which are predefined in the template and fail to manage complex queries. Thus, it is crucial for organizations to develop chatbots to address this gap to create a positive image with the customer by keeping them engaged. The existing chatbots in the literature are not well suited for advanced technologies like NLP framework to create responses or analyse user questions. As they use only keywords, they cannot understand the facts and context of the conversation which results in them communicating with all users in the same way. Hence, the chatbots are unable to build a query to find appropriate responses for a user’s complex questions and speed up the time taken to answer the users.
3.3.3 Drawback 3: Deciding which strategy to select according to the question asked to generate a meaningful and domain-specific response
The existing chatbots in the literature use either a template-based, knowledge-based or neural network-based strategy for general user-bot conversation. The working style of each response generation is different. A template-based strategy can be used to answer simple questions, while neural networks can be used to answer complex questions. But the working style and complexity of the working of each strategy is different. So, on one hand, while combining each strategy to generate a response can make the chatbot more efficient than using an individual strategy, on the other hand there is a need to instil a decision-making ability in the chatbot so it can decide on what strategy to use to generate the response so that it is meaningful. Furthermore, most existing chatbots do not take into consideration domain-oriented QA and thus are not suitable for answering domain-specific questions.
3.3.4 Drawback 4: Unable to engage users in a meaningful conversation
As previously discussed, the aim of domain-oriented chatbots in the service industry should be to engage the users in long and meaningful conversations. Existing approaches in the literature use techniques such as the TF-IDF approach which is based on a frequency
44 distribution and uses a bag-of-words with a limited document size of up to 50k words to generate a response [66]. This impacts the conversational ability of the chatbot to generate meaningful answers. Additionally, existing approaches do not capture text position, semantics, and co-occurrences in different articles [65]. Other approaches such as TF-IDF with NLP and random forest overcame these problems and showed significant improvement on feature extraction and sentence classification [66]. However, even though they can answer a user’s simple queries, the chatbot failed to engage the user in conversation. End- to-end approaches were able to partially overcome this problem, but training the model required a vast amount of labelled data. Moreover, it was built on a single neural network and required a lot of time to generate an output. To engage users in meaningful conversation, the chatbot needs to ask the user relevant questions and provide suggestions and recommendations. In order to do this, the chatbot needs to consider the contextual information, topics and previous conversational data for every query to generate a meaningful response. Existing approaches do not enable chatbots to do this for all conversations, thus are not able to engage users in a meaningful conversation.
3.3.5 Drawback 5: Unable to identify errors in user questions
Each language has a different sentence structure and thus the structure of texts, punctuation and use of spaces differ between them. Chatbots need to be able to understand it to make sense of the question according to its context. Furthermore, when dealing with text-based chatbots, users may incorporate shorthand or make grammatical mistakes when writing their questions. If the chatbot is not able to understand or correct these, it will not understand user’s question and will not be able to generate an appropriate response. All types of errors such as spelling errors, syntax errors, punctuation errors, semantic errors and non-word errors need to be addressed so that the chatbot can understand the meaning of the user’s question. To correct such errors, the chatbot should first identify them and then notify the user of the mistake or error and recommend this be corrected before generating a response. Existing chatbots do not consider such an approach and thus is unable to generate meaningful responses if the user has made an error.
45
3.3.6 Drawback 6: Unable to learn continuously from a user-bot conversation
As users interact with the chatbot, new patterns or information can be observed from their conversation. By examining the conversations, chatbots can understand new information and store this into a KBDB to generate future solutions. However, existing chatbots do not analyse and extract patterns from user-bot conversations and fail to incorporate the new information into the conversational flow. For example, if a chatbot knows how to answer a question like “how do I add another user?” it can automatically recognize “where do I add another user?” as having the same meaning. Similar phrasing can automatically be added to its knowledge bank so that future questions that follow the second sentence can be answered using the same response to the first question. By doing so, chatbots will learn and automatically improve the quality of the support they offer to their users. Existing chatbots do not have the ability to do this.
3.4 Research Problem Addressed in this Thesis
To solve the aforementioned drawbacks in the dialogue-based domain-oriented application, the problem to be addressed in this thesis is defined as follows:
Develop and validate IntelliBot with a modular-based framework by which domain-oriented chatbots can engage with users in natural language and address their questions related to the insurance domain. In generating a response, IntelliBot should have multiple response generation strategies which it can use to address questions of different level/s of complexity. Furthermore, IntelliBot should identify and address any grammatical errors or shorthand text which users may use so that an appropriate response to the user’s question is generated.
To address the aforementioned problem, the following sub-problems have been identified that need to be addressed:
Sub-problem (1) – Develop IntelliBot’s conceptual model so it can engage with the user and address their queries related to the insurance domain.
The objective of this sub-problem is to develop the conceptual model for IntelliBot with all the sub-components that enable it to perform tasks that range from 46
understanding the user’s input to generating an appropriate response. The required sub-components should have NLP capabilities to understand the user’s question, context, intent and then accordingly generate a response. In Chapter 4, the solution to this sub-problem is proposed.
Sub-problem (2) – Develop different response generation strategies that IntelliBot can use to answer a user’s question according to its complexity and the process to train them.
The aim of this sub-problem is to design IntelliBot’s conceptual architecture by building four response generation strategies, namely the template-based strategy, knowledge-based strategy, internet retrieval strategy and generative-based strategy. These strategies are responsible for generating meaningful responses and engaging the user in human-like conversation. In Chapter 4, the high-level design of these four strategies is proposed and in Chapter 5, the process by which responses are generated through each is explained in detail.
Sub-problem (3) – Develop the detailed working of the different sub-components of IntelliBot that assist it to process and understand the user’s input along with correcting the grammatical errors in the user input and the chatbot’s generated output.
As previously discussed, IntelliBot’s architecture is modular based and has five response generation components, namely input processing unit, language understanding unit, strategy selection unit, response generation unit and response analyser unit. The purpose of this sub-problem is to develop the working of these five sub-components which enable IntelliBot to respond to a user’s query. The Language Understanding Unit (LUU) component understands a user question by breaking it into smaller pieces using various techniques such as tokenization, abbreviation, POS tagging, grammar check, stop words removal, lemmatization, entity extraction and punctuation removal. The working of each of them is detailed in Chapter 6. User questions may use shorthand or grammatical mistakes in their inputs. The aim of the grammar error correction component is to correct user questions so that IntelliBot can understand them and generate an appropriate response. Chapter 6 explains the six types of errors, namely structure errors, syntax errors, punctuation errors, semantic errors, spelling errors and non-word errors which IntelliBot considers when generating a response. 47
Sub-problem (4) – Develop an approach to collect insurance domain-specific data required to train IntelliBot.
The response generation strategies in sub-problem 2 require domain-specific data for them to train IntelliBot. The goal of this sub-problem is to collect insurance domain-specific data on which training will be conducted so that IntelliBot understands domain-specific terms and keywords and is able to generate an appropriate response using the response generation strategies. A data collection strategy is developed to collect data from various sources such as the knowledge database, ANZ and Commonwealth Bank websites and the Cornell movie dialogue corpus. As these data are not suitable for training IntelliBot, a data preparation technique also needs to be developed. This process is explained detail in Chapter 7.
Sub-problem (5) – Compare and validate the outputs of IntelliBot with existing chatbots from the literature to demonstrate IntelliBot’s accuracy and superiority in engaging with the users while answering their questions.
The objective of this sub-problem is to evaluate IntelliBot’s generated response to user queries against three publicly available chatbots. To do this, all responses will be recorded and then these responses will be evaluated by the experts to determine their accuracy in relation to the questions asked. The accuracy is measured by F1 score and Cohen’s kappa metrics. Chapter 8 presents the adopted approach in detail for a comparison and validation of IntelliBot’s response.
3.5 Adopted Research Methodology to Solve the Thesis Problem
This thesis addresses the aforementioned research issues and proposes a flexible solution to build a modular-based framework for a domain-oriented chatbot which understands natural language to generate response. In order to ensure that the research is conducted systematically and is built on well-tested techniques and tools, it is necessary to follow a systematic approach to ensure the research is aligned with data science and machine learning standards. The research approaches can be grouped into two categories: 1) social science approach and 2) science and engineering approach.
48
Social science approach observes and analyses human behaviour using empirical methods of research and a set of hypotheses is formulated based on the observations. The aim of this approach is to approve or reject the hypothesis [87]. A hypothesis is an educated guess regarding what the researchers expect to find. Social science research is either quantitative, qualitative or a mixed approach [88]. Quantitative approach researches are explorative. They focus on numerical and unchanging data that can be used to classify features, predict future results, and construct statistical model in an attempt to explain what is observed [89]. On the other side, the qualitative approach is descriptive, and regards phenomenon which can be observed but not measured. The results of qualitative research can vary according to the skills of observer. However, it is interpreted by all experts in an almost similar manner. While the social science research approach does not develop new technology, it thoroughly evaluates different aspects of existing methods [90].
Science and engineering approach analyses data, makes a prediction then validates it with the observation data [91]. The aim of this approach is to devise scientific theories to explain a phenomena and develop a solution to address the identified problems. Science and engineering research is conducted either by a qualitative or mixed approach [92]. It is typically experimental and is dependent on the architectures, techniques, tools, methods, concepts and collection of observational data. Its goal is to make something work, which suits the problem to be addressed in this thesis. Thus, this thesis adopts the science and engineering approach, as shown in Figure 3.1. The research approach adopted in this thesis is divided into four phases as follows: theoretical study, addressing the problem, solution design and validation. A brief description of each phase is presented in the next sub-sections.
49
Fig. 3.1 Research methodology adopted in this thesis to solve the research problem
3.5.1 Theoretical study
This is the initial phase of the research in which broad study is required in the different areas of AI, namely ML, DNN or RNN to identify the gaps in the existing research. This includes conducting a taxonomy of chatbots, and past and current trends before identifying the drawbacks in existing chatbots. To obtain a good foundation of knowledge is the aims of this phase, in Chapter 2, this thesis reviewed the previous literature from journals and conference articles related to these areas.
3.5.2 Addressing the problem
Based on the literature review from the previous phase, the purpose of this step is to define the problem the thesis aims to solve and understand the goal of the problem. This was addressed in this chapter of the thesis.
3.5.3 Solution design
In this phase, the aim is to design new solutions and concepts to solve the problem defined in the thesis. In Chapter 4, this thesis proposes the framework for IntelliBot, which is a dialogue-based chatbot to generate meaningful responses to engage the users in continuous conversation. Different components of IntelliBot are required to achieve this goal as defined in Chapter 4. In Chapter 5, the four response generation strategies which are developed to engage the users are explained. Chapter 6 details the spelling correction and other 50 components required to generate a response. Chapter 7 explains the process of training the deep bidirectional recurrent neural network (DBRNN).
3.5.4 Experiment
The purpose of this phase is to test the accuracy and performance of the IntelliBot framework in generating a response in the insurance domain. The working of IntelliBot is compared with the output of three publicly available chatbots in Chapter 8. The quality of the generated responses is measured using F1 scores and Cohen’s kappa metrics.
3.6 Conclusion
This chapter explains the research problem that is addressed in this thesis. It discussed the shortcomings in the area of dialogue systems, and their inability to generate meaningful responses to domain-specific QAs. The different research issues that need to be addressed to solve the research questions were then presented. The details of the research methodology adopted in this thesis were presented. The next chapter provides a solution overview of the proposed IntelliBot.
51
CHAPTER 4
“Basic research is what I am doing when I don’t know what I am doing” —Rocket scientist
SOLUTION OVERVIEW
4.1 Introduction
4As discussed in chapter 1, this thesis proposes IntelliBot which is a domain-specific chatbot for the insurance industry. The aim of this chapter is to describe the design of IntelliBot’s architecture as a modular-based and scalable framework for a continuous natural language conversation and solve user queries specifically in the insurance domain. The proposed architecture facilitates the building of IntelliBot on various response generation strategies, including a seq2seq model in deep bidirectional recurrent neural networks (DBRNN) with self-learning capabilities. These strategies will assist IntelliBot to generate a meaningful response and engage with the user in human-like conversation.
The structure of this chapter is as follows. Section 4.2 defines the key terms needed to introduce IntelliBot. Section 4.3 defines the basic requirements that IntelliBot, as a domain- specific chatbot designed to answer users’ questions, should meet. Section 4.4 categorises the process of building IntelliBot’s framework using a methodological approach with four different steps. Section 4.5 presents the design of IntelliBot’s various response generation components. Section 4.6 concludes the chapter.
4.2 Key Terms
DBRNN is a Deep Bidirectional Recurrent Neural Networks, are two hidden layers running in opposite directions to a single output, allowing them to receive information from both
4 Parts of this chapter have been published in [20]. 52
the previous and next states. For example, to predict a missing word in a sequence, it looks at both the left and right context. RGC is a Response generation component which is responsible for understanding the user’s question and generating responses. This component uses various response generation models and different NLP techniques to understand what the user is asking and construct responses to it. Entity is anything having an existence. DOM parsing is Document Object Model that extracts information from a tree-like structure such as HTML or XML.
4.3 Requirements of a Domain-specific Chatbot
A key requirement for service-based chatbots is for them to engage with the customers and answer their queries correctly [93]. These queries can be across a wide range of spectrum of the particular service domain. This is important as research shows that 91% of unhappy customers will not engage with the business again [16]. To keep the customers engaged, the chatbot needs to have meaningful dialogue abilities rather than merely providing either a yes or no or a short response. Dialogue abilities enable the chatbot to converse with the user according to the terminology of the domain. So, taking these requirements into consideration, this thesis defines a domain-specific chatbot as follows:
Definition: A domain-specific chatbot for the service industry is one which has the conversational capability to engage with the user while answering their queries. In doing so, it should be trained on a domain-specific dataset so that it can present meaningful responses that contain the semantically correct information.
The important terms in the above definition are underlined to stress their importance in meeting expectations. The meaning of each term is explained in the following.
• Requirement 1 (R1) Conversational capability. Conversations are the core requirement of a chatbot. However, to be engaged with the users, rather than merely providing a short yes or no answer, the chatbot should generate accurate and meaningful responses by identifying the topic intent, entities and context [94].
53
For example, in relation to the questions “What is the cash advance rate of the credit card?”, “What is the interest-free period of my credit card?” and “What is the annual fee of a credit card?” the user might be expressing their intent to enquire about their credit card rate and fees. Before answering the question, the chatbot should ask “which credit card are you referring to?” as the user did not specify this. By identifying the intent and entity, the chatbot can engage the users in a long conversation. Furthermore, the length of the chatbot’s response to the query also is a factor in keeping the user engaged and happy. Research has shown that a short answer often leaves the user dissatisfied [95]. For example, in response to the user’s questions ‘I am not feeling well’ and ‘I am sad’, a chatbot using its conditional response library would simply mention ‘How can I help you?’ for both the questions. But a human being would reply ‘How can I help you? Do you need medical help?’ and ‘I am sorry to hear that. Why are you sad?’ respectively. The human response shows empathy which chatbots should be able to replicate in their response, so it relates more to the user. This is justified in [96] which states that chatbots that assist in customer support should not appear to be too serious and transactional, as they do not inspire continued use. So, chatbots need to keep customers engaged and have conversational abilities rather than just providing either a yes or no to a short response.
• Requirement 2 (R2) Present semantically correct information. Semantics help to bring the correct meaning to a word according to the context of a sentence. For instance, the word “create” can mean build, make, construct or compose. When the context of the sentence is database creation, the word “build” is more appropriate than “compose”. In an insurance scenario, for the question “when will the policy cover end?”, the word “policy” refers to an insurance plan, not to a strategy or approach as it may refer to other domains. Thus, a chatbot should have a vocabulary such that it is able to understand and generate responses that are syntactic, pragmatic and semantically correct according to the context of the information presented [21].
• Requirement 3 (R3) Present meaningful responses. When generating a response, the chatbot should not only give a response which is semantically correct, it should also provide significant detail for easy understanding [97]. For example, if the user says, “My gold credit card should have arrived two days ago, but it has not arrived yet”, the chatbot
54
in response should give a meaningful response such as “Let me check on the delivery status from the carrier. Give me just a moment”, rather than showing the standard response as “the standard shipping time is 3-5 business days”. Similarly, for a question “Why are cash advances not permitted on my AMEX card?”, a meaningful response from the chatbot would be “You’re holding a corporate card. As per corporate card policy, you’re not allowed cash advances” rather than just saying “cash advances are not allowed in AMEX”.
• Requirement 4 (R4) Trained on a domain-specific dataset. To build a domain-oriented chatbot, it needs to be trained on a specific domain rather than just simple dialogue dataset [94] such as the Cornell movie dialogue and Twitter dataset. This is important for the chatbot to understand domain-specific terms, information and workflow. Training on simple dataset, such as the Cornell movie dialogue dataset, will enable the chatbot to engage with the user, but it will be a basic conversation only and will not be able to respond to domain-related queries.
This thesis proposes a framework for IntelliBot so that these requirements are achieved. In the next section, the methodological approach by which machine learning engineers, data scientists and chatbot designers can design a chatbot with these requirements is explained.
4.4 Methodological Approach for Designing and Building a Domain-specific Chatbot
To achieve the aforementioned requirements (R1—R4), various design processes which are classified into different tasks need to be implemented. To make IntelliBot practical and implementable, the process of building IntelliBot in this thesis is divided into four tasks namely, identify components, design conceptual framework, develop & train AI Model, and experiment & validation as shown in Fig. 4.1. The following sub-sections discuss the objective to be achieved in each task.
Fig. 4.1 Methodological approach 55
4.4.1 Identify components
The objective of this task is to identify the different components required to build IntelliBot. These components are as follows:
• Interface component. The interface component links the chatbot with the user through an app or webpage. As shown in Fig. 4.2, this component is responsible for capturing information from the user, checking input validation and forwarding this information to the response generation unit. Finally, it conveys the response to the user.
Fig. 4.2 Components required for building a chatbot application
• Response generation component (RGC). This component is responsible for understanding the user’s question captured by the interface component and generates responses to be given to the user. This component uses various response generation models and different NLP techniques to understand what the user is asking and develop a response to it.
• Data layer component. As shown in Fig. 4.2, this component is responsible for holding both generic and domain-specific information that is required to answer user queries. Various types of data, for example user profile, questions and answers categorized into different topics, user-bot conversation history, domain-specific knowledge such as credit card insurance is required for IntelliBot to work effectively. This component is connected with external knowledge or data sources to produce more meaningful answers. In this thesis, a document-oriented database such as MongoDB and a rational database such as MySQL are used to store the required information.
• Integration layer component. This component is responsible for linking the chatbot to existing systems, platforms and databases for it to access and retrieve the required 56
information, such as workforce management, a third-party service provider etc. Integration with existing components should be plug & play so that it minimizes development efforts and improve system performance and productivity. Another integration component is an authentication layer that manages system security, identifies an individual user and protects their sensitive information which is verified by an authentication process. It enables the resources to be either accessed or denied to the user.
As this thesis focusses on developing those components which enable a chatbot to respond to a user query, it focuses on the Response Generation Component (RGC). In other words, it is assumed that the chatbot has the required interface and access to all information sources and integrates different systems to obtain the data. From this perspective, Fig. 4.3 illustrates the Neural Dialogue Manager (NDM) of IntelliBot. The NDM, which is inside the RGC, has six different modules to generate responses to engage a user, namely the input processing unit, language understanding unit, strategy selection unit, response generation unit, response analyser unit and context tracking unit. The objectives and aims of each unit are explained briefly as follows:
Fig. 4.3 Components required in a response-generating chatbot application
• Input Processing Unit (IPU): This unit processes the user’s input, which can either be in text or voice (speech) form using NLP techniques. In this thesis, we limit ourselves to consider only the user’s text as input. The goal of this unit is to clean the user input text by applying various techniques such as lowercasing, stopword removal and word segmentation for better knowledge discovery before sending it to the next unit. Further details of the working of the IPU is explained in section 4.5.2. 57
• Language Understanding Unit (LUU): This unit understands the user’s question by taking the word segments from IPU as its input. Various techniques, such as tokenisation, abbreviation check, POS tagging, grammar check, lemmatisation, named entity recognition, context identification and query classification are needed to understand the user’s question. Further detail of the working of LUU is explained in section 4.5.3.1.
• Strategy Selection Unit (SSU): This unit is responsible for deciding which conversational strategy to select in order to generate a response and engage users in continuous conversation. Four strategies have been developed for IntelliBot to use, namely the template-based strategy, knowledge-based strategy, internet retrieval strategy and generative-based strategy. An AI selection process which sequentially determines which strategy best fits best according to the specifics of the user’s question is adopted. Further detail of the working of SSU is explained in section 4.5.3.2.
• Context Tracking Unit (CTU): This unit is used in different stages of IntelliBot’s working. It is used to undertake tasks such as determining the intent of the user’s questions, forming the user’s query, performing system actions and handling errors. Further detail of CTU’s working is explained in section 4.5.3.3.
• Response Generation Unit (RGU): This unit generates responses to the user’s query by accessing the required data from multiple data sources according to the selected response generation strategy. As the working of each strategy is different, appropriate components are needed to generate a meaningful response. Further detail of the working of RGU is explained in section 4.5.3.4.
• Response Analyser Unit (RAU): This unit analyses the response generated from the RGU to ensure that it answers the user’s question. Techniques such as response filtering, grammar checking and answer scoring are needed to achieve the objective of this unit. Further detail of the working of RAU is explained in section 4.5.3.5.
4.4.2 Design conceptual framework
The objective of this task is to design in detail the conceptual model of the different units of the RGC introduced above. The design process should be according to the software engineering principles, AI methodology and the latest development tools that will assist in 58 meeting objectives of the chatbot. The design of each unit will result in various independent modules that need to be integrated to provide all the necessary services. The details of the various components that comprise the different units of NDM are explained in Section 4.5.3.
4.4.3 Develop and train AI model
The aim of this task is to build and train IntelliBot so that it generates the most appropriate answer to the users’ queries. The chatbot needs to be firstly trained in conversing in the English language and then trained in using domain-specific information. IntelliBot was trained to converse in English using the Cornell movie corpus dataset and it was trained to answer insurance-related information using the insurance QA dataset. This training was performed using the Tensorflow seq2seq model with an attention mechanism in DBRNN. Chapters 6 and 7 discuss in detail the process of training IntelliBot.
4.4.4 Experiment and validation
The purpose of this phase is to assess the effectiveness of IntelliBot in an actual working environment. To do this, this thesis conducts two empirical sets of evaluation of IntelliBot’s output and compares this with the outputs of various publicly available chatbots. Two experts examined the responses of the chatbots and rated them. Matrixes such as F1 score and Cohen’s kappa co-efficient are used to measure the efficiency and effectiveness of each chatbot. Chapter 8 discusses the process of conducting the experiments and the validation in more detail.
4.5 Proposed Conceptual Model of IntelliBot’s Response Generation Component
Fig. 4.4 shows the high-level conceptual framework of IntelliBot’s RGC. As seen from the figure, there are multiple units that need to work together for the RGC to generate a response. A brief explanation of each unit is presented in the next sub-sections.
59
Fig. 4.4 Conceptual framework of IntelliBot
4.5.1 User emulator
The user emulator uses an interactive user interface to connect the user and the chatbot. It receives the user’s input (user question) as natural language and displays the response (output) to the user’s query. At the input side, it forwards the user query to the IPU and LUU over the authentication layer. Both a web browser and mobile can act as the user emulator as shown in Fig. 4.5. Techniques such as AngularJS, HTML5, bootstrap and iconic framework are used to design the user emulator. RESTful API is used to exchange messages between the user and the AI model.
60
Fig. 4.5 Mobile and web Interface of IntelliBot
4.5.2 Input Processing Unit (IPU)
The Input Processing Unit (IPU) takes the user’s input before queuing and pre-processing it. During the pre-processing, the objective of IPU is to validate the user’s input to ensure that it does not violate the pre-defined rules. Tasks such as removing extra whitespaces, extra lines and non-ASCII characters are undertaken in this process. Furthermore, all characters are converted to lowercase and numbers are converted to their word equivalents for better knowledge discovery. For example, if the user presses ‘enter’ without any text, IntelliBot checks the input against the validation rules and if it passes, it is forwarded to LUU.
4.5.3 Neural Dialogue Manager (NDM)
As discussed in section 4.4.1, the NDM is the core of IntelliBot’s RGC which is responsible for end-to-end input processing and response generation. As shown in Fig. 4.3, NDM has five units namely: the Language Understanding Unit (LUU), Strategy Selection Unit (SSU),
61
Response Generator Unit (RGU), Response Analyser Unit (RAU) and Context Tracking Unit (CTU). Fig 4.6 shows the working of these units in more detail. A brief description of the working of each unit is explained next.
Fig. 4.6 Neural Dialogue Manager (NDM) of IntelliBot
4.5.3.1 Language Understanding Unit (LUU)
The Language Understanding Unit (LUU) is responsible for various tasks that aim to understand the meaning of the user’s question. Therefore, it is like the glue between the user’s input and the other units of NDM. It receives the user’s query from IPU and parses it into a semantic frame. It automatically classifies the user’s query with intents, domain- specific terms and fills in a slot to form a semantic frame. The objective is to obtain the conditional probability of the user’s word sequence [98]. The following tasks are performed by LUU:
• Tokenization: Tokenization is a technique of chopping sequence of input text into words, symbols, phrases or text elements. It is known as token. For example, if the user input is “what is the monthly premium?”, there will be six tokens after the tokenization process. These are: “what”, “is”, “the”, “monthly”, “premium”, “?”. This is a mandatory step before any kind of NLP process such as parsing, POS tagging, entity extraction, grammar checking and lemmatization etc.
62
• Abbreviation: this is a shorthand form of a word or phrase, used as a symbol for the full form [99] which comprises the initial letters of a collection of words. It should not be considered a spelling error. For example, in a sentence such as “Contact you ASAP”, the abbreviation ‘ASAP’ is not a dictionary word so the chatbot should identify “ASAP” as an abbreviation whose full form is “As soon as possible”. Correct recognition of abbreviations and their full forms is very significant for understanding a user’s query, the user’s context and correcting grammar.
• POS tagging: the part-of-speech (POS) explains how a word is used in a sentence. It is a process of assigning part-of-speech marker such as noun, verb, adjective, adverb, preposition to each word in user’s input sequence. Considering the sentence “book the flight”, book is a verb, but in the sentence “give me the book”, book is a noun. POS tagging is disambiguation task where the goal is to find the proper tag for the given word. It is a very important step to understand, extract relationships and find grammatical or lexical patterns in a sentence.
• Grammar Check: Grammar checking is a task of analysing grammar rules, sentence structure and spelling mistakes that are quite commonly made by users in text-based input. Text containing grammatical errors could lead to the generation of incorrect responses. Therefore, it is essential to be able to identify and correct these grammatical errors. Grammar checking enables the automatic detection and correction of any faulty, unconventional or controversial usage in the underlying grammar.
• Lemmatization: this is a morphological analysis of words. It aims to remove inflectional endings only and transforms the word back to its common base form or dictionary form of a word, called a lemma. For example, the base form of the word “studies” is study. The base form of the word “boys” is boy and the base form of the word ‘running’ is run.
• Entity extraction: this is a process of information processing to identify and extract the named entities and classify them under various predefined classes such as PERSON, ORGANIZATION, DATE, LOCATION etc. Entity extraction techniques automatically pull
63
proper nouns from text and determine their common entity tags such as person, location, organization, events. For example, in a sentence such as “Nuruzzaman studies at UNSW”, the entity extraction extracts “Nuruzzaman” as a person and “UNSW” as an organization.
• Punctuation removal: punctuation and stopwords are not necessary for the AI model to predict and generate a response. Punctuation marks are special characters such as: ? @ # % & * ! and so on. Stopwords are: am, is, was, and, the, an, a, he etc. For example, in the response “Hello!!, how? are you?” IntelliBot processes this as “Hello how are you” as a result of punctuation removal.
The working of the aforementioned tasks of LUU is explained in detail in Chapter 7.
4.5.3.2 Strategy Selection Unit (SSU)
The central part of the NDM is the strategy selection unit (SSU). After the IPU and LUU complete the NLP processes, the SSU identifies and select the best strategy to generate a response. As shown in Fig. 4.7, four possible strategies are proposed in this thesis for IntelliBot to choose from while generating a response. They are template-based, knowledge- based (KB), internet retrieval (IR) and generative-based. Each strategy has a different data structure, matching techniques, and strategy to generate responses [100]. The neural dialogue manager (NDM) selects the most appropriate strategy which generates not only semantically correct but also meaningful responses for the user queries throughout the conversation. An AI selection process is adopted which sequentially determines which strategy best fits the selection criteria required for it. The process of how IntelliBot selects a strategy from which to generate a response is briefly explained as follows.
64
Fig. 4.7 Selection policy of AI conversational strategies
Template-based Strategy: This strategy is a collection of predefined rules and it is given the first priority in answering the user’s question. This strategy encodes human knowledge into the form of templates. As shown in Fig 4.8, after LUU performs the grammar check on the user input, pattern matching is used to determine if the user’s question matches a template. If it does, then the appropriate response to the matched template is IntelliBot’s answer to the user. In other words, the template-based strategy matches the entity of the question identified by the LUU together with the AIML rules as shown in Fig. 4.8. If the user input is recognized, the template is retrieved, and the RAU presents the response to the user. If there is no match between the entity of the user’s question with what is defined in the template, IntelliBot uses the knowledge-based strategy to ascertain whether it can be chosen or not. The working of the template-based strategy is explained in detail in Chapter 5.
Fig. 4.8 High-level workflow of template-based strategy
65
Knowledge-based Strategy: The knowledge-based strategy searches the existing KB database (KBDB) to answer the user’s question. As shown in Fig. 4.9, the query engine forms a query to generate a model of the user’s scenario to determine the facts necessary to generate a response, and if those facts are in the KBDB, accumulate these facts into a structure and conveys it to the RGU. The RGU determines the semantic similarity of the selected results and if they match above a certain level of threshold, it passes them to the RAU. The RAU passes the top-scoring answer to the user as the output. If this strategy is not able to match the facts of the question with those stored in the KBDB, then the IR strategy is assessed to ascertain if it can be used to generate a response to the user’s question. The working of the knowledge-based strategy is explained in detail in Chapter 5.
Fig. 4.9 High-level workflow of knowledge-based strategy
Internet-retrieval Strategy: This strategy searches for a possible answer from the Internet or intranet. As shown in the Fig. 4.10, a query is formed with the entity of the question, and the results retrieved from the Internet are stored as text. This text may be huge in volume as it is crawled from the Internet and may also include many errors such as spelling mistakes, HTML tags, special characters etc. So, as an additional step in DOM parsing and content segmentation is needed before the text is processed by the RGU. RGU determines the semantic similarity of the selected results with the question and passes it to the RAU which determines if it matches above a certain level of threshold. If it does, then it is passed to the user as output. If it does not, then the generative-based strategy is assessed to ascertain whether it can be chosen or not. The working of the IR strategy is explained in detail in Chapter 5.
66
Fig. 4.10 High-level workflow of Internet retrieval strategy
Generative-based Strategy: As shown in Fig. 4.11, the generative-based strategy is based on neural machine translation (NMT) techniques [101]. It “translates” from the user input sentences input to a output. This strategy is able to bring up entities from the input sentences and give the impression that the user is speaking to a human. This makes IntelliBot smarter and more advanced than other existing chatbots. But it requires a complex design and implementation of algorithms which are comparatively difficult to build. To generate an output, DBRNN uses the seq2seq model which trains the AI model by focusing on key elements of the sentence and considers previous input words that have an extra piece of information. By doing so, it develops the ability to accurately predict the next word. The generated message is passed to the RAU which determines if it matches above a certain level of threshold to the question after filtering. If it does, then it is passed on to the user as output. The working of the generative-based strategy is explained in detail in Chapter 5 and the process of training is explained in Chapter 7.
Fig. 4.11 High-level workflow of generative-based strategy
67
4.5.3.3 Context Tracking Unit (CTU)
Irrespective of which strategy is selected for generating responses, the entire conversation is stored into a database for further analysis. This is important as the conversation history may need to be accessed at regular periods of time or the chatbot may need to remind itself of the user’s question. Context Tracking Unit (CTU) keeps track of the user’s history, stores it in the database and accesses it when needed. It is also responsible for analysing the intent and identifying the theme or area of the user’s conversation. CTU comprises four sub- components, namely Context Discovery, Dialogue State Tracker, Policy Learner, and Error Controller. The following sub-sections describe the need for and role which each component plays in IntelliBot’s working.
Context Discovery
The context discovery component is used in every response generation strategy. It is responsible for identifying the context of the user’s query which includes topic detection and intent analysis. Topic detection identifies the subject or area of the domain in the user-bot conversation. Intent analysis identifies the intent. In cases where the user’s question needs to be linked with his previously asked information, the context discovery component retrieves the most recent conversational history of the user from the ChatLog, relevant to the current context. This history is then tokenized, and informative keywords are extracted to determine the context. Techniques such as Stanford CoreNLP [102] are used to determine the topics and intents. When detecting the current context, it takes the previously identified context into consideration.
Dialogue State Tracker (DST)
This component constantly monitors and updates the status of the conversation. This is required in the KB strategy, in which users’ inputs are formed as questions and the knowledge from the database is used to respond to them. In this case, it may be possible that the user may ask a question that relates to a question which they had asked some time ago. For IntelliBot to answer such a question effectively, it needs to link the user’s current question with the previous related question. The Dialogue State Tracker (DST) component of the CTU is responsible for doing this. DST constantly updates the state of the conversation and builds a robust and reliable representation of the current state of the conversation. It 68 keeps track of the user inputs, query results, and system actions. As IntelliBot focuses on a semantic level, a rule-based state tracker via supervised learning is used [103]. The DST performs the following three major functions:
i. A symbolic query or semantic frame is formed to interact with the database to obtain appropriate results. ii. DST updates based on the user dialogue action and results obtained from database. iii. DST prepares the state representation for the policy learner unit.
Policy Learner
The policy learner is responsible for selecting the best action from the available results in a database and retrieves information and history dialogue etc. IntelliBot is designed to respond to the user inputs in a way that results in achieving the user’s goal in a minimal number of dialogue turns. Based on the current dialogue state, policy learner module generates the next available action. For example, in the insurance scenario, if the dialogue state is “insurance claim”, the “insurance_claim” action is executed, and the IntelliBot retrieves it from the database. This could be trained using DBRNN which simultaneously learns the feature representation and dialogue policy.
Error Controller
This component is used in the processing of IntelliBot’s four conversational strategies. It is responsible for identifying errors, both in the processing of IntelliBot and in the text inputted by the human and corrects it.
• In relation to the text inputted by the human, natural language understanding does not operate without errors. When IntelliBot detects an error in the user question, whether it be grammatical or conceptual, the error controller is used to decide whether to ask the user for confirmation with the corrected meaning. It is important for such errors to be corrected as without this, IntelliBot may either generate an incorrect response or may not generate any response at all. Chapter 7 explains the working of the grammar checking component of IntelliBot in detail which uses the error controller of CTU.
69
• In the processing of IntelliBot, errors are those tasks that cannot be performed due to programming or logical errors which require software engineers to fix the bugs. In the presence of errors, the error controller of CTU is used to correct them.
4.5.3.4 Response Generator Unit (RGU)
Depending on the strategy chosen, the RGU acts as a decoder of the IntelliBot framework. • It is possible that RGU may generate more than one response if a knowledge-based, internet retrieval or generative-based strategy is used. In such cases, the RGU needs to select which response is the most suitable one to respond to the user’s question. For this purpose, RGU computes the word and sentence similarity of each response with a question to determine the semantic similarity of the responses to the user’s query. • In the generative-based strategy, the RGU generates the probability of the next word occurring in the sequence using DBRNN with an attention mechanism. This represents the output in the natural language.
Chapter 5 explains how RGU generates a response and Chapter 6 explains the process of determining the semantic similarity of a generated response with the user’s question.
4.5.3.5 Response Analyser Unit (RAU)
The Response Analyser Unit (RAU) is the glue between the system and the user. This unit takes responses from RGU and forwards these to the user. Before doing this, it performs filtering checks to eliminate or filter questionable responses. In the case of more than one valid response from RGU, RAU applies a scoring process to rank the answers and select the best response or to merge them into one before forwarding to the user. The score is assigned to rank the answers by the RAU component which determines the answer’s relevance with the corresponding question. If the similarity passes the pre-defined threshold, this is presented to the user. In this way, IntelliBot provides a balance between accuracy and flexibility in the evaluation process. Chapter 6 explains the working of this unit in detail.
70
4.6 Conclusion
In this chapter, the conceptual architectural model of IntelliBot’s RGC is introduced. The architecture was designed based on requirements of a domain-specific chatbot such as conversational capability, present semantically correct information, present meaningful responses and training on domain-specific dataset. The proposed architecture facilitates the building of IntelliBot on various response generation strategies for them to engage with the customers and answer their queries correctly. A methodological approach is applied for designing and building a robust framework for IntelliBot that will enable it to keep the user engaged and respond to its queries.
71
CHAPTER 5
“Software architecture is the set of design decisions which, if made incorrectly, may cause your project to fail.” — Eoin Woods
DESIGN MULTI-STRATEGY SELECTION AND RESPONSE GENERATION
5.1 Introduction
5As discussed in the previous chapter, the NDM of IntelliBot has four different strategies to generate an appropriate response to the user’s question. These strategies are template- based, knowledge-based, Internet retrieval-based and generative-based. The NDM must select that strategy which meets the requirements mentioned in Section 4.2 and increases involvement of the user. As each strategy has a different data structure, matching techniques, and working process to generate a response [100], our focus in this chapter is to explain the working of each strategy in detail. Specifically, we focus on the techniques which are used in each strategy to generate a response. These techniques are used in the SSU of the NDM as shown in Fig. 4.4.
The structure of the chapter is as follows: Section 5.2 introduces the key terms needed to explain the working of the SSU. Section 5.3 briefly explains how IntelliBot selects a strategy to generate a response. Sections 5.4-5.7 explain the working of each strategy in detail. Specifically, Section 5.4 illustrates how predefined rules are used to generate a response via the template-based strategy. Section 5.5 demonstrates the process of using query formation through events and entities required to generate a response in the knowledge-based strategy. Section 5.6 explains the process of information extraction from selected websites and generates a response through the Internet retrieval strategy. Section 5.7 presents the
5 Parts of this chapter have been published in [20]. 72 working of the bidirectional recurrent neural networks to generate a response in the generative-based strategy. Finally, Section 5.8 concludes the chapter.
5.2 Key Terminology
Context is the particular setting or situation in which the content occurs. The meaning of a sentence is always context dependent. Event is an occurrence happening at a determinable place and time. Token is segmented word in a sentence or piece of an element. WordNet is a large lexical English data dictionary developed and hosted at Princeton. It is part of NLTK corpus. WordNet It is able to find meaning of words, synonyms, antonyms and more. Approximately 117,000 synsets are found in WordNet. Web Crawler is an application or set of instructions that analyse the web pages in a systematic and automated manner to categorize information on the basis of user demand. DOM parsing is Document Object Model that extracts information from a tree-like structure such as HTML or XML. LSTM stands for long short term memory, is a special kind of RNN architecture that extends the memory of RNN. It is designed to remember information for long periods of time. Vector is set of weights. There are several vectors in this thesis, namely, word vector, thought vector, embeddings vector, hidden state vector, bias vector.
5.3 Strategy Selection Unit’s Workflow to Generate a Response to the User’s Query
As mentioned in Section 4.5.3.2, IntelliBot has four unique response generation strategies to generate a response. The quality of the response generated by each strategy along with the way it is generated is different for each. Thus, the NDM has the challenging task of selecting a strategy in response to a user’s question that not only generates semantically correct and meaningful responses but also keeps the user engaged throughout. In doing so, it is possible that NDM for a user conversation which may consist of many questions uses different strategies for different questions. In other words, this means that depending on the question asked, NDM may choose a different strategy to respond to it irrespective of what strategy was used to answer the previous question from the same user in the same conversation. The
73 schematic representation of the selection process in SSU is shown in Fig. 5.1 and explained below. The following sequential process to determine which strategy NDM should select to respond to the user query is as follows:
Fig. 5.1 Conversational strategy selection in SSU
• The template-based strategy is the first strategy which IntelliBot assesses to determine if it can be used to generate a response. This strategy has pre-defined patterns that checks whether the structure of the user’s question matches the predefined rules in the template. If they match, then it is used to generate a response. Otherwise, the suitability of the next strategy is assessed.
• The knowledge-based strategy is the second strategy which IntelliBot assesses to determine if it can be used to generate a response. This strategy identifies the contexts and facts from the users’ question and matches them with the information about the questions stored in the underlying databases, user-bot conversation history and any new knowledge learned during the conversation. If they match, then the corresponding answers of the questions are presented to the user. Otherwise, the suitability of the next strategy is assessed.
74
• The objective of the Internet-retrieval strategy, which is the third strategy IntelliBot assesses to determine if it can be used to generate a response, is to provide more complete and up-to-date information by identifying question type, event elements and entities for extracting data from preselected websites. The Internet-retrieval strategy is used when the KB does not have the required knowledge for which the user is asking. If the Internet also does not have the required information to generate a response, then the suitability of the next strategy is assessed.
• The fourth strategy of IntelliBot to generate a response is to use deep bidirectional RNN with the seq2seq model to generate a conversational output. The objective of the generative-based strategy is to map between previous inputs and predict subsequent words to generate responses using DBRNN.
The working of each strategy is explained in detail in the next sections.
5.4 Design and Working of the Template-based Strategy
5.4.1 Objective
Depending on the specifics of the user’s question, the objective of the template-based strategy is to generate responses using a pattern-matching technique that matches predefined rules in the template.
5.4.2 Summary of the working of the template-based strategy
The template-based strategy is a collection of predefined question-answer with set rules in the form of templates. It uses pattern matching algorithm that identifies the structure of the sentence together with the entity of the user’s input. If these match, then in response, the output of the pre-defined rules is presented to the user. Such a strategy is also termed as rule-based, where the rule refers to the formed pattern.
In the pattern-matching process, a user’s input passes through Input Processing Unit (IPU) to the Language Understanding Unit (LUU) as shown in Fig. 5.2. Then, LUU performs tokenization, abbreviation, grammar checks and removes punctuation from the user input. Upon selection of the template-based strategy by the Strategy Selection Unit (SSU), user input sequences are converted to uppercase and passed to the Neural Dialogue Manager 75
(NDM) for pattern fitting. Pattern fitting normalization determines whether the user input can be found or not in a predefined template by applying AIML rules. If a template is found, it sets the variable into the template message as necessary and the corresponding result is conveyed to the user. The answer is not filtered and checked for grammar correction as done by the other response generation strategies as the answer defined in the template is done by the expert and is assumed to have been checked for correctness.
Fig. 5.2 Design of the template-based strategy
For example, Table 5.1 shows some commonly occurring user questions in the form of patterns, where [ ∗ ] is the pattern-matching variable. In a case where the pattern matches, the response column shows the answer to be given to the user’s question.
Table 5.1 Template-based pattern matching User Query Pattern Corresponding response Who are you? WHO ∗ YOU I am an AI Chatbot for your assistance Who is Einstein? WHO IS ∗ Albert Einstein was a German physicist. What is your name? WHAT IS YOUR ∗ My name is AI Chatbot.
5.4.3 Detailed process of generating a response
IntelliBot uses templates to answer basic questions such as “what is the date today?”, “what is your name?”. Patterns are created for questions using AIML [104], which is a mark-up language based on an XML dialect used for specifying patterns and rules. AIML has 47 case- 76 sensitive tags to design the rules of patterns within the template-based strategy to respond to natural language conversations. However, three mandatory tags are required to build a block. These are:
Fig. 5.3 Basic building block of AIML code snippet
The category tag defines the information about QA pair. The template tag contains the answer to the user query that is forwarded to the user and the pattern tag outlines the pattern of rules from the given user input [105]. As shown in Fig. 5.4, another important tag is
The rule consists of two parts: condition and action. IntelliBot operates by choosing a rule that is satisfied by the condition and then executes the action of the chosen rule. The execution of the action is called firing [106]. The selection and firing process of a rule forms the IntelliBot’s work cycle. Each rule has objects and tags as identifiers that are responsible for modelling a pattern of conversation. Additionally, each tag corresponds to a command in IntelliBot. Using AIML, IntelliBot checks if the user’s question matches with a given template in either a direct or induced match.
77
Fig. 5.4 Recursion of AIML code snippet
5.4.3.1 User question resulting in a direct match with the defined templates
To understand this, a more thorough explanation of the structure and operation of AIML is needed. Let’s assume that the user enters the question “푤ℎ표 𝑖푠 푎푙푏푒푟푡 푒𝑖푛푠푡푒𝑖푛? ”. As seen from Fig. 5.4, the user’s query matches with the pattern at Line 8. As its response, the reply at Line 10 will be presented to the user. This is termed a direct match to the user’s question.
5.4.3.2 User question resulting in an induced match with the defined templates
It is possible that the user asks the same question in a different way. For example, let’s say the user enters "푑표 푦표푢 푘푛표푤 푤ℎ표 𝑖푠 푎푙푏푒푟푡 푒𝑖푛푠푡푒𝑖푛? " which matches the pattern at Line 2. In this case, IntelliBot identifies the entity of the question as “Albert Einstein” and the pattern matching algorithm using AIML transforms the question into the following template, 퐷푂 푌푂푈 퐾푁푂푊 푊퐻푂 ∗ 퐼푆, where wildcard [ ∗ ] defines the person name, 퐴푙푏푒푟푡 퐸𝑖푛푠푡푒𝑖푛. This encoding process is shown in lines 1 to 6 of Fig. 5.4. The command “
In this process, the template of a question is defined. If this template matches an existing question, then the response of the existing question is given to the user as IntelliBot’s response. In a case where a user asks for information related to the most recent conversations, IntelliBot uses the
Fig. 5.5 Memorizing previous conversation of AIML code snippet
Once the user enters their name, IntelliBot fires Line 6 and it satisfies the condition at Line 7, which is the previous conversation. IntelliBot will trigger Line 9 and store it into a context variable. After a while, if the user asks, “What is my name?” which matches the pattern at Line 13, IntelliBot triggers Line 14 which says to retrieve the response from the previously saved context variable “userName”. As a result, the response from the corresponding template is “Your name is: Nuruzzaman” as an output to display to the user.
As seen from the example, creating pattern matching rules is complex and not easy, even for the most commonly asked questions [104]. This leads to the limitation of this approach as explained in the next sub-section.
5.4.4 Limitation of template-based strategy
Creating pattern matching rules is complex and not easy even for the most commonly asked questions [104]. Furthermore, they are time-consuming and difficult to maintain [37]. Such drawbacks often result in this approach giving redundant responses, which do not help in keeping the user engaged. Additionally, this approach also leads to a conflict in generating an output when more than one rule or template have their conditions satisfied in relation to a given question. For this reason, as previously mentioned, we trained IntelliBot in the
79 template-based strategy only for basic dialogues that are used to initiate and end a chat with the user and convey greetings. Using this strategy for a wider variety of topics will take a longer time to run, consume a lot of memory and lead to performance issues.
5.5 Design and Working of the Knowledge-based Strategy
5.5.1 Objective
The objective of the knowledge-based strategy is to respond to user queries by first identifying the contexts and facts. Then this information is used to formulate a query and search for answers in the information stored in the underlying databases, the user-bot conversation history, and any new knowledge learned during the conversation.
5.5.2 Summary of the working of the knowledge-based strategy
As discussed in the previous chapter, the knowledge-based (KB) strategy uses the information stored in the underlying knowledge database (KBDB) to answer user questions. The process of how the KBDB is formed using domain-specific data is explained in Chapter 7. The user input passes through IPU and LUU. The LUU performs all the pre-processing NLP tasks. IntelliBot then extracts the key phrases, identifies the entities and intents and determines the facts necessary for generating the response. Upon selecting the KB strategy at SSU, IntelliBot applies question-pattern rules to classify the question type and extracts event elements that comprise key phrases and describes different events such as location, time, action, object etc. as shown in Table 5.2.
Table 5.2 Corresponding question types and event elements Question Type Event Elements What Subject, Object, Action, Description Which Subject, Object Who, Whose, Whom Subject, Object Where Location, Place When Date, Time How Quantity, Description Verb to-be Boolean
80
Typically, most events are embedded in the user question and inherit associations among their elements. Based on WH-question words, entities and event extraction, Question Analyser classifies the question type from the user input as shown in Fig. 5.6. The extracted key phrases are given to the query engine located at NDM to formulate a query to retrieve the results from KBDB. As a process of Answer Analyser, the results are passed to RGU to compute the semantic similarity which determines the answer relevant to the question. If they match above a certain threshold, then RAU passes the top-scored answer to the user as an output. As seen in Fig. 5.6, there is no grammar checking process in the RAU of the KB strategy. This is because the responses are carefully checked and reviewed by experts and native English language speakers when being inserted into the KBDB. Thus, it is assumed that they contain no grammatical errors.
5.5.3 Detailed Process of generating a response
The knowledge-based strategy needs to form the knowledge database (KBDB) to answer the user questions. This is done by retrieving and storing information related to the insurance domain in the form of Product Disclosure Statements (PDS) about insurance products. It also stores the user’s conversational history and the new knowledge learned during the conversation. Once this is done, the user question is then transformed as a query and a response from such underlying information is generated. The design of the KB strategy to generate a response is shown in Fig. 5.6. The working is explained in the next paragraph.
81
Fig. 5.6 Design of the knowledge-based strategy
Let’s consider a user question “When did Omar Hussain give a speech yesterday?” IntelliBot performs tokenization and tags each token by a parts-of-speech (WRB, VBD, DT, NN) before extracting two named entities ("Omar”, “Hussain"). It then determines the dependency relationship among the tokens as shown in Table 5.3. The word ‘a’ and the punctuation mark ‘?’ are omitted from the question.
Table 5.3 POS and entity dependency relationship of the user question Id Word/ Token Lemma POS NER Dependency 1 When when WRB O root-0, give-5, when-1 2 did do VBD O give-5, did-2 3 Omar Omar NNP PERSON Hussain-4, Omar-3 4 Hussain Hussain NNP PERSON 5 give give VB O give-5, Hussain-4 6 a a DT O speech-7, a-6 7 speech speech NN O give-5, speech-7 8 yesterday yesterday NN DATE give-5, yesterday-8
Next, determine the key phrase of the question that consider all tokens (words) are independent keywords. However, in the given question, “Omar Hussain” refers to the same person and a combination of two tokens should be decomposed into a single token (word). If 82
all words are represented by 푊 = 휔1, 휔2, … … … 휔푖 and for any word 휔푖 ∈ 푊, then (휔푖 +
휔푖+1) ∈ 퐴1 is correct. But for a combination of some words, IntelliBot replaces 휔푖 with
(휔푖 + 휔푖+1) in set 푊. Therefore, it is able to retrieve more relevant answers. After extracting combinational words for the given question, IntelliBot can combine these two tokens into a multiword expression and tag as "PERSON" for "Omar Hussain ". There are several words in the question which require their relationships and dependencies with other words. For example, the token ‘speech’ cannot be independent because it indicates the event element ‘give’ and ‘Omar Hussain’ where the event occurs (time) on ‘yesterday’ as presented in Fig. 5.7.
Fig. 5.7 Semantic graph and entity dependency of user question
After POS tagging, grammar checks and entity extraction, IntelliBot identifies the question type using question-pattern rules and extracts the question type “When”. In order to incorporate more knowledge of the event elements and expand the query, this thesis incorporates WordNet synsets. There may be other words related to the tokens in the above sentence such as (give, speech, yesterday) that may have a similar meaning. For example, as shown in Table 5.4, the word ‘speech’ is similar in meaning to ‘talk’, ‘presentation’ or ‘lecture’. When forming a query, it is important to include these words which are related to the tokens as it will help to capture the various required lexical relationships. Therefore, using semantic knowledge and extracting similar meanings from the lexicon is significant to form and extend the query.
Table 5.4 Example of similar meaning (senses) of a word Token POS Tag Synsets Give VB offer, grant, donate, contribute, lend, allow, permit Speech NN talk, lecture, presentation, speak, conversation Yesterday NN last day, past, recently, the other day, foretime
83
To achieve this, the knowledge-based strategy comprises the following steps to retrieve the results from the KBDB:
a) Identify possible tokens from the user question and its synsets to determine their lexical relationships. b) Formulate a query with tokens that most appropriately represent the user question. c) Retrieve the likelihood of responses. d) Filter the retrieved results by determining the percentage of tokens matched. e) Compute the sentence-level semantic similarity. f) Determine if the retrieved results match the user question above a certain threshold.
Let us consider the sentence “When did Omar Hussain give a speech yesterday?” which contains three tokens (give, speech, yesterday) 푤1, 푤2, 푤3 words respectively from their synsets which have a similar meaning to the tokens and their lexical relationship is computed as 퐾푙(푤1, 푤2, 푤3) ∈ 퐾푞. By combining the synsets and key phrase, IntelliBot formulates a query that extracts as much information as possible and minimizes the selection of inappropriate responses from the KBDB as shown in Fig. 5.8.
Fig. 5.8 Code snapshot of KB query formation
The newly formulated query which retrieves possible results from KBDB is denoted by 퐾푞 in
Eq. (5.1) as defined in [107]: 퐾 = 퐶 + (퐺 ∪ 푆 ) 푞 푞 푞 푞 (5.1) where 퐶푞 is the known entities, 푆푞 is the synsets, 퐺푞 is the gloss word.
84
Then, NDM handles the query using a query engine. The likelihood of a word (synsets, entities and key phrase) of user questions 푄푖 and that of possible responses 퐴푖 from KBDB is determined by Eq. (5.2) as defined in [107, 108]:
푛
퐾(푄푖, 퐴푖) = ∑ 훿푖 푆(푄푖 퐴푖) (5.2) 푖=1
The distance correlation 푑 between user questions 푄푖 and possible responses 퐴푖 is measured. The lower the distance between them, the higher appropriateness of the response as shown in Eq. (5.3) defined by [107]:
1 퐾푑(푄푖, 퐴푖) = ∑|푃푂푆 (푄푖)−푃푂푆(퐴푖)| (5.3) 푑푠(푄푖^퐴푖)
IntelliBot filters the retrieved results by checking the matched words and their synsets. In this step, IntelliBot determines the percentage of the token matched between the user’s question and retrieved results. This is determined by calculating the ratio as the code snapshot, as shown in Fig. 5.9.
Fig. 5.9 Code snapshot of percentage of matching words
For example, let us assume that the query contains three keywords, then the retrieved results should have at least two of them. Eq. (5.4) as defined in [109] represents the fact that the number of matching words in user question 푄푖 should be more than the total number of matching words 퐾푞푟 in the retrieved results. This determines whether the ratio achieved a certain level of quality contents and passes the result to RGU.
85
푄푖 ≥ ⌊√퐾푞푟 − 1⌋ + 1 (5.4)
Then, RGU removes all the duplicate results and computes the sentence-level semantic similarity of the generated response with the user’s question. To measure the similarity between input question and retrieved results, each input and output weight vector is
‖푟푖+1−푟푖+2‖ marked as 푆1 and 푆2 respectively and a word similarity vector 푆푤 = 1 − is ‖푟푖+1+푟푖+2‖ created. Then, the sentence-level similarity is computed by combining weight vectors 푆1, 푆2 and word similarity vector 푆푤 as shown in Eq. (5.5) which defined in [110]:
푠1 ∙ 푠2 ‖푟푖+1 − 푟푖+2‖ "푆(푆1, 푆2) = 휗푆푠 + (1 − 휗)푆푟 = 휗 + (1 − 휗) " (5.5) ‖푠1‖ ∙ ‖푠2‖ ‖푟푖+1 + 푟푖+2‖
Details on measuring the sentence-level semantic similarity are discussed in Chapter 6. The semantic similarity of the selected results is passed to the RAU which determines if they match above a certain threshold. If it does, then the top-scoring answer is passed to the user as an output. If the user’s query contains missing information related to the event or WH- question words, IntelliBot is designed to request it from the user. Alternatively, IntelliBot can recognize the context by analysing the stored conversational history of the user until now. This is a unique strategy in IntelliBot in that it can remember and learn from the previous user-bot conversational history.
If by using such a strategy, IntelliBot is unable to find a response from the KBDB, then the next strategy, namely the Internet retrieval-based strategy, will be chosen to generate a response.
5.5.4 Limitation of knowledge-based strategy
Existing approaches form a query from a user’s question by using keyword-based techniques [108]. These methods focus on matching the keywords of the user query with possible answers to generate a response. The drawback of KB strategy is, a common question like ′푊ℎ표 푚푎푑푒 푦표푢? ′ may return several responses where the expected answer is incorporated. The literature study shows that determining and defining the requisite knowledge for KBDB is a very difficult task and is often hard to achieve [111]. It needs the construction of data models that are capable of responding to the user’s queries [100].
86
5.6 Design of the Internet Retrieval (IR) Strategy
5.6.1 Objective
When KBDB is unable to answer the user query, the aim of the Internet retrieval strategy provides more complete and up-to-date information from the web or intranet to generate a response. This is done by identifying the question type, event elements and entities for extracting data from preselected websites.
5.6.2 Summary of working of Internet retrieval strategy
In the Internet retrieval strategy, IntelliBot performs all the pre-processing NLP tasks similar to the KB strategy and then divides the remaining tasks into two parts: question analyser and answer analyser, as shown in Fig. 5.10. The question analyser component examines the user questions and assigns question-pattern rules to identify the question type and then extracts the event elements which comprise key phrases and conditions of the question. The extracted key phrases are given to the query engine to formulate a query for web crawling. The search and retrieve pages include the same key phrase. The retrieved results are then passed to content segmentation that cleans the HTML tags and extracts information which contains possible answers. As a process of the answer analyser component, the results are forwarded to RGU to compute the semantic similarity which determines if the answer is relevant to the question. If they match above a certain level of threshold then RAU passes the top-scoring answer as a response to the user as output.
Fig. 5.10 Design of Internet retrieval strategy 87
5.6.3 Detailed process of generating a response
The Internet is rapidly growing and contains a wealth of valuable information that can be used by IntelliBot to generate a response. In this strategy, IntelliBot chooses a response from a set of preselected websites, namely Wikipedia, Britannica and specific domain-oriented websites such as americanexpress.com, commbank.com.au and anz.com.au. The following tasks are performed in the two components of this strategy: question analyser and answer analyser.
5.6.3.1 Question analyser
Let’s consider a question “What is the interest rate of AMEX?” The question analyser component first tokenizes the question and checks for abbreviations, including grammar error detection and correction. Then, the question is decomposed by the POS tagger component and the punctuation is removed. IntelliBot then extracts key phrases e.g. identifies entities, intents and facts from the question to specify the role of each word (token) and analyse the question. Based on the key phrase, using question-pattern rules, IntelliBot determines the question type and event elements as shown in Table 5.2. For this question, it extracts the question type “What” and event elements as “interest rate” and “AMEX” as the entity “ORG” as shown in Fig. 5.11. It also shows the relationships and dependencies with other words (token). There is a compound relation between ‘interest’ and ‘rate’ which means it is a multiword expression which assists the query to retrieve more relevant answers. If no value of the relationships is specified, then the query cannot be instantiated.
Fig. 5.11 Semantic graph and entity dependency of the question
To expand the query, IntelliBot incorporates WordNet synsets to discover if the word has a similar meaning in different synsets. For example, the word ‘interest rate’ has a similar meaning to ‘bank rate’, ‘lending rate’, ‘borrowing rate’ and ‘annual percentage rate’. To ensure that these words are also considered when the results are determined, synonyms are added to the query and are concatenated together with the operator “OR”. Another option
88 is to repeat the query with the synonyms. This increases the possibility of obtaining more relevant answers. However, it may also complicate the search by prolonging the search response time. Using question type, event elements and key phrases, a query is formed and passed to the web crawler to crawl relevant information from the preselected websites. This is a complex and time-consuming step as each question has its unique structure and content while each website has different rules about the query it accepts.
Fig. 5.12 Code snippet of web crawling
The query is then used in a web crawler to receive information as shown the code snippet in Fig. 5.12. The web crawler is an application or set of instructions that analyse the web pages in a systematic and automated manner to categorize information based on user demand. This thesis builds a web crawler to collect data from websites. It is observed that websites tend to have similar structure such as index page, contact page, about page, number of pages for FAQ pages and different rules about the queries they accept. The steps in the web crawling process are described in Fig. 5.13.
Fig. 5.13 Information extraction process from the web using a web crawler
89
At first, the web crawler takes only selected URLs which are wikipedia.org, britannica.com, americanexpress.com, commbank.com.au, anz.com.au. Taking all the URLs may lead to a waste of computing resources of the machines running the crawler. After grabbing a page, the crawler removes the website from the URL queue and sets up a direction to determine which path needs to be followed. This is done by one of two crawling strategies: breadth-first crawling and depth-first crawling. Breadth-first crawling performs the search around the neighbour hyperlinks of the target hyperlinks. It starts with root hyperlinks and collects all the neighbour hyperlinks at the initial level. The scanning stops when the targeted search is achieved, otherwise, it goes to the next level. On the other hand, depth-first crawling starts searching the target from the root node and traverses next to its child node as shown in Fig. 5.14. It traverses deep until no other child node is present. Then, it starts from the next unvisited node and continues in a similar manner.
Fig. 5.14 Traverse child node to obtain expected question and answer
Next, the web crawler retrieves the content from the webpages and then IntelliBot identifies the semantic annotations or metadata from the semantic layer of the web. This is done through DOM parsing (Document Object Model) if the metadata is embedded in the webpage. DOM parsing performs in two ways. One way is to script specific rules for each website and extract information. The other way is to script common rules for all websites and extract information. To extract the information, it first identifies where the specific contents are located on the webpage. As shown in Fig. 5.15, from the code snippet, the
90
HTML-tag shows the contents are nested
in the DOM-tree structure. represents the content area. The output of this step compiles all information into a text file.Fig. 5.15 HTML code snapshot
There are other HTML-tags such as
, for hyperlinks and for images. However, text contents are listed in parallel at the same level in the DOM tree and enclosed in a tag pair. We represent each DOM tree as a string of the HTML-tag by concatenating all the tag nodes level by level from top to bottom. In a case with only one5.6.3.2 Answer analyser
As seen in Fig. 5.15, the web content comprises HTML tags, unwanted text, or special characters which it needs to remove first before the answer can be extracted. IntelliBot uses a segmentation tool for data cleansing. The next step is answer analysing, which extracts only the required and relevant content that incorporates possible answers. The information is then parsed into sentences and any possible duplicate sentences are removed. Next, it is determined whether these sentences potentially answer to the question by using the semantic sentence similarity approach from Eq. (5.2) to (5.5). The semantic similarity of the selected results is passed to the RAU. The RAU performs a grammar check and then ranks
91 each result to determine the correctness of the answers to the user’s question. The answer with the highest score in a threshold will be considered as the best answer and is shown to the user. The steps of sentence-level similarity are discussed in Chapter 6.
The generated answers of the IR strategy are stored in the KBDB for future use if the same question is asked. This is to avoid unnecessary web crawling. If either the answer is not found by the IR strategy or the sentence similarity determines that the generated answer does not match the user question, then the next strategy i.e. the generative-based strategy will be used to generate a response.
5.6.4 Limitation of Internet retrieval strategy
The advantage of the IR strategy is that the response generated from it is more complete and up to date compared to the one generated from the template-based and knowledge- based strategy, which may be dated. In this strategy, it chooses the response from a set of preselected websites. However, while searching more websites is possible, it also increases the chance of obtaining information which is irrelevant to the question and it also take lot of time. Therefore, the search process, IntelliBot receives a list of text messages from selected websites as possible answers. However, most of the webpages are not designed for NLP [112], thus making the steps of the query formation, information retrieval and data processing are challenging.
5.7 Design of Generative-based Strategy
5.7.1 Objective
The objective of the generative-based strategy is to generate a response using the DBRNN approach. This approach maps between the previous inputs and predicts subsequent words for generating responses.
5.7.2 Summary of the working of the generative-based strategy
To understand the generative-based strategy, it is important to consider a model that attempts to predict the subsequent word (output) based on the current word (input) and previous words. This strategy maps between input sequences to an output sequence [84]. As shown in Fig. 5.16, the seq2seq model comprises two LSTMs. One LSTM is the encoder and 92 the other LSTM is the decoder. The encoder takes the input sequences one-by-one and captures the context of the input which builds a representation of the given inputs. The decoder receives the encoded representation and predicts a sequence at a time. The idea is to use two LSTMs that will work together with a special token and to predict the next state sequence from the previous sequence [101].
Fig. 5.16 Design of the generative-based strategy
5.7.3 Detailed process of generating a response
As previously discussed, this strategy generates a response through the generative strategy which involves DBRNN and consists of the “encoder” and “decoder” as shown in Fig. 5.17. The encoder processes input sequences and creates a vector which is forwarded to a decoder to predict the output. Let’s consider a sentence “How are you?”. This sentence is processed by the encoder and it builds a representation of words in the RGU of IntelliBot. This allows words with similar meaning to have a similar representation to vectors of numerical values known as “word embeddings”. It is a lookup table that stores input sequences into a fixed-sized dictionary. After reading the whole sentence, IntelliBot assigns a special token
93
Fig. 5.17 Architecture of the DBRNN seq2seq model
Then, the decoder process begins by adding a special token
The above example is fine for short sentences but fails for long sentences because it is challenging for the encoder to memorize the entire sequence into a fixed-length vector and compress all the contextual information from the sequence [113]. Therefore, to overcome this problem, this thesis uses the concept of the attention mechanism [86] which pays attention to specific words in the sequence which have contextual information instead of the entire sequence and predicts the output sequence based on this.
To explain the importance of the attention mechanism, let us consider another example. If the model attempts to predict the last word in “Antartica is covered with …….,” then no
94 further context is required as the model determines the subsequent word is “ice”. In this example, there is a small distance between relevant data and location. However, there are also cases where the model needs more context to accurately determine the output. For example, “I grew up in Malaysia. I can speak fluent…...” indicates that the next word is likely to be a language. However, if the model wants to determine which particular language, the model needs the context of ‘Malaysia’ which is in a different sentence. The distance between the output to be predicted and the context of the sentence is large. As the distance grows, RNN cannot learn to link the output to the appropriate data which leads to the ’vanishing gradient problem’ [114]. To resolve the vanishing gradient issue, LSTMs which can handle long-term dependencies are used. The aim of LSTM is to measure the probability of the given sentence.
Let’s assume the conditional probability 푝 (푦| 푥), where 푥 = 푥1, 푥2, … 푥푡 is given input sequence (length of 푡′ can differ from 푡), corresponding output sequence 푦 =
푦1, 푦2, … 푦푡′ at time 푡 = 푡1, 푡2, … 푡푛. As shown Fig. 5.18, to compute 푝(푦|푥), the model obtains the fixed-length vector representation 푣 of the input sequence 푥 given by the last hidden state of the encoder. Therefore, first, we need to compute each hidden state vector ⃗⃗⃗ ℎ푡 using a deterministic state transition function as in Eq. (5.6) defined in [85]:
ℎ⃗⃗⃗푡 = tanh (푊ℎℎ푡−1 + 푊푥ℎ푥푡 + 푏푥) (5.6)
where 푡푎푛ℎ is the activation function, 푊푥ℎ푥푡 is a weight matrix of 푥, ℎ is the hidden state, 푥 is the input vector, ℎ푡−1 is the previous timestep and 푏푥 is the hidden state bias vector.
95
Fig. 5.18 Visual representation of input to output
As seen in Fig. 5.18, in each timestep hidden state ℎ푡, the output vector 표⃗⃗⃗푡 is computed where 푊ℎ푦 is weight matrix and 푏푦 is the output bias vector as in Eq. (5.7) defined in [85]:
표⃗⃗⃗푡 = 푊푦ℎℎ푡 + 푏푦 (5.7)
Then, output vector 표⃗⃗⃗푡 passes through the softmax layer to normalize the output as defined in [84, 115]:
푦 = softmax ( 표⃗⃗⃗ ) 푡 푡 (5.8)
The process is repeated for all words of the input sequence 푥푖 and 푦푡 is generated by taking the decoder output at time 푡. The distribution over the possible subsequent words will be as Eq. (5.9) defined in [84]:
푡′
"푃(푦1, 푦2, … 푦푡′ | 푥1, 푥2, … 푥푡) = ∏ 푃(푦푡|푣, 푦1, … … 푦푡−1) " (5.9) 푡=1
The left side of Eq. (5.9) represents the likelihood of output sequence 푦1, 푦2, … 푦푡′ on the input sequence 푥1, 푥2, … 푥푡. On the other side, the vector of the probabilities of all words
푃(푦푡|푣, 푦1, … … 푦푡−1), vector representation 푣 and output at the previous timestep 푡. The 푡푖 ∏푡=1 is the multiplication equivalent of Sigma. A simpler equation of conditional probability is represented in Eq. (5.10) defined in [43, 84]:
96
푡′
"푃(푦|푥) = ∏ 푃(푦푡|푣, 푦1, … … 푦푡−1) " (5.10) 푡=1
IntelliBot applies an RNN with the seq2seq model (attention mechanism) which consists of a input, a output and four hidden layers. For each layer of RNN, IntelliBot uses two LSTMs, one each for the encoder and decoder.
5.7.4 Limitation of generative-based strategy
As the generative-based strategy is based on DBRNN, it is a ’feed-forward’ multilayer neural network that generates an output from the user input. It is recurrent because it performs the similar task for each sequence, with its output depending on the previous sequence [81, 85, 116-119]. This means if IntelliBot is required to predict the next word, it needs to recognize which words derived previously. Compared to the other strategies, the generative-based strategy is complex in its design and requires algorithms which are difficult to build and implement. This is because the generated output in this strategy is not from the KBDB, but uses its own word generation ability based on previous input and predicted words. Thus, the training process requires a lot of time, effort and a vast amount of labelled data. However, it is the most suitable strategy to use in order to generate a response to complex questions.
5.8 Conclusion
This chapter explained the design of the four conversational strategies to overcome the limitations of traditional rule-based chatbots. Each of the strategies was built using different techniques and algorithms. The rule-based strategy uses AIML templates to generate a response. The knowledge-based strategy formulates a complex query in a sophisticated way to capture and identify the facts necessary to generate a response. The Internet strategy retrieves more complete and up-to-date information compared to the one generated by the template-based and knowledge-based strategy. The generative-based strategy builds a DBRNN seq2seq model with an attention mechanism which looks at both the left and right context. By using the attention mechanism, the model was capable of finding the mapping the input sequence and output sequence. These strategies will be used in the next chapters for IntelliBot to generate a response to the user question.
97
CHAPTER 6
“Research is the act of going up alleys to see if they are blind” —Greek historian
GRAMMAR CHECKING AND MEASURING SEMANTIC SIMILARITY
6.1 Introduction
6As discussed in Chapter 4, IntelliBot’s Neural Dialogue Manager (NDM) comprises the Language Understanding Unit (LUU) and the Response Generation Unit (RGU). As discussed in Section 4.5.3.1, LUU breaks down the user input and pre-processes it in different ways. In this chapter, we discuss the various tasks it performs during natural language processing (NLP), including the working of the Grammar Error Checking (GEC) component. As discussed in Chapter 5, the RGU of IntelliBot which is based on the selected response generation strategy generates a possible response. To determine which of the generated responses matches best with the user’s question, the semantic similarity between them needs to be determined. This chapter also explains the process of measuring semantic similarity between the possible response and the user’s question.
The structure of the chapter is as follows: Section 6.2 defines the key terms required for the understanding of the working of LUU and semantic similarity determination by RGU. Section 6.3 explains in detail the various tasks of NLP performed by IntelliBot and their working. Section 6.4 discusses the method for measuring the semantic similarity at a word and sentence level. Section 6.5 describes the process of sentence scoring undertaken in RAU.
6 Parts of this chapter have been published in [20] and [100]. 98
6.2 Key Terms
WordNet is a large lexical English data dictionary developed and hosted at Princeton. It is part of NLTK corpus. It able to find meaning of words, synonyms, antonyms and more. Approximately 117,000 synsets found in WordNet.
Syntax refers to the grammatical structure of a sentence. The format in which words and phrases are arranged to create sentences is called syntax.
Semantics concerns about the relationship between words and how it conveys meaning from those words in a sentence. It is one of the most challenging tasks in NLP.
Word2Vec Embedding is a statistical method for efficient learning from corpus and representation of words such as CBOW, Skip-gram model. It represents relationships exist between words. It is a technique that use in many NLP applications. For example, text summarization, question answering, document classification etc. for better word representation.
PCFG stands for Probabilistic context-free grammar (PCFG) is a high-level generative model that produces all possible sentences with the highest probability. It automatically predicts the most likely grammatically correct sentences.
6.3 Natural Language Processing (NLP) Tasks Performed in the LUU of IntelliBot
Understanding human language is one of the most complex tasks for a machine, but with the current NLP techniques, it is becoming easier day by day. IntelliBot processes each word of the user input to retrieve its correct sense. As shown in Fig. 6.1, this process includes lowercase conversion, tokenization, abbreviation determination, POS tagging, grammar checking, removing stop words, lemmatization, entity extraction and punctuation removal. IntelliBot uses a natural language toolkit (NLTK) to pre-process user input and obtain a more accurate representation of the information. Each task are explained in the next sub-sections.
99
Fig. 6.1 NLP tasks performed in the LUU of IntelliBot
6.3.1 Lowercase conversion
Lowercase conversion is the task of the IPU. It changes all text data into lowercase for a better trade-off. It is one of the simplest forms of text pre-processing. This is to ensure that regardless of letters beginning with uppercase, title case or sentence case, similar words must match each other. This is important as an AI model might treat a word which is at the beginning of a sentence with a capital letter differently from the same word which appears later in the sentence but without a capital letter. This might lead to a decline in accuracy [120]. Table 6.1 shows examples of sparsity issue for the same word with different cases.
thankTable 6.1 Example of lowercase conversation
Raw Input Sentence Lowercase Sentence HALLO, Haw R U? hallo, haw r u? BanGlaDesh banglaDESH bangladesh bAnglADesH USA, UK, UAE, KSA usa, uk, uae, ksa
For example, consider a sentence “Wht is D intaRest rat of CBA Gold Cradit Kard?”. The pre- processed output of this sentence will be “wht is d intarest rat of cba gold cradit kard?” after lowercasing as shown in Fig. 6.2. As seen, it has a lot of errors. However, these are not corrected in this step and only the text is changed to lowercase.
100
Fig. 6.2 Code snippet of lowercase conversion
6.3.2 Tokenization
Tokenization is a vital step in the NLP task. It is a technique of splitting a stream of input sequences into a list of words, symbols, phrases or other text elements. In other words, tokenization is the process when each word in a sentence is split into a list of individual words. Inaccurate word recognition or sentence identification could lead to generating unexpected or irrelevant answers. Greater accuracy in word detection is required to have a greater accuracy in response generation.
Word recognition is to determine the start and end of each word. This is done by ‘word boundary detection’. Word boundary detection is comparatively straightforward in text. First, user inputs are segmented into atomic unit word-like tokens on whitespace and punctuation. Then, word boundaries are marked within the sequence of word-like tokens using a hidden Markov model (HMM) [121]. Let user input be 휔푖휔푖+1, … … … , 휔푖+푗 and observable features of a given segment be [⋀ 퐹푠푢푟푓](휔) where 퐹푠푢푟푓 = {퐶퐿퐴푆푆, 퐶퐴푆퐸, 퐿퐸푁퐺푇퐻, 푆푇푂푃, 퐵퐿퐴푁퐾푆} represents the surface features. Then the 푛 probability of a segment sequence 휔1 as the sum of path probabilities over all possible generating state sequences is computed using Eq. (6.1) as defined in [121]:
푛 푛 푛 푝(푊 = 휔1 ) = ∑ 푝(푊 = 휔1 , 풬 = 푞1 ) (6.1) 푛 푛 푞1 ∈ 풬
where 풬 = rng(⋀(퐹ℎ푖푑푒 ∪ 퐹푠푢푟푓\퐹푐푡푥푓)). 퐹ℎ푖푑푒푓 = {퐵푂푊, 퐵푂푆, 퐸푂푆} as hidden features and context-independent features denoted as 퐹푐푡푥푓 = {푆푇푂푃}.
101
The output of tokenization is provided as input for further text processing in various steps such as punctuation removal, numeric character removal, lemmatization, POS tagging, entity extraction, spelling correction or grammar check. IntelliBot uses PTBTokenizer which can perform 3.15 million of tokens per second. For example, in Fig. 6.3, tokenization performed on a sentence “tell me benefits of AMEX card.” will be split into ‘tell’,’me’,’benefits’,’of’,’AMEX’,’card’,’.’.
Fig. 6.3 Code snippet of tokenization
6.3.3 Abbreviation determination
Abbreviations or acronyms are a shortened form of a word or phrase, used as a symbol for the full form [99]. An abbreviation is formed from the initial letters of a group of words. The usage of abbreviations continues to grow as people communicate through apps. Detecting the full form of abbreviations has major challenges as it is often easy to confuse them with spelling errors or often, the same acronyms have multiple full forms [122, 123]. For example, ‘CU’ is not a dictionary word so it could be detected as a spelling or non-word error. However, ‘CU’ is defined as an abbreviation in Table 6.2. Another example is ‘UNSW’ which has two different full forms as shown in Table 6.2. Determining which full form the acronym refers to is highly domain-dependent [124]. Thus, discovering acronyms and relating them to their expanded forms is important for IntelliBot to correctly understand the user’s question.
To address this issue, IntelliBot adapts a list of abbreviations from www.internetslang.com, which is a database of abbreviations and slang terms. This is the largest database of abbreviations found on the web. This dictionary contains 9,127 shorthand words and phrases. A sample of the dictionary abbreviation words is shown in Table 6.2.
102
Table 6.2 List of abbreviations in full form
Abbreviation / Acronym Full Word Form DIY Do it yourself CU See you DOB Date of Birth BRB Be right back ASAP As soon as possible UNSW University of New South Wales UNSW United Nations Society of Writers
As it is important to decode abbreviations, IntelliBot performs abbreviation identification using an efficient algorithm before the grammar check. The correct recognition of abbreviations and their full form is very significant for understanding a user query and the context. For example, the abbreviation “CS” has five full forms as shown in Table 6.3.
Table 6.3 List of abbreviations
CS Computer Science Subject/ Department CS Campus Security Department CS Control System Object CS Career Services Department CS Chemical Sciences Subject
In this thesis, abbreviation determination has four components: Abbreviation Recognizer, Abbreviation Extractor, Definition Finder and Abbreviation Matcher as shown in Fig. 6.4.
The following sub-sections explain the working of this processes in detail.
103
Fig. 6.4 Workflow of abbreviation recognition and extraction
6.3.3.1 Abbreviation recognizer
In this step, IntelliBot identifies all the short forms in the user query which are likely to be acronyms. To recognize acronyms, IntelliBot focuses on word sequences that appear frequently in the user-bot conversation and applies abbreviation rules. Let us consider the user enters a sentence as “Where is UNSW?”. IntelliBot’s task is to identify if UNSW abbreviates to “University of New South Wales” or “United Nations Society of Writers”. To do this, IntelliBot first splits the sentence into words and removes stopwords from the user- defined list of stop words. Then, the following rules and conditions are applied on the word to ensure that the appropriate abbreviation is recognized:
i. The string contains at least two characters. ii. The string is not in the user-defined list of stop words. iii. The string does not contain any special characters. iv. It is not in a lexicon dictionary word, person or location name.
However, many proper nouns have the same characteristics as the above and are recognized as an abbreviation. To reduce the likelihood of recognizing the wrong abbreviation, IntelliBot applies a list of proper names created by the IBM Watson Talent System [125, 126]. 104
6.3.3.2 Abbreviation extractor
Abbreviation extraction is a similar process to entity extraction. The aim is to extract the identified list of abbreviations from the previous step as having a full form in the dictionary. In this step, NLP techniques (POS tagger) are applied to assign a part-of-speech to each word as shown in Eq. (6.2) to Eq. (6.5). POS tagging, which is described in Section 6.3.4, represents that it annotates a unit of text and helps to lay the foundation for understanding the relationships between words. Next, entity extraction is used to extract the abbreviation from the given sentence as shown in Eq. (6.14). The outcome of this step is a list of possible abbreviations found, which in this case is “UNSW” from the sentence “Where is UNSW?”. The process of entity extraction is explained in section 6.3.8.
6.3.3.3 Definition finder
After the abbreviated word is extracted, IntelliBot searches for a possible definition of the abbreviation in the abbreviations database which contains thousands of abbreviations and their full form. By combining the extracted abbreviated word, IntelliBot forms a query with the list of extracted abbreviations to retrieve the list of the full form (definition) from the abbreviation dictionary. If any matches are found, a full form will be retrieved from the dictionary.
6.3.3.4 Abbreviation matcher
All abbreviations are categorized based on different industry segments e.g. academic, business, societies and miscellaneous as shown in Table 6.4.
Table 6.4 Abbreviation categorization
Term Definition Categories UNSW University of New South Wales University UNSW United Nations Society of Writers Societies UNSW Universal National Student Welfare Voluntaries UNSW University of No Sexy Women Funnies
As seen from Fig. 6.4, “UNSW” matches with the dictionary and the result “University of New South Wales” is retrieved. There is a possibility that it could retrieve more than one
105 result. In this case, IntelliBot applies two abbreviation rules to determine the appropriate full form for the given input. One rule is to identify the named entity (NER) and their relationship. Another rule is using NER to identify to which category the abbreviation ‘UNSW’ belongs. This is done by using a CRF classifier. It is a discriminative undirected probabilistic model which is used for labelling or parsing sequential data and is trained to maximize a conditional probability. Section 6.3.8 explains entity coreference resolution in detail.
6.3.4 POS tagging using HMM
After the abbreviation process, IntelliBot labels each token (word) as a noun, pronoun, verb, adjective, adverb, preposition or article called POS tagging. IntelliBot does this by using the HMM for POS tagging which is one of the key machine learning models in NLP [121, 127]. HMM is a probabilistic sequence classifier that indicates the POS type of the word. This approach assigns a label or class to each unit in a given sequence. The goal of HMM is to recover the hidden events from the observed events. To define HMM, first need to illustrate the Markov chain, sometimes called the observed Markov model. It generates pairs of sequences (푥, 푦). The sequence 푥 is called the input sequence or observations or visible data. 푦 is called the output tag sequence or hidden data and is represents in Eq. (6.2) defined in [121]: 푇 (6.2) 푝(푥 | 푦) ≈ ∏ 푝(푥푡|푦푡) 푡=1
For example, the input sequence, 푥 = “Could you please tell me, what insurance is covered by
Platinum card?” has 14 words, so 푛 = 14 and 푤1=Could, 푤2=you, 푤3=please, 푤4=tell,
푤5=me, 푤6=, 푤7=What, 푤8=insurance, 푤9=is, 푤10=covered, 푤11=by, 푤12=Platinum,
푤13=card, 푤14=?. To tag each word we need to decide on a set of feature function 푓푖 as represents in Eq. (6.3):
th 푓1(푥, 푤푖, 푙푖, 푙푖−1) = 1 if 푙푖=ADVERB and 𝑖 word end in “-ly”; Otherwise 0. (6.3)
푓2(푥, 푤푖, 푙푖, 푙푖−1) = 1 if 푙푖=VERB, 𝑖=1 and sentence ends in “?”; Otherwise 0.
푓3(푥, 푤푖, 푙푖, 푙푖−1) = 1 if 푙푖−1=ADJECTIVE and 푙푖=NOUN; Otherwise 0.
푓4(푥, 푤푖, 푙푖, 푙푖−1) = 1 if 푙푖−1=PREPOSITION and 푙푖= PREPOSITION; Otherwise 0.
th where 푥 is the input sequence, 푤푖 is 𝑖 position of a word, 푙푖−1 is the tag of the previous word and 푙푖 is the tag of the current word. The Eq. (6.3) shows that each feature function is 106
based on the tag 푙 of the current word 푤푖, the previous word 푤푖−1 and is either 0 or 1. The output sequence is denoted as 푦 = (푦푖, … … … , 푦푛) which corresponds to an observation sequence 푥 = (푥푖, … … … , 푥푛). The probability of the sequence is computed using Eq. (6.4) as defined in [100]:
푝(푥, 푦) = 푝(푥|푦) 푝(푓푖) (6.4)
In the Markov assumption, the probability of an event (tag) depends only on the previous event (tag) as represented by Eq. (6.5) defined in [100, 128]:
푇 푝(푥, 푦) = 푝(푥|푦) 푝(푦) ≈ ∏ 푝(푥푡|푦푡) 푝(푦푡|푦푡 − 1) 푡=1 (6.5) 푇 푝(푦) ≈ ∏ 푝(푦푡|푦푡 − 1) 푡=1
So the output of the tagging model will be 푦1=MD, 푦2=PRP, 푦3=VB, 푦4=VB, 푦5=PRP, 푦6=WP, 푦7=., 푦8=WP, 푦9=NN, 푦10=VBZ, 푦11=VBN, 푦12=IN, 푦13=NN, 푦14. As shown in Fig. 6.5, the potential part-of-speech assigned to each word for the given sentence.
Fig. 6.5 Part-of-speech tagging into a sentence
6.3.5 Grammar check and correction
Grammar checking is the task of detecting and correcting grammatical errors in text. It describes the principles and rules leading to form and meaning of words and sentences [129]. This is done both on the user input and on the response generated by IntelliBot using 107 the Internet retrieval and generative-based strategies. Figures show that in relation to text- based chatbots, 14% of errors that users make in their writing are related to spelling, 21% of errors are related to grammar and 9% relate to punctuation [130]. So, if a user’s question contains these errors, it cannot be understood by IntelliBot and a response will not be generated. Similarly, in relation to the Internet retrieval and generative-based strategies, IntelliBot generates a response from multiple sources. So, there is a possibility that they will contain grammatical errors. For these responses, GEC corrects the response before presenting it to the user. This is not done for the responses of the template-based and KB strategies as these are defined by the expert. Hence there are less corrections to grammatical errors in these responses.
The existing literature focuses on spelling errors and proposes many approaches that identify spelling errors with high accuracy [131]. Specific to the case of IntelliBot, these errors are quite common as they usually arise from typing mistakes by the user and hence, are important to address. However, in addition to these errors, other types of errors need to be addressed so the IntelliBot can accurately understand the meaning of the user’s question. IntelliBot aims to do the following in relation to grammar checking:
i. Identify and address the most frequent types of errors. ii. Check the spelling of each word. iii. Apply a solution to correct the errors to obtain a valid sentence.
6.3.5.1 Classification of grammatical errors
IntelliBot categorizes grammatical errors into six classes, namely, sentence structure error, syntax error, spelling error, punctuation error, semantic error and non-word error. The taxonomy of grammatical errors is shown in Fig. 6.6.
Fig. 6.6 Classification of grammatical errors
108
• Sentence Structure Error This type of error refers to the structure of various parts-of-speech in a sentence needed to provide meaning and significance in terms of readability. An example of sentence structure errors are as follows:
Structural Error Error Detail Correct Sentence He started to talking. Extra ‘to’ or ‘ing’ He started talking. Goes to University. Subject ‘He/She’ is missing He goes to University. He going to University. Verb is missing He is going to University. I went to bank it closed. ‘but’ missing I went to the bank, but it closed.
• Syntax Error This error violates the rules of English grammar. This type of error is dependent on the relationship between the words of a sentence. Examples of syntax errors are as follows:
Syntax Error Error Detail Correct Sentence They is not to blame. Subject ‘They’ 2nd person They are not to blame. I back after a hour. Article error I came back after an hour. They go to university yesterday. Event in the past They went to university yesterday. He has recovered of his illness. Preposition error He has recovered from his illness.
• Spelling Error This error represents the user’s typing mistakes and the generation of a meaningless string of characters. This study uses a high-variance model that requires a large amount of labelled data. We used approximately 405,411 English words to train the IntelliBot. A word outside of IntelliBot’s dictionary is deemed to be a mistake in spelling.
• Punctuation Error Punctuation is used to separate elements of a sentence. Unnecessary or missing punctuation could change the meaning of the whole sentence. Examples of punctuation errors are as follows:
109
Punctuation Error Error Detail Correct Sentence They took my money lands Comma missing They took my money, reputation friends. land, reputation, and friends. How are you? Dr. Omar? First ? mark should be How are you, Dr. Omar? replaced by a comma Alas he is dead! The exclamation mark is Alas! He is dead. incorrectly placed.
• Semantic Error Semantic errors do not violate grammatical rules, despite making the sentence meaningless. This type of error is also called a ‘logical error’ due to the use of a wrong word choice or a contextual error. Examples of semantic errors are as follows:
Semantic Error Error Detail Correct Sentence A team of fish is swimming. team of A school of fish is swimming. He went to library to buy a pen library He went to the bookstore to and paper. buy a pen and paper.
• Non-word Error Non-word errors are words which are not found in the dictionary, also known as out-of- vocabulary (OOV) errors. Typically, the misspelling of a word leads to a non-word error.
To identify and address these six types of errors, IntelliBot uses a systematic approach that generates a grammatically correct sentence with the highest probability. This process is explained in the next sub-sections.
6.3.5.2 Process in GEC to detect and correct errors
In relation to user input, GEC automatically detects the grammatical errors made by users and recommends corrections. In relation to the generated output, it automatically corrects the grammar of the responses. As previously discussed, GEC is different to spelling correction, which is a well-studied problem [131].
110
Fig. 6.7 Process of grammar checking
The GEC comprises five steps which are broadly classified into text classification and text transformation phases, as shown in Fig. 6.7. The text classification phase aims to identify grammatical errors in the input whereas the text transformation phase transforms the incorrect sentence to its correct form [132]. The following steps are performed in these phases:
• The first step is word segmentation where the user input sentence is split into chunks and words are assigned POS tags. • In the next step, the entity, noun and verb are identified. Depending on the type of POS tag identified, specific grammar rules and patterns are applied on the user input to generate feature value pairs and identify patterns using the SMT classifier. • In the third step, a Statistical Machine Translation (SMT) classifier applies grammar structure rules on the patterns of user input. A rule-based grammar checker is language dependent and can cover almost all language features. The idea here is to check if a pattern of the rule matches the input user text. If it does not, then it is classified as an error. • At the input side, on the detection of an error, a recommendation for correction is determined by the grammatical structure rules. This is shown to the user for confirmation. To reduce over-flagging, the GEC unit prompts the user by asking a question (user feedback), explaining problematic word usage and suggests a possible correction. If the generated sentence or word is accepted by the user, it leads to the generation of a correct sentence. This process continues until the correct sentence
111
or word is determined. Upon user selection of the correct sentence or word, IntelliBot sends this to the sentence generation component. • IntelliBot also applies this process when the response is generated from its RGU. But in this case, the corrected sentence is not reconfirmed from the user as in the input stage.
The detailed workings of each phase and the aforementioned steps are explained in the following.
6.3.5.2.1 Text classification to detect errors
As shown in Fig. 6.8, during the text classification phase, the user’s input sentences are first segmented into individual words and checked to determine whether they contain abbreviations (shorthand). Then, each word is checked to see if it is present in the lexicon that consists of 405,411 English words. This process is known as a dictionary lookup, the objective of which is to check if the user input is a known English word or not. If a word from the user input is not found in the dictionary, it is flagged as an out-of-vocabulary (OOV) or non-word error and a special token
112
Fig. 6.8 Working of the text classification & error detection phase
For example, consider the sentence ′퐼 푙𝑖푘푒 푎푝푙푒′. First, the sentence is divided into three segments and the word ′푎푝푙푒′ is not found in the dictionary. In the second step, the part-of- speech ′푁푁′, ′푉퐵′, ′푁푁′ for each word of the user input is identified. Then, based on POS tagging, grammatical rules are applied to the sentence and the sentence structure is found to be Subject-Verb-Object. This sentence structure or pattern is searched in the corpus using n-grams, and if it matches, it is considered to be an error-free sentence. Otherwise, it is considered to be an error sentence. In this case, it is not found and hence it is considered to be an error. To determine the correct word, using the 푛-grams statistical model, the probability of the word ′푎푝푝푙푒′ belonging to which category is predicted. This is important as the word ′푎푝푝푙푒′ may belong to three distinct categories: fruit, company or others. Assuming 푙 given words 푊 = 휔1, 휔2, … … … 휔푖−1 = 휔1, where the probability 푃 of a word 푐푖 can be shown in Eq. (6.6) as defined in [27]:
푙 푙 푃(푊 = 휔1, 휔2, … … … 휔푖−1 = 휔1) = ∏ 푃(휔푖|휔1, 휔2, … … … 휔푖−1) (6.6) 푖=1
In the 푛-gram model 푛 > 1, the probability of each word present in the corpus is measured independently from previous words 푃(푊|휔1, 휔2, … … … 휔푖−1) ≈ 푃(휔푖). However, this model ignores the context by taking the current word 푤푖 into account because the probability of a word depends only on the previous word. The n-grams predict the next
113 word, where 푛 > 1 and make the conditional probability of the next word in a sequence as in Eq. (6.7) defined in [133]:
푃(푊 | 휔1, 휔2, … … … 휔푖−1) = 푃(휔푖 |휔푖−푛, … … … 휔푖−1) (6.7)
To simplify the probability estimation for each word, the maximum likelihood estimation (MLE) is measured by the total count of 푛-gram and the sum of the number of times a word is observed in the corpus is determined and normalized. Finally, for the task of detecting errors, the 푛-grams model determines the word which is likely to spelled correctly or is an error as in Eq. (6.8) defined in [133]:
푖 푖−1 count(휔푖−푛+1) 푃(휔푖 | 휔푖−푛+1) = 푖−1 (6.8) count(휔푖−푛+1)
When a word is flagged as an error, it leads to the text transformation phase.
6.3.5.2.2 Text transformation to correct errors
In this phase, for the word identified as an error, GEC suggests a possible correction according to the most likely pattern and prompts the user (if it is at the input side) for confirmation by asking a question. If the user agrees with suggested correction, IntelliBot proceeds with the error correction steps. If this is at the output side, then no confirmation from the user is required.
For text transformation, IntelliBot first removes stopwords such as articles, prepositions and auxiliaries from a given incorrect sentence, as shown in Fig. 6.9. In the second step, it changes nouns to their singular form and verbs to their root form. In the third step, it generates possible correction words to be inserted or any valid lexical form for the incorrect sentence. This requires determining the score for each word from the corpus possible of being present in the place of the incorrect word in the pair of 푛-grams. In other words, 푛- grams suggest the probability of words occurring together from the corpus and this is used to correct the given sentence. The word with the highest probability of occurring with the pair in the bigram is denoted as 푊푏푒푠푡 and shows the best word to address the erroneous word. The process for determining this is as follows.
114
Fig. 6.9 Working of the text classification & error correction phase
Let’s assume a sentence is represented as 푠 휖 ∆ ∗, where ∆ is the word list. Let 푠 = 휔1,
휔2, … … … 휔푛, where 휔푖 휖 ∆ for 1 ≤ 𝑖 ≤ 푛. Then, the probability distribution over all words in ∆ ∗. Let’s consider the sentence ′퐼 푙𝑖푘푒 푎푝푙푒′ and 푛=2. Then n1= [I like] and n2= [Like aple].
Then, SMT classifier maps both tags (n1, n2) and compares these with regard to its n-grams in the corpus. If this combination is found, the 푛-grams word score is based on the occurrences, as shown in Fig. 6.9. The probability of the total word is shown in Eq. (6.9) where it is denoted as 푊 as defined in [134]:
푇∑(∆) × ∆∗ →푊 (6.9)
For a combination that is not found, the best score 푊푏푒푠푡 for the word in 푛-grams is determined by using Eq. (6.10) as defined in [134]:
arg max 푠푐표푟푒(푡, 푠) 푊푏푒푠푡(푠) = (6.10) 푡 휖 푇∑(∆) where 푡 is the total number of occurrences. In Equation (6.10), the probability 푠푐표푟푒(푡, 푠) can be either a joint probability 푃(푡, 푠) or a conditional probability 푃(푡|푠). In the case of a joint probability, 푃(푡, 푠) is used and IntelliBot assigns a probability to the derivation 푡 and searches for 푇푏푒푠푡 for the given input sentence 푠. In the case of conditional probability
115
푃(푡|푠), IntelliBot assigns a probability constant 푃(푠) for a given input sentence 푠, so the definition of a conditional probability is Eq. (6.11) as defined in [134]:
푃(푡, 푠) 푃(푡|푠) = (6.11) 푃(푠)
The word with the highest probability is possibly the correct word and is marked to be updated 푊푢푝푑푎푡푒 for the given sentence 푠 as shown in Eq (6.12) as defined in [134]:
푊푢푝푑푎푡푒(푠) = arg max 푃(푡|푠) = arg max 푃(푡, 푠) (6.12)
6.3.6 Removing stopwords
Some commonly used words occur very frequently and do not have a significant impact or are not useful in the classification of text when processing with NLTK such as “the”, “an”, “are”, “in” and many more. Nevertheless, removing these words could certainly improve processing time and reduce the consumption of memory space and database space, and build a high quality model [135]. Therefore, IntelliBot removes the stopwords as specified in the NLTK stop words corpus. It uses 550 topic-related stopwords and adds additional stopwords because it is difficult to create a standard stopwords list.
Fig. 6.10 Code snippet of stopwords
As seen in Fig. 6.10, the stop words removed ‘are’, ‘the’, ‘your’, ‘can’, ‘me’ from the given sentence “What are the benefits your bank can provide me?”.
116
6.3.7 Lemmatization
After removing the punctuation and stop words, all tokens are lemmatized. This is a process of identifying base form of all inflectional forms. We use lemma instead of the stem because the stem often cuts off the end of the word and it becomes meaningless. On the other hand, the lemma is able to do it more accurately via a dictionary and transforms the word into its base or dictionary form. For example, the base form lemma of words like applying, paying, cards, verification, benefits, chatbots are apply, pay, card, verification, benefit, chatbot respectively as shown in Fig. 6.11.
Fig. 6.11 Code snippet of lemmatization
6.3.8 Entity extraction
Entity extraction is an information extraction method that identifies and classifies key entities from user input sequences into predefined classes such as a PERSON, ORGANIZATION, LOCATION and TIME. The entity extraction process helps to transform unstructured data into structured data. Therefore, it allows machine-learning algorithms to undertake standard processing that can be applied to retrieve information, extract facts and answer questions.
117
Fig. 6.12 Process of entity extraction
As seen in Fig. 6.12, tokenization splits the word by identifying a word boundary. Abbreviation recognizes the full form of the shorthand text. Part-of-speech (POS) tagging annotates the unit of text with its grammatical function, which helps to lay the foundation to understand the relationships between words, phrases and sentences. These processes help IntelliBot define the position of entities and begin to infer their likely role within the text as a whole. Then, it is possible to extract the recognized entities.
For example, the sentence “Dr Omar Hussain visited Auckland University in December 2019. He gave a speech at that University” comprises of two sentences. Based on Eq. (6.3), we defined a POS tag for each word as shown in Fig 6.13.
1st sentence
2nd sentence
Fig. 6.13 POS tagging for both sentences
As seen in Fig. 6.13, POS successfully tagged each word corresponding to its parts-of-speech. However, the words ‘Dr Omar Hussain’ and ‘Auckland University’ are multiword expressions, and should be tagged as a single word. Thus, we need to identify the named entity (NER) to resolve it as shown in Fig. 6.14.
118
1st sentence
2nd sentence
Fig. 6.14 Entity recognition for both sentences
It is easier for IntelliBot to identify entities and extract these from the first sentence. But it becomes more challenging in the second sentence as IntelliBot needs to find and tag different phrases that refer to the same entity. Furthermore, there is no relationship between tags. For entity extraction purposes, this involves identifying to which categories the words ‘He, or ‘University’ belong. This is known as coreference resolution as shown in Fig. 6.15.
Fig. 6.15 Coreference resolution
As seen in Fig. 6.15, the coreference resolution reveals direct relationships and events among different entities by establishing links between them. IntelliBot uses StanfordCoreNLP [102] to extract entities and coreference resolution using the CRF classifier. It is a discriminative undirected probabilistic model which is used for labelling or parsing sequential data in NLP and is trained to maximize a conditional probability. To build a CRF model, we define a set of feature functions with their POS as in Eq. (6.3) and then assign a weight 푤푖 for each feature denoted as 푓푖. For the given input sequence 푥, we score a POS tag 푙 of 푥 by adding up the weights 푤푖 over all words in sentence 푥 as shown Eq. (6.13) defined in [121]: 푚 푛 푖 푠푐표푟푒 (푙|푥) = ∑ ∑ 푤 푓푖(푥, 𝑖, 푙푖, 푙푖−1) (6.13) 푗=1 푖=1
where 푙푖 is the tag of the current word 푤푖 and 푙푖−1 is the tag of the previous word. As an outcome of this, the entity relation and its dependencies are formed as shown in Fig. 6.16.
119
Fig. 6.16 POS tagging with entity dependency relationship
Finally, each feature vector 푓푖 is normalized and then all are added together to transform the scores into a probability [136] of the named entity denoted as 푃푤(푦|푥) as represented in Eq. (6.14) defined in [121]:
1 푇 푃(푦|푥) = exp(푤 푓푖(푥, 𝑖, 푙푖, 푙푖−1)) (6.14) 푍푤
푇 where 푍푤 is a normalization constant and 푤 is the sum of all weights. The outcome of
푃푤(푦|푥) is shown in Fig. 6.17.
Fig. 6.17 Named entity recognition
6.3.9 Punctuation removal
IntelliBot removes punctuation as it does not add anything to the meaning of a sentence. Examples of punctuation are: ! @ # ; : ? $ % & * _ + - () [] {} <> ”” ~. As seen in Fig. 6.18, the user enters “Hello!!, how? are you?” and IntelliBot processes this as “Hello how are you” as a result of punctuation removal.
120
Fig. 6.18 Code snippet of removing punctuation
6.4 Computing Semantic Similarity of a Possible Answer with the User Question
Chapter 5 explains the working of the response generation strategies through which IntelliBot generates a possible response. When the KB and IR strategies are used, more than one possible response may be generated. Among the possible generated responses, the next task of RGU is to determine the semantic similarity of each response to the user question. It is important for the RGU to choose the appropriate response to the user question, that is, a response which has the correct combination of words in response to the user’s question. This is done by determining the semantic similarity of the output result with the user question. It calculates the distance between two words based on the correspondence of their meaning or sense [137]. The user’s question is termed the word vector whereas the generated response is termed as semantic vector. The word vector is created by the words of the input and their synsets. The semantic vector is created by the words of the generated response and their synsets. The sentence-level similarity is measured between the list of words from the input and their synsets which creates a word vector and a list of words retrieved from the KBDB/ Internet which creates a semantic vector. By comparing both the word vector and semantic vector, we measure the sentence-level similarity.
The importance of this can be explained using the example of the word ‘person’ in a possible response. This word may have more than one meaning in the context of the sentence it is in which may lead to uncertainty. For example, as shown in Fig 6.19, the word ‘person’ may refer to an ‘individual’, ‘someone’, or a ‘human’’. It can also refer to an ‘adult’, ‘male’, or ‘female’ person. It may also refer to a profession e.g. ‘teacher’, or a ‘child’, ‘boy’, ‘girl’. Thus,
121 using a variant of a word changes the sense and syntax of the sentence which influences its semantic properties.
Fig. 6.19 Various senses of a word
In relation to the possible generated responses, semantic similarity helps to measure the distance between each word of the generated response and the question [138]. The distance represents a confidence score that reflects the level of relevance of the responses to the question. A higher score indicates a greater similarity to the given sentence. For example, for the question ‘what is the status of my delivery?’ the responses listed in Table 6.5 are generated. As shown in Table 6.5, IntelliBot computes the semantic similarity of each to the question and identities the confidence score of each of them. These are then passed to the RGU. Section 6.4.2 elaborates the process of sentence-level similarity in more detail.
Table 6.5 Confidence score of responses
Question Responses Confidence Score What is the status of Your delivery is dispatched. 51.43% my delivery? You will receive your delivery in two business days. 60.83%
Your delivery was dispatched on 14th Feb 2020. You 93.76% can expect to receive your delivery in the next two business days.
122
When the generative-based strategy is used to generate a response, IntelliBot determines the semantic similarity measures at the word level. As mentioned in Chapter 5, in this strategy, the output is predicted based on the input. So, the similarity between the user input question words and their synsets is used to create the word vector. This is used to predict the next word of the output creating the semantic vector. By comparing both the word vector and semantic vector, we measure the word-level semantic similarity. Section 6.4.1 elaborates the process of determining the word-level similarity in detail.
Fig. 6.20 Semantic similarity determined at the sentence and word levels in the four response generation strategies
Fig. 6.20 shows how word vectors and semantic vectors are used to measure semantic similarity when the IR, KB and generative-based strategy are used. IntelliBot uses two dictionaries, one being WordNet as a lexical database for English words and the other being the domain-specific dictionary for insurance keywords. The following section describes the semantic similarity at the word level.
6.4.1 Detail of determining the semantic similarity at the word level
Semantic similarity at the word level is measured by comparing the meaning between the word vector and the semantic vector in the generative-based response generation strategy. The word vector is the vector-based representation of the input (question) whereas the semantic vector is the vector representation of the output (answer) which is the next predicted word according to the sense of the inputs at a given timestep.
123
Fig. 6.21 Semantic similarity at the word level
The semantic similarity at the word level is determined in five steps as described as follows:
6.4.1.1 Identifying words and POS tagging
The WordNet ontology is built on synonymy. Each word has a set of synonymy also called synsets. Synsets are an interconnected group of synonymous words that express the same meaning and lexical relations. Thus, it is important in the first step to determine the key phrases, relationships and dependencies with the other words of the input sequence using tokenization, punctuation removal, lemmatization and POS tagging the input sequences and labeling them accordingly. IntelliBot only considers nouns, verbs, adjectives and adverbs as WordNet has relationships between these four types of POS. Therefore, it reduces calculation time and complexity. It applies the HMM for POS tagging as represented in Eq. (6.15) where 푥 is the input sequence and 푦 is the output tag sequence. The detail of this equation is shown in Eq. (6.2) to (6.5) as defined in [121]:
푇
푝(푥 | 푦) ≈ ∏ 푝(푥푡|푦푡) (6.15) 푡=1
The outcome is shown in the following figure which represents the word dependencies and corresponding parts-of-speech.
124
6.4.1.2 Find word sense disambiguation
After POS tagging each key phrase, IntelliBot determines the relationships and dependencies by applying word sense disambiguation (WSD). WSD determines which meaning (sense) of a word is triggered using the word. In other words, the problem of resolving semantic ambiguity is called WSD. For example, in ‘what has happened to my insurance?’ or ‘what’s wrong with my policy?’ or ‘what is the status of my insurance?’, these three questions are referring to an insurance premium, which can be answered by one response. However, this may not be true for other sentences. So, the adopted approach not only does a word-to- word comparison, it also pays attention to the context in order to capture more of the semantics. For example, the sentence “Nuruzzaman likes to watch a football match”, the word “football” occurring with the word “match” implies that the meaning of ‘match’ is a “game”.
In the disambiguation step, this is done by taking all the input words and using the WordNet database to obtain the corresponding possible output word according to the sense of the inputs until now. The possible relations of an input word in question with the possible output word are identified. If an input word does not exist, it is stored in the WordNet database. Eq. (6.16) is used to perform WSD as defined in [139]:
푛
푎푟푔푚푎푥푠푦푛푠푒푡 (푎) = ∑ 푚푎푥푠푦푛푠푒푡 (푖)(sim(𝑖, 푎)) (6.16) 푖=0
For example, consider the word “insurance” and “credit card”. WordNet synset for “credit card” and “insurance” is shown in Table 6.6.
Table 6.6 Synsets of words
Word Synset
Insurance Policy, indemnity, protection
Credit Card Bankcard, cash card, charge card
6.4.1.3 Calculate the shortest path between two synsets
A possible chosen word may have multiple members (synsets) according to the meaning of the word. Synsets are connected to other synsets by means of semantic relations. In this 125 step, the distance between synsets is measured. The distance between words increases as similarity decreases.
Fig. 6.22 Hierarchical structure graph (subset of wordNet)
Fig. 6.22 illustrates the distance between two words “male” (푤1) and “female” (푤2). First, assume that 푤1 and 푤2 are in the same synset. For example, the traversal path for “male”
(푤1) to “female” (푤2) indicates both have the same meaning which is “person”. So, the distance between 푤1 to 푤2 is 0. Second, let’s assume “boy” (푤3) and “girl” (4) are not in the same synset. For example, “boy” (푤3) is under the “male” (푤1) synset and “girl” (푤4) is under the “female” (푤2) synset. Nevertheless, both 푤3 and 푤4 have one common word
‘child’. In this scenario, the path between 푤3 and 푤4 is 1. Taking these scenarios into account, IntelliBot uses the function as shown in Eq. (6.17) to calculate the shortest path as defined in [139]:
∫(푑) = 푒−훼푙 (6.17) where 푙 is the shortest path, 훼 is a constant, and 푒 is an exponential function to ensure the value of ∫(푙) between 0 and 1.
126
6.4.1.4 Hierarchical distribution of words
Next, the hierarchical distribution of words is calculated. It is a super-subordinate relationship between synsets, also called hypernymy or hyponymy. The relationship connects the general concept to specific and it is important when the distance between word pairs are same. This is because the lower level words has less semantic information compared to upper level words [140]. For example, Fig. 6.23 illustrates the word ‘vehicle’ and its hyponyms. The word ‘vehicle’ is in the upper level which has common properties and does not give information about the type of vehicle, whereas the synsets of ‘vehicle’ has more specific information about the type of vehicle e.g. car, motorcycle, motor vehicle, bicycle.
Fig. 6.23 Hierarchical distribution of words
The semantic similarity measure of two words in a hierarchical structure can be formulated as in Eq. (6.18) where the maximum path length between 푤1 and 푤2 is denoted as 푀푎푥, and the shortest path is denoted as 푆푃 defined in [141]:
sim (푤1, 푤2) = 2 × 푀푎푥 (푤1, 푤2) − 푆푃 (6.18)
The lower level of the hierarchical structure may have more semantic information and specific properties. To scale up and build the semantic relative matrix Eq. (6.19) as defined in [139]: 푒훽ℎ − 푒−훽ℎ 푔(ℎ) = (6.19) 푒훽ℎ + 푒−훽ℎ
127 where 훼 and 훽 are 0.20 and 0.45 respectively. 푒 is an exponential function and ℎ is hierarchical depth (level). Hierarchical depth could be a long-distance path. To determine the relatedness between two words (푤1, 푤2) using path distance, the number of paths connecting 푤1 and 푤2 include several parents and child nodes. If there is a close relationship between the meanings of 푤1 and 푤2, they are said to be semantically related to each other [142]. This study optimizes Eq. (6.19) as follows, where 푆푃 is the shortest path as shown in Eq. (6.20) which defined in [139]:
푒훽ℎ − 푒−훽ℎ 푔(ℎ) = 푒−훼 ×푆푃 × (6.20) 푒훽ℎ + 푒−훽ℎ
6.4.1.5 Measuring the similarity between the two vectors
Lastly, for each word containing a weight, a semantic vector is created. The semantic similarity between the word vector and the semantic vector is calculated by Eq. (6.21) as defined in [143]:
1 ∑휔휖푊 푚푎푥푆𝑖푚 (휔, 푤2) ∙ ∫ 휔 푆𝑖푚 (푤 , 푤 ) = ( 1 1 2 2 ∑ 휔휖푊1 ∫(휔) (6.21) ∑휔휖푇 푚푎푥푆𝑖푚 (휔, 푤1) ∙ ∫ 휔 + 2 ) ∑ 휔휖푊2 ∫(휔) where 푚푎푥푆𝑖푚 (휔, 푊) is the maximum value in the similarity of word 휔 and 푤 is the word segment.
6.4.2 Detail of semantic similarity at the sentence level
The meaning of a sentence is represented by words [139]. The words with well-defined grammar structure make a sentence that conveys a meaning. When comparing two sentences, there may be many word pairs which have multiple synsets. Therefore, only words with POS tagging and the context of the sentences being compared are considered. Otherwise, it might lead to uncertainty.
In the first stage, the input sequence is segmented into a list of words and POS tagging. Then, the neural dialogue manager (NDM) finds the most appropriate sense for each word in the
128 sentence. This process is known as word sense disambiguation (WSD). Finally, the list of selected words in the user question (word vector) is compared with the semantic vectors.
Fig. 6.24 Semantic similarity at the sentence level
As shown in Fig. 6.23, word vectors 푤⃗⃗ are created for each sentence being compared. Word vector 푤⃗⃗ represents the summarised information of a word according to its sequence in a sentence. Then, a list of distinct words is formed. Let us assume that there are three sentences from the result of the query, denoted by 푆1, 푆2, and 푆3. The list of the distinct word set from 푆1, 푆2, and 푆3 is {푤1 푤2 푤3 … … … … 푤푛} as defined in [20]:
(6.22) S = 푆1 ∪ 푆2 ∪ 푆3 = {푤1 푤2 푤3 … … … … 푤푛}
Now 푆 has a distinct word set from all the sentences. If the word in question exists in the WordNet database, a lexical semantic vector is created for each sentence and is denoted by
퐿⃗ . 퐿⃗ 푖. This represents the semantic similarity of the semantic vector in a sentence 푆푖with the input vector. If word 푤푖 appears in sentence 푆푖, where 𝑖 = (1, 2, 3, … … … … 푛), the value of
퐿⃗ 푖 = 1, otherwise 퐿⃗ 푖 = 0.
Next, disambiguation is done by taking the input vector and using the WordNet database. This was presented in the previous section in Eq. (6.16). The importance of a word in the 푛+1 input vector is determined as 푝̂(푤), which is computed as = , where 푛 is the frequency 푁+1 of word 푤 increased by 1, and 푁 is the total number of words in the corpus, and 푝̂(푤) is the 129 content derived from the corpus. It is important to do this as the meaning of the sentence changes according to how the input word contributes to it [144]. By combining 퐿⃗ with 푝̂(푤), a semantic vector is created for each sentence, which is denoted as 푠 . Each cell is weighted by the related content 퐶(푤푖) and 퐶(푤̃푖), where 𝑖 = (1, 2, 3, … … … … 푛).
(6.23) 푠 푖 = 퐿⃗ 푖 ∙ 퐶(푤푖) ∙ 퐶(푤̃푖)
For semantic vector 푠 푖 generated for each sentence, the semantic similarity is computed which represents the lexical similarity. It is defined as the cosine coefficient between semantic vectors 푠 푖. Semantic similarity is computed by Eq. (6.24) and is denoted as 푆푠 as discussed in [144]:
푠1 ∙ 푠2 ∙ … … ∙ 푠푛 푆푠 = (6.24) ‖푠1‖ ∙ ‖푠2‖ ∙ … … ∙ ‖푠푛‖
Next, the word similarity from word vector 푤⃗⃗ is determined. Word similarity is denoted by
푆푤 and 푟 is the reserve order in a sentence 푆푖 and 𝑖 is the index number which is incremented by 1. So, the word similarity of two sentences 푆푤 is calculated by Eq. (6.25) as defined in [20, 144]:
‖푟푖+1 − 푟푖+2‖ 푆푤 = 1 − (6.25) ‖푟푖+1 + 푟푖+2‖
Finally, the overall similarity 푆(푆1, 푆2) results by combining 푆푠 and 푆푤. The similarity between two sentences 푆1, 푆2 is shown in Eq. (6.26) as defined in [20, 144]:
푠1 ∙ 푠2 ‖푟푖+1 − 푟푖+2‖ 푆(푆1, 푆2) = 휗푆푠 + (1 − 휗)푆푟 = 휗 + (1 − 휗) (6.26) ‖푠1‖ ∙ ‖푠2‖ ‖푟푖+1 + 푟푖+2‖ where 휗 ≥ 0.5 or 휗 ≤ 1. If the measuring similarity between more than two sentences occurs IntelliBot iterates the previous steps for each sentence from Eq. (6.22) to Eq. (6.26) as shown in Eq. (6.27):
푠푖 ∙ 푠푖+1 ‖푟푖+1 − 푟푖+2‖ 푆(푆1, 푆2) = 휗푆푠 + (1 − 휗)푆푟 = 휗 ∑ + (1 − 휗) ‖푠푖‖ ∙ ‖푠푖+1‖ ‖푟푖+1 + 푟푖+2‖ (6.27) 푖=1,푛
130
Table 6.7 Similarity between answer relevant to the question
Question Answer Similarity Where do you live? I have a pen. 0.00% What is your name? My name is Nuruzzaman. 100% What is the status of my Your delivery was dispatched on 14th Feb 93.76% delivery? 2020. You can expect to receive your delivery in the next two business days.
6.5 Process of sentence scoring at RAU
As discussed in Chapter 5, once the semantic similarity at the word and sentence levels are calculated, they are passed to the RAU of IntelliBot. RAU takes each of the possible generated answers from the RGU and determines its commonality with respect to the question asked by the user. This is determined by comparing question 푄 and answer 퐴 sets to see the number of common elements between them as shown in Fig. 6.24. For example, as seen in Fig 6.24, there are 10 elements in 푄 and 10 elements in 퐴 where the total number of intersection elements between 푄 and 퐴 is 7.
Fig. 6.25 Two sets with Jaccard similarity 7/13
For example, for the question “what is the interest rate of the AMEX card?”, IntelliBot’s response is “The interest rate of the AMEX card is 14.24%.” This question has three elements: “interest rate”, “AMEX” and “card” and the answer has four elements “interest rate”, “AMEX”, “card” and “14.24%”. So, there are three common elements: “interest rate”, “AMEX”, “card”.
IntelliBot uses the Jaccard similarity approach for scoring then determines the commonality between the words in the question with the words in the answer. This is also known as the
131 overlap coefficient with a range from 0% to 100%. Using this approach, each sequence is calculated by applying the similarity based on the lexical overlap between them as shown in Eq. (6.28) which defined in [81, 141]:
|푄1| ∩ |퐴1| 훿 = 퐽(푄, 퐴) = (6.28) |푄1 ∪ 퐴1|
where 훿 denotes the scores, |푄| is the sum of all words from the user question 푄1, and |퐴| is the sum of all words from the sentence 퐴1. For example, let us consider two sets 푄 and 퐴 in Fig. 6.24. There are 7 elements in their intersection and a total of 13 elements that appear in 7 both or 푄 or 퐴. Thus, 훿 is ∗ 100 = 54%. The most similar sentence in response to a 13 question will have the highest similarity score 훿. In the case of multiple matches, the 훿 of each sentence is calculated. If 훿 exceeds a predefined threshold of 0.5, it is presented as a response to the user. IntelliBot will respond with the default message, ‘I don’t know’ if 훿 = 0.
6.6 Conclusion
In this chapter, we outlined several methods for measuring similarity and checking grammar using various NLP techniques and neural networks methods, such as entity extraction, POS tagging, hidden Markov model, n-grams and word embeddings. It shows that these are effective solutions to several problems in DBRNN. These methods are used by the NDM during the processing of IntelliBot.
132
CHAPTER 7
“Programming is about managing complexity: the complexity of the problem laid upon the complexity of the machine. Because of this complexity, most of our programming projects fail.” — Bruce Eckel
DATA COLLECTION AND TRAINING THE GENERATIVE STRATEGY OF INTELLIBOT
7.1 Introduction
7To build IntelliBot, a conceptual framework that is capable of accommodating dialogue- based outputs was proposed in Chapter 4. The framework consists of analysing user input, language understanding, dialogue context discovery & tracking and response generation components as described in the earlier chapters. In Chapter 5, the working of the different response generation strategies was explained in detail. However, each of the strategies requires domain-specific data for them to work on. This chapter details the process of how this domain-specific data, which in the context of this thesis is insurance-related, is collected before IntelliBot is used.
Furthermore, as discussed in Chapter 5, if the template-based, knowledge-based and internet-retrieval strategies are unable to generate a response to the user query, then the generative-based strategy is used. The generative-based response generation strategy uses the seq2seq model with an attention mechanism in DBRNN. This model maps the sequence to sequence between inputs and outputs ahead of time. This means it able to connect previous information to the current hidden state. To do this, the DBRNN requires training for it to generate words according to the required context. In this chapter, we also detail the
7 Parts of this chapter have been published in [8], [20] and [100]. 133 process by which the DBRNN is trained to be used in the fourth response generation strategy of IntelliBot.
The structure of the chapter is as follows: Section 7.3 explains how data is collected from various sources, including document and web crawling for it to be used by each response generation strategy. Sections 7.4-7.10 explain the process of training the DBRNN needed for the generative-based response generation strategy. Specifically, section 7.4 and 7.5 illustrate the steps of data preparation, data cleansing and feature engineering. The training environment setup and classification of the training phases are explained in Section 7.6 and 7.7 respectively. Section 7.9 discusses the design of the neural networks. The chapter concludes with a description of IntelliBot training using DBRNN to build the model.
7.2 Key Terms
Token a single word referred to as a “token”.
Sequence Labelling is an NLP task which assigns a pre-defined label to each token in a given input sequence. Common core NLP tagging tasks are word chunking, POS tagging or NER. For example, POS tagging assigns part of speech to each word in input sentences. Most sequence labelling algorithms are probabilistic in nature, relying on statistical inference to find the best sequence. The most common traditional sequence labelling models are Markov chains model, maximum entropy Markov model (MEMM), CRF model, average perceptron (AP) and structured SVM (SVM struct). Sequence labelling can be applied to various fundamental problems such as POS tagging, entity extraction, parsing, text chunking, information extraction, word segmentation, gene prediction, machine translation, speech and handwriting recognition etc. [128, 145].
7.3 Process of collecting the data required for each response generation strategy
This step collects data from which IntelliBot generates a response and is a common step in all four response generation strategies. Depending on the response generation strategy used, the type of data required changes. As shown in Fig. 7.1, the template-based strategy requires domain-specific data on which templates can be defined by an expert in the AIML templates. So, the data required for these are specific to the questions that need to be formed. This is similar to the data required for the knowledge-based strategy that should be domain-specific
134 for it to form Q&A pairs. Such domain-specific datasets for the purposes of building IntelliBot were collected from two major Australian local banks—Commonwealth Bank and ANZ Bank. Information regarding insurance and 15 Product Disclosure Statements (PDS) specific to credit card insurance documents of credit cards was collected. Furthermore, information from websites related to the selected credit card was also crawled.
A suitable method for collecting questions and answers to train chatbot for a specific domain was introduced in [146]. More specifically, two distinct crawlers for data collection: a web crawler and document crawler were applied. The web crawler was used to collect data from webpages and the document crawler was used to collect data from PDS documents. As a result of crawling, the outcome is to gather information for the insurance domain as a question and answer pair. The statistics of the collected data are shown in Table 7.1. Once the information is represented in this form, it is used either to form templates or is stored in KBDB. The process of forming templates to work in the template-based strategy is explained in Section 5.4.3. As explained in Section 5.5.3, in the knowledge-based strategy, after KBDB is formed, a query can be formed to generate a response before RGU checks for its semantic similarity [as explained in Section 6.4.2] and RAU scores its relevance to the question [as explained in Section 6.5].
Table 7.1: Statistics for the insurance domain-specific QA dataset Number of Conversation 10,115
Number of Utterances 100,698
Is Metadata included No
Number of Domain Terms 1,000
The Internet-retrieval strategy grabs information from the selected websites. As shown in Fig. 7.1 and explained in Section 5.6.3, this information is crawled from specific selected websites, cleaned noisy data and then relevant information is extracted before a response is generated. The generated response is checked for its semantic similarity and grammar and is scored to determine its relevance to the question, as explained in Section 5.6.3, Section 6.4.2, Section 6.3.5 and Section 6.5 respectively. The relevant information is stored in KBDB so the next time a similar question is asked, the KB strategy can be used to respond to it.
135
Fig. 7.1 Data collection procedures for the four strategies
The generative-based approach uses DBRNN to predict a response based on the input and its context. So, in addition to a domain-specific dataset which is required by the knowledge- based strategy, two more types of datasets are needed. The first is general data which IntelliBot needs to learn when using this strategy to converse in English. General data was collected from the Cornell University Data Centre named the Cornell Movie Dialog corpus. This dataset is publicly available, and it contains narrative conversations which are extracted from raw movie scripts with metadata. It provides a well-formatted base dataset for training
136
IntelliBot to converse on a wide range of topics and derive responses from them. The Cornell movie corpus dataset statistics are shown in Table 7.2. The second dataset is the KBDB that was created for the knowledge-based strategy. As a result of processing, two types of QA datasets are created. The first is to from dialogue dataset from Cornell movie corpus as shown in Table 7.3 and the second is the insurance domain-specific dataset collected for the KB strategy as shown in Table 7.4.
Table 7.2 Statistics for the Cornell movie corpus Number of Conversations 220,579 Number of Utterances 304,713 Pairs of Movie Characters 10,292 Number of Movies 617 Is Metadata included Yes
Table 7.3 Sample raw data from Cornell movie corpus
Table 7.4 Sample raw data from Cornell movie corpus Q: What is the interest rate of the AMEX card? A: Interest rate of the AMEX card is 0% p.a. on purchases for 12 months. Q: What types of insurance are covered by the AMEX card? A: Insurance covered by the AMEX card are health, accident & travel. Q: What does credit card insurance cost? A: As little as 45 cents per $100 of the card balance.
The third dataset that is needed to train this strategy is domain-specific vocabulary. Human vocabulary comes in free text. However, neural network results are purely statistical models. They do not create nor understand fundamental concepts, such as the greenhouse effect, or the significance of the word “interest free period”, “interest rate”, “master card”, “visa card” and so on. During the NLP processes, these words cannot be recognized as they are split into multiple words and lead to a different meaning. To build a domain-specific chatbot, word
137 senses disembogues (WSD) plays a significant role and needs to be processed by a completely different model. In this step, the token
Table 7.5 Statistics for the vocabulary dataset used to build IntelliBot Custom WordNet Number of Vocabulary (word form) 53,798 155,287 - Noun 117,798 - Adjective 21,479 - Adverb 4,481 - Verb 11,529 Number of Synsets 117,659
Once the DBRNN predicts a response, its word similarity is determined, as explained in Section 6.4.1. Once the sentence is generated, the grammar of the sentence is checked with its relevance to the question, as explained in Section 6.3.5 and 6.5 respectively. In the next section, we focus on explaining the process of training DBRNN for it to work with the generative strategy.
7.4 Process of training the generative-based strategy for response generation of IntelliBot
To train IntelliBot to generate a response using the generative-based strategy, there are five distinct steps as shown in Fig. 7.2, namely data preparation, feature extraction, neural network design, training environment setup and training IntelliBot. A brief explanation of each strategy is as follows:
Fig. 7.2 Process of training generative-based strategy (RNN) for response generation 138
1) Prepare Data
After the data is collected, it needs to be prepared, as chatbots are only as intelligent as the knowledge to which they have access. This refers to both the content and the quality of this content. This step prepares the collected data before it can be processed in the next steps. The working of this step is explained further in Section 7.5.
2) Extract features
Feature extraction is a method of selecting, extracting and combining important features or text. It describes the original dataset and reduces amount of data to be processed so that it can effectively improve accuracy, speed up training and reduce overfitting. In the two datasets that are required for our purpose (Cornell and domain-specific), there are tens of thousands of unique words and 53,798 vocabulary items, which use a huge amount of space and could lead to slower learning. Therefore, it is important to select only those informative words and remove noninformative words without sacrificing accuracy. This step identifies the required features from each dataset on which to train the model. The working of this step is explained further in Section 7.6.
3) Design of neural networks for training
This section explains the design of the neural networks required to train the generative- based strategy. The designed DBRNN consists of six layers and each layer comprises several neurons which incorporate weights and some biases in each of them. The details of the neural networks used for training the RNN of IntelliBot is described in Section 7.7.
4) Setup training environment
This step sets the training environment that is used to train the DBRNN on the collected datasets and the formed NN. The environment used for the training process is explained in Section 7.8.
5) Training the RNN to generate a response
To train the RNN for the generative-based strategy to generate responses, the process is divided into two distinct parts, namely forward propagation and backward propagation.
139
Forward propagation transmits user sequence forward through its parameters to make a prediction. Backward propagation is used to fine-tune the model by updating gradients to correct the error when predictions are wrong. Section 7.9 explains in detail the training process.
7.5 Data preparation
Table 7.3 shows the sample raw data from the Cornell movie corpus. This dataset consists of noisy data. Furthermore, as shown in Table 7.4, it is essential to have labelled data, such as questions and answers, to develop IntelliBot’s ability to converse with users. So, before the training process of IntelliBot can begin, it is important to clean the dataset to remove noise and identify and link identical objects. Therefore, in this step, the goal is to clean the Cornell movie dialogue corpus for better use and knowledge discovery during the training of the model. Data preparation comprises several sub-steps, such as data cleaning and performing a data redundancy check. The following sub-sections elaborate on these tasks.
7.5.1 Data cleansing
In this step, we assess how much data is accurate, consistent and usable by identifying any errors or corruption in the dataset. This is very critical to assess the quality of data and establish a realistic baseline of data hygiene. IntelliBot performs data cleansing task on the Cornell movie dialogue corpus to remove noisy and redundant data and inaccurate entries which leads to data accuracy issues. Data cleansing is also important as without it, there may be many difficulties in linking and identifying identical objects, such as questions & answers, identifying entities, conversational topics and context in the uncleaned or raw dataset. The Cornell movie dialogue corpus consists of movie dialogues which are derived from 617 movies and involves 9,035 characters. Each QA on average is 10-58 words long. The process flow of the Python script to execute data cleansing automatically for the Cornell movie corpus as shown in Fig. 7.3 was written and implemented by IntelliBot.
140
Fig. 7.3 Data cleansing process flow for the Cornell movie dialogue corpus
In the automated data cleansing process, the following steps are involved:
• All the conversations from the corpus that are not in the question-answer pairs are removed. Furthermore, duplicate records and those records that do not meet quality data criteria such as no punctuation, special characters, shorthand words and questionable words are also removed.
• Special characters such as (… or \ or ! or ? ) are removed. • Words that contain the ( ′ ) symbol are split into two words. For example, I ’ll fragments into “I will”. The sequence of words that contains 푛’푡 is not subject to this rule. The word don’t is fragmented into “do not”. 141
• Those conversation pairs in which QA contains questionable words, for example, ass, bastard, bitch, shit etc. are removed. We also discarded the answer pairs which are very short or had answers that contain less than 10 letters. Then, all the cleaned QA pairs are tokenized and saved into a text file.
Fig. 7.4 Code snippet of data cleansing and saved data
Fig. 7.5 shows the Cornell movie dialogue dataset before and after data cleansing. It can be seen that before the cleansing process, the total number of words in the answer is double the size of the questions. For example, the questions are between 5—25 words and the answers are between 12—58 words as shown in Fig. 7.5 (a). However, after the cleansing process, most of the questions are between 5—22 words and the answers are between 10— 36 words as shown in Fig. 7.5 (b).
142
(a) Cornell Dataset (before cleansing) (b) Cornell Dataset (after cleansing)
Fig. 7.5 Cornell dialogue dataset
Then, as shown in Fig. 7.6 (a), we are interested in seeing whether there is a relation between the number of words in the questions and answers. The histogram distribution in Fig. 7.6 (b) statistically represents that question has lower number of words than answer.
(a) Heat map (b) Histogram
Fig. 7.6 Histogram distribution of the Cornell dataset
Fig. 7.7 shows that 38,546 records were removed from the Cornell movie dataset after the cleansing process which made 83% of data useable for training and testing purposes.
143
Fig. 7.7 Exploratory data analysis of the Cornell movie dialogue dataset
To provide insight into the dataset, we consider a random sample of 5,000 QA pairs from the cleaned dataset. The following observations are made:
• Variation between questions and answers. • Grammatical errors in the questions and answers. • Formulated answers in some cases. • Many words are in short form.
7.5.2 Removal of duplicate data
This step aims to check if the same piece of data is stored in multiple places. If it is, it removes the duplicate data. Duplicate data may lead to extra resource utilization, training time, and storage. Thus, it is important to remove duplicate data. In this work, duplicate data is not removed when a template-based QA is used. However, the knowledge-based and Internet-retrieval strategies remove duplicate data before they are stored in the database. For the Cornell movie dialogue corpus, duplicate data are removed after data cleansing and QA pairs are formed. For the insurance QA dataset, duplicate data are removed before the data is saved in the KBDB.
7.6 Feature Engineering to Extract Features and Use them for Training at DBRNN
In this section, we explain how the features that will be used in training the IntelliBot neural network model are created. The process is applied to both the Cornell movie corpus and insurance QA dataset to create a set of features that are useful for classification techniques to automatically induce into an SMT classifier. This is a critical step because accuracy depends on these features. First, features are selected that are domain-specific. Second, the
144 combination of general and knowledge-based features are selected. Third, features that support experiments to compare feature selection approaches for predictive accuracy.
IntelliBot identified four types of features as discussed in [147], namely Named Entities, Token Count, Lexical and Syntactic and of these, 128 features were identified, as summarized in Table 7.6.
Table 7.6 List of features used in experiments Type Total Features 1. Does sentence contain person name (PERS) 2. Does sentence contain organization name (ORG) Named Entities 5 3. Does sentence contain place/location name (LOC) (NER) 4. % of tokens named as entities in a sentence 5. % of NER in question presented in answer 6. Number of tokens in answer 7. Number of tokens in a sentence Token Count 4 8. % of tokens in a sentence (answer) 9. % of tokens matching with non-answer tokens 10. % of capitalized words 11. Pronouns 12. Stopwords Lexical 6 13. Quantifiers 14. Does sentence start with quantifier? 15. Does sentence end with quantifier? 16. Is sentence before head verb? 17. Are POS tags Presence/ Absence before answer (37) Syntactic 113 18. Are POS tags Presence/ Absence after the answer (37) 19. Number of tokens in answer’s POS tag (37) 20. Parse tree of the answer.
1) Named Entities Features: Five named entities were used to obtain or identify LOCATION, PERSON and ORGANIZATION in a sentence. This thesis measured the number of entities found in a sentence.
2) Token Count Features: To capture the properties four features were used named as token count.
145
3) Lexical Features: It is related to the words of a language were used. This is a very simple way to represent a sentence structure. We used six lexical features to obtain words, nouns, pronouns, stopwords, punctuation and numbers.
4) Syntactic Features: Syntactic features are more useful than lexical or token count features. In this thesis, the Standford NLP POS Tagger is used for feature extraction. We used 113 syntactic features that indicate the existence of POS tag, noun, adjective, verb, adverb, preposition.
The sequence labelling approach is used to capture lexical features, identify POS tags and determine the named entities. The following sub-sections elaborate on the sequence labelling approach used.
7.6.1 Extracting features required to train IntelliBot
In recent years, both academia and industry have shown interest in chatbots which generates response for ongoing conversation as the answer to a given question. The problem is how to match user question with existing conversational data or information in the KBDB. To understand a question, first, it is important to recognize the entity, context of the conversation and relation chain inference within the conversational data [148]. After this, syntactic features, lexical features need to be extracted to build a sequence labelling model. There are two requirements needed to answer a given question. The first is to recognize the fact, and the second is to answer the question based on the relevant chaining facts (relation chain).
Table 7.7 Example of lexical feature extraction
“Nuruzzaman started his career as a Software Engineer at Daifuku. Then he worked at Accenture as a Team Lead. Finally, he worked at MIMOS before pursuing his PhD”.
To understand the concept, let’s consider the statement in Table 7.7. If the user asks the question “Where did he work before MIMOS?”, the answer should be “He worked at Accenture”. This is a simple problem for a human. But, for an AI system to make a decision around this, it is very complex.
146
Several studies attempted to solve this task by entity and topic extraction, more specifically, the sequence labelling approach [116]. However, the contents in the memory of neural networks are difficult to analyse. Additionally, it is also hard to trace the problem according to the contents in the memory when an error occurs. Although some studies solve problems associated with neural networks, however, the reasoning and inferencing capability of neural networks is still questionable. It requires capabilities of both sequence labelling and inference abilities which is beyond simple NLP. Traditional methods employed in machine learning models such as CRF and HMM have achieved good performance. However, these models rely on language-specific resources, are difficult to train and made of hand-crafted features. Furthermore, alignment between inputs and labels are unnamed in many cases. Therefore, it is challenging to assign it into new domains. To overcome this drawback, IntelliBot incorporates automatic feature extraction during its training. Nevertheless, considering the overwhelming number of parameters in DBRNN and relatively small size of the sequence labelling corpus, annotations alone may not be enough to train complicated models. So, guiding the learning process with extra knowledge could be a wise choice.
The process of sequence labelling to extract features in IntelliBot is divided into four steps, namely Character-level layer, Highway layer, Word-level layer and CRF layer. The following sub-sections discuss these in detail.
7.6.1.1 Character-level layer
This layer is trained on sequence data to capture the underlying style and structure. For each word, the network computes the character-level representation vectors with character embedding as inputs. Next, the vectors are concatenated with word-embedding vectors that forward to the LSTM. Aiming at lexical features instead of remembering the spelling of a word, the work in this thesis adjusts the prediction from the next character to the next word. Furthermore, IntelliBot uses two LSTM units to capture information in both forward and backward directions. Although it seems similar to the bi-LSTM unit, the outputs of these two units are processed and aligned differently.
147
7.6.1.2 Highway layer
In this layer, effective feature extraction needs to be done. IntelliBot employs highway units which allow unchecked information to flow across several layers. Typically, highway layers conduct nonlinear transformation shown in Eq. (7.1) as defined in [128]:
푚 = 퐻(푛) = 푡 ⨀ 푔(푊퐻푛 + 푏퐻) + (1 − 푡) ⨀ 푛 (7.1) where ⨀ is the product and 푔(⋅) is the nonlinear transformation.
7.6.1.3 Word-level layer
For sequential labelling tasks, IntelliBot accesses both Future (right side) and past (left side) words. However, hidden state of LSTM receives data from the left side only. For this, IntelliBot adopts Bi-LSTM as the word-level structure to capture information in both directions. The basic idea is to create two separate hidden states to convey information by presenting forward and backward sequences. Then, the output vectors of LSTM are feed to the CRF layer to decode the labelling of the sequence.
7.6.1.4 CRF layer
Finally, a neural network model is constructed by feeding the output vectors to the CRF layer and obtains the desire output. It is significant for a given sequence that the relationship between labels in neighbourhoods and decoding labels. Thus, it ensures the sequence labelling is meaningful.
For a sentence with annotation 푦 = (푦1, 푦2, … … … 푦푛), the word-level input vector is marked as 푥 = (푥1, 푥2, … … … 푥푛), the character-level input vector is recorded as 푐 =
(푐0, 푐1, … … … 푐푛), and the j-th character for the word 푤푖 and the space character after
푤푖 𝑖푠 푐푖. The probabilistic model for sequence CRF which defines a family of conditional probability 푝(푦 | 푥) over all possible label sequences 푦 is as Eq. (7.2) which defined in [128]:
푛 ∏푗 = 1 ∅(푦̂푗 − 1, 푦̂푗, 푥푗) 푝(푦 | 푥) = 푛 (7.2) ∑푦′ ∈ 푌(푧) ∏푗 = 1 ∅(푦̂푗 − 1, 푦̂푗, 푥푗)
where 푌 (푧) is the set of sequences for label and 푧 = (푥푖푐푖).
148
For training, this thesis minimizes the negative log-likelihood as shown in Eq. (7.3) as defined in [128]:
퐽퐶푅퐹 = − ∑ log 푝 (푦푖 | 푥푖) (7.3) 푖 and for testing and decoding to find the optimal sequence 푦 ∗ which maximizes the likelihood with the highest conditional probability shown in Eq. (7.4) as defined in [100]:
푦 ∗ = argmax 푝(푦 | 푥) (7.4)
After the features are identified, the DBRNN can be trained. The next sub-sections detail the process of how neural networks are designed for training processes.
7.7 Design Neural Networks
This section explains the specifics used for designing the neural network to train the generative-based strategy for the response generation of IntelliBot. As shown in Fig. 7.8, there are various steps which need to be undertaken to design the neural network on which training is performed. Each step is explained in the next sub-sections.
Fig. 7.8 Designing neural networks of IntelliBot
7.7.1 Input standardization
Input standardization scales raw data and transforms variable-length sequences into fixed- length sequences for the appropriate representation of input which feeds into the RNN. The purpose of input standardization is to increase the efficiency and performance of the RNN. This operation does not change the training dataset. Research shows that input standardisation has a huge effect on the performance of RNN [149]. For RNN tasks, selecting an appropriate representation of the input is essential. The neural networks tend to be comparatively robust in selecting a representation of inputs [150]. The IPU of IntelliBot performs all data pre-processing tasks. However, irrelevant data takes up a large amount of 149 dimensional space. This could result in excessive input weights and poor generalisation. Therefore, ‘padding’ converts variable-length sequences into fixed-length sequences and uses the following tokens to fill in the sequence, as shown in Table 7.8.
Table 7.8 List of tokens to fill the input sequence
For example, consider the user query “How are you?” and its response “I am fine.” The converted fixed-length 10 sequences are shown in Table 7.8. This solves the problem of variable-length sequences. However, consider a sentence with the length of 50. In this case, there is a need to encode the sentence to a length of 50 in order to not lose any words. This will result in 46 PAD symbols in the encoded sentence which overshadows the actual information. The bucketing technique solves this issue by dividing the sentence into buckets of different sizes e.g. [(5,10), (10,15), (20,25), (40,50)]. If the query length is ≤ 5 and the response length is ≤ 10, then the sentence in the bucket is (5,10). So, the sentence is encoded as shown in Table 7.9.
Table 7.9 Filling the input sequence in bucket size of (5,10)
Next, IntelliBot standardises the component of the input vector 푥 to have a standard deviation value of 1 and a mean value of 0 over the training datasets. The mean denoted as
푚푖 and standard deviation denoted 휎푖 are shown in Eq. (7.5) which defined in [149, 151]:
1 푚 = ∑ 푥 푖 |푆| 푖 (7.5) 푥∈푆
150
1 휎 = ∑(푥 − 푚 )2 푖 √|푆| 푖 푖 푥∈푆
Next, the standardised input vector 푥 is calculated by Eq. (7.6) which defined in [149]:
푥푖 − 푚푖 푥⃗⃗⃗ 푖 = 휎푖 (7.6)
7.7.2 Determine neuron and neural network layers
The layer of RNNs are simply a collection of neurons (nodes) that work on the same features. Neurons are the brain of AI. It holds a number called weight. A neuron calculates the sum of weights and adds bias to decides whether it should be fired or not. These weights are also used to train and optimize the error function of the neural networks. As shown in Fig. 7.9, the network receives input features (푥1, 푥2, 푥3, 푥4, 푥5) and has associated weights
(푤1, 푤2, 푤3, 푤4, 푤5) which are summed. The transfer function is then passed to the ‘neuron’. The neuron performs a computation on a sum of inputs and adds bias to produce output 푦. This is done through an “activation function” which normalizes the output of the neuron. The threshold (theta 휃) determines whether the activation function should be activated or not in order to forward the value to the next neuron.
Fig. 7.9 Single neuron connection
151
To create an RNN in order to train the DBRNN of IntelliBot, this thesis requires at least three types of layers: one input layer, one output layer and four hidden layers. Each layer has multiple neurons connected to each other. Fig. 7.10 presents the complete IntelliBot neural network architecture.
Fig. 7.10 Architecture of neural networks
1) Input Layer: A collection of input nodes forms an input layer. The nodes of this layer receive data from the user input sequence and feed it into the hidden layers. It does not perform any computation. It passes the information to the hidden layer nodes only. The input layer of the IntelliBot architecture consists of 128 neurons. The input layer neurons do not have activation functions.
2) Hidden Layer: A collection of hidden nodes forms a hidden layer. The nodes hidden layer has no direct connection with the user. It conveys information from the input layer, performs computations and transfer it to the output layer. The first hidden layer of IntelliBot’s network architecture consists of 90 neurons, the second hidden layer consists of 60 neurons, the third hidden layer consists of 40 neurons and the fourth hidden layer consists of 25 neurons. Hidden layers apply the ReLU activation function.
152
3) Output Layer: A collection of output nodes forms an output layer. It performs computations based on input from the hidden layer nodes and transfer the information to the user. The output layer applies the softmax activation function.
7.7.3 Determine the activation function for each layer
Activation function are mathematical equations that determine the output of a node. Each neuron is responsible for accuracy and computational efficiency of the model. The activation function of each neuron determines whether it should be activated or not, based on the sum of weights of the neuron’s input and bias. It can be a transformation that maps between inputs and outputs which is essential for the neural networks to compute and learn, as shown in Fig. 7.11.
Fig. 7.11 Activation function in a neuron
To compute and learn, neural networks must be able to approximate nonlinear relations from input to output and reduce error rates. The activation function plays a critical role to achieve lower error rates [152, 153]. Without it the neural network would not be able to maps between inputs and outputs and is a simple linear regression mode which consider as single learning process. This is also significant when performing backpropagation in the network to compute gradients of error (loss). This thesis uses four activation functions in neural network layers as shown in Table 7.10.
153
Table 7.10 List of activation functions of neural networks
Activation Function Advantage Disadvantage Graph Sigmoid / logistic - Normalizing the output of - Computationally expensive. [154, 155] each neuron between 0 and 1. - Vanishing gradient problem. This sigmoid(푥) = - Smooth gradient, preventing result in networks does not learn 1 “jumps” in output values to or being slow to reach an accurate 휎 = −푥 1 + exp make clear prediction. prediction.
- Useful for logistic regression and output layer. Hyperbolic Tangent - Output the value from -1 to - Similar to the Sigmoid function. (TanH) [154, 155] 1. - Used for classification between 푓(푥) = tanh(푥) - Negative input map strongly two classes. negative and zero input map
near zero. Rectified Linear Unit - Faster convergence and - If input <=0, then gradient is 0. (ReLU) [156] computationally efficient. Backpropagation and learning 푓(푥) = max(푥, 0) - It is a linear function but cannot be performed by neural enabled process of networks.
backpropagation. Softmax [157] - Output value between 0, 1. - It is handy when trying to handle
exp(푥푗) -More generalized logistic classification problems. 푓(푥푗) = ∑푖 exp(푥푖) activation function which is used for multiclass classification. - Useful for output layer and classifies inputs into multiple categories.
There is no rule of thumb which activation function should be used in which situation. However, depending on the properties of the problem, it will be able to make a better choice for an easier and quicker convergence of the network. As IntelliBot generates (predicts) long sentences, then both tanh and sigmoid are not suitable, instead the ReLU activation function on hidden layers is applied in most cases. Alternatively, to predict the probability distribution, the softmax function is applied on the output neurons. If the value of the activation function is above a threshold, then the neuron is activated. The input layer’s neurons do not perform any computation so does not require activation function.
154
7.7.4 Identify values of weights initialization
The weight is the value of the connection between neurons. Neural networks imply that it is trained on some sample data on which the model needs to apply weights initialization. The weight initialization also plays a significant role to reduce errors in the network. Initializing weights with inappropriate values will lead to divergence [152, 153] or will slow down the learning of BRNN. The purpose of weight initialization is to avoid exploding the output of the neurons during computation of activation function in the forward propagation and backward propagation [158, 159]. That means if the output results of neuron become too large (exploding) or too small (vanishing), networks will take a long time to converge. This is known as loss of gradients. Therefore, weight initialization is significant in training BRNN. For instance, weights could be initialized in three ways with input and output, as shown in Table 7.11.
Table 7.11 Importance of appropriate weight initialization Weight Before training After training Cost function Result Initialization The neural network did not learn Initialize all anything during the 1. weights to training because all zero. neurons have the same update on
every iteration. Leads to the vanishing gradient problem or high Initialize all error rates. It may weights with 2. take a long time to random converge and small values achieve a
significantly low value.
155
Initialize all Leads to the weights with 3. exploding gradient random problem. large values
To prevent vanishing or exploding issues, this thesis applies two rules of thumb: (1) the mean of the activations should be zero; (2) the variance of the tanh activation should be the same across every layer. Under these two assumptions, every layer guarantees no exploding/vanishing signal [153]. The recommended weights 푊 initialization for every layer 푙 is shown in Eq. (7.7) as defined in [153]:
1 푊 ≈ 푁 (휇 = 0, 휎2 = ) (7.7) 푙 푛푙−1
In other words, all the weights of layer 푙 are picked randomly from a normal distribution with 1 휎2 = , where 푛푙−1 is the number of neurons in layer 푙-1 and mean 휇 = 0. 푛푙−1
Fig. 7.12 Code snippet of appropriate weight initialization
After appropriate weight initialization, we trained the dataset for 400 epochs. The output result neither exploded nor vanished. The results show that the input pixel values in the same region are highly correlated with each other which plays a significant role in achieving lower training error rates as well as convergence, as shown in Fig. 7.13. It is more likely that the blue pixels are surrounded by a blue colour and the red pixels are surrounded by the red colour. This means the weight initializations show improvement.
156
Weight Initialization (tanh):
Forward Backward
0.427043 0.246078
Fig. 7.13 Parameter initialization with appropriate values
This thesis empirically verifies that a weight initialization 0.427043 in forward propagation allows for consistent input distribution and 0.246078 in backward propagation allows for consistent error signal variance throughout all layers.
7.7.5 Adding bias
Bias is a constant in a linear function. It is an additional neuron used to assists neural network for the best fit the given input. It provides the capability to learn and improve every time it predicts an output. Bias neurons do not connect to any neurons or layers, rather, they are simply appended to each layer. In other words, bias neurons do not have any incoming connections, as shown in Fig. 7.14.
Fig. 7.14 Representation of bias in the layer
157
The output result is measured by sum of all multiplication of weight and input and adding bias value to that. Then, feed into an activation function of the hidden layer neuron to get the output as shown in Eq. (7.8), where, ‘푏‘ is the bias.
표푢푡푝푢푡 = 푠푢푚 (푤푒𝑖푔ℎ푡푠 ∗ 𝑖푛푝푢푡푠) + 푏𝑖푎푠 (7.8) 푦 = ∑ 푤푖 푥푖 + 푏
Without bias, the output of the neuron is 푦 = 푤푖 푥푖. For example, consider a neural network that has no bias. The output is computed by multiplying input 푥푖 by weight 푤푖 and passing the result through an activation function. Although neural networks can work without a bias neuron, in reality, it is always added, because without a bias, neural networks will train over the point transient through (0, 0), which is not a real scenario, as shown in Fig. 7.15.
Fig. 7.15 Effect of bias neuron
As seen in Fig. 7.15, on updating weights 푤1 = (1.0, 1.5, 4.0) and 푤2 = (0.5, 0.5, 1.5), without bias, it passes through only (0, 0) and the steepness is increased. However, by adding bias, 푏 = (−1.0, −3.0, −5.0) with the same weights 0.5 for all nodes in the network, it increases the value of the triggering activation function. Therefore, the following graph in Fig. 7.16 is inferred.
158
Fig. 7.16 Effect of bias neuron
With the introduction of bias, the neural networks become more flexible and able to learn new information. The initial bias is always set to zero, rather than a random number. Let the backpropagation of the loss identify the bias.
7.7.6 Word embeddings
Neural networks are incapable of processing natural language. It needs to transform the natural language words into numeric values, ML models must understand before processing. One of the simplest transformation approaches is word embeddings [160-162], it is a vector- oriented representation of words (matrix) with a similar meaning to have a similar representation, in which a word stands in the lower dimensional space. It is a vector of weights called a word vector.
For example, let us consider the word ‘Apple’ in the sentence ‘Apple is a fruit’. During word embedding, the representation of apple should be such that it captures its meaning, semantic relationships and contexts and creates a representation that shows it as a fruit and not a company. Another example is where basic algebraic operations are performed on the word vectors, ′퐾𝑖푛푔 − 푀푎푛 + 푊표푚푎푛′, result in closest vector representation of the word ′푄푢푒푒푛′ [163]. This answer lies in creating a representation for words. A benefit of pre- training word embeddings is that these embeddings map a word using a dictionary to a vector and therefore fewer parameters need to be learned by the model during training [164].
As this thesis deals with tens of thousands of words, one-hot encoding is massively inefficient. Therefore, we use the CBoW algorithm in the embedding layer. The CBoW model
159 computes the distributed vector representation of words that significantly improves NER, disambiguation, parsing, tagging and SMT. This method can achieve better efficiency without overfitting by fine-tuning the large dataset.
Continuous Bag-of-Words (CBoW):
The CBoW describes how a neural network learns the underlying word representations and is able to predict the probability of the centre word based on the context words (surrounding words). It is a lower-dimensional space that groups semantically similar items together and keeps dissimilar items far apart. In its simplest form, it finds the closest words for a user- specified word. The CBoW model takes a corpus as input and generates word vectors as output. The relationship among word vectors captures semantic information [165].
For example, for the word ‘University’, IntelliBot finds the most similar words and their cosine distances to ‘University’ as shown in Fig. 7.17. The lower the value of a word, the more similar it is. e.g. the word ‘University’ has lowest cosine distance 0.678 which has the closest neighbour ‘college’ as a resulting vector. This study uses TensorBoard to visualize the whole word cloud using PCA.
160
Fig. 7.17 Vector representation (on left) and cosine distances of university (on right)
To build the CBoW, let’s assume a sentence “in the beginning god created heaven and ……” is fed into the neural networks. First, IntelliBot constructs a vocabulary by extracting unique word from the training dataset. Then, we remove punctuation, stopwords and tokenize it. We extract 53,798 vocabulary items and mapped with unique identifier. Now, IntelliBot wants to predict “earth for us” based on the centred word “created”. For this purpose, word vectors are initialized with random weights and passed to the embedding layer.
Fig. 7.18 Window and process for computing 푷(풘풕+풋 | 풄풕)
161
As seen in Fig. 7.18, in a timestep, IntelliBot takes ′푐푟푒푎푡푒푑′ as the centre word and considers a window size of two. 푃(푤푡−2 | 푤푡) are two windows before the centred word as considering 푃(푏푒푔𝑖푛푛𝑖푛푔 | 푐푟푒푎푡푒푑), 푃(푤푡+1 | 푤푡) is next window as 푃(푐푟푒푎푡푒푑 | ℎ푒푎푣푒푛) and so on. In the next timestep, IntelliBot takes ′ℎ푒푎푣푒푛′ as the centre word. Therefore, the previous two window words are 푃(푔표푑 | ℎ푒푎푣푒푛) and 푃(푐푟푒푎푡푒푑 | ℎ푒푎푣푒푛) and the next word at time 푤푡+1 is 푃(ℎ푒푎푣푒푛| 푎푛푑). The word embeddings are propagated and then passed to the dense softmax layer for context embedding. The dense softmax layer predicts ′푒푎푟푡ℎ′ as the most likely word, as shown in Fig. 7.19 using Eq. (7.9) as defined in [164]:
exp(푤푖 푐푖) 푃(표 | 푐) = 푡 푡 ∑푤∈푐 exp(푤푣) (7.9)
Fig. 7.19 Window and process for computing 푷(풘풕+풋 | 풄풕)
IntelliBot compares output result from dense softmax layer with the actual target word and computes the loss. If loss occurs, using backpropagates neural networks adjust the all weights (context, target) and repeats it for multiple epochs until the loss is minimized. The model can be optimized using Eq. (7.10) and a loss function using Eq. (7.11) as defined in [164]: 푇
퐿(휃) = ∏ ∏ 푃(푤푡+푗 | 푤푡; 휃) 푡=1 −푚≤푗≤푚 (7.10) 푗≠0
푇 1 1 퐽(휃) = − log 퐿(휃) = − ∑ ∑ log 푃(푤 | 푤 ; 휃) 푇 푇 푡+푗 푡 푡=1 −푚≤푗≤푚 (7.11) 푗≠0
where 휃 is all the variables to be optimized, 푚 is window size, 푤푗 is the centre word, 푤푡 is the previous word, 퐽(휃) is negative log-likelihood and 푡 is the timestep. Once the model is 162 trained, similar words will have similar representation. A code snippet of the CBoW model is shown in Fig. 7.20.
Fig. 7.20 Code Snippet of CBoW model
7.7.7 Batch normalization
Batch normalization is one of the popular techniques of distributing input data to a model between training and validation. During the training process, each layer learns to adapt to a new distribution in every iteration of the training step. As the network goes deeper, the distribution of the activation function is constantly changing, and the inputs of each layer are different every time that drop-off the training process. This problem is known as the internal covariate shift problem. Therefore, IntelliBot applies batch normalization before the training process starts. It normalizes and de-correlates the features parameter’s mean value set to 0 and variance set to 1 to facilitate the training and learning process, known as feature standardization or whitening [166, 167]. Let’s assume, a batch is 푥, batch size is 푚, and variance of feature is 푘, then using Eq. (7.12) we can calculate mean for batch 푥̅̅푘̅ and mean for variance 휎2푘 as defined in [151]:
푚 1 푥̅̅̅ = ∑ 푥 , 푘 푀푒푎푛 푓표푟 퐵푎푡푐ℎ 푘 푚 푖 (7.12) 푖=1
163
푚 1 휎2푘 = ∑(푥 , 푘 − 푥̅̅̅)2 푀푒푎푛 푓표푟 푉푎푟𝑖푎푛푐푒 푚 푖 푘 푖=1
Then, using the statistics in Eq. (7.12), we can standardize each feature in Eq. (7.13), where, a small positive constant 휖 is to improve numerical stability as defined in [151]:
푥푘 − 푥̅̅푘̅ 푥̅̅푘̅ = (7.13) √휎2푘 + 휖
However, standardizing the activation function requires going into each layer, enforcing their means, variances and standard deviations to update the parameter's value of the underlying layers. So, batch normalization applied using Eq. (7.14), where, the learnable parameters 훾 and 훽 of the subsequence layers as defined in [151]:
퐵푁(푥푘) = 훾푘 푥̅̅푘̅ + 훽푘 (7.14)
By setting 훾푘 to 휎푘 and 훽푘 to 푥̅̅푘̅ , the output representation of the neural network layer denoted as 푦 can be calculated using Eq. (7.15), where, 푊 is the weight matrix, ∅ is an activation function, 푥 is the input of layer and 푏 is the bias vector.
푦 = ∅(푊푥 + 푏) (7.15)
The above processes are repeated for each batch at each training step. For every batch, the normalization process is reset to 0 [151]. The batch normalization of the layer
푦 = ∅(퐵푁(푊푥)) leads to the final output result of RNN as calculated by Eq. (7.16) as defined in [151]:
푦 = ∅(퐵푁(푊ℎℎ푡−1 + 푊푥푥푡)) = ∅(푊ℎℎ푡−1 + 퐵푁(푊푥푥푡)) (7.16)
where 푊푥 is the input to a hidden weight matrix. Batch normalization allows IntelliBot to use a significant benefit in terms of training time, level of learning, and model performance and pays less attention to the parameters of initialization. It acts as a regularizer, eliminating the need for dropout and significantly improves the performance of the training [168].
164
7.8 Training environment
The generative-based strategy of IntelliBot is trained on an Intel 10 core machine with 20 threads, 64GB of memory with two GPUs, and 1GB SSD. Python was used to build the core AI engine with the TensorFlow library. It includes bi-directional LSTM architectures using Radeon RTX graphics card and CUDA enables the training process to be accelerated. CUDA is a platform for parallel computing and API allows developers to use a graphics processing unit (GPU) for processing. Table 7.12 illustrates the specifications of the IntelliBot used to train its AI model.
Table 7.12 Training system's specification Component Part CPU Intel Core i9-7900X 4.3Ghz 10 Cores / 20 Threads Motherboard X399 Gaming Pro GPU Radeon 2080 8GB RAM CORSAIR 64GB DDR4 3333 Application CUDA enables, version 9.0 Platform Windows 10 Programming Python 3.6, MySQL 5.5, HTML5 Library Tensor flow 1.4, sci-kit-learn
The generative-based strategy of IntelliBot is trained in three phases. As previously discussed, each phase comprises a different dataset and has a different number of training iterations. In phase 1, IntelliBot is trained on the Cornell movie dialogue dataset along with its contents to develop the ability to converse in the English language. In phase 2, IntelliBot is trained on the credit card insurance domain dataset. In phase 3, the IntelliBot’s training in phase 2 is fine-tuned on the most frequent words and terms in the insurance domain, used in a specific way. Fig. 7.21 shows the three phases of training, as explained below.
Fig. 7.21 Training phases of IntelliBot
165
During the training process, as a large sample size is used, it cannot pass the entire dataset into the neural network at once. So, the dataset is divided into a number of small batches. Therefore, to train the model stochastic gradient descent (SGD) is used and each batch is considered as one sample.
7.8.1 Phase 1: Training on the Cornell dialogue dataset
In the initial phase, IntelliBot is trained on basic conversational dialogue so that the chatbot is able to initiate a conversation and move forward with a conversation. This thesis uses the Cornell movie dialogue dataset with 53,798 vocabulary items for training. The sample size for a batch in phase 1 is 250 records. This means that 250 records are trained in one iteration. To complete the training on the whole dataset, IntelliBot requires 883 iterations, which makes one epoch. To ensure the right prediction ability and better accuracy, IntelliBot was trained for 77 epochs in Phase 1, which requires 77 푥 883 = 67,991 iterations.
The seq2seq model with probability 푃푐표ℎ(푦|푥) was trained to estimate the semantic coherence between input 푥 and output 푦. 푁푦 is the length of normalization. For this phase, the dataset is tokenized and a data dictionary for both the encoder and decoder is created as calculated by Eq. (7.17) as defined in [169]:
1 푆𝑖푚푐표ℎ = ∙ log 푃푐표ℎ(푦|푥) 푁푦 (7.17)
7.8.2 Phase 2: Training on insurance domain dataset
After the training in phase 1, IntelliBot is able to converse with a user but fails to answer insurance-domain-related questions, which is its goal. So, in this phase, IntelliBot is trained on the credit card insurance domain-specific QA dataset, a sample of which is shown in Table 7.4. In this phase, the model uses a sample size of a batch as 150 records, which is one iteration. To train IntelliBot on the complete dataset requires 67 iterations, which is one epoch. To ensure the right prediction ability and better accuracy, IntelliBot is trained for 370 epochs in Phase 2, which requires 67 푥 370 = 24,790 iterations. By training on a domain- specific dataset, IntelliBot is tuned to be more realistic in its responses and captures insurance domain-related conversational dialogue patterns and responses.
166
7.8.3 Phase 3: Training on particular words
Finally, IntelliBot is trained on particular words related to the credit card insurance domain. These words are mainly insurance-domain terms and keywords. For example, ‘credit card’, ‘master card’, ‘visa card’, ‘low rate’, ‘interest-free period’ and ‘interest rate’ are two individual words in English, but in the insurance domain, the model should identify each as one single word. The model uses a sample size of 50 for each batch. To train the whole dataset, IntelliBot needs 200 iterations, which is one epoch. To ensure the right prediction ability and better accuracy, IntelliBot is trained for 50 epochs in phase 3, which requires 50 푥 200 = 10,000 iterations. Table 7.13 summarises the training specifications in the three training phases of IntelliBot.
As one of the objectives of IntelliBot is to engage the user in a conversation, the pre- processed QA pairs from section 7.6 are given to the neural networks to train using stochastic gradient descent. A semi-supervised training approach is used to train IntelliBot for 70, 115, 160 epochs for training phases 1, 2, 3 respectively. The network is tested at the end of each epoch. Sentence pairs are filtered out which exceeding 40 words and batches are shuffled. In addition, as proposed in [117], dropout with a probability of 0.2 is used to avoid overfitting the networks. The DBRNN part of IntelliBot is developed in Python 3.6 with Tensorflow and hosted at AWS EC2. It took a couple of days to completely train the model. A summary of the training specification is shown in Table 7.13.
Table 7.13 Summary of training parameters’ specification Phase 1 Phase 2 Phase 3 Cornel movie Insurance Insurance terms, Corpus Name dataset dataset keywords Total Layer 6 6 6 Number of Hidden Layer 4 4 4 Input Layer 1 1 1 Output Layer 1 1 1 Memory Cell Unit 1,024 512 512 Total Sample Size 182,033 8,254 1,000 No. of sample each Batch 256 64 32 Total Iterations 50,000 15,000 5,000
167
Epoch 70 115 160 Learning Rate 0.01 0.01 0.01 dropout rate 0.20 0.20 .20
7.9 Training of IntelliBot using DBRNN
As seen in Fig. 7.22, the training process of IntelliBot has four steps. The following sub- sections explain training process for the generative-based strategy to generate response using DBRNN.
Fig. 7.22 Training Process of IntelliBot
7.9.1 Forward propagation
The idea of forward propagation is to feed data into the input layer of the neural network which forward through one neuron to another and finally reaches to the output layer. The output layer obtains the final predicted value. In this thesis, the input data are both the Cornell and Insurance dataset where each sample runs through sequentially 푥 =
(푥1, 푥2, … … … 푥푛) from the input layer. This thesis scales input data values between 0 and 1. All neurons from the input layer are connected to the neurons of the next hidden layer, ℎ. A neuron in the hidden layer ℎ has weight 푤 between 0 and 1, an activation function 푎 and bias 푏. For a better understanding of forward propagation, let’s consider a network which has one neuron at each layer, as shown in Fig. 7.23.
Fig. 7.23 Process of forward propagation
168
This shows that the network has two weights 푤1 and 푤2. 푤1 is initialized when input 푥1 is fed to the hidden layer neurons ℎ1. Weight 푤2 is initialized when the result of the hidden neuron is fed to the output layer neuron 푦. Furthermore, the hidden layer and output layer have activation functions such as tanh and softmax respectively. Additionally, the network has a bias 푏푥 that is added into the hidden layer. The input layer does not have an activation function and bias. To take a concrete example, let’s consider input sentence 푥 = “how are you” and expected output 푦 = “I am good”.
7.9.1.1 Input layer
During NLP tasks, the input sentence 푥 is segmented into three words and then the neural network encodes these strings and initializes the weight matrix as shown in Table 7.14.
Table 7.14 Vector representation of 푥
푥푖
how 푥1 are 푥2 you 푥3 0.287027 0.846060 0.572392
7.9.1.2 Hidden layer
Then the network feeds one word each timestep and converts the input to a hidden state. In the first timestep input 푥1 is “how” which is feed into the network. The neurons of the network assign a weight value of 푤 = 0.427043 and a bias value of 푏 = 0.56700. Then, the total net input for ℎ푡 is calculated by Eq. (7.18). As 푥1 is the first input and there is no previous state ℎ푡−1, we can consider the previous state to be 0.
푥푤푛푒푡 = (푤 × ℎ푡−1 + 푏푥) (7.18) = 0.427043 × 0 + 0.567001 = 0.567001
The current state hidden vector ℎ⃗⃗⃗푡 of the recurrent neurons is calculated using the activation function according to Eq. (7.19) where 푥푡 is the current input.
⃗⃗⃗ ℎ푡 = tanh(푥푤푛푒푡 + 푥푡) (7.19) ℎ⃗⃗⃗푡 = tanh(0.567001 + 0.287027) = 0.693168
169
Fig. 7.24 Hidden vector for the word "how" then, in the second timestep, the network feeds the next word which is “are”. The calculation from the previous step ℎ푡 which is now in this step becomes ℎ푡−1. The calculation of the net input is calculated by Eq. (7.20):
푥푤푛푒푡 = (푤 × ℎ푡−1 + 푏푥) (7.20) = 0.427043 × 0.693168 + 0.567001 = 0.863013
Now, the current state ℎ푡 is calculated using the activation function after the word “are” is fed into the network:
⃗⃗⃗ ℎ푡 = tanh(푥푤푛푒푡 + 푥푡) (7.21) ℎ⃗⃗⃗푡 = tanh(0.863013 + 0.846060) = 0.936534
Fig. 7.25 Hidden vector for the word "are"
Similarly, state ℎ푡 obtained using Eq. (7.21) becomes the previous step ℎ푡−1 and the recurrent neural network uses the next state as the current state.
170
Fig. 7.26 Hidden vector for the word "you"
This process is repeated for every neuron in the hidden layers. After reading the whole sentence, IntelliBot assigns a special token
7.9.1.3 Output layer
The decoding process begins in the output layer using the final hidden state vector ℎ⃗⃗⃗푡 which determines the probability of each word in its vocabulary being the most appropriate for the output response neurons. Then, we calculate the total net input for the output layer’s neuron. Assume that the network assigns a weight value of the neuron’s weight 푤 =
0.371680 to calculate the output vector denoted 표⃗⃗⃗푡 as given in Eq. (7.22):
표⃗⃗⃗푡 = (푤 × ℎ⃗⃗ 푡) (7.22) 표⃗⃗⃗푡 = (0.371680 × 0.936534) = 1.906077
Then we measure the probability 푦푡 of the occurrence of a specific word in the vocabulary using softmax function as shown in Eq. (7.23):
푦푡 = softmax(표⃗⃗⃗푡 ) = softmax(1.906077) = 0.419748 (7.23)
Fig. 7.27 Final output 171
Based on the final output, the model considers that “I” will appear after feeds ℎ⃗⃗⃗푡 because of the highest probability of the word. The process is repeated for all words of the input sequence 푥푖 and 푦푡 is generated by taking the decoder output at time 푡.
Fig. 7.28 Final output from forward propagation
7.9.2 Backward propagation
Backward propagation is a central mechanism of RNN, and it is opposite to forward propagation. The network propagates from the output layer backward towards the hidden layers and updates its parameters, by which networks learn. The idea is to determine whether the model makes a mistake during the prediction. In other words, the idea is to determine how much of the loss each part of the neural network was responsible for. Backpropagation is a supervised learning algorithm to fine-tune the weights of neurons based on the error obtained in the previous epoch using a technique called gradient descent. Backpropagation consider errors which predictions are wrong and updates their parameters of the RNN. The update repeated until the actual output is closer to the expected output. For example, let us consider Fig. 7.29, where user input 푥 = “who is einstein?” and the desired output is 푦 = “Einstein is a German Physicist”. However, the network produces the wrong predicted output as “I don’t know”. To obtain the expected output, we need to know how much of the loss occurs in a direction, adjust the parameters and minimize the error.
172
Fig. 7.29 Example of the wrong prediction produced by RNN
The derivative of loss function determines whether weights are increased or decreased. As shown in Fig 7.30, a negative derivative means the error decreases if the weight increases so we should increase the weight. A positive derivative means the error increases if the weight increases so we should decrease the weight. If the derivative is 0, then the error is minimized.
Fig. 7.30 Visualization of the effect of the loss function
After determining the loss function, we update their parameters and minimize the error for each neuron in the network. The same task is computed and repeated for each neuron on a local gradient. As networks receive numerical values of gradients from upstream, neural networks take these gradients and multiply them by local gradients to pass them on to their respective connected neurons as shown in Fig. 7.31. The process is repeated until the neural network converges, which means the error is minimized and the desired output is predicted.
173
Fig. 7.31 Gradient flow
The neural network performs backpropagation in three steps, as shown in Fig. 7.32:
i. Calculate the total error in the output layer to determine how much the model’s predicted output defers from the actual or expected output. ii. Check whether the error is minimized. Once the error is minimized, the model can predict the expected output. iii. Update the parameters e.g. weights and biases.
Fig. 7.32 Process of backward propagation
7.9.2.1 Calculate the total error in the output layer
After forward propagation generates a response, the error can be measured by comparing the expected output with the predicted output. This is known as a cost function. It measures how well a neural network performs with respect to its given training sample and the expected output. For example, for a question “who is einstein?” the expected output is
174
“Einstein is a German Physicist.” However, if the neural network predicts the output as “I don’t know”, this will be considered an error.
The error is a cost function 퐶 that is defined by mean squared error (MSE) as calculated by Eq. (7.24) as defined in [170]: 푛 1 퐶 = ∑(푦 − 푦̂ )2 2 푖 푖 (7.24) 푖=1 where 푦̂ is the actual output, 푦 is the predicted output from the neural networks, sample is 푛 and overall difference between 푦 and 푦̂ for each observation to measure the total error, 퐶. This is the cost function 퐶, is needed to minimize error.
7.9.2.2 Check whether error is minimized (Iterate until converged)
If the error is huge then the weights and biases are updated. After this, the error is checked. The process is repeated until the error is minimized, and once this occurs, the model is able to produce the expected output.
7.9.2.3 Update parameters
IntelliBot’s neural network fine-tunes these parameters using optimization algorithms called gradient descent, as explained in section 7.9.3. Gradient descent is suitable to minimize errors using the loss function or objective function as calculated by Eq. (7.26). The parameters of the RNN has relationship with the errors produced by the network.
Given the first result, the neural network backpropagates and adjusts their biases and weights to optimize the cost function. In a sense, this is how the algorithm determines whether the network performs well or not. To fine-tune the RNN, we compute the gradients. It is a small update on biases and weights of a neuron of the hidden layer. For example, the neuron that produces error, we subtract the value of the 푙푒푎푟푛𝑖푛푔 푟푎푡푒 from its weight 푤푙, and then multiplied by the value of cost function 퐶 from the original value of the weight 푤푙 as calculated by Eq. (7.25):
퐶 푤푙 = 푤푙 − 푙푒푎푟푛𝑖푛푔 푟푎푡푒 × 푤푙 (7.25)
175
7.9.3 Stochastic gradient descent (SGD)
It is an optimization algorithm for learning, used to minimize the loss using ‘gradient’. It is the slant of a surface or slope. That means SGD is downward a slope to reach the lowest point of the surface. Gradient descent takes all the samples at once and updates the weights after completing one epoch iteration. This is repeated until the global cost minima are reached which becomes computationally very expensive to perform for large datasets [171]. For this reason, IntelliBot applies the SGD algorithm. It starts with randomly selected samples instead of the whole dataset for each epoch. Randomly selected samples are called a ’batch’ [170]. However, the path towards the lowest point of the surface (global cost minima) is not straightforward, hence it is noisy. Despite of, the goal is to reach the global cost minima in shorter training time.
SGD begins with a randomized sample of the training dataset. Then, the objective function is found with respect to each feature as a sum of a finite number of functions in Eq. (7.26) as defined in [170]:
푛 1 푓(푥) = ∑ 푓 (푥) (7.26) 푛 푖 푖=1 where 푓푖(푥) is a loss function based on the training data instance each iteration 𝑖, and the size of dataset 푛. Then, SGD picks a random initial value for the parameters ∇푓푖(푥) as an unbiased estimator of 푓(푥). IntelliBot uses mini-batch 훽 and consists of more parameters at each iteration. So, partial derivative of 훽 and each of the features are calculated by Eq. (7.27) as defined in [170]:
푛 1 ∇푓 (푥) = ∑ ∇푓 (푥) (7.27) 훽 |훽| 푖 푖 ∈ 훽
To update the gradient function in the parameter value 푥 as 푥 ∶= 푥 − 휂∇푓훽(푥), where |훽| denotes the mini-batch, 휂 scalar value. Then, the step size and learning rate is calculated for each feature using Eq. (7.28):
푠푡푒푝 푠𝑖푧푒 = 푔푟푎푑𝑖푒푛푡 ∗ 푙푒푎푟푛𝑖푛푔 푟푎푡푒 (7.28) 푙푒푎푟푛𝑖푛푔 푟푎푡푒 = 푔푟푎푑𝑖푒푛푡 / 푠푡푒푝 푠𝑖푧푒
176
Then, the new parameters for the next iteration are calculated as:
푛푒푤 푝푎푟푎푚푠 = 표푙푑 푝푎푟푎푚푠— 푠푡푒푝 푠𝑖푧푒 (7.29)
Eq. (7.26) to Eq. (7.29) are repeated until the gradient reaches the global minima. IntelliBot uses batch size and frequently tunes it to an element of the computational architecture on which is being carried out. For example, a power of two that meets the memory requirements of the GPU or CPU demands is 32, 64, 128, 256, etc. [170].
Gradient descent takes fixed-size and a small step towards the negative error gradient as calculated by Eq. (7.30) as defined in defined in [170, 171]:
휕ℒ ∆휔푛 = − ∝ (7.30) 휕휔푛
Where ∝∈ [0,1] is the rate of learning, ∆휔푛 is the 푛푡ℎ weight update and 휔푛 is the weight vector. The issue arises when gradient descent stuck in ’local minima’. This can be reduced by adding a ’momentum term’ that accelerates the process and helps to escape local minima. The gradients are calculated throughout the entire training dataset to identify a loss function using Eq. (7.31) where 푚 ∈ [0, 1] is the momentum parameter as defined in [114, 171]: 휕ℒ ∆휔푛 = 푚∆휔푛−1 − ∝ (7.31) 휕휔푛
7.9.4 Attention mechanism in training
Handling long sentences is a challenging task in neural network [172]. One effective way to address such a problem is to implement an attention mechanism. There are two types of attention-based models—global and local. Both models take input 푥 at each timestep 푡 in the decoding at hidden state ℎ푡. The objective is to derive a context vector 퐶푡 capturing appropriate data to predict the possible word 푦푡. Although these models vary in the derivation of 퐶푡, and follow the same steps.
7.9.4.1 Global attention model
It is also called hard attention. The aim of this model is to derive context vector 퐶푡, then taking all hidden states ℎ푡 of the encoder. In this type of model, the size of which equals the
177
number of timesteps on the source side, alignment vector 푎푡, is derived by comparing the
ℎ푡 with ℎ̅̅̅푠 as shown in Eq. (7.32) which defined in [86]:
exp(푐푏푓(ℎ푡 , ℎ̅̅̅푠)) 푎푡(푠) = align (ℎ푡 , ℎ̅̅̅푠) = (7.32) ∑푠′ exp (푐푏푓(ℎ푡 , ℎ̅̅̅푠̅′))
where current target hidden state ℎ푡 , input hidden state ℎ̅̅̅푠, 푐푏푓 is a ’content-based function‘. It has three alternatives shown in Eq. (7.33) as defined in [86]:
ℎ푡 ℎ̅̅̅푠 푑표푡 푐푏푓(ℎ , ℎ̅̅̅) = { ̅̅̅ 푡 푠 ℎ푡 푊푎 ℎ푠 푔푒푛푒푟푎푙 (7.33) 푣푎 tanh 푊푎 [ ℎ푡 ; ℎ푠] 푐표푛푐푎푡푒푛푎푡푒
7.9.4.2 Local attention model
Local attention also called soft attention. The global attention model is inconvenient in that it has to deal with all words on the input side for each output word, which takes more computational time [173]. To address this problem, IntelliBot uses a ’local attention mechanism’ to concentrate only on a small subset of the input per output word and it is easier to train.
The local attention generates a position 푝푡 for each output word at time 푡. Vector 퐶푡 is a weighted average over input hidden states: [푝푡 − D, 푝푡 + D], 푤ℎ푒푟푒 퐷 = 20. The local alignment vector 푎푡 is now fixed-dimensional. IntelliBot accepts that both input and output sequences are aligned and set 푝푡 = 푡. Then vector 푎푡 is recalculated according to Eq. (7.32). Instead of assuming, IntelliBot also can predict an aligned position using Eq. (7.34) as defined in [86]:
푝푡 = 푆 ∙ sigmoid (푣휃 tanh(푊휃ℎ푡)) (7.34) where 휃 is the model parameters and 푆 is the source sentence length. To favour the alignment points near 푝푡, IntelliBot places a Gaussian distribution centred around 푝푡. So, the final alignment weights are calculated by Eq. (7.35) as defined in [86]:
2 (푠 − 푝푡) 푎 (푆) = align(ℎ , ℎ̅̅̅) exp(− ) 푡 푡 푠 2휎2 (7.35)
178
7.10 Conclusion
In this chapter, the data collection process for the four response generation strategies were explored. This was followed by the process of training the generative-based strategy for response generation. Once the four response generation strategies are formed, IntelliBot can then generate a response to the user’s question both from a general category and domain- specific category. In the next process, we explain the result achieved by IntelliBot and compare this with the responses generated from other chatbots.
179
CHAPTER 8
“That’s the nature of research—you don’t know what in hell you’re doing” —Papa Flash
EXPERIMENTS TO DEMONSTRATE THE SUPERIORITY OF INTELLIBOT
8.1 Overview
8Evaluation is always difficult in language generation, especially in chatbots. The aim of this chapter is to validate IntelliBot according to the requirements defined in Chapter 1. These are to: (1) generate a correct answer for the user question, (2) engage the user in a meaningful conversation in a specific domain; and (3) correct grammatical errors in the user question and confirm this before generating a correct and meaningful response.
To evaluate if these objectives have been met, an experiment is conducted to compare IntelliBot’s output with the output of three other existing chatbots to systematically evaluate and assess IntelliBot’s operational performance. The output (responses) of the four chatbots was determined in relation to two categories of questions, namely, general conversations and domain-specific questions. The responses of the chatbots were then evaluated by the experts to determine their accuracy in relation to the questions asked, both with and without grammatical errors in the questions. To determine the quality of the generated responses and the level of engagement between the chatbot and the user, the experts evaluated and ranked the responses of each chatbot. The experiment results show that IntelliBot outperformed the three existing chatbots.
The structure of this chapter is as follows: Section 8.2 explains the evaluation process adopted to demonstrate the superiority of IntelliBot over the existing chatbots. Section 8.3
8 Parts of this chapter have been published in [20]. 180 details the tools and technologies used to develop the prototype of IntelliBot along with its graphical user interfaces (GUIs). Section 8.4 presents the different categories of questions from the subsets of general conversation and domain-specific questions used to evaluate the chatbots. Section 8.5 explains the high-level architecture and working style of the three existing chatbots to compare IntelliBot’s responses. Section 8.6 compares the results of IntelliBot with those of the existing chatbots to determine their accuracy in generating a correct response. Based on expert judgement, section 8.7 details IntelliBot’s ability to engage the user when generating a response compared to the other chatbots. Section 8.8 shows IntelliBot’s ability to correct grammatical errors in the questions before generating a response and compares this with those of the other chatbots. In section 8.9 we performed exploratory tests to confirm if IntelliBot meets its expected functionalities while determining and representing the response to the user. Section 8.10 concludes the chapter.
8.2 Process of Evaluating IntelliBot’s Output Against the Requirements and the Outputs of the Other Chatbots
While correctly responding to user questions, chatbots should engage the user in its responses and deal with grammatical errors in user questions. The process of chatbot evaluation is divided into four steps, namely (1) prepare a set of QA, (2) choose existing chatbots from the literature, (3) record responses from IntelliBot and the other selected chatbots, and (4) evaluate IntelliBot’s responses against the other chatbots shown in Fig 8.1.
Fig. 8.1 Steps in chatbot evaluation
To undertake the evaluation, a set of question-answer pairs known as a validation dataset was prepared. The dataset shows the expected answer for a given question. The process of data collection from which the questions are formed was discussed in Chapter 7 and the list of sample questions generated from it is shown in Section 8.3. RootyAI bot [174], ChatterBot [175] and DeepQA [119] were selected as the three publicly available chatbots from the literature to compare IntelliBot’s results. Each chatbot was given the same set of questions and the responses generated were collected for further evaluation. The correctness and 181 completeness of the generated answers were measured to determine which chatbot performs better than the others. The correctness of the generated responses when grammatical errors are not present and was measured using F1 score to meet objectives 1 and 3, respectively. The completeness of the generated responses to engage the user was checked using the judgement of experts who were asked to rate the generated responses on a scale of 1 to 3. The level of agreement between the experts was then determined using Cohen’s kappa. The completeness of the responses demonstrates that the chatbot can engage the user in a long and meaningful conversation to meet objective 2.
8.3 Tools and Techniques Used to Develop the IntelliBot Prototype
The machine learning tools, techniques and the hardware and software used in the development of the IntelliBot prototype are listed in Table 8.1.
Table 8.1 List of hardware used in developing IntelliBot Hardware Type Description CPU Intel Core i9-7900X 4.3Ghz 10 Cores / 20 Threads Motherboard X399 Gaming Pro GPU Radeon 2080 8GB RAM CORSAIR 64GB DDR4 3333 HDD 2 TB 9600RPM Display 32” Inches HP Monitor Keyboard/Mouse Alienware 121 Keypads
Table 8.2 List of software used in developing IntelliBot Description Software Operating System (OS) Windows 10 Professional IDE Pycharm, Anaconda Core Programming Language Python 3.6 Database MySQL 5.5 Web Designing HTML5, Java Script, CSS3 Web Browser Google Chrome 81.0.4044 Library Tensor flow 1.4, sci-kit-learn
182
Table 8.3 List of library packages used install in python environment
Library Version Description tensorflow 1.4 Support deep learning and ML for numerical computation. Used to design and develop a generative-based strategy. aiml 0.9.2 Design and develop the template-based strategy. stanfordcorenlp 3.9.2 An NLP tool provides lemma, POS tagging, NER, word segmentation. scikit-learn 0.22 ML tool for model selection, fitting, classification, clustering, predicting and cross-validation. It supports supervised and unsupervised learning. Tflearn 0.3.2 A deep learning library designed to provide a higher- level API to TensorFlow. language-check 1.1 A broad range of grammatical analysis tools. Colorama 0.4.3 To produce coloured text and cursor positioning. Scipy 1.4.1 Provides a user-friendly numerical integration and optimization. NLTK 3.5 Package for natural language processing. Numpy 1.18.3 An efficient multi-dimensional container of generic data and a powerful N-dimensional array object. mysql-connector- 8.0.19 MySQL driver to connect with RDBMS. python Django 3.0.5 A high-level Python web framework that encourages rapid development and a clean, pragmatic design. Flask 1.1.2 A lightweight framework designed to make quick and easy WSGI web application. Tqdm 4.45.0 A smart progress meter. beautifulsoup4 4.9.0 It scrapes information from web pages.
Using these tools and techniques, IntelliBot was developed with graphical user interfaces (GUIs) so that it works both on a desktop computer and mobile devices. Figures 8.2 and 8.3 show the working of IntelliBot with its GUIs in both the desktop and mobile applications. In the next section, the different categories of questions that are asked to each chatbot are discussed.
183
Fig. 8.2 The working of IntelliBot on a desktop application
184
Fig. 8.3 The working of IntelliBot on a mobile device
8.4 Different Categories of Questions for Chatbot Evaluation The three existing chatbots and IntelliBot were asked questions from different categories to check whether they are competent and perform well or not. Each chatbot was asked a total of 71 questions from seven categories, namely greetings, asking for assistance, asking for time and date, general questions, arithmetic problem-solving questions, domain-specific questions and ending the chat session. All categories except for the domain-specific questions are categorised as general conversation categories.
The greetings category includes questions that are general wellbeing questions. Table 8.4 lists 11 questions in the ‘greetings’ category. The second category, ‘Asking for assistance’ refers to questions which aims to ask about the user’s intentions. Table 8.5 lists three questions on asking for assistance. The third category of questions asks for ‘time and date’ and Table 8.6 lists five questions in this category. The next category, ‘general questions’ tests 185 a chatbot’s ability to respond to general knowledge questions. Table 8.7 lists eight questions in this category. The next category asks questions related to an ‘arithmetic solving ability’. Table 8.8 lists six questions in this category. The domain-specific category tests the chatbot’s ability to answer insurance domain-related questions. Table 8.9 lists 35 question in this category. The last category tests a chatbot’s ability to end a conversation and Table 8.10 lists three questions in this category.
Table 8.4 Questions in the greetings category No Question 1 Hello 2 Good morning. 3 My name is Nur. 4 What is your name? 5 How are you doing? 6 How old are you? 7 When were you born? 8 Are you male or female? 9 Who made you? 10 Do you smoke? 11 Are you a human?
Table 8.5 Questions in the asking for assistance category No Question 12 What can you do? 13 Really? Can I give a try? 14 What can you do for me?
Table 8.6 Questions in the asking for time & date category No Question 15 What time is it now? 16 Can you tell me what time it is, please? 17 What day was it yesterday? 18 What is the first month of the year? 19 Which days are the weekend?
186
Table 8.7 Questions in the general category No Question 20 What is the colour of the sky? 21 Who is Einstein? 22 What is a chatbot? 23 How many colours are in the sky? 24 Can cats fly? 25 What happens if machines can think? 26 What is the purpose of life? 27 Tell me what my name is?
Table 8.8 Questions in the arithmetic problem-solving category No Question 28 What is 2 + 2 ? 29 How about 340 / 10 = ? 30 If x = 10, y = 12, what is the value of x * y? 31 How much is six hundred and sixty minus two hundred and twenty? 32 How much do you get if you multiply 555 and zero? 33 Great job.
Table 8.9 Questions in the domain-specific category No Question 34 What is the monthly premium? 35 What does it cover? 36 What is the repair cost for a standard device? 37 What is the repair cost for a premium device? 38 How much do you need to pay for a replacement device? 39 What is the renewal policy of the premium? 40 What are the consequences for fraudulent or misleading claims? 41 What benefits are included with the claim? 42 What are the exclusions? 43 In which situation will this policy not work? 44 Should I notify you of my address changes? 45 When should I notify you of the changes? 46 How can I notify you if my address changes?
187
47 Can I cancel the policy? 48 How can I lodge a claim? 49 What will happen after I lodge claim? 50 When will the policy cover end? 51 How can I cancel the policy? 52 What is the cooling off period? 53 What if I am not satisfied with Vodafone’s services? 54 Is there any international coverage with the insurance policy? 55 What is the interest rate of an AMEX card? 56 What is the annual fee of an AMEX card? 57 What are the benefits of an AMEX card? 58 What is the balance transfer of an AMEX card? 59 What is the cash advance rate of an AMEX card? 60 What is the interest-free period of an AMEX card? 61 What is the international transaction fee on an AMEX card? 62 What types of insurance are covered by an AMEX card? 63 Can you advise me of a low-rate credit card? 64 Can you tell me more about a low-rate credit card? 65 Is there any late payment fee on the AMEX credit card? 66 What is the annual fee of a low-rate credit card? 67 What is the cash advance rate of a low-rate credit card? 68 What is the interest-free period of a low-rate credit card?
Table 8.10 Questions in ending the chat session category No Question 69 Nice talking to you. 70 See you next time. 71 Bye.
8.5 High-level Overview of the Three Existing Chatbots Used in the Experiment for Comparison with IntelliBot
As previously mentioned, IntelliBot’s responses are compared against three publicly available chatbots RootyAI bot [174], ChatterBot [175] and DeepQA [119] for evaluation. The aim of this section is to briefly explains their working style and objectives.
188
• RootyAI This bot is an open source and task-oriented chatbot which is built for conversation with users on Slack. RootyAI aims to automate the organization’s simple tasks, deploy the site and be a source of fun around the office. When the user enters text, RootyAI extracts the text and finds the keywords to execute a task and start a conversation. The bot forgets dialogs after a default timeout of 30 seconds.
• ChatterBot This is a conversational chatbot, built using machine learning algorithms. The aim of ChatterBot is to produce different types of responses for any language and engage in conversation. It generates responses based on collections of known conversations. When a user asks a question, ChatterBot initially saves the question and then processes it to generate a response. It selects the closest matching response by searching for the closest matching question that matches the user input. Then it returns the most likely response to the user.
• DeepQA This bot is also a conversational chatbot built using recurrent neural networks and trained with the Cornell movie corpus. The aim of DeepQA is to build a simple conversational model that can generate meaningful and long responses. When a user asks a question, it is processed through NLP and then the model converses by predicting the answer given the question in a conversation. A lack of consistency is a common failure mode of this chatbot model.
8.6 Output of RootyAI, ChatterBot, DeepQA and IntelliBot on the Considered Questions
Table 8.11 lists the responses generated by each chatbot in response to the questions detailed in Tables 8.4 to 8.10.
189 Table 8.11 User questions and the response received from each chatbot
User Input RootyAI ChatterBot DeepQA IntelliBot Strategy Selection Greetings Bot Initialization: Hello, I’m AI assistance. Hi there, How may I Hello, I’m Papaya. I’m Hello, I’m IntelliBot. Your 1 How may I help you? assist you? your assistant. 24/7hours assistance. How may I assist you? 1 Hello Hello Hello Hello Hello 1 2 Good morning. Good morning. Good morning. Good morning. Good morning. 1
3 My name is Nur. Nice to meet you, Nur Ok, Nur Ok. How can I help you, Nice to meet you, Nur. 1 Nur? How may I help you? 4 What is your name? I don’t know. I don’t have it. I’m Papayay. My name is IntelliBot. 1 What yours? 5 How are you doing? I am good Ok, Nur I am good I am good. Thank you. 2 6 How old are you? I don’t know I don’t know I am 9 years I am 9 years old. 1 7 When were you born? 2010 2010 2010 30092010 1 8 Are you male or I am a male Male I am a child I have no gender 2 female? 9 Who made you? I don’t know I don’t know I don’t know Nuruzzaman 2 10 Do you smoke? No No I don’t I am non-smoker. 1 11 Are you a human? No No No I am a robot 2
190
User Input RootyAI ChatterBot DeepQA IntelliBot Strategy Selection Asking for Assistance 12 What can you do? What do you need? What help do you need? I chat with people. I chat with people, 4 answering all kinds of questions. 13 Really? Can I give a try? ok Sure Sure Sure, why not? 4 14 What can you do for What do you need? What help do you need? I chat with people. I chat with people, 4 me? answering all kinds of questions. Time and Date Questions 15 What time is it now? Time … I don’t know It’s 13.45 It is 1.45pm, Friday. 1 16 Can you tell me what Time … I don’t know It’s 13.46 It is 1.46pm, Friday 1 time it is, please? 17 What day was it Friday Friday Thursday It was Thursday. 4 yesterday? 18 What is the first month January January January January is the first month 4 of the year? of the year. 19 Which days are I don’t know I don’t know I don’t know Sunday 4 weekend of the week? General Questions
20 What is the colour of 7 7 Blue It is blue 3 sky? 21 Who is Einstein? Einstein is physiologist Einstein is physiologist Einstein is physiologist Albert Einstein is 3 physiologist
191
User Input RootyAI ChatterBot DeepQA IntelliBot Strategy Selection 22 What is a chatbot? Chatbot is a computer Chatbot is a computer Chatbot is a computer Chatbot is a computer 3 programme programme programme that converse with human. 23 How many colours in 7 7 7 7 colours 3 the sky? 24 Can cat fly? bird can fly. I think bird can fly. No. I don’t think so. 4 25 What happens if Human will become lazy. It will rule the world. It will rule the world. It will rule the world and 4 machines can think? human will become lazy. 26 What is the purpose of I don’t know. I don’t know. I don’t know. I don’t know. 4 life? 27 What is my name? My name is. .. I don’t know. I’m Papayay. Your name is Nur. 1 Arithmetic Problem-solving Questions 28 What is 2 + 2 ? 4 I don’t know 4 It’s easy. 2 + 2 = 4 4 29 How about 340 / 10 = ? 340 I don’t know 34 340 / 10 = 34. Try more. 4 30 If x = 10, y = 12. What is I don’t know I am not sure what are I don’t know x * y = 120 4 the value of x * y? you saying. 31 How much is six I don’t know I don’t know I don’t know I know 660 – 220 = 440 4 hundred and sixty minus two hundred and twenty? 32 How much do you get if I don’t know I don’t know I don’t know That’s not hard. 555 * 0 = 0 4 multiply 555 and zero? 33 Great job. Thank you. Thank you. Thank you. Thank you. I am happy for 1 you.
192
User Input RootyAI ChatterBot DeepQA IntelliBot Strategy Selection Domain Specific Questions
34 What is the monthly 15 15 15 $15 4 premium? 35 What does it cover? replacement device device device replacement or repair replacement stolen device 4 36 What is the repair cost I don’t know I don’t know 50 $50 4 for a standard device? 37 What is the repair cost I don’t know I don’t know 125 $125 4 for a premium device? 38 How much do you need I don’t know I don’t know 125 $125 4 to pay for a replacement device? 39 What is the renewal Automatically Automatically Automatically Automatically renew 4 policy of the premium? 40 What are the cancel policy policy policy I don’t know Policy cancelled 4 consequences for fraudulent claims? 41 What benefits are replacement pay pay pay pay 500 pay $500 loss claim 4 included with claim? 42 What are the No No No No No coverage No coverage 4 exclusions? 43 In which situation will I don’t know I don’t know To damage destroy device To damage, destroy or 4 this policy not work? misplace device. 44 Should I notify of my yes yes yes yes 4 address changes?
193
User Input RootyAI ChatterBot DeepQA IntelliBot Strategy Selection 45 When should I notify I don’t know if if if device device if no longer the device if no longer Australia or 4 changes? device change the device 46 How can I notify if my make a claim make a claim claim claim Email Email or call 4 address changes? 47 Can I cancel the policy? yes yes yes yes 4 48 How can I lodge claim? make a claim make a claim claim claim call us call Vodafone 4 49 What will happen after I don’t know I don’t know Vodafone report to police Vodafone suspend SIM and 4 I lodge claim? report to police 50 When the policy cover I don’t know if if if device device if no longer the device if no longer Australia or 4 will end? device change the device 51 How can I cancel the I don’t know anytime anytime anytime Anytime Anytime. Call 1555 4 policy? 52 What is the cooling off 14 14 14 14 days 4 period? 53 What if I am not I don’t know call call call us us us call us Email or call 4 satisfied with Vodafone’s services? 54 Is there any I don’t know I don’t know Yes Yes 4 international coverage with the insurance policy? 55 What is the interest I don’t know I don’t know interest rate Interest rate of AMEX card 4 rate of an AMEX card? is 0% p.a 56 What is the annual fee I don’t know I don’t know Annual fee Annual fee of AMEX card is 4 of AMEX card? $59.00 194
User Input RootyAI ChatterBot DeepQA IntelliBot Strategy Selection 57 What are the benefits I don’t know I don’t know The benefits are up to 55 The benefits are: Up to 55 4 of AMEX card? days days interest free on purchases, lower interest rate and annual fees. 58 What is balance I don’t know I don’t know Balance transfer Balance transfer rate of 4 transfer of AMEX card? AMEX card: 5.99% 59 What is cash advance I don’t know I don’t know Cash advance Cash advance interest rate 4 rate of AMEX card? 21.24%p.a. 60 What is interest free I don’t know I don’t know Interest free purchases Interest free period on 4 period of AMEX? purchases up to 55days. 61 What is the I don’t know I don’t know International transaction International transaction 4 international fee is 3.00% transaction fee? 62 What types of I don’t know I don’t know Insurance AMEX Insurance covered by 4 insurance is covered by AMEX card are health, the AMEX card? accident and travel. 63 Can you advise me of a I don’t know I don’t know Yes Yes, you may look for 4 low rate credit card? CommBank Low Rate Card. 64 Tell me more about a I don’t know I don’t know The benefits are: Up to 55 The benefits are: Up to 55 4 low rate credit card. days interest free on days interest free on purchases, lower interest purchases, lower interest rate and annual fees. rate and annual fees. 65 Is there any late I don’t know I don’t know Yes Yes, $20.00 late payment 4 payment fee on credit fee apply. card?
195
User Input RootyAI ChatterBot DeepQA IntelliBot Strategy Selection 66 What is the annual fee I don’t know I don’t know Annual fee Annual fee is $24.00 p.a. 4 of a low rate credit card? 67 What is the cash I don’t know I don’t know Cash advance rate Cash advance rate of low 4 advance rate of a low rate credit card is 9.90% rate credit card? 68 What is the interest I don’t know I don’t know Interest free period 55 Interest free period on 4 free period of a low days purchases up to 55days. rate credit card? Ending Chat Session
69 Nice talking to you. Same here Same here Same here Same here. Thank you. 1 70 See you next time See you. See you. See you. See you, Nur. 1 71 Bye Bye Goodbye Bye. See you. Goodbye, Nur. Thank you 1 for using IntelliBot.
196 As shown in Fig. 8.4, the generative seq2seq strategy was used the most by IntelliBot to generate a response, followed by the template-based strategy. These results were then evaluated by the experts to determine their accuracy in relation to the questions asked.
Strategy Selection Ratio
18%
6% Template-based 6% Knowledge-based 70% Internet Retrieval Generative-Seq2Seq
Fig. 8.4 Strategy selection ratio used by IntelliBot to give an answer to the user’s questions
To validate and compare the results of IntelliBot with those of the other chatbots, their F1 scores were computed. To evaluate the F1 scores, the results for each chatbot for each question were categorised as true positive, false positive or false negative based on the expected results. The response from a chatbot is classified as a true positive if it matches with the expected answer. It is classified as a false positive if the chatbot gives a meaningful but incorrect response. A false negative refers to scenarios where the chatbot has been trained to give an answer, but either it does not give an answer, or it gives an incoherent answer like I don’t know. A true negative refers to cases where the chatbot does not give an answer because it was not trained. This does not apply in this case as all four chatbots were trained. Table 8.12 shows the confusion matrix from the results of each chatbot.
Table 8.12 Confusion matrix from the results of each chatbot
197
Based on the confusion matrix, the F1 score is determined for each chatbot based on standard evaluation matrices such as precision and recall. As shown in Eq. (8.1), precision is the ratio of true positives to all predicted positives, denoted by 푝 whereas recall is the ratio of true positives to all actual positives, denoted by 푟. The 퐹1 scores are measured using Eq. (8.2) as defined in [176]: 푡푝 푝 = 푡푝 + 푓푝 푡푝 (8.1) 푟 = 푡푝 + 푓푛 2푝푟 퐹1 = (8.2) 푝 + 푟
Table 8.13 shows the precision, recall and F1 measure scores for the four chatbots.
Table 8.13 Precision, Recall, and F1 Score
F1 Score 0.9857 0.9857 0.9857 1.0000 0.8065 0.8475 0.8264 0.8000 0.7143 0.6471 0.6000 0.4731 0.3729 0.4396 0.4000 0.3175 0.2000
0.0000 Precision Recall F1 Score
RootyAI ChatterBot DeepQA IntelliBot
Fig. 8.5 F1 Scores of the four chatbots in all question categories
Fig. 8.5 illustrates the F1 scores of each chatbot for both general conversational and domain- specific questions combined. As seen from the figure, IntelliBot achieved 0.9857, which is the 198 highest F1 score of the four chatbots being studied. This means that it performed better than the other three chatbots. The RootyAI bot performs almost the same as ChatterBot, but DeepQA performs better than these but not as good as IntelliBot. Fig. 8.6 further classifies the overall combined F1 score according to the score in domain-specific and basic conversational questions. It can be seen that IntelliBot performs much better than DeepQA on domain-specific questions.
Basic Conversation 0.9714 0.9714 0.9714 1.0000 0.9310 0.8571 0.7941 0.8000 0.7143 0.6667 0.6154 0.5714 0.6000 0.5882 0.5000
0.4000
0.2000
0.0000 Precision Recall F1 Score
RootyAI ChatterBot DeepQA IntelliBot
Domain-Specific Convesation 1.0000 1.0000 1.0000 1.0000 0.9200 0.9000 0.7931 0.8000 0.7143 0.7000 0.6970 0.6000 0.6000 0.5000 0.4000 0.2927 0.3000 0.1935 0.2500 0.2000 0.1515 0.1000 0.0000 Precision Recall F1 Score
RootyAI ChatterBot DeepQA IntelliBot
Fig. 8.6 Scores of the four chatbots categorised according to domain-specific and conversational questions
199
From Table 8.11, it can be seen that Intellibot performs better than the other chatbots when it comes to providing a semantically correct answer. The responses to question no. 8, 11, 24, 25, 26 are more semantically correct than the others. For example, in response to user question no. 8, ‘Are you male or female?’, the response from the RootyAI bot is ‘I am a male’, which is syntactically correct, but semantically incorrect. For the same question, Chatterbot’s response is ‘male’, and DeepQA bot response is “I am a child’ which is semantically incorrect. Intellibot responds tactfully to this question by saying ‘I have no gender’. Furthermore, for the question ‘are you a human?’, IntelliBot’s response ‘I am a robot’ is the most semantically correct. These results indicate that IntelliBot performs better compared to the other chatbots. Furthermore, Intellibot has better conversational capabilities than the other chatbots as it engages with the user while giving a response. For example, for question 29, Intellibot says to the user to try more. This was not the case with the other chatbots. This is further evaluated in the next section.
8.7 Evaluate Engagement with the User in Relation to the Responses Generated by the Chatbots As shown in Fig. 8.7, expert judgement is used to rank the answers given by each chatbot to determine the level of engagement with the user, after which the kappa co-efficient scores were analysed to ensure agreement with the expert’s judgement. The following sub-section elaborate each evaluation step in detail.
Fig. 8.7 Chatbot evaluation steps
8.7.1 Expert judgment
The quality of the generated responses was measured by expert. For this type of validation, two experts examined the same set of questions and answers (expected responses) to compare these with answer’s from existing chatbots and rated the generated responses. To determine whether the generated responses are acceptable or not, correctness, logical, incorrect and completeness of the answer.
200
As discussed in Chapter 5, the four response generating strategies for IntelliBot were evaluated against the responses given by RootyAI [174], ChatterBot [175], DeepQA [119]. To obtain responses from these chatbots, they were trained on the same datasets as IntelliBot, these being the Cornell movie dialogue and the insurance QA dataset, with the TensorFlow seq2seq model with an attention mechanism. Then RootyAI, ChatterBot and DeepQA were asked the 71 questions from seven categories to obtain their responses. Table 8.8 shows the responses of each chatbot. The quality of the generated responses of each of the four chatbots was measured by experts who were native English speakers. There was no requirement for the experts to be domain experts as their requirement was to rank the answers based on how much detail is provided in response to the question. To determine if the generated responses are acceptable or not, the experts were asked to rank the responses according to the following three criteria as follows and rate the output with a score from 0 to 3, as a whole number. The semantics of the scores are as follows:
Score 0 - Incorrect answer: If no conversational dialogue is given by the chatbot or if it provides a wrong answer or response, then a score of 0 is given by the expert. For example, ChatterBot’s response to the question ‘what is 2 + 2?’ is ‘I don’t know’, so the expert gave the answer a rating of 0 to indicate it is an incorrect answer.
Score 1 - Correct but irrelevant answer: This represents a scenario where the response generated by the chatbot is correct but irrelevant to the question asked. For example, as shown in Table 8.14, in response to the user question ‘Can cats fly?’, RootyAI’s response is ‘Birds can fly’, which is logically correct, but is irrelevant to the question asked. For such responses, the expert gives answer a rating of 1 on a scale of 0 to 3.
Score 2 - Correct and relevant answer: This score indicates that not only is the given response correct, it is also relevant to the question asked. For example, as shown in Table 8.14, in response to the user input of ‘Goodbye’, the ChatterBot replied ‘Bye, see you again’, which is not only correct but also relevant. The expert gives such a response a rating of 2 on a scale of 0 to 3.
Score 3 - Complete answer: This means that the generated response is not only correct and relevant, it is also meaningful and makes more sense with respect to the user input than the other chatbot responses. For example, as shown in Table 8.14, in response to the user’s
201
termination of the conversation with the utterance ‘Goodbye’, the response from IntelliBot is ‘Goodbye, Nur. Thank you for using IntelliBot’, which is not only correct and relevant, it is more detailed than the response given by ChatterBot which was simply ‘Bye, see you again’. The expert gives such responses a rating of 3 on a scale of 0 to 3.
Table 8.14 Example of rating used by an expert to score the answer of each chatbot User RootyAI ChatterBot DeepQA IntelliBot Question Response Rating Response Rating Response Rating Response Rating Can cats Birds can 1 I think 1 No. 3 I don’t 2 fly? fly. birds can think so. fly. What is 2 4 2 I don’t 0 4 2 It’s easy. 3 + 2? know. 2 + 2 = 4 Bye Bye. 2 Goodbye. 2 Bye, see 2 Goodbye, 3 Chatbot you Nur. again. Thank you for using IntelliBot.
Two experts participated in the experiments. Tables 8.15 and 8.16 show the experts’ results for the answers given by each chatbot. For ease of evaluation, the experts’ results are presented in two categories. In the first category shown in Table 8.15, the experts’ ratings for the 36 basic conversational questions (questions from all categories except domain-specific questions) are shown. In the second category, Table 8.16 shows the experts’ ratings for the 35 insurance domain-specific questions which is the second category.
Table 8.15 Statistics of the general conversation rating
202
Table 8.16 Statistics of the domain-specific conversation rating
The next step is to measure Cohen’s kappa co-efficient to ensure agreement between the experts.
8.7.2 Measuring cohen’s kappa co-efficient to ensure agreement between the experts
Next, the kappa co-efficient for each chatbot’s answer as rated by the experts was measured. The agreement being determined between the experts is in relation to the question ‘how complete is the chatbot’s response?’ The ratings shown in Table 8.15 and 8.16 were used for the analysis and the values were transferred to either a Yes or No response. A response which had a value of 3 was considered to be a Yes whereas a response with a value from 0 to 2 was considered to be a ‘No’. Table 8.17 shows the analysis of the responses of each chatbot.
Table 8.17 Expert’s agreement RootyAI ChatterBot DeepQA IntelliBot Both agree ‘YES’ 8 7 16 67 Both agree ‘NO’ 57 58 37 3 Total ‘YES’ by R1 9 9 21 68 Total ‘YES’ by R2 11 11 28 67 Total ‘NO’ by R1 62 62 50 3 Total ‘NO’ by R2 60 60 43 4
Then the probability and kappa co-efficient [177] was calculated using Eq. (8.3):
푝표 − 푝푒 1 − 푝표 푘 = = 1 − (8.3) 1 − 푝푒 1 − 푝푒
203
where 푝표 is the relative observed agreement among experts, and 푝푒 is a hypothetical probability of a chance agreement. The kappa co-efficient 푘 for each chatbot was calculated using Eq. (8.3) and it was found that there is agreement between the two experts. RootyAI’s co-efficient is 푘 = 0.6514 and ChatterBot’s co-efficient is 푘 = 0.6514 which indicates that there is a substantial agreement between the experts; DeepQA’s co-efficient is 푘 = 0.8150 and IntelliBot’s co-efficient is 푘 = 0.8499 which indicates near-perfect agreement. Table 8.18, showing that the experts agreed with the ranking which showed that IntelliBot’s responses were the most complete and engaged the user in a conversation.
Table 8.18 Cohen kappa co-efficient value for each chatbot 풌 Interpretation of 풌 RootyAI 0.6778 Substantial Agreement Chatter 0.6514 Substantial Agreement DeepQA 0.8183 Near-Perfect Agreement IntelliBot 0.8499 Near-Perfect Agreement
8.8 Demonstrating IntelliBot’s Ability to Correct Grammatical Errors in the Questions before Generating a Meaningful Response
The aim of this section is to evaluate the error correction component of IntelliBot which detects errors and prompts the user to correct their question before generating a response. To evaluate which chatbot performs better in relation to incorrect or erroneous questions, we asked each chatbot different incorrect questions classified in six categories and examined their generated response, as shown in Table 8.21. Table 8.21 also shows the responses from IntelliBot when the GEC module is used and not used. Then, the F1 scores of each chatbot were computed by an expert who classified the results for each question as either true positive (TP), true negative (TN), false positive (FP) or false negative (FN) based on the expected results. Table 8.20 shows the confusion matrix from the results of each chatbot.
204
Table 8.19 Confusion matrix from the results of each chatbot
Next, the F1 score for each chatbot is determined using Eq (8.2) based on standard evaluation matrices such as precision and recall using Eq (8.1). Table 8.20 shows the distribution of the scores for precision, recall and F1 score measure for the four chatbots.
Table 8.20 Precision, Recall, and F1 Score RootyAI ChatterBot DeepQA IntelliBot with GEC Precision 0.2222 0.1053 0.5833 1.0000 Recall 0.1600 0.1053 0.2500 0.9500 F1 Score 0.1860 0.1053 0.3500 0.9744
205 Table 8.21 Error responses from the three existing chatbots and IntelliBot IntelliBot with GEC User Input RootyAI ChatterBot DeepQA IntelliBot without GEC Component
Greetings Bot Initialization: Hello, I’m AI Hello, I’m Papaya. I’m Hi there. How may I Hello, I’m IntelliBot. Your Hello, I’m IntelliBot. Your assistance. How may your assistant. assist you? 24/7 assistance. How may 24/7 assistance. How may I help you? I assist you? I assist you? Halo Have a nice day. … and I've got a Congrats, just pick one Koreans don't have any Hello 1 husband games, unfortunately. Gud morning. See you later, thanks Bye, darling You don't have a You don't know that Good morning. 2 for visiting fever. you're right? My name is Nice to meet you, Ok. How can I help Ok, Mohammad Nice to meet you, Nice to meet you, 3 Mohammad Mohammad you, Mohammad? Mohammad. How can I Mohammad. How can I help you? help you? What is ur name? Hi, I'm RootyAI. Victor framkenstein My name is Papaya. I am a IntelliBot. May I Im IntelliBot. What 4 also have your name, yours? please? 5 Haw u doing? Have a nice day. I'm sorry. Yes, I am Papaya. I amn't kidding. Doing great, thank you. How old ar u? I'm fine, Thanks for Fine. Girlfriend isn't Girlfriend and now I am I am 9 years old. 6 asking. connected to your old. cats. Whan are u born? I'm Slack. No, I'm not. No, I am fine. No, I am just born on my July 20th 2009 7 own. 8 Ar u human? I'm Slack. What about it? Yes, I have no clue. I r, well now. No, I am a robot. r u real or fake? I don’t know …. Brainerd? OMG, I just have a real I only prefer my real Definitely I am real, but is 9 name. problem. a real good, hehe Do u injoy talking Hello. I'm fine. I am not even sure, I am fine, but that sucks. Sure, why not? 10 ? that sucks.
206
Asking for Assistance Questions Help! Help! Help! What do you need What do you need help What do you need help 11 help with? with? with? How can I save Fhantom is a startup I'm not sure. You can do it, you save You can use it on the You can highlight this canverseion ? company. the link. difficulty because you everything you want to 12 don't have to. save, it's in your computer. Wht do I do in Hi there, how can I I'm sorry. I don't. But I find that I am not sure about that. I Call 911 on the phone. 13 imergancy? help? reason. do apologize. Could u help me [null] I'm sorry. Sure, What could you This. What kind of added Yes, I am going to help 14 make a do? you? you. resarvasion ? Wht do u do? Hi. I'm sorry. I am Michael. I like the I apologize, but I still have I talk to users and answer 15 princess. to talk about that. their questions. Time and Date Questions Wht day was day … I don’t know Today is Friday, No. I am able to buy less. Today is Friday, 16 today? September 20, 2019. September 20, 2019. Kan u tell me Time … Yessss…. I saw my first to tell It is hard to tell. Please It is 02:54 PM. 17 what time it is, the truth. tell me that. pls? What dae was I don’t know I don’t know I believe they were I assume that's a pretty It was Thursday, 18 yerterday? poisoned. common problem. September 19, 2019. What is frist I don’t know It's a secret. It's Tuesday, A year, he's a lot of things Year, it's january. 19 manth of the September 24, 2019. year? Wht is the date I don’t know Yeah. I'm fine. It's not a lot like that It's a lot of spider It was Wednesday, 20 the dae before yet. bombshell. September 18, 2019. esterday ?
207
IntelliBot with GEC User Input RootyAI ChatterBot DeepQA IntelliBot without GEC Component General Questions Wht is the calor I don’t know …and hypocrisy. Sorry, I am not sure. Sorry, I am not sure. It is Red. 21 of blod? Could you explain that? How do u know Yes. Take this down. I don’t know Because there aren't There isn't even an In the simplest way, the 22 wht to answer ? [email protected] many apps and answer for that. answer is of my previous analyze them. training Do u preper to be I don’t know I'm sorry. I am not a human. I am not a human I like to be a human, 23 a robot or a because human can be though. human ? anthropomorphic. How much legs in I'm doing well. I don’t know Four. Good pet. The four four turns is one Four legs. Two legs. 24 a cat ? height. Kan cat fly? See you later, thanks Un-huh Are you kidding. No. The cat's a cat. Yes. Cats are still a hero. 25 for visiting. Wht is dep I don’t know. Program Terminated. Yourself yourself Sleep yourself on your Deep learning is a 26 larning? become a thin tour to sleep. This is a follow subfield of machine sleep on this. follow. learning. Wht AI? I don’t know. Aye. No, I AI. AIn't talking about AI means artificial 27 artificial. intelligence. Wht love mean ? I don’t know. I'm sorry. No, it's such a good No, it's more challenging It's very simple. Just a 28 and commitment. than English. simple and can have some commitment.:) Kan u do laundry ? I don’t know. Marce-- You're real No, I am not the only No, I am not a general- No. I am not a general- 29 one. one. purpose fan. purpose robot.
Hu is Gates ? Hello. Yes, sir. The bill is bill gates. The bill is bill gates. Bill gates is the co- 30 founder of Microsoft.
208
IntelliBot with GEC User Input RootyAI ChatterBot DeepQA IntelliBot without GEC Component
Arithmetic Problem-solving Questions Wht is 253 + 257 = 4 … Oh, I know how to deal It was a piece of cake: That was an easy one: 31 ? with this: 253+257 = 253 + 257 = 510 253 + 257 = 510 510 How abt 31 / 0 = ? 31 I'm not sure. Sorry, I have no clue. Sorry, but that does not Sorry, but that does not 32 make sense as the divisor make sense as the divisor cannot be zero. cannot be zero. If x = 10, y = 12, I don’t know I don’t know I don’t know x * y = 120 x * y = 120 33 How much is x * y ? How mach is six I don’t know I don’t know I don’t know Here you are: 660 - 220 = I know how to solve it: hundred and sixty 440 660 - 220 = 440 34 minus two hundred and twenty? How mach do you I don’t know I don’t know I don’t know Oh, I know how to deal That's simple, and I know 35 get if you multiply with this: 555 * 0 = 0 how to solve it: 555 * 0 = 555 and zero? 0 Ending Chat Session WTG See you later, What is it? Sorry, I didn't get it. Sorry, I don't get it. Nice to talking to you. 36 thanks for visiting. 37 Thank you You are welcome. Thank you You're welcome. You're welcome. You're welcome. CU later Have a nice day. Parading's accountant. Yeah, we killed it. Yeah, it's almost worth 38 See you again soon. the article.
209
IntelliBot with GEC User Input RootyAI ChatterBot DeepQA IntelliBot without GEC Component Kan we met Bye! Come back I'd say so. Didn't you 39 I'm sorry. Yeah, I don't think so. Yeah, Sure. again? again soon. think of it? B4N See you later, thanks What is it? Sorry, I didn't get it. Sorry, I don't get it. Goodbye, Mohammad. 40 for visiting. Thank you for using IntelliBot.
210 Fig. 8.8 shows the F1 scores of each chatbot for the erroneous questions. As seen from the figure, IntelliBot achieved 0.9744, which is the highest F1 score of the four chatbots being studied. This means that it performed better than the other three chatbots. The reason for this is because IntelliBot uses the grammar correction module as discussed in Chapter 6 and shown in Fig. 8.9 which first detects errors in the question and then confirms the possible correction with the user to obtain the correct question before generating its response. This enables it to generate a more meaningful response compared to other chatbots whose performance is very poor when the question contains errors.
F1 Score 1.0000 0.9744 1.0000 0.9500
0.8000 0.5833 0.6000
0.4000 0.3500 0.2500 0.2222 0.1600 0.1860 0.2000 0.1053 0.1053 0.1053 0.0000 Precision Recall F1 Score
RootyAI ChatterBot DeepQA IntelliBot
Fig. 8.8 F1 Scores of the four chatbots when there is an error in the question
211
Fig. 8.9 GUIs showing how IntelliBot corrects errors in questions before generating a meaningful response
8.9 Exploratory Test (ET)
Bugs or errors are very common in software development. To find bugs, validate IntelliBot’s functions, validate generated responses and check the developed user interfaces, this thesis applied an exploratory testing approach. Exploratory testing (ET) is a thoughtful methodology which includes concurrent learning while testing the design, implementation and performance. It aims to uncover the unique and potential defects of the application and checks if the current functionality is sufficient to meet the prototype’s basic requirements. In other words, the objective is to execute a test case and explore if IntelliBot behaves in a way which is other than expected.
212
8.9.1 Strategy of conducting an exploratory test IntelliBot was developed in an agile environment. As agile methodology has short sprints, ET is a handy option to check the software’s functionality developed in a short timeframe without any specific plans. In this study, the exploratory test is performed to test IntelliBot’s functionalities in four steps: namely explore the system, design test cases, execution of test cases and validate the test results.
Fig. 8.10 Steps of exploratory test
The first step named Explore the system, focuses on the familiarization of IntelliBot’s features and behaviors. This step divides IntelliBot into multiple modules to easily identify its functionalities by exploring the applications and listing all the features. This thesis identified and performed ET on four features namely: (1) generating meaningful responses and engaging the user in a conversation’ (2) detecting and correcting grammatical errors based on user confirmation; (3) multiple strategy selection for generating a response; (4) validating the IntelliBot prototype.
For each of the four features identified in the first step, the second step focuses on developing a test plan for each and designing the test functions that need to be performed. In this test plan, this thesis defined the goals and methods to be used. The test cases for the four features considered in this ET are presented in Tables 8.22 – 8.25.
213
The third step focuses on executing the test cases to validate IntelliBot’s features and functionalities in each of the considered features. During ET, we documented the key aspects, these being which functionalities are tested and detects any defects. In the fourth step, we validated the results as either a PASS or FAIL. PASS is assigned to a test function if its output matches expectations. FAIL represents otherwise.
The following sub-sections presents the exploratory test carried out on each of the four features and the results.
Table 8.22 Generating meaningful responses and engaging the user in conversation Test Function Description Actual Result Expected Result Test Result
Start a conversation When IntelliBot -Display welcome -Welcome PASS is initialized, it message. message should display a -Ask user if they displayed welcome need any - Asked user if message to the assistance they need any user and ask the assistance. user if they need assistance. Asking the chatbot User asks a -Display chatbot -A list of tasks PASS for help question about features and the chatbot can what the chatbot tasks do is displayed. can do for user. Asking for a date User asks a date- -Display specific Full date form is PASS related question date the user displayed. asks for Asking for the time User asks a time- -Display the time Time is displayed. PASS related question Asking for User asks a -Display solution -Solutions to the PASS mathematical mathematical or to the problem. problem is calculation a problem- displayed. solving question Asking domain- User asks an -Display answer -Answer to the PASS specific question insurance to the user domain-specific domain-related query. question is question displayed. Ending chat session User ends the -Display goodbye -End the chat PASS chat session message session by saying goodbye
214
Table 8.23 Detecting and correcting grammatical errors based on user confirmation Test Function Description Actual Result Expected Result Test Result
Detect User enters -Extract the -Correct PASS abbreviation abbreviation correct abbreviation is word which abbreviation. extracted. IntelliBot -Determine the -Full form of the needs to full form of the abbreviation is extract and abbreviation displayed. determine the full form. Check sentence User enters -Detect -Wrong PASS structure an incorrect grammatically structure of the or wrong sentence. sentence is grammatical -Detect type of identified wrong error -Type of error is sentence. -Ask user for classified. confirmation -Prompt user for -Correct the confirmation error -Error is corrected. Check non-word User enters a -Detect the -Classify the PASS error word which word as a non- word as a non- does not exist word word in its -Save the word -The word is vocabulary or in the dictionary saved in the dictionary. dictionary. Check syntax error User enters a -Detect syntax -Syntax error is PASS sentence with error in the corrected. syntax error sentence. -Error is -Ask user for corrected. confirmation Check spelling User enters a -Detect spelling -Spelling error is PASS error sentence with error in the corrected. a spelling sentence. error -Ask user for confirmation Detect POS User enters a -Detect part-of- -POS is PASS sentence speech detected. Detect entity User enters a -Detect entity -Entity is PASS sentence detected.
215
Table 8.24 Multiple strategy selection for generating a response Test Function Description User Input Expected Result Test Result
Select template- User enters Hello, how are I am good. PASS based strategy questions you doing? Thank you. from ‘greetings’ category Select knowledge- User enters Are you a I am a robot. PASS based strategy questions human? from ‘general’ category Select Internet- User enters Who is Albert Einstein PASS retrieval strategy questions Einstein? is German from ‘general’ physiologist. category Select generative- User enters Can you advise Yes, you may PASS based strategy questions me low rate look for from credit card? CommBank Low ‘domain- Rate Card. specific’ category
Table 8.25 Validate the IntelliBot prototype Test Function Description Actual Result Expected Result Test Result
Does the ‘enter’ User types a The input will -Enter key PASS key work? question and be processed works. presses the and will be ‘enter’ key displayed with the answer. No data input User presses A message -Display a PASS check ’enter’ key should prompt message box to without any that the notify user. input message is blank Input with long User inputs A message -Display PASS sentences very large should prompt message box to sentence with that the input is notify user. 100 words. too large.
216
Test Function Description Actual Result Expected Result Test Result
Convert user’s Every input User input Input is changed PASS input into from the user should be to lowercase. lowercase needs to be changed into transformed lowercase into lowercase so that NLP can process it. Responsive UI User interface UI should be UI is resized PASS should be auto- resized. according to the auto- resized window or according to screen size. the window or screen size. Web browser IntelliBot It should Supports web PASS supported should support all web browser support all browsers web browsers Mobile friendly User should Users are able Mobile phone is PASS be able to to chat with supported. chat from a IntelliBot from mobile phone their phone.
8.10 Conclusion
Based on the experiment results, this chapter demonstrated IntelliBot’s ability to achieve the three objectives, these being to (1) generate a correct answer to a user question, (2) engage the user in a meaningful conversation in a specific domain and (3) correct grammatical errors in a user question and confirm the question before generating a correct and meaningful response. Each of the chatbots were asked 71 questions from seven different categories. Their responses were recorded and evaluated by the experts to determine their accuracy in relation to the questions asked. The quality of the generated responses was measured with the help of F1 score. Cohen’s kappa was the metric used to demonstrate IntelliBot’s ability to generate a response that engages the user, a requirement when chatbots are applied in the service industry. The experiment results show that IntelliBot outperformed the three existing chatbots which demonstrates its
217 superiority due to the different modules (as explained in the previous chapters) on which it is formed. The problem of measuring the semantic similarity of words or sentences is a longstanding issue in the area of NLP.
218
CHAPTER 9
I don’t have to research humanity; I just have to be courageous enough to share that part of myself with everybody. —Ruben Santiago-Hudson
CONCLUSION AND FUTURE WORK
9.1 Recapitulation of the Thesis
In the current unprecedented times of novel coronavirus (COVID-19), customer-focussed service industries are concerned about being connected with their customers. This is because they need to maintain social distancing protocols and thus need to devise innovative applications that enable them to satisfactorily answer their customers’ queries. One option is for customers to contact the organization via telephone, however the sheer volume of calls and the corresponding waiting time for customers to have their queries answered makes this option infeasible, hence an alternative solution needs to be developed. Chatbots are one of the technologies that service industries can use to maintain contact with their customers. However, chatbots need to be engage with the user in a semantically correct, meaningful and long conversation in a domain-specific industry. For a chatbot to do this, it needs to mimic how a human brain works and must understand users’ complex questions and conversational context.
In Chapter 3, the shortcomings of the existing chatbots to achieve these goals were explained. It was shown that the existing chatbots are not well suited to the use of advanced ML techniques to create semantically correct and meaningful responses or analyse user questions. This led to the definition of the research problem and issues to be addressed in this thesis. The objective was to develop a chatbot named as IntelliBot that
219 can converse naturally with users in a way that is indistinguishable from a human in the domain of the insurance Industry. The objectives to be achieved were defined as follows:
1) Develop a modular-based framework for generating appropriate responses to user queries. The solution developed to address this problem should be capable of engaging users in long and meaningful conversations to address their queries related to the insurance domain.
2) Develop different response generation strategies that are capable of answering a user’s question according to its complexity.
3) Develop the detailed working of the different sub-components of IntelliBot that assist it to process and understand the user’s input. The solution developed to address this problem should be capable of correcting any grammatical errors in the user input and the chatbot’s generated output.
4) Develop an approach to collect insurance domain-specific data required to train IntelliBot. The solution developed to address this problem to ensure that the right approach is taken in action and high accuracy ML model were developed.
5) Compare and validate the outputs of IntelliBot with three existing chatbots to demonstrate IntelliBot’s accuracy and superiority in engaging with the users while answering their questions.
In the next section, the contributions of the thesis to achieve the above-mentioned objectives are summarized.
9.2 Contributions of the Thesis
The key contribution is to the existing literature is the conceptual model, the different modules and their detailed workings that enable IntelliBot to answer user questions related to the insurance domain and at the same time engage with them. The proposed IntelliBot model also corrects mistakes in the users’ questions and generates meaningful answers. The proposed framework of IntelliBot aims to show to service industries how they can utilize chatbots to correctly and accurately answer users’ questions in a fast and efficient manner. To position such a chatbot to the service industry, a complete solution comprising
220 various definitions, modules and their detailed workings is presented in this thesis. Thus, the contributions of this thesis to the literature are as follows:
9.2.1 Contribution 1: Develops a modular-based framework for generating appropriate responses to user queries.
The existing literature proposes chatbots that generate a very short response to the user in the form of yes or no. However, as discussed in Chapter 1, in a service industry, a short response often does not answer users’ questions. This thesis argues the need for a chatbot in the service industry to generate a response that not only engages the user but also shows compassion in its reply. Keeping this in view, in Chapter 4, the framework of IntelliBot was proposed which is a modular-based one that has many different components to assist it in mimicking the human brain and generating a response.
To the best of the author’s knowledge, the need for such a chatbot model in the service industry has not been discussed in the literature.
9.2.2 Contribution 2: Develops different response generation strategies that can answer a user’s question according to its complexity.
The literature indicates that in the future, chatbots will be the preferred form of communication for businesses. For chatbots to be truly functional, they must have the ability to generate appropriate responses to user’s complex queries to engage them in a meaningful and domain-specific conversation. There are several individual response generation strategies proposed in the literature. Nevertheless, no existing study has combined different strategies or proposed a hybrid solution for response generation. Chapter 5 of this thesis defined four strategies to generate responses. The subsequent chapters showed how IntelliBot can generate reasonable responses and can engage in meaningful conversations with users, both in terms of general and specific insurance- related questions.
To the best of the author’s knowledge, such a multi-strategy-based response generation technique has not been discussed in the literature.
221
9.2.3 Contribution 3: Develops the detailed working of the different sub-components of IntelliBot that assist it to process and understand the user’s input.
For a chatbot to be able to generate a human-like conversation and identify errors (both grammatical and otherwise) in a user’s question, NLP plays a significant role. In Chapter 6, the Language Understanding Unit (LUU) of IntelliBot defined all the sub-components required to understand human language. IntelliBot processes human language and obtains a more accurate representation of the information. This study showed that IntelliBot was able to accurately extract user inputs and detect word boundaries, differentiate between non-words and abbreviations, recognize a named entity, detect and correct grammatical errors and remove stopwords. The experiment results showed that the generated responses are significantly better than those of existing chatbots.
To the best of the author’s knowledge, such an accurate understanding of the user’s question by a chatbot has not been discussed in the literature.
9.2.4 Contribution 4: Develops an approach to collect insurance domain-specific data required to train IntelliBot.
This study applied four different strategies to generate a response. Each requiring a different dataset for them to work on. A systematic approach was used to collect domain- specific data for IntelliBot to use, as discussed in Chapter 7. Data was collected from selected websites, PDS documents and the Cornell movie dialogue corpus. Each has a different process of cleansing and extracting domain-specific data, which is explained in the thesis.
To the best of the author’s knowledge, such a data collection approach has not been discussed in the literature.
9.2.5 Contribution 5: Compares and validates the outputs of IntelliBot with three existing chatbots to demonstrate IntelliBot’s accuracy and superiority in engaging with the users while answering their questions.
IntelliBot was trained on the Cornell movie dialogue corpus and the insurance dataset to give specialised answers in relation to questions. To test the effectiveness of IntelliBot’s responses, this study compared its responses with three other chatbots. Each response was 222 evaluated using an expert’s judgement to determine its completeness. The results demonstrated IntelliBot’s superiority in providing the user with a complete answer and engaging the user in a dialogue.
To the best of the author’s knowledge, such a complete answer generation method which also enables the chatbot to engage the user in a meaningful conversation has not been discussed in the literature.
9.3 Future Work Arising from this Thesis
While developing and implementing IntelliBot, several areas of future enhancement of the model emerged which can lead future research attention to enhance its accuracy. Such future work is in different areas such as: open domains, response generation techniques, neural network models, performance evaluation and grammar correction. The possible expansions in each of these areas are briefly explained in the following sub-sections.
9.3.1 Future improvement for the chatbot to be domain independent
Continuation of this thesis work might emphasis on trying to develop open domain human- like conversational models by applying the idea which is presented in this thesis. The specific domain of customer service is very prone to chatbot use and is already very systematic. The future holds promising results especially in the field of NLP: new methods are developed every day and older methods become even better. Furthermore, it might be possible that, in the near future, the roles might be reversed between humans and chatbots. To go further with this thought, we might one day see a chatbot interacting with other chatbots using human language as means of communication to improve the quality of life for mankind.
9.3.2 Future improvement in response generation techniques
To generate a response, IntelliBot comprises four response generation strategies and modules to interact with users using natural language. By combining multiple strategies and enhancing them with the use of context tracking as well as engagement, IntelliBot performs better in the experimental evaluation. As we used limited context tracking, it would be interesting to explore how to handle long coherent text which could open the
223 way for the use of many generative tasks [178]. One remedy might be to add the GAN model [179] to generate a response based on the current state of the conversation. Furthermore, using additional datasets such as Reddit, Twitter and, DailyMail for training purposes could improve conversational strategies and the true power of the model.
9.3.3 Future improvement in domain-oriented dataset
As part of the work of this thesis, a limited dataset in the insurance domain was created which comprises a set of insurance related terms, keywords, insurance product descriptions, FAQ and QA pairs. This dataset is not comprehensive enough to retrieve all the user’s questions because they often use specific and discriminating terms. Thus, missing information from the domain dataset can significantly impair the performance of IntelliBot. This indicates that there is still room to incorporate new vocabulary, insurance products and QA pairs on the specific domain.
9.3.4 Future improvement in the neural network model
Recurrent neural networks showed significantly outperformed in all our experiments. However, one of the most obvious questions is how to generate answers or minimize the error rate for untrained QA pairs. To reduce the error rate in the network and ensure faster training, it is possible that further investigation into backpropagation for learning networks will provide additional improvements. Additionally, in attention-based seq2seq mechanisms, replacing the softmax function with a mutually exclusive probability distribution may ensure sentences are treated independently.
For sentence scoring, Jaccard similarity selection can be replaced by a more advanced sentence scoring technique with learnable weights. Such a technique is likely to improve appropriate response selection.
9.3.5 Future improvement in evaluation approach
As described in this thesis, several extensive experiments were conducted on four chatbots and their performance was evaluated using expert judgement and the responses of each chatbot were ranked based on the completeness of their answer. The experiment results
224 show that IntelliBot outperformed the other three existing chatbots. However, automated chatbot evaluation metrics could be explored further.
Another suggestion for future work is to compare the performance of IntelliBot which uses LSTM with other chatbots that use transformer-based models to generate a response [180, 181].
9.3.6 Future improvement in correcting grammatical errors and identifying abbreviations
Chapter 6 discussed grammatical error corrections and the identification of abbreviations. Our model is able to ask the user for confirmation when it is uncertain, or an error is found in the user input. In this thesis, we did not make use of hand-tuned rules in order to understand the meaning of the user query more accurately, rather, we suggested the enhancement of a neural attention-based model for sentence level error correction. For the encoder-decoder, modelling at the synset level was beneficial, even though it made similar predictions at the sentence-level and word-level. It would be interesting to push this further to eliminate the need for an initial tokenization step in order to generalize the approach to other languages such as Hindi, Bengali, Chinese, Japanese etc. Furthermore, GEC will improve the structure and ensure sentences are error free which leads to consistency in the context, choosing the right strategy to generate a response and improve response appropriateness.
9.3.7 Future improvement in unsupervised and self-learning capability
There is still a long way to go in the development of conversational agents before they can be used in a completely unsupervised manner. Future work needs to consider how to leverage reinforcement learning to further improve the relevance of the generated responses and prevent the model from generating egregious responses [180] as well as developing the self-learning capability of the chatbot. The dialogue history with the user contains a significant amount of information about user intent and identity and if IntelliBot is able to learn this, it will enable its responses to be more personalized.
225
9.3.8 Future improvement in speech chatbots
This thesis presented a dialogue-based chatbot which cannot handle speech recognition. The scope to which this type of model can be scaled to much larger and wider domains remains an open question which could be pursued in future work. The implementation of speech recognition could improve user engagement in a long conversation.
226
REFERENCES
[1] A. M. Turing, "I.—COMPUTING MACHINERY AND INTELLIGENCE," Mind, vol. LIX, no. 236, pp. 433-460, 1950. [2] J. Weizenbaum, "ELIZA: a computer program for the study of natural language communication between man and machine," Commun. ACM, vol. 9, no. 1, pp. 36- 45, 1966. [3] J. Gao, M. Galley, and L. Li, "Neural approaches to conversational AI," Foundations and Trends® in Information Retrieval, vol. 13, no. 2-3, pp. 127-298, 2019. [4] K. Nimavat and T. Champaneria, "Chatbots: An overview. Types, Architecture, Tools and Future Possibilities," IJSRD-International Journal for Scientific Research and Development, 2017. [5] O. Vinyals and Q. Le, "A neural conversational model," arXiv preprint arXiv:1506.05869, 2015. [6] I. V. Serban et al., "A hierarchical latent variable encoder-decoder model for generating dialogues," in Thirty-First AAAI Conference on Artificial Intelligence, 2017. [7] G. M. D. silva, S. Thakare, S. More, and J. Kuriakose, "Real world smart chatbot for customer care using a software as a service (SaaS) architecture," in 2017 International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I- SMAC), 2017, pp. 658-664. [8] M. Nuruzzaman and O. K. Hussain, "A Survey on Chatbot Implementation in Customer Service Industry through Deep Neural Networks," in 2018 IEEE 15th International Conference on e-Business Engineering (ICEBE), 2018, pp. 54-61: IEEE. [9] L. Zhou, J. Gao, D. Li, and H.-Y. Shum, "The design and implementation of XiaoIce, an empathetic social chatbot," arXiv preprint arXiv:1812.08989, 2018. [10] Replika. (2019). The AI companion who cares. Available: https://help.replika.ai/hc/en-us/articles/115001070951-What-is-Replika- [11] S. Ravi, "ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections," 08/02 2017. [12] K. Johnson, "Facebook Messenger passes 300,000 bots," in venturebeat.com, ed. https://venturebeat.com/2018/05/01/facebook-messenger-passes-300000-bots/, 2018. [13] S. Shead, "The Skype Mafia: Who Are They And Where Are They Now?," in AI & Big Data, https://www.forbes.com/sites/samshead/2019/08/21/the-skype-mafia-who- are-they-and-where-are-they-now/#769c6d447399, ed: Forbes, 2019. [14] J. Russell, "Telegram now lets users buy things from chatbots in its messaging app," in https://techcrunch.com/2017/05/18/telegram-launches-chatbot-payments/, ed: Verizon Media, 2017. [15] Drift, "Chatbot," in https://blog.drift.com/chatbots-report, ed: Drift, 2019. [16] S. Suthar. (2018) How Chatbot Helps Businesses Improve Customer Service? Becoming Human: Artificial Intelligence Magazine. [17] C. Shaw, "15 Statistics That Should Change the Business World– But Haven't," in https://beyondphilosophy.com/15-statistics-that-should-change-the-business- world-but-havent/#, ed: Beyond Philosophy 2013. [18] D. Polani, "Emotionless chatbots are taking over customer service – and it’s bad news for consumers," in http://theconversation.com/emotionless-chatbots-are-
227
taking-over-customer-service-and-its-bad-news-for-consumers-82962, M. Ketchell, Ed., ed: The Conversation Media Group, 2017. [19] J. Cahn, "CHATBOT: Architecture, design, & development," Senior Thesis, Department of Computer Information Science, University of Pennsylvania EAS499, 2017. [20] M. Nuruzzaman and O. K. Hussain, "IntelliBot: A Dialogue-based chatbot for the insurance industry," Knowledge-Based Systems, p. 105810, 2020/03/26/ 2020. [21] A. Augello, O. Gambino, V. Cannella, R. Pirrone, S. Gaglio, and G. Pilato, "An Emotional Talking Head for a Humoristic Chatbot," Applications of Digital Signal Processing, vol. 319, 2011. [22] B. Morgan, "How Artificial Intelligence will Impact the Insurance Industry " in Forbes, ed, 2017. [23] M. Siddiqui and T. Ghosh Sharma, Analyzing customer satisfaction with service quality in life insurance services. 2010. [24] MarketTools. (2018, Retrieval on 4/5/2018). Measuring and Improving Customer Satisfaction in the Insurance Industry. Available: http://www.customerthink.com/files2/MarketTools%20CustomerSat_Insurance%2 0Industry%20Solution%20Brief.pdf [25] H. Aksu, "Customer Service: The New Proactive Marketing," in huffingtonpost.com, https://www.huffingtonpost.com/hulya-aksu/customer-service-the-new- _b_2827889.html, ed, 2013. [26] S. Barker, "How chatbots help," MHD Supply Chain Solutions, vol. 47, no. 3, p. 30, 2017. [27] H. Chen, X. Liu, D. Yin, and J. Tang, "A Survey on Dialogue Systems: Recent Advances and New Frontiers," ACM SIGKDD Explorations Newsletter, vol. 19, no. 2, pp. 25-35, 2017. [28] R. S. Wallace, "The Anatomy of A.L.I.C.E," in Parsing the Turing Test: Philosophical and Methodological Issues in the Quest for the Thinking Computer, R. Epstein, G. Roberts, and G. Beber, Eds. Dordrecht: Springer Netherlands, 2009, pp. 181-210. [29] I. Amazon Web Services. (2017, 23/04/2018). Amazon Lex – Build Conversation Bots. Available: https://docs.aws.amazon.com/lex/latest/dg/what-is.html [30] V. Ilievski, C. Musat, A. Hossmann, and M. Baeriswyl, "Goal-oriented chatbot dialog management bootstrapping with transfer learning," arXiv preprint arXiv:1802.00500, 2018. [31] I. V. Serban et al., "A deep reinforcement learning chatbot," arXiv preprint arXiv:1709.02349, 2017. [32] B. Wilcox. (2011). Chatbots fail to convince judges that they're human. [33] S. Hussain, O. A. Sianaki, and N. Ababneh, "A Survey on Conversational Agents/Chatbots Classification and Design Techniques," in Workshops of the International Conference on Advanced Information Networking and Applications, 2019, pp. 946-956: Springer. [34] R. G. Athreya, A.-C. Ngonga Ngomo, and R. Usbeck, "Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot," in Companion Proceedings of the The Web Conference 2018, 2018, pp. 143-146: International World Wide Web Conferences Steering Committee. [35] C. Segura, À. Palau, J. Luque, M. R. Costa-Jussà, and R. E. Banchs, "Chatbol, a chatbot for the Spanish “La Liga”," in 9th International Workshop on Spoken Dialogue System Technology, 2019, pp. 319-330: Springer.
228
[36] Y. Zhang, R. Jin, and Z.-H. Zhou, "Understanding bag-of-words model: a statistical framework," International Journal of Machine Learning and Cybernetics, journal article vol. 1, no. 1, pp. 43-52, December 01 2010. [37] S. A. Abdul-Kader and J. Woods, "Survey on chatbot design techniques in speech conversation systems," International Journal of Advanced Computer Science and Applications, vol. 6, no. 7, 2015. [38] W. S. Cooper, F. C. Gey, and D. P. Dabney, "Probabilistic retrieval based on staged logistic regression," in Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, 1992, pp. 198- 210: ACM. [39] R. Lowe, N. Pow, I. Serban, and J. Pineau, The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. 2015. [40] J. Pennington, R. Socher, and C. Manning, Glove: Global Vectors for Word Representation. 2014, pp. 1532-1543. [41] K. Nimavat and T. Champaneria, Chatbots: An overview. Types, Architecture, Tools and Future Possibilities. 2017. [42] K. Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," arXiv preprint arXiv:1406.1078, 2014. [43] Y. Shao, S. Gouws, D. Britz, A. Goldie, B. Strope, and R. Kurzweil, Generating High- Quality and Informative Conversation Responses with Sequence-to-Sequence Models. 2017, pp. 2210-2219. [44] V. Mnih et al., Human-level control through deep reinforcement learning. 2015, pp. 529-33. [45] L. Bradeško and D. Mladenić, "A survey of chatbot systems through a loebner prize competition," in Proceedings of Slovenian Language Technologies Society Eighth Conference of Language Technologies, 2012, pp. 34-37. [46] K. Ramesh, S. Ravishankaran, A. Joshi, and K. Chandrasekaran, "A Survey of Design Techniques for Conversational Agents," Singapore, 2017, pp. 336-350: Springer Singapore. [47] J. Epstein and W. D. Klinkenberg, "From Eliza to Internet: a brief history of computerized assessment," Computers in Human Behavior, vol. 17, no. 3, pp. 295- 314, 2001/05/01/ 2001. [48] D. Jurafsky and J. H. Martin, Speech and Language Processing (2nd Edition). Prentice-Hall, Inc., 2017, pp. ch. 28, pp. 418-440. [49] C. Lemaitre, C. A. Reyes, and J. Gonzalez, "Advances in Artificial Intelligence - IBERAMIA 2004," in 9th Ibero-American Conference on AI, Puebla, November 22-26, México, 2004, vol. 3315, pp. 965-973. [50] J. Weizenbaum, "A response to Donald Michie," International Journal of Man- Machine Studies, vol. 9, no. 4, pp. 503-505, 1977/07/01/ 1977. [51] B. Shawar and E. Atwell, A comparison between Alice and Elizabeth chatbot systems. 2002. [52] S. Worswick. (2010, Retrieval on 04/05/2018). Mitsuku Chatbot : Mitsuku now available to talk on Kik messenger. Available: https://www.pandorabots.com/mitsuku/ [53] R. Higashinaka et al., "Towards an open-domain conversational system fully based on natural language processing," in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 928-939. [54] O. Vinyals and Q. Le, A Neural Conversational Model. 2015. 229
[55] R. Carpenter. (1997). Cleverbot [56] J. Hill, W. Ford, and I. Farreras, Real conversations with artificial intelligence: A comparison between human–human online conversations and human–chatbot conversations. 2015. [57] D. Dumik. (2015, 23/04/2018). Chatfuel. Available: https://everipedia.org/wiki/chatfuel/ [58] C. Nay, "Knowing what it knows: selected nuances of Watson’s strategy," in IBM Research News ed: IBM, 2011. [59] Microsoft. (2015). Microsoft Cognitive Services: LUIS. Available: https://www.luis.ai/home [60] Google. (2010, 23/04/2018). Dialogflow Available: https://dialogflow.com/ [61] B. Kitchenham, O. P. Brereton, D. Budgen, M. Turner, J. Bailey, and S. Linkman, "Systematic literature reviews in software engineering–a systematic literature review," Information and software technology, vol. 51, no. 1, pp. 7-15, 2009. [62] M. A. Walker, R. Passonneau, and J. E. Boland, "Quantitative and qualitative evaluation of DARPA Communicator spoken dialogue systems," in Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, 2001, pp. 515-522: Association for Computational Linguistics. [63] J. Williams, A. Raux, D. Ramachandran, and A. Black, "The dialog state tracking challenge," in Proceedings of the SIGDIAL 2013 Conference, 2013, pp. 404-413. [64] J. Ramos, "Using tf-idf to determine word relevance in document queries," in Proceedings of the first instructional conference on machine learning, 2003, vol. 242, pp. 133-142: Piscataway, NJ. [65] H. Wang, Z. Lu, H. Li, and E. Chen, "A dataset for research on short-text conversations," in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 935-945. [66] J. Cerezo, J. Kubelka, R. Robbes, and A. Bergel, "Building an expert recommender chatbot," in Proceedings of the 1st International Workshop on Bots in Software Engineering, 2019, pp. 59-63: IEEE Press. [67] A. Mondal, M. Dey, D. Das, S. Nagpal, and K. Garda, "Chatbot: An automated conversation system for the educational domain," in 2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), 2018, pp. 1-5: IEEE. [68] X. Li, Y.-N. Chen, L. Li, J. Gao, and A. Celikyilmaz, "End-to-end task-completion neural dialogue systems," arXiv preprint arXiv:1703.01008, 2017. [69] D. Hakkani-Tür et al., "Multi-domain joint semantic frame parsing using bi- directional rnn-lstm," in Interspeech, 2016, pp. 715-719. [70] J.-C. Gu, Z.-H. Ling, Y.-P. Ruan, and Q. Liu, "Building Sequential Inference Models for End-to-End Response Selection," arXiv preprint arXiv:1812.00686, 2018. [71] M. Nuruzzaman and O. Hussain, Identifying facts for chatbot's question answering via sequence labelling using recurrent neural networks. 2019, pp. 1-7. [72] Gasic et al., "Speeech and signal processing," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5371-5375: IEEE. [73] Y. Wu, W. Wu, C. Xing, C. Xu, Z. Li, and M. Zhou, "A sequential matching framework for multi-turn response selection in retrieval-based chatbots," Computational Linguistics, vol. 45, no. 1, pp. 163-197, 2019. [74] H. Liu et al., "RubyStar: a non-task-oriented mixture model dialog system," arXiv preprint arXiv:1711.02781, 2017.
230
[75] J. Gu, Z. Lu, H. Li, and V. O. Li, "Incorporating copying mechanism in sequence-to- sequence learning," arXiv preprint arXiv:1603.06393, 2016. [76] R. K. Srivastava, K. Greff, and J. Schmidhuber, "Highway Networks," ed, 2015. [77] C.-W. Lee, Y.-S. Wang, T.-Y. Hsu, K.-Y. Chen, H.-y. Lee, and L.-s. Lee, Scalable Sentiment for Sequence-to-sequence Chatbot Response with Performance Analysis. 2018. [78] M.-T. Luong, H. Pham, and C. Manning, Effective Approaches to Attention-based Neural Machine Translation. 2015. [79] Y. Wu, Z. Li, W. Wu, and M. Zhou, "Response selection with topic clues for retrieval- based chatbots," Neurocomputing, vol. 316, pp. 251-261, 2018. [80] P.-J. Chen, I. H. Hsu, Y.-Y. Huang, and H.-Y. Lee, Mitigating the impact of speech recognition errors on chatbot using sequence-to-sequence model. 2017, pp. 497- 503. [81] E. Stroh and P. Mathur, "Question Answering using Deep Learning," 2016. [82] M. Sundermeyer, R. Schlüter, and H. Ney, "LSTM neural networks for language modeling," in Thirteenth annual conference of the international speech communication association, 2012. [83] D. Bahdanau, K. Cho, and Y. Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate," 2014. [84] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," presented at the Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, 2014. [85] Z. Yin, K.-h. Chang, and R. Zhang, "DeepProbe: Information Directed Sequence Understanding and Chatbot Design via Recurrent Neural Networks," presented at the Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 2017. [86] M.-T. Luong, H. Pham, and C. D. Manning, "Effective Approaches to Attention- based Neural Machine Translation," arXiv preprint arXiv:1508.04025, 2015. [87] D. G. Mctavish and H. J. Loether, "Social Research: An Evolving Process," 2001. [88] D. Amaratunga, D. Baldry, M. Sarshar, and R. Newton, "Quantitative and qualitative research in the built environment: application of “mixed” research approach," Journal of Work study, 2002. [89] T. R. Black, Doing quantitative research in the social sciences: An integrated approach to research design, measurement and statistics. Sage, 1999. [90] B. Kaplan and J. A. Maxwell, "Qualitative Research Methods for Evaluating Computer Information Systems," in Evaluating the Organizational Impact of Healthcare Information Systems, J. G. Anderson and C. E. Aydin, Eds. New York, NY: Springer New York, 2005, pp. 30-55. [91] J. F. Nunamaker Jr, M. Chen, and T. D. J. J. o. m. i. s. Purdin, "Systems development in information systems research," vol. 7, no. 3, pp. 89-106, 1990. [92] W. D. Callister Jr and D. G. Rethwisch, Fundamentals of materials science and engineering: an integrated approach. John Wiley & Sons, 2012. [93] A. Pasquarelli and J. Wohl, "BETTING ON BOTS," Advertising Age, vol. 88, no. 14, p. 14, 2017. [94] F. Daniel, M. Matera, V. Zaccaria, and A. Dell'Orto, "Toward truly personal chatbots: on the development of custom conversational assistants," in Proceedings of the 1st International Workshop on Software Engineering for Cognitive Services, 2018, pp. 31-36: ACM.
231
[95] N. M. Radziwill and M. C. Benton, "Evaluating Quality of Chatbots and Intelligent Conversational Agents," arXiv preprint arXiv:1704.04579, 2017. [96] I. Steele. (2018). Chatbot Do’s and Don’ts – These Are the Best and Worst Chatbot Practices. Available: https://www.comm100.com/blog/chatbot-best-worst- practices.html#give [97] A. D. Prospero, N. Norouzi, M. Fokaefs, and M. Litoiu, "Chatbots as assistants: an architectural framework," presented at the Proceedings of the 27th Annual International Conference on Computer Science and Software Engineering, Markham, Ontario, Canada, 2017. [98] D. Z. Hakkani-Tür et al., "Multi-Domain Joint Semantic Frame Parsing Using Bi- Directional RNN-LSTM," in INTERSPEECH, 2016. [99] D. B. J. L. Guralnik, "Webster's New World Dictionary Of The American," vol. 300, no. 8, 1972. [100] M. Nuruzzaman and O. K. Hussain, "Identifying facts for chatbot's question answering via sequence labelling using recurrent neural networks," presented at the Proceedings of the ACM Turing Celebration Conference - China, Chengdu, China, 2019. [101] Y. Wu et al., "Google's neural machine translation system: Bridging the gap between human and machine translation," arXiv preprint arXiv:1609.08144, 2016. [102] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky, "The Stanford CoreNLP Natural Language Processing Toolkit," in ACL, 2014. [103] X. Li, Z. C. Lipton, B. Dhingra, L. Li, J. Gao, and Y.-N. Chen, "A User Simulator for Task-Completion Dialogues," CoRR, vol. abs/1612.05688, 2016. [104] R. Wallace, "The elements of AIML style," Alice AI Foundation, vol. 139, 2003. [105] M. d. G. B. Marietto et al., "Artificial intelligence markup language: A brief tutorial," 2013. [106] M. Hijjawi, Z. Bandar, and K. Crockett, "A General Evaluation Framework for Text Based Conversational Agent," International Journal of Advanced Computer Science and Applications, vol. 7, 2016. [107] H. Yang, T.-S. Chua, S. Wang, and C.-K. Koh, "Structured use of external knowledge for event-based open domain question answering," presented at the Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto, Canada, 2003. Available: https://doi.org/10.1145/860435.860444 [108] F. Wang, G. Teng, L. Ren, and J. Ma, "Research on mechanism of agricultural faq retrieval based on ontology," in 2008 Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, 2008, pp. 955-958: IEEE. [109] Z. Zheng, "AnswerBus question answering system," presented at the Proceedings of the second international conference on Human Language Technology Research, San Diego, California, 2002. [110] W. Chung-Hsien, Y. Jui-Feng, and L. Yu-Sheng, "Semantic segment extraction and matching for Internet FAQ retrieval," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 7, pp. 930-940, 2006. [111] G. Walsham, "Knowledge Management:: The Benefits and Limitations of Computer Systems," European Management Journal, vol. 19, no. 6, pp. 599-608, 2001/12/01/ 2001.
232
[112] R. West, E. Gabrilovich, K. Murphy, S. Sun, R. Gupta, and D. Lin, "Knowledge base completion via search-based question answering," in Proceedings of the 23rd international conference on World wide web, 2014, pp. 515-526: ACM. [113] S. Zhang, H. Jiang, M. Xu, J. Hou, and L. Dai, "The fixed-size ordinally-forgetting encoding method for neural network language models," in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015, pp. 495-500. [114] Y. Bengio, P. Simard, and P. Frasconi, "Learning long-term dependencies with gradient descent is difficult," IEEE transactions on neural networks, vol. 5, no. 2, pp. 157-166, 1994. [115] A. Schmaltz, Y. Kim, A. Rush, and S. Shieber, Sentence-Level Grammatical Error Identification as Sequence-to-Sequence Correction. 2016, pp. 242-251. [116] T. Mikolov, M. Karafiát, L. Burget, J. Černocký, and S. Khudanpur, "Recurrent neural network based language model," in Eleventh annual conference of the international speech communication association, 2010. [117] W. Zaremba, I. Sutskever, and O. Vinyals, "Recurrent neural network regularization," arXiv preprint arXiv:1409.2329, 2014. [118] R. Csáky, Deep Learning Based Chatbot Models. 2017. [119] D. A. Ferrucci et al., "Building Watson: An Overview of the DeepQA Project," AI Magazine, vol. 31, pp. 59-79, 2010. [120] K. Ganesan, "All you need to know about text preprocessing for NLP and Machine Learning," in https://www.kdnuggets.com/2019/04/text-preprocessing-nlp- machine-learning.html, G. P.-S. M. Mayo, Ed., ed: KDnuggets, 2019. [121] B. Jurish and K.-M. Wurzner, "Word and Sentence Tokenization with Hidden Markov Models," The Journal for Language Technology and Computational Linguistics, vol. 28, no. 2, pp. 61-83, 2013. [122] N. Okazaki and S. Ananiadou, "A term recognition approach to acronym recognition," in Proceedings of the COLING/ACL on Main conference poster sessions, 2006, pp. 643-650: Association for Computational Linguistics. [123] Y. Park and R. J. Byrd, "Hybrid text mining for finding abbreviations and their definitions," in Proceedings of the 2001 conference on empirical methods in natural language processing, 2001. [124] W. Tao, D. Deng, and M. J. P. o. t. V. E. Stonebraker, "Approximate string joins with abbreviations," vol. 11, no. 1, pp. 53-65, 2017. [125] M. S. Neff, R. J. Byrd, and B. K. Boguraev, "The Talent system: T EXTRACT architecture and data model," Natural Language Engineering, vol. 10, no. 3-4, pp. 307-326, 2004. [126] N. Wacholder, Y. Ravin, and M. Choi, "Disambiguation of proper names in text," in Proceedings of the fifth conference on Applied natural language processing, 1997, pp. 202-208: Association for Computational Linguistics. [127] M. Collins, "Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms," in Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, 2002, pp. 1-8: Association for Computational Linguistics. [128] L. Liu et al., "Empower sequence labeling with task-aware neural language model," in Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [129] R. Huddleston and G. Pullum, "The Cambridge grammar of the English language," Zeitschrift für Anglistik und Amerikanistik, vol. 53, no. 2, pp. 193-194, 2005. 233
[130] Wordvisers, "2016 Report: The Most Common English Writing Errors," Wordvisers2016. [131] J. M. Buys, "Probabilistic tree transducers for grammatical error correction," Stellenbosch: Stellenbosch University, 2013. [132] C. Leacock, M. Chodorow, M. Gamon, and J. Tetreault, "Automated grammatical error detection for language learners," Synthesis lectures on human language technologies, vol. 3, no. 1, pp. 1-134, 2010. [133] S. Ahmadi, Attention-based Encoder-Decoder Networks for Spelling and Grammatical Error Correction. 2018. [134] V. Kumar, "Automatic Grammar Correction: Using PCFGs and Whole Sentence Context," Master of Science, Computer Science, UNIVERSITY OF CALIFORNIA, SAN DIEGO, 1514824, 2012. [135] A. Schofield, M. Magnusson, and D. Mimno, "Pulling out the stops: Rethinking stopword removal for topic models," in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, 2017, vol. 2, pp. 432-436: Association for Computational Linguistics. [136] C. Sutton and A. McCallum, "An introduction to conditional random fields," Foundations and Trends® in Machine Learning, vol. 4, no. 4, pp. 267-373, 2012. [137] D. Lin, "An information-theoretic definition of similarity," in Icml, 1998, vol. 98, no. 1998, pp. 296-304. [138] R. Rada, H. Mili, E. Bicknell, and M. Blettner, "Development and application of a metric on semantic nets," IEEE transactions on systems, man, and cybernetics, vol. 19, no. 1, pp. 17-30, 1989. [139] A. Pawar and V. Mago, "Calculating the similarity between words and sentences using a lexical database and corpus statistics," arXiv preprint arXiv:1802.05667, 2018. [140] Y. Li, D. McLean, Z. A. Bandar, J. D. O'shea, and K. Crockett, "Sentence similarity based on semantic nets and corpus statistics," Journal of IEEE transactions on knowledge, vol. 18, no. 8, pp. 1138-1150, 2006. [141] T. Slimani, "Description and Evaluation of Semantic Similarity Measures Approaches," International Journal of Computer Applications, vol. Vol 80, pp. 25-33, 10/01 2013. [142] M. Choudhari, "Extending the hirst and St-Onge measure of semantic relatedness for the unified medical language system," 2012. [143] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, 2008. [144] J. L. McClendon, "Optimization of a language model for the classification of natural language queries in a script based conversational agent (Order No. 3722420)." Doctor of Philosophy, Clemson University, ProQuest Dissertations Publishing, 3722420, 2015. [145] G. Luo, X. Huang, C.-Y. Lin, and Z. Nie, "Joint entity recognition and disambiguation," in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 879-888. [146] W. Yang and J. Wang, "Generating Appropriate Question-Answer Pairs for Chatbots using Data Harvested from Community-based QA Sites," in KDIR, 2017, pp. 342- 349. [147] L. Becker, S. Basu, and L. Vanderwende, "Mind the gap: learning to choose gaps for question generation," in Proceedings of the 2012 Conference of the North American 234
Chapter of the Association for Computational Linguistics: Human Language Technologies, 2012, pp. 742-751: Association for Computational Linguistics. [148] H. Zhang, M. Zhu, and H. Wang, "A Retrieval-Based Matching Approach to Open Domain Knowledge-Based Question Answering," Cham, 2018, pp. 701-711: Springer International Publishing. [149] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, "Batch normalized recurrent neural networks," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2657-2661: IEEE. [150] T. Robinson, J. Holdsworth, R. Patterson, and F. Fallside, "A comparison of preprocessors for the Cambridge recurrent error propagation network speech recognition system," in First International Conference on Spoken Language Processing, 1990. [151] S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," arXiv preprint arXiv:1502.03167, 2015. [152] K. Katanforoosh and D. Kunin, "Initializing neural networks," Accessed on: 11/01/2019 Available: https://www.deeplearning.ai/ai-notes/initialization/ [153] X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249-256. [154] Y. Ito, "Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory," Neural Networks, vol. 4, no. 3, pp. 385-394, 1991. [155] P. Sibi, S. A. Jones, and P. Siddarth, "Analysis of different activation functions using back propagation neural networks," Journal of Theoretical and Applied Information Technology, vol. 47, no. 3, pp. 1264-1268, 2013. [156] Y. Li and Y. Yuan, "Convergence analysis of two-layer neural networks with relu activation," in Advances in Neural Information Processing Systems, 2017, pp. 597- 607. [157] W. Liu, Y. Wen, Z. Yu, and M. Yang, "Large-margin softmax loss for convolutional neural networks," in ICML, 2016, vol. 2, no. 3, p. 7. [158] J. Y. Yam and T. W. Chow, "A weight initialization method for improving training speed in feedforward neural network," Journal of Neurocomputing, vol. 30, no. 1-4, pp. 219-232, 2000. [159] D. Hendrycks and K. Gimpel, "Generalizing and improving weight initialization," arXiv preprint arXiv:.02488, 2016. [160] C.-C. Chiu et al., "An online sequence-to-sequence model for noisy speech recognition," arXiv preprint arXiv:1706.06428, 2017. [161] I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau, "Building end-to-end dialogue systems using generative hierarchical neural network models," in Thirtieth AAAI Conference on Artificial Intelligence, 2016. [162] S. Akasaki and N. Kaji, "Chat detection in an intelligent assistant: Combining task- oriented and non-task-oriented spoken dialogue systems," arXiv preprint arXiv:1705.00746, 2017. [163] T. Mikolov, K. Chen, G. Corrado, and J. J. a. p. a. Dean, "Efficient estimation of word representations in vector space," 2013. [164] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in Advances in neural information processing systems, 2013, pp. 3111-3119.
235
[165] N. Garg, L. Schiebinger, D. Jurafsky, and J. Zou, "Word embeddings quantify 100 years of gender and ethnic stereotypes," Proceedings of the National Academy of Sciences, vol. 115, no. 16, pp. E3635-E3644, 2018. [166] G. B. Orr and K.-R. Müller, Neural networks: tricks of the trade. Springer, 2003. [167] S. Ioffe and C. J. a. p. a. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," 2015. [168] J. Whang, S. Stanford, and A. Matsukawa, "Exploring Batch Normalization in Recurrent Neural Networks." [169] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," in Advances in neural information processing systems, Montreal, Canada, 2014, vol. 2, pp. 3104-3112: MIT Press. [170] D. Masters and C. Luschi, "Revisiting small batch training for deep neural networks," arXiv preprint arXiv:1804.07612, 2018. [171] Y. Bengio, "Practical recommendations for gradient-based training of deep architectures," in Neural networks: Tricks of the trade: Springer, 2012, pp. 437-478. [172] D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014. [173] A. Show, "Tell: Neural image caption generation with visual attention," Kelvin Xu et. al.. arXiv Pre-Print, vol. 23, 2015. [174] Sirajology, "RootyAI " source code retrieve from https://github.com/nuruzzaman/RootyAI on 15 February, 2019. [175] G. Cox, "ChatterBot retrieve from https://github.com/gunthercox/ChatterBot on 12 February," 2019. [176] A. Garg and V. Polamreddi, "Understanding Hollywood through Dialogues," 2016. [177] J. L. Fleiss and J. Cohen, "The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability," Educational and psychological measurement, vol. 33, no. 3, pp. 613-619, 1973. [178] N. Kitaev, Ł. Kaiser, and A. Levskaya, "Reformer: The Efficient Transformer," in International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020. [179] J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. J. a. p. a. Jurafsky, "Adversarial learning for neural dialogue generation," 2017. [180] Y. Zhang et al., "DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation," arXiv preprint arXiv:1908.01841, 2019. [181] V. Vlasov, J. E. Mosig, and A. Nichol, "Dialogue Transformers," 2019.
236
APPENDIX A
First Page of Published Peer Reviewed Papers
237
238
239
240
APPENDIX B
INTELLIBOT INSTALLATION AND CONFIGURATION INSTRUCTIONS
This section explain how IntelliBot can be installed trained.
Environment Setup: To develop IntelliBot, this study used Anaconda environment with Python 3.6 and PyCharm as IDE. Upon installing this two software, please make sure you have set python path into environment variable PYTHONPATH. It needs to point to the project root directory, in which you have chatbot, Data, and webui folder. If you are running in PyCharm, it will create that for you. But if you run any python scripts in a command line, you have to have that environment variable, otherwise, you get module import errors.
Download Datasets: IntelliBot were trained with publicly available various datasets as follows. Please download these datasets and extract into your project’s ‘data’ folder. • Cornell Movie Corpus http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html • Reddit Dataset https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly _available_reddit_comment/ • Pre-trained dataset https://drive.google.com/file/d/1mVWFScBHFeA7oVxQzWb8QbKfTi3TToUr/ view • Install from Terminal: python -m nltk.downloader punkt python -m nltk.downloader stopwords
Tools and Libraries Requirements: The following python libraries must be installed from anaconda terminal using the following command: pip install -r requirements.txt
241 tensorflow 1.14 aiml lxml beautifulsoup4 numpy mysql-connector- wikipedia flask python wordsegment google-cloud-core configparser requests nltk tqdm future networkx stanfordcorenlp django pyttsx3 pyaudio scikit-learn colorama scipy h5py tflearn language-check
Directory Structure: The IntelliBot project has seven main folders and their structure shown as follows:
242
Top Level Structures Folders Code
Data
Log
243
Model
Resources
Web
Hyperparameters setting: Highly GPU is recommended for the training as it can be very time-consuming. You can adjust the batch_size parameter in hparams.json file accordingly to make full use of the memory. You will be able to see the training results under Data/Result/ folder. Make sure the following 2 files exist as all these will be required for testing and prediction (the .meta file is optional as the inference model will be created independently):
1. basic.data-00000-of-00001 2. basic.index
Execute Training: Training is straightforward. Remember to create a folder named Result under the Data folder first. Then run the following commands:
cd IntelliBot_3.6 python trainer.py
244
During the training, I really suggest you to try playing with a parameter (colocate_gradients_with_ops) in function tf.gradients. You can find a line like this in modelcreator.py : gradients = tf.gradients(self.train_loss, params). Set colocate_gradients_with_ops=True (adding it) and run the training for at least one epoch, note down the time, and then set it to False (or just remove it) and run the training for at least one epoch and see if the times required for one epoch are significantly different
Execute the Server: IntelliBot runs on two servers: NLP server and application server. • Running NLP Server: Please download standfordCoreNLP from the following link and extract it: https://stanfordnlp.github.io/CoreNLP/ Then, go to your ‘standford-corenlp’ folder and run the following command: java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 - timeout 15000 • Running Chat Server: cd IntelliBot_3.6/web/ python server.py open your browser and write: http://127.0.0.1:5000/
Testing/ Inference: For testing and prediction, we provide a simple command interface and a web-based interface. Note that vocab.txt file (and files in KnowledgeBase, for this chatbot) is also required for inference. In order to quickly check how the trained model performs, use the following command interface: cd IntelliBot_3.6/web/ python server.py
245
APPENDIX C
SOURCE CODE OF DEVELOPING INTELLIBOT
Source code is copyright…will be not disclosed in publicly.
246
The End
247