Clustering Semantically Related Questions

International Master’s Thesis Clustering Semantically Related Questions Nikolaos Karagkiozis Computer Science Studies from the Department of Technology at Örebro University örebro 2019 Clustering Semantically Related Questions Studies from the Department of Technology at Örebro University Nikolaos Karagkiozis Clustering Semantically Related Questions © Nikolaos Karagkiozis, 2019 Title: Clustering Semantically Related Questions ISSN Abstract There has been a vast increase of users that use the internet in order to com- municate and interact, and as a result, the amount of data created follows the same upward trend making data handling overwhelming. Users are often asked to submit their questions on various topics of their interest, and usually, that itself creates an information overload that is difficult to organize and process. This research addresses the problem of extracting information contained in a large set of questions by selecting the most representative ones from the total number of questions asked. The proposed framework attempts to find semantic similarities between questions and group them in clusters. It then selects the most relevant question from each cluster. In this way, the questions selected will be the most representative questions from all the submitted ones. To ob- tain the semantic similarities between the questions, two sentence embedding approaches, Universal Sentence Encoder (USE) and InferSent, are applied. Moreover, to achieve the clusters, k-means algorithm is used. The framework is evaluated on two large labelled data sets, called SQuAD and House of Com- mons Written Questions. These data sets include ground truth information that is used to distinctly evaluate the effectiveness of the proposed approach. The results in both data sets show that Universal Sentence Encoder (USE) achieves better outcomes in the produced clusters, which match better with the class labels of the data sets, compared to InferSent. i Acknowledgements I would first like to thank my thesis supervisor Hadi Banaee for his valuable support and guidance in the execution of this thesis. He was always there to answer my questions and to steer me in the right direction. His guidance was very valuable and important for the outcomes of this work. I would also like to thank all the tutors of the Department of Technology for all the knowledge they have provided me with during my journey though the Robotics and Intelligent Systems Master’s programme, and I wish them success in all their future endeavours. Finally, I would like to thank my parents and my wife for the support that they have been offering me over the years. They are always there for me. iii Contents 1 Introduction1 1.1 Motivation.............................1 1.2 Problem Statement.........................2 1.3 Research Questions.........................3 1.4 Contributions............................4 1.5 Thesis Outline...........................5 2 Related Work7 2.1 Textual Similarity and Clustering.................7 2.1.1 Text Vectorization.....................7 2.1.2 Textual Similarity Measures................8 2.2 Sentence/Question similarity...................9 2.2.1 Term Frequency-Inverse Document Frequency (Tf-idf).9 2.2.2 Word embeddings.....................9 2.2.3 Unsupervised Sentence Embeddings........... 10 2.2.4 Supervised Sentence Embeddings............. 11 2.2.5 Universal Sentence Encoder (USE)............ 12 2.2.6 Text/sentence/question Clustering............ 13 3 Method 15 3.1 Framework Overview........................ 15 3.2 Preprocessing............................ 17 3.2.1 Data cleaning........................ 17 3.2.2 Sampling.......................... 17 3.3 Question semantic similarity.................... 18 3.3.1 Text Vectorization..................... 18 3.3.2 Cosine Similarity...................... 19 3.4 Question Clustering........................ 24 3.4.1 k-means Algorithm..................... 24 3.4.2 Clustering Performance Measures............. 27 3.5 Representative Question extraction................ 29 v vi CONTENTS 4 Results 31 4.1 SQuAD: Stanford Question Answering.............. 31 4.1.1 Preprocessing........................ 31 4.1.2 Question semantic similarity............... 32 4.1.3 Question Clustering.................... 34 4.1.4 Representative Question extraction............ 36 4.2 House of Commons Written Questions.............. 38 4.2.1 Preprocessing........................ 38 4.2.2 Question semantic similarity............... 39 4.2.3 Question Clustering.................... 39 4.2.4 Representative Question extraction............ 43 4.3 Summary of Results........................ 46 5 Conclusions 49 5.1 Summary of Contributions..................... 49 5.2 Limitations............................. 50 5.3 Future Research Directions.................... 50 References 53 List of Figures 1.1 Obama AMA session........................2 2.1 Bag-of-Words Representation...................8 2.2 Skip-thought model......................... 11 2.3 Example sentences of the SNLI Corpus.............. 11 2.4 USE example Heatmap...................... 12 3.1 Global Approach Diagram..................... 15 3.2 Framework diagram........................ 16 3.3 Cosine Similarity Figure...................... 19 3.4 Similarity matrix USE....................... 20 3.5 Similarity matrix InferSent.................... 21 3.6 Heatmap - Toy Example Questions USE............. 21 3.7 Heatmap - Toy Example Questions InferSent.......... 22 3.8 Entire Toy Data Set Heatmap - USE............... 22 3.9 Entire Toy Data Set Heatmap - InferSent............ 23 3.10 k-means procedure......................... 25 3.11 Toy data set UMAP plot - USE.................. 26 3.12 Toy data set UMAP plot - InferSent............... 26 4.1 Heatmap of SQuAD - USE.................... 33 4.2 Heatmap of SQuAD - InferSent.................. 33 4.3 UMAP plot of SQuAD - USE................... 34 4.4 UMAP plot of SQuAD - InferSent................ 34 4.5 Heatmap of House of Commons Written Questions - USE... 40 4.6 Heatmap of House of Commons Written Questions - InferSent. 40 4.7 UMAP plot of House of Commons Written Questions - USE.. 41 4.8 UMAP plot of House of Commons Written Questions - InferSent 41 vii List of Tables 3.1 Labels per cluster index - USE.................. 27 3.2 Labels per cluster index - InferSent................ 27 3.3 Most Representative Questions Toy data set - USE....... 29 3.4 Most Representative Questions Toy data set - InferSent.... 29 4.1 Number of Questions before and after Sampling - SQuAD... 32 4.2 Labels per cluster index SQuAD - USE............. 35 4.3 Labels per cluster index in SQuAD using InferSent....... 35 4.4 Most Representative Questions SQuAD data set - USE..... 36 4.5 Most Representative Questions SQuAD data set - InferSent.. 37 4.6 Number of Questions before and after Sampling - House Of Commons Written Questions................... 39 4.7 Labels per cluster index House of Commons Written Questions - USE................................ 42 4.8 Labels per cluster index in House of Commons Written Ques- tions using InferSent........................ 42 4.9 Most Representative Questions House of Commons Written Ques- tions data set - USE........................ 44 4.10 Most Representative Questions House of Commons Written Ques- tions data set - InferSent...................... 46 4.11 Summary of the Similarity Matrix Results............ 46 4.12 Summary of the Clustering Performance Results........ 47 ix List of Algorithms xi Chapter 1 Introduction 1.1 Motivation With the evolution of technology during the last decades, user interaction has become more and more prevalent. In the past, the first stage of the World Wide Web allowed for static content sharing and most of the published content was not user-generated. However, nowadays there exist the means that make user interaction fast and easy. It is very usual for big groups of people to interact and submit their questions or comments on specific topics of their interest and there are various use cases where this interaction is observed. Examples of such cases include but are not limited to, occasions where the audience is asked to submit their questions to a speaker during a conference, or cases where a TV show presenter hosts a guest and the audience is asked to submit their questions to the guest. One popular example is the Reddit's “Ask me anything (AMA)” sessions [26] where the host of a session is either a famous person or an ordinary user. In this kind of sessions, the interview occurs between him or her and all the other every-day users of Reddit who want to ask the person any questions. For example, the former president of the United States, Barack Obama hosted a half-hour AMA session in 2012, where more than 200.000 concurrent users have visited the session page and around 22.000 questions have been submitted by them, according to Reddit (See Figure 1.1). The topics of the questions were very general and they were ranging from finance related topics to questions like the recipe for the White House's beer [31]. Barack Obama has answered just 10 among the total of the 22.000 questions, most probably the first ones or the ones he has found interesting and not a representative subset of questions that reflects most of them. 1 2 CHAPTER 1. INTRODUCTION Figure 1.1: A screenshot of the Reddit's AMA session, with Barack Obama as the host [31]. Another related example is the summary of the frequently asked questions on a website (i.e, FAQs), in which the total questions submitted by the users of the website need to be

Load more