Unsupervised Text Clustering Using Survey Answers

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2017 Unsupervised text clustering using survey answers THERESE STÅLHANDSKE MATHIAS TÖRNQVIST KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Computer Science and Communication Unsupervised text clustering using survey answers Mathias Törnqvist and Therese St˚alhandske Degree Project in Engineering Physics, First Level at CSC Course code: SA114X Superviser: Pawel Herman Examiner: Martin Viklund May 20, 2017 Abstract Text data mining is a growing research field where machine learning and NLP are important technologies. There are multiple applications concerning categorizing large sets of documents. Depending on the size of the documents the methods di↵er, when it comes to short text documents the information in individual ones are scant. The aim of this paper is to show how well unsupervised text clustering reflects existing class assignments and how sensitive clustering is when comparing di↵erent text representation and feature selection. The raw data was collected from several national health surveys. Evaluation was made with a conditional entropy-based method called V-measure which connects the clusters to the categories. We present that some methods perform significantly better against raw data then others. Acknowledgement We would like to thank Gustav Svensson at Zerebra AB for the support with ideas and technical knowledge. Additionally we want to express our gratitude to our supervisor Pawel Herman for his help and guidance in the subject. Contents 1 Introduction 1 1.1 Problemdescription ......................... 2 1.2 Scope and objectives . 2 1.3 Outlineofreport ........................... 3 2 Background 4 2.1 Text Analysis with ML-methods . 4 2.1.1 Natural Language Processing (NLP) . 4 2.1.2 Feature Selection . 5 2.1.3 Clustering algorithms . 6 2.1.4 Validation of clustering outcomes . 6 2.2 Previouswork............................. 7 2.2.1 Short format text . 7 2.2.2 Text representation with ML . 8 2.2.3 TextclusteringwithK-means . 8 3Method 9 3.1 Description of the data set . 9 3.2 Data Handling . 9 3.2.1 Overview ........................... 10 3.2.2 Text representation . 10 3.2.3 Binomial Separation . 11 3.3 K-meansclustering.......................... 12 3.3.1 Evaluation . 13 3.3.2 Implementation . 13 4 Results 14 4.1 Spell correction and statistics . 14 4.2 Categories . 15 4.3 V-measure . 16 4.3.1 TF — TF-IDF — Bigram — Trigram . 16 4.3.2 Final results . 19 4.4 Distribution of answers . 20 5 Discussion 23 5.1 General findings . 23 5.2 Analysis of methods . 24 5.3 Futurework.............................. 25 6 Conclusion 26 Chapter 1 Introduction A recognized method for collecting feedback about a service or a product is by using surveys. Survey responses come in di↵erent forms and can be divided into qualitative and quantitative sets (Jick, 1979). Quantitative answers, for example in numerical format, are visually easy to interpret and analyze in contrast to qualitative text answers. Responses to open-ended questions give qualitative information that can be vital for analysis of surveys as a whole. They give the respondents a way to express ideas that without, detailed answers would be missed (Salant et. al., 1994). Problems arise in handling a large amount of surveys and extracting information becomes tedious if done manually. In general, when analyzing a large set of text document, machine learning (ML) methods are rising in popularity among scientists for the purpose of detecting topics. Grouping the answers into categories or trying to put them into categories of interest are good applications for the field of ML. For example, detecting trending twitter topics in real time, as shown in the work of (Hong et. al., 2010);(Kolcz et. al., 2014). Tweets show similarities to survey answers as both contain a low word count but a high degree of noise i.e. in form of spelling mistakes. Motivated by the above and with the idea of improving the analysis process of surveys, unsupervised methods were used with the aim of extracting information from open -ended answers. After clustering open-ended answers and applying di↵erent feature representation methods, we wanted to further improve the re- sulting clusters. A deeper understanding of how di↵erent groups answer, plays a role in enhancing the information extracted from surveys as to validate the pre-defined categories. It may be used to improve construction of surveys and as a validation tool by extracting the information that is of interest from questions asked. The chosen approach is to use state-of-the-art methods on di↵erent specific problems concerning the data-set and combining these in order to make a good clustering. 1 1.1 Problem description Several text classification methods exist and making the choice between them depends both on the desirable outcome and the data set at hand. One approach to classify a set of documents being either positive or negative, is to label a small set of randomly selected training documents. Supervised ML methods can then train on the labeled set and further use this to label the remaining unlabeled data, to be either positive or negative. This is a time-consuming process as documents have to be manually labeled and it also comes with the risk of a biased labeling. To find natural clusters within the data set, without explicit labeling by a human, would simplify the process. This thesis tries to imitate categories that have been manually labeled to a set of text documents by clustering the documents. We wanted to know how well an unsupervised approach performs in labeling short text documents. When di↵erentiating clusters, there is a risk that cluster may di↵er depending on the choice of data representations (Duda. et. al., 2000). This motivated the exam- ination and validation of di↵erent data representations in order to investigate the e↵ect of representations on the clustering outcome. In order to evaluate the results, we relied on a portion of labeled data for cross-validation. The data set was provided by a survey company. 1.2 Scope and objectives The objective of the report could be described as follows: evaluating di↵erent text representations for short texts • improve clustering of text answers by using di↵erent settings, text repre- • sentations and feature selection The focus lies on k-means, described in section 2.2.1, while di↵erent kinds of clustering methods are not examined closer. The feature selection methods focus on the issues concerning semantic relationship of the words and the skewness of the data set. This leaves out other characteristics of the data: length of answers, connection to closed answers, demographics or the possibility of multiple labels. 2 1.3 Outline of report Firstly, in chapter 2, the subject of machine learning is presented and its benefits in analyzing data. Furthermore, previous work conducted on the subject of natural language processing (NLP) and topic detection (TD) are presented. We explore studies that have been done on a similar type of data set, for example Twitter snippets. In chapter 3, the theoretical background for the planned used methods and approaches are discussed. Used methods and motivations of the choices of these methods are described in depth. Results are presented in chapter 4. We compare the di↵erent methods and investigate the efficiency and accuracy of the clustering in chapter 5. Ultimately, we discuss the use of the clustering method when analyzing short, high-dimensional texts and how feature selection can be used in improving the results. 3 Chapter 2 Background 2.1 Text Analysis with ML-methods The ability to analyze texts is an important aspect of many di↵erent fields. A large portion of business-relevant information is represented in an unstructured form, primarily as texts. Common examples to mention are reports, email, chats, tweets, social media updates or any other documents containing mostly natural-language text (Russom, 2007). Usually, the idea of text analysis is to transform unstructured data and impose a structure to the corpus so information easier can be extracted. This can be done with di↵erent approaches. ML methods strive to automate the transformation into structured data and without explicitly programming, find patterns in the data set at hand (James et. al., 2013). Applying ML for the task of detecting topics in a document, has been shown to be successful and computational beneficially when comparing to manually driven processes (Russom, 2007). ML algorithms can generally be divided into two di↵erent sub-fields: unsupervised and supervised. The main di↵erence lies within the formulation of the problem. For a supervised approach, labels or categories are known from the start and are used to generate a hypothesis when an unknown data point is introduced. Unsupervised algorithms, as clustering algorithms, use no prior labeling to di↵erentiate di↵erent natural clustering in a data set (James et. al., 2013). One example of application of unsupervised method is topic detection within a corpus of documents (Blei, 2012). 2.1.1 Natural Language Processing (NLP) When working with ML methods for text clustering, an important aspect to consider is how the data should be processed and represented. The field of NLP circulates around this interaction between computers and human languages. In this field, there are several interesting problems which can be dated back to 1950 when Alan Turing published the paper Computational Machinery and In- telligence(Turing, 1950). He proposed what is now called the ”Turing test”. It was not the first formulation related to NLP, though it was a major breaking point in the field. Since then, the research around NLP has mainly been based 4 around ML, especially since the ’statistical revolution’ during the late 1980s (Johnson, 2009).

Load more