DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2017

Unsupervised text clustering using survey answers

THERESE STÅLHANDSKE

MATHIAS TÖRNQVIST

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Computer Science and Communication

Unsupervised text clustering using survey answers

Mathias T¨ornqvist and Therese St˚alhandske

Degree Project in Engineering Physics, First Level at CSC Course code: SA114X

Superviser: Pawel Herman Examiner: Martin Viklund

May 20, 2017 Abstract

Text data mining is a growing research field where machine learning and NLP are important technologies. There are multiple applications concerning categorizing large sets of documents. Depending on the size of the documents the methods di↵er, when it comes to short text documents the information in individual ones are scant. The aim of this paper is to show how well unsupervised text clustering reflects existing class assignments and how sensitive clustering is when comparing di↵erent text representation and feature selection. The raw data was collected from several national health surveys. Evaluation was made with a conditional entropy-based method called V-measure which connects the clusters to the categories. We present that some methods perform significantly better against raw data then others. Acknowledgement

We would like to thank Gustav Svensson at Zerebra AB for the support with ideas and technical knowledge. Additionally we want to express our gratitude to our supervisor Pawel Herman for his help and guidance in the subject. Contents

1 Introduction 1

1.1 Problemdescription ...... 2

1.2 Scope and objectives ...... 2

1.3 Outlineofreport ...... 3

2 Background 4

2.1 Text Analysis with ML-methods ...... 4

2.1.1 Natural Language Processing (NLP) ...... 4

2.1.2 Feature Selection ...... 5

2.1.3 Clustering algorithms ...... 6

2.1.4 Validation of clustering outcomes ...... 6

2.2 Previouswork...... 7

2.2.1 Short format text ...... 7

2.2.2 Text representation with ML ...... 8

2.2.3 TextclusteringwithK-means ...... 8

3Method 9

3.1 Description of the data set ...... 9

3.2 Data Handling ...... 9

3.2.1 Overview ...... 10

3.2.2 Text representation ...... 10

3.2.3 Binomial Separation ...... 11

3.3 K-meansclustering...... 12 3.3.1 Evaluation ...... 13

3.3.2 Implementation ...... 13

4 Results 14

4.1 Spell correction and statistics ...... 14

4.2 Categories ...... 15

4.3 V-measure ...... 16

4.3.1 TF — TF-IDF — ...... 16

4.3.2 Final results ...... 19

4.4 Distribution of answers ...... 20

5 Discussion 23

5.1 General findings ...... 23

5.2 Analysis of methods ...... 24

5.3 Futurework...... 25

6 Conclusion 26 Chapter 1

Introduction

A recognized method for collecting feedback about a service or a product is by using surveys. Survey responses come in di↵erent forms and can be divided into qualitative and quantitative sets (Jick, 1979). Quantitative answers, for exam- ple in numerical format, are visually easy to interpret and analyze in contrast to qualitative text answers. Responses to open-ended questions give qualitative information that can be vital for analysis of surveys as a whole. They give the respondents a way to express ideas that without, detailed answers would be missed (Salant et. al., 1994). Problems arise in handling a large amount of surveys and extracting information becomes tedious if done manually. In general, when analyzing a large set of text document, machine learning (ML) methods are rising in popularity among scientists for the purpose of detect- ing topics. Grouping the answers into categories or trying to put them into categories of interest are good applications for the field of ML. For example, detecting trending twitter topics in real time, as shown in the work of (Hong et. al., 2010);(Kolcz et. al., 2014). Tweets show similarities to survey answers as both contain a low word count but a high degree of noise i.e. in form of spelling mistakes.

Motivated by the above and with the idea of improving the analysis process of surveys, unsupervised methods were used with the aim of extracting information from open -ended answers. After clustering open-ended answers and applying di↵erent feature representation methods, we wanted to further improve the re- sulting clusters. A deeper understanding of how di↵erent groups answer, plays a role in enhancing the information extracted from surveys as to validate the pre-defined categories. It may be used to improve construction of surveys and as a validation tool by extracting the information that is of interest from ques- tions asked. The chosen approach is to use state-of-the-art methods on di↵erent specific problems concerning the data-set and combining these in order to make a good clustering.

1 1.1 Problem description

Several text classification methods exist and making the choice between them depends both on the desirable outcome and the data set at hand. One approach to classify a set of documents being either positive or negative, is to label a small set of randomly selected training documents. Supervised ML methods can then train on the labeled set and further use this to label the remaining unlabeled data, to be either positive or negative. This is a time-consuming process as documents have to be manually labeled and it also comes with the risk of a biased labeling. To find natural clusters within the data set, without explicit labeling by a human, would simplify the process.

This thesis tries to imitate categories that have been manually labeled to a set of text documents by clustering the documents. We wanted to know how well an unsupervised approach performs in labeling short text documents. When di↵erentiating clusters, there is a risk that cluster may di↵er depending on the choice of data representations (Duda. et. al., 2000). This motivated the exam- ination and validation of di↵erent data representations in order to investigate the e↵ect of representations on the clustering outcome. In order to evaluate the results, we relied on a portion of labeled data for cross-validation. The data set was provided by a survey company.

1.2 Scope and objectives

The objective of the report could be described as follows:

evaluating di↵erent text representations for short texts • improve clustering of text answers by using di↵erent settings, text repre- • sentations and feature selection

The focus lies on k-means, described in section 2.2.1, while di↵erent kinds of clustering methods are not examined closer. The feature selection methods focus on the issues concerning semantic relationship of the words and the skewness of the data set. This leaves out other characteristics of the data: length of answers, connection to closed answers, demographics or the possibility of multiple labels.

2 1.3 Outline of report

Firstly, in chapter 2, the subject of machine learning is presented and its benefits in analyzing data. Furthermore, previous work conducted on the subject of natural language processing (NLP) and topic detection (TD) are presented. We explore studies that have been done on a similar type of data set, for example Twitter snippets. In chapter 3, the theoretical background for the planned used methods and approaches are discussed. Used methods and motivations of the choices of these methods are described in depth. Results are presented in chapter 4. We compare the di↵erent methods and investigate the eciency and accuracy of the clustering in chapter 5. Ultimately, we discuss the use of the clustering method when analyzing short, high-dimensional texts and how feature selection can be used in improving the results.

3 Chapter 2

Background

2.1 Text Analysis with ML-methods

The ability to analyze texts is an important aspect of many di↵erent fields. A large portion of business-relevant information is represented in an unstructured form, primarily as texts. Common examples to mention are reports, email, chats, tweets, social media updates or any other documents containing mostly natural-language text (Russom, 2007). Usually, the idea of text analysis is to transform unstructured data and impose a structure to the corpus so informa- tion easier can be extracted. This can be done with di↵erent approaches. ML methods strive to automate the transformation into structured data and with- out explicitly programming, find patterns in the data set at hand (James et. al., 2013). Applying ML for the task of detecting topics in a document, has been shown to be successful and computational beneficially when comparing to manually driven processes (Russom, 2007).

ML algorithms can generally be divided into two di↵erent sub-fields: unsuper- vised and supervised. The main di↵erence lies within the formulation of the problem. For a supervised approach, labels or categories are known from the start and are used to generate a hypothesis when an unknown data point is introduced. Unsupervised algorithms, as clustering algorithms, use no prior la- beling to di↵erentiate di↵erent natural clustering in a data set (James et. al., 2013). One example of application of unsupervised method is topic detection within a corpus of documents (Blei, 2012).

2.1.1 Natural Language Processing (NLP)

When working with ML methods for text clustering, an important aspect to consider is how the data should be processed and represented. The field of NLP circulates around this interaction between computers and human languages. In this field, there are several interesting problems which can be dated back to 1950 when Alan Turing published the paper Computational Machinery and In- telligence(Turing, 1950). He proposed what is now called the ”Turing test”. It was not the first formulation related to NLP, though it was a major breaking point in the field. Since then, the research around NLP has mainly been based

4 around ML, especially since the ’statistical revolution’ during the late 1980s (Johnson, 2009). The main objective of NLP is to find structures and patterns so a computer can understand, generate and manage a natural language. When representing a text, a used approach is implemented to represent words as dis- crete symbols and vectorize the document with the intention to retain the most prominent information. Each word is replaced by a weight and each document can then be quantified as a vector. If the semantic correlation between words is disregarded, the method to classify every word as unique is troublesome when it comes to contextual classification. By including sense relation, part of speech tagging and combining sequences of words rather than analyzing word by word, the information provided can be enhanced as more aspects of the sentimental meaning of the language is taken into account (Cavnar et. al., 1994).

A supervised feature representation method, consider correlation between fea- tures and the categories they belong to: finding strongly predictive features for a specific class. This can especially be useful when working with multiple cate- gories and when the size of the categories are skewed. One problem that arise with a skewed data set is that the predictive features for the di↵erent categories does not consider the asymmetrical distribution. Meaning that for smaller clus- ters, important features for di↵erentiating these clusters would disappear in the vast amount of other features. A method that works around this problem is binomial selection. The method use probabilities of a certain feature belonging, versus not belonging, to a certain class to find di↵erent weights and number representations for each word.

2.1.2 Feature Selection

In feature selection problems, the goal is to find the most relevant features that maximizes the information about the input. It has been shown that there is a dependence between the accuracy of the classification and the dimensions of the data. The accuracy rate can significantly decline if a redundant number of features are used, known as the Hughes Phenomenon, making feature reduction a crucial part of a clustering problem (Alonso et. al, 2011). The advantages of feature selection has been shown by comparing accuracy rate and the use of data-processing power (Punch et. al., 1993).

Comparisons between di↵erent feature selection methods, as pointed out in the paper ’A survey on feature selection methods’ by Chandrashekar (2013) can only be done by tests on the specific data set (Chandrashekar, 2013). He shows that a given algorithm may behave di↵erently on di↵erent given data sets, which may make it harder to predict an optimal model. In the evaluation of a selec- tion model, Chandrashekar (2013) points out several aspects to consider. In particular, classification accuracy and a number of reduced features are used as comparative measurements between two di↵erent feature selective models.

5 2.1.3 Clustering algorithms

When clustering a data set, the goal is to group data points that show stronger similarities to each other compared to the rest of the data set. For this task, dif- ferent clustering algorithms use di↵erent definitions of clusters and approaches to find them. A common notion of a cluster is that its members lies within close distances with each other or that there are ares of high densities in the data space. Clustering algorithms can give an important insight into which features are more prominent when di↵erentiating groups and also give a measure of the homogeneity of the data set. When working with documents of texts, clustering algorithms can be used as detecting topics by grouping the corpus and then by separately defining suitable labels. Suitable in this context could mean di↵erent things and would be a case of a subjective judgment. In general, the choice of algorithm depends on the specific data set at hand and on the intentions, or specifications, of the result (James et. al., 2013).

K-means

K-means is an unsupervised clustering method that has played a prominent role in the ML community due to its simplicity, easy implementation and eciency (Jain, 2009). The algorithm defines K centroids: one for each class and repeat- edly moves them around to minimize the distance between a data point and the center of the cluster (Macqueen, 1967). The number of centroids are given by the input parameter K, which will highly e↵ect the outcome of the clustering.

One critical point when adapting k-means to a set of data, is the determination of the similarity measure. This depends on how the distance or the degree of closeness is identified between a data point and the centroids mean. Metrics such as Euclidean distance, cosine similarity, Jaccared coecient and Pearson correlation coecient can be used with di↵erent degrees of success. Further- more, the most ecient way of determining the choice of metric is by cross validating the clustering results (Huang, 2008).

2.1.4 Validation of clustering outcomes

Validation of the outcomes of the clusters can be done in di↵erent ways by em- phasizing di↵erent metrics (Halkidi et. al., 2011). Common approaches can be divided into external or internal criteria. Internal criterias could be measure- ments of intra vs. -inter-cluster variance or clusters compactness and separation. External criteria are used when additional information is given about the for- mation of the clusters beforehand. In a situation when the number of clusters and their respective size is approximately known beforehand, validation of the predicted clusters can be done (Novikoc, 2012).

6 2.2 Previous work

Previous work of interest examines topic detection in short text documents. For survey answers in particular, little work has been done but work handling twitter tweets face same diculties in term of the length of the documents and the amount of noise. Noise in this context is not only spelling mistakes and word abbreviations but also documents with very little or no information. When determining text representation of the survey answers, we examined previous comparative studies between di↵erent feature representation models.

2.2.1 Short format text

Topic detection and tracking of documents have been driven by the increasing production of news and shorter texts, especially since the emergence of social media (Amayri et. al., 2013). A data format that has been the subject of several investigation is Twitter text snippets (Hong et. al., 2010)(Kolcz et. al., 2014)). The goal of these studies has been organizing sparse and noisy texts into pre-defined categories. Koltz et al. (2014) show that topic modeling can be highly influenced by the length of the targeted text and that topic models learned from aggregated texts given by the same person may lead to superior performance in clustering problems. Concerning the evaluation of quality of the clusters, Hong et al. (2010) and Kolcz et al. (2014) use two di↵erent approaches: Koltz et al. (2014) focus on statistical measures as F-score, precision and recall received from comparing the clusters to the human made labeling and Hong et al. (2010) presented tweet-topic pairs to humans and asking them to provide binary answers to whether or not the category is correct for the given tweet.This approach was suggested since binary tasks are easier, the human participant is less likely to make a mistake. To check the quality of the binary answers, a small set of tweets considered having a high probability of being rightfully labeled, are introduced.

Texts extracted from social media, such as Twitter, are infamously noisy: they contain spelling mistakes, short-hand language and colloquialisms. Given this data set, handling of unidentified words can be shown to be important when further analysis on the corpus is done. Dasai et al. (2015) propose a method to handling this: first identifying words not belonging to a given vocabulary and then processing them through a Word Shortening Algorithm. Common word replacement and lexical matching to find the most statistical probable correction for use in re-phrasing the sentence. Following this structure, results have been shown more ecient compared to traditional text message translators as Transl8it and Lingo2Text (Desai et. al., 2015).

7 2.2.2 Text representation with ML

Several articles were studied on the specific subject of representing text in di↵er- ent forms. A common approach for text representation is to use the frequency - inverse document as previously mentioned in (tf-idf)(Robertson, 2004). It intends to rank words depending on their importance to a document in a cor- pus. The comparative study conducted by (Zhang et. al., 2011) investigate di↵erent representation methods. The focus of the study was comparing tf-idf and other text representations that take semantic relation between words in account. It shows that the main advantages of a model such as the tf-idf is the computational benefits, as other models become more complex. Further- more, Zang shows that semantic models, improves the accuracy of supervised text classification. Foreman (2014) proposes a method that creates a ranked list of features, using binomial selection, for each class c of a data set, where all features are ranked according for the binary sub-task of discriminating class c vs. all other classes combined. By storing this ranking for class c and using scheduling algorithms for example Round-Robin or Rand-Robin, the accuracy has been shown to substantially increase (Foreman, 2014). Specifying words that have a high probability of belonging to a specific topic, can improve the quality of the predictions. Additionally, by extending feature selection methods to take into account the between features and documents, classification accuracy improves compared with only considering correlation be- tween features and the categories they belong to (Zong et. al., 2015). Moreover, Foreman (2014) shows that BNS representation of text documents improves the V-measure, compared to tf-idf.

2.2.3 Text clustering with K-means

The k-means algorithm has been the subject of evaluation in several di↵erent tasks of text classification (Macqueen, 1967). By using a vector representation that take semantic attributes into consideration, the k-means algorithm can give a better F-value when comparing to non-semantic based representations (Liu et. al., 2010)(Ma, 2014). This research also marks the importance of text representation and feature selection to accurately find similarities between documents and clusters.

8 Chapter 3

Method

3.1 Description of the data set

The data used, consisted of a large set of survey answers from multiple na- tionwide surveys regarding health care. We disregarded the data from closed questions and demographic variables as we only wanted the raw text from the open-ended questions. Each answer had earlier been read and categorized man- ually into one of the eight pre-defined categories. In the end, we had a corpus existing of a set of raw texts with one category label attached to each them. The di↵erent categories and their primary meanings are presented in table 3.1

Table 3.1: Pre-defined categories Involvement and participation Emotional Support General impression Information and knowledge Continuity and coordination Respect and Consideration Accessibility Other

3.2 Data Handling

The data set provided was noisy containing misspelled and slang words. Our target data was categorized into ”Letters” and ”Non-Letters” where all non- letter symbols were removed, such as punctuation. The remaining words were transformed to lowercase and were then checked against a Swedish dictionary. All the unknown words were iterated through a spelling algorithm to create a list with corrections. We did this by creating a dictionary from approximately 2 million Swedish Wikipedia articles. Frequencies of the words were then used for predicting the most probable spelling. Subsequently, we then replaced all the misspelled words with the correct spelling. If the algorithm did not find a correction for a specific word, it was removed from the answer.

9 3.2.1 Overview

The workflow of the preprocessing is depicted in figure 3.1. In total we compare 18 di↵erent vectorization methods and their filtering methods. The green (mid- dle) sections are the di↵erent text representation types and the yellow (2nd row from below) sections are the di↵erent feature selection methods.

Figure 3.1: Workflow of the preprocessing

3.2.2 Text representation

To retain as much information given by a text answer when quantifying it, we tried di↵erent vectorization methods to evaluate and compare the accuracy of the models. Tf-idf combines two statistics, the term frequency and the inverse document frequency, with help of a scalar product. The term frequency uses a raw count of the number of times a term t appears in a document d.Theinversedocument frequency (idf) is a measurement of how many times a term appears in the whole set of documents. Equation 3.1 shows the formula for calculating the idf value for each term t and the documents containing that specific term D (Robertson, 2004).

10 N idf(t, D)=log( (3.1) d D : t d | 2 2 | Where N is the total number of documents in the corpus and d D : t d is the number of documents where the term t appears. As several| 2 terms only2 | appeared once we used a smooth weighing meaning that we added 1 count of every term to document frequencies. The range that terms get as a score varies between 0 and 1, with the highest score on the most common terms in the corpus. By setting a threshold on the maximum document frequency(maxdf), terms that surpasses the threshold are ignored. We used this by iterating over di↵erent thresholds and summarizing the result. The same was done with thresholds on minimum document frequency(mindf).

N-gram

We extended the terms in tf-idf with an n-gram model. The n-gram model pairs all words with its n-number of preceding words in order to give extra context to the vectorized data. For example the sentence ”the cat in the hat” will be paired accordingly:

(n = 2): [the cat], [cat in], [in the], [the hat]

(n = 3): [the cat in], [cat in the], [in the hat]

We only used n = 2, Bigram, and n = 3, Trigram when vectorizing every answer. After the n-gram model had been created, we vectorized every word and n-gram belonging to each answer.

3.2.3 Binomial Separation

The goal with Binomial Separation (BNS) was to adjust the score for more predictive words belonging to a specific category. To see if there existed a better scoring scheme for our text representation. For each category, every word was separated by being either ’Positive’ and belonging to that category or ’Negative’, belonging to any other category. The variables are defined are presented in table 3.5, note that features represents words in this context,

Where TP and FP are defined as, TP TP rate = P (word positive class) = (3.2) | TP + FN FP FP rate = P (word negative class) = (3.3) | FP + TN

11 Table 3.2: Variables used in BNS True Positive TP Features in the specified positive label False Positive FP Non-Features in the specified positive label False Negative FN Features in the specified negative label True Negative TN Non-Features in the specified negative label

The Binomial Separation Score is now calculated as in equation 3.3,

1 1 BNS-Score = F (TP rate) F (FP rate) (3.4) | |

1 Where F is the inverse normal cumulative distribution function. To avoid problems when the distribution goes to infinity at zero or at one, the True and False positive rates were limited to the range (0.00001, 1 0.00001). The method resulted in several sets of words with scaling scores, one for each labeled class. The metric was applied to a term frequency count. The more predictive words will be more prominent in the vectorization of the document.

3.3 K-means clustering

The k-means algorithm takes a given set of points (x1, x2,...,xn) where each point is an N-dimensional vector and tries to partition the points into k sets(Macqueen, 1967]. The problem was solved using Lloyd’s algorithm, with an average com- plexity of (knT), n =number of samples and T =number of iterations. With the size andO features of our data, speed was not an issue. K, the number of centroids, are assumed to be the same as the expected groupings so K was set to 8.

The algorithm could be described as follows (Duda et. al., 2000):

1. Choose k number of clusters to divided the data set into 2. Choose k data points at random, and assign them as the initial cluster center 3. Repeat:

(a) Assign each data point to their closest cluster center (b) Compute new clusters by calculating mean points. Do this until the centroids stop changing location, no points changes in its cluster or an other type of criteria as maximum number of iterations.

12 3.3.1 Evaluation

We aimed to compare the clusters we found by k-means with the existing cat- egories. One good method to evaluate several clusters with several pre-labeled categories was the V-measure (Hirschberg et. al, 2007). V-measure calculates the harmonic mean between homogeneity and completeness. The homogeneity has the criteria that each cluster must only be assigned to those answers that have been labeled to a single category. Completeness has the criteria that each answer that belongs to a particular category must be grouped into one cluster. V-measure is defined as,

h c V =2 · . (3.5) · h+c

Homogenity, h, is defined as follows,

1, if H(C, K)=0. h = H(C K) (3.6) 1 | , otherwise, ( H(C) where H(C K) is 0 if all answers within a single cluster is labeled with the same label and H(C)| is the maximum reduction in entropy. The completeness is reverse of the homogeneity, where H(C K) is 0 if all the answers belonging to one category gets clustered into the same| cluster.

3.3.2 Implementation

The K-means algorithm chooses the cluster centroid seeds at random producing di↵erent clusters with every run. This motivated us to run the k-means clus- tering and calculate the V-measure several times for each specific configuration of the data. Then we calculate the resulting mean and standard deviation and show the comparative result. To distinguish the equality of the means we use a independent two-sample t-test on the key results.

13 Chapter 4

Results

4.1 Spell correction and statistics

When running all the answers through our spell correction algorithm we found a total of 85555 words misspelled. Out of those 85555 words 50252 of them were corrected and 35303 were not corrected. Some words may have been to

Table 4.1: Statistics of the provided data set Number of answers Number of words Average number of words after cleaning per answer/Standard deviation 15461 362166 23.4 / 21.99

In table 4.1, statistics of our data are presented after the processing of the data. Notice that the standard deviation of the average number of words per answer was large. This was due to the fact that several answers were quite long and contained many words. In the presentation of our experiments, the categories are represented by a number in the range 0-7. In table 4.2 the di↵erent categories and their frequency of answers are shown. Notice that the first two categories had significantly less labeled data, this leads to a skewed data set that might have led to prominent words of these two categories were overlooked or disappeared in the abundant number of features. Notice also that category 7 contained almost one third of all the answers. That category contained general answers that did not fit in to one of the more specific categories.

Table 4.2: The size and distribution of labels in the pre-labeled dataset. There are 8 categories, labeled 0-7. Category 0 1 2 3 4 5 6 7 All Number of answers 189 213 3740 1216 1472 1800 1733 5098 15461 Size in percentage 1.22% 1.38% 24.19% 7.86% 9.52% 11.64% 11.21% 32.97% 100%

14 4.2 Categories

The BNS algorithm extracted the top ten strongest predictive words for each category, presented in table 4.3. We translated the table into english which is presented in table 4.4

Table 4.3: Top ten most predictive features for each category in Swedish. Category Involvering Emotionellt st¨od Helhetsintryck Information och delaktighet och kunskap inbillar bed¨ovningen enkelrum gipset l˚an g t i d ve r ka n d e k¨anslom¨assiga smutsigt h¨arkomst tvingats lugnas efterr¨att infon accepterad samtalst¨od toaletter f¨orst˚add accepterades s¨ovningen sallad n¨arvara fundering tr¨osta helhetsintrycket okunskap f¨oredragit tvivla ¨a c k l i g t spr˚akf¨orbistringar konsulterade unders¨okning ¨a g g utf¨orligare kunskapsbrist samtalskontakt ¨a c k l i g anvisningar Category Kontinuitet Respekt Tillg¨anglighet Ovrigt¨ och koordinering och bem¨otande sagts bem¨otts telefontider grundskola borttappad nedl˚atande telefonk¨o medverka journalsystem ov¨anlig telefontiden rutan f¨ortroende v¨anligt uppringning kryss h¨orts f¨ornedrande telefontiderna enk¨atens kontaktast nedv¨arderande telefonsvararen formulerad sammarbete opassande parkeringsplatser f¨oranleddes samordnad v¨albem¨ott upptaget menas systemen arroganta ¨o p p e t i d e r svarsalternativ

The words gave an indication of the content in each category. The idea when extracting these words, was to increase their respective score so that the cluster algorithm would easier di↵erentiate between the answers. Subjectively judging, the most predictive words match their corresponding categories.

15 Table 4.4: Top ten most predictive features for each category in english, trans- lated with Google Translate. Category Involvement Emotional Support General Impression Information and Participation and knowledge imagination anesthesia single room plaster long-term emotional dirty descent forced calm dessert info accepted call support restrooms understood accepted sowing salad attend reflection comfort overall impression lack of knowledge preferred doubt disgusting language enhancements consulted examination eggs more detailed lack of knowledge conversation contact disgusting instructions Category Continuity Respect Accessibility Other And coordination and Treatment said treated phone times elementary school lost condescending phone queue take part journal system unfriendly phone time box trust friendly call tick heard degrading phone times survey contact negative answering machine formulated collaborate disassemble parking spaces caused coordinated well-behaved busy meant systems arrogant opening times answer options

4.3 V-measure

The parameters of the k-means algorithm was set to K = 8 clusters, for a single run the maximum number of iterations was set to 300. K-means will run with centroid seeds placed at di↵erent random positions 10 times and outputs the best run when comparing inertia. We did 50 consecutive runs for every type of test and calculated the standard deviation. We show the di↵erent V-measures when shifting the maxdf and mindf for the di↵erent feature selection methods and compare the best results with the BNS scoring.

4.3.1 TF — TF-IDF — Bigram — Trigram

In this section comparative results of the V-measure are presented. Every spe- cific configuration with di↵erent values on maxdf and mindf are plotted, both with spelled data and raw data to see if our spelling algorithm improved the result.

16 17

Figure 4.1: Comparative results of V-measure with di↵erent thresholds on maxdf. Notice that the rightmost values represent no threshold. 18

Figure 4.2: Comparative results of V-measure with di↵erent thresholds on mindf. Notice that the leftmost values represent no threshold. Figure 4.1 shows that when lowering the threshold of maxdf the V-measure reaches a maximum peak, beyond that peak the V-measure converges to a very low V-measure. This tells us that some common words make the clustering harder as the short answers may resemble each other when containing similar words. Comparing with figure 4.2 where there were little or no improvements in lowering the threshold on mindf, meaning that removing the least common words did not clearly improve the V-measure. Also notice that the Bigram produced the highest overall mean in both pictures when acting on the spell corrected data.

4.3.2 Final results

Finally we used the most predictive words calculated by the BNS algorithm to weigh the terms before clustering with K-means. We ran K-means 50 times and calculated the mean and standard deviation of the V-measure. This is plotted in figure 4.3 against the di↵erent configurations and their highest obtained V- measure, taken from figure 4.1.

Figure 4.3: The 9 di↵erent text representations and their highest obtained V- measure. The V-measure is normalized against the maximum V-measure ob- tained. Notice that the text representations that end with raw refer to when processing the data without a spell correction.

19 When testing the equality of the means there was a significant di↵erence be- tween the Trigram and Bigram scores(two sample t-test, p < 0.001). When testing the equality between Bigram and Bigram-Raw there was also a signifi- cant di↵erence for a lower p-value(two sample t-test, p < 0.20). The BNS representation performed worst, contradicting our earlier beliefs. Our result contradicted the hypothesis stated by Foreman (2014) that BNS scor- ing would result in a higher V-measure when compared to tf-idf, with a high statistical certainty. As BNS supposedly works especially well in cases when the data set is skewed, as in our case, BNS should theoretically give a good representation in terms of clustering outcome.

4.4 Distribution of answers

It was interesting comparing the distribution of labels in each cluster and re- spectively comparing the answers belonging to each category got distributed among the clusters. Figure 4.4 shows the distribution of categories in 8 clus- ters. Cluster 0, 1, 4 and 7 have distinct peaks, meaning that these clusters have some correlation to their respective categories. Cluster 2 and 5 are harder to distinguish any specific relationship to a single category. Figure 4.5 shows the answers in each category and their distribution among the clusters. Notice that category 0, 1 and 3 resemble each other in their distribution.

20 Figure 4.4: The figure shows eight plots representing eight di↵erent cluster for a single run. Each bar correspond to the number of answers belonging to that category. Note that it is only meant to give an idea of how the clusters correspond to the di↵erent categories. The method used for this clustering was bigram with maxdf = 0.1, mindf = 0.0006.

21 Figure 4.5: The figure shows eight plots representing the eight categories. Each bar correspond to the number of answers that got clustered into one of the eight cluster for a single run. Note that it is only meant to give an idea of how the categories correspond to the clusters found. The method used for this clustering was bigram with maxdf = 0.1, mindf = 0.0006.

22 Chapter 5

Discussion

5.1 General findings

The primary findings in our results were that the choice of text representa- tion and filtering methods are important when clustering short texts. There is a noticeable trend in our results that applying feature selection improves the V-measure. This indicates that some words or features have a bad impact on distinguishing the answers. Removing common words, for instance conjunctions, reduces similarities between the categories. This lead to a better clustering in our thesis, which highlights the usefulness of feature selection. Adding features such as that tries to extend the text representation by its semantics in a simple way do improve the results. Extracting more relevant information, and therefore increasing the number of features from the text improves the substrate of finding more suitable patterns. Extending the representation of underlying structures in the text, such as semantics, will probably ameliorate the extrac- tion of information. In the field of NLP there is challenges concerning how to represent the indented meaning of a sentence, a challenge also reflected in our results. Answers that contain little or no information are dicult to di↵erentiate whether they belong to a specific category or not. A corpus containing many answers with low information, answers will likely scatter among several clusters reducing the V-measure. Answers that are in between di↵erent clusters would have the same reducing e↵ect on the resulting V-measure. It would therefore be inter- esting to use hierarchical clustering algorithms, where answers with no obvious topic are clustered to a indeterminate category. This would make it possible to di↵erentiate those answers of the most importance, aka those answers that strongly belongs to a specific category or cluster. These answers would be more valuable in the overall analysis process for the surveys. Overall, the results showed a poor clustering when comparing with the assigned labels. Conclusion can be drawn that the existing categories are either to broad or to narrow. This implies that narrowing the number of clusters to a prede- termined number and see a connection with human labeled categories may be to ambitious. On the other hand the resulting clusters may indicate how well a predefined category represent the intrinsic subjects. Some categories would benefit in being separated and others could be joined, something that could be used to improve further construction of surveys.

23 5.2 Analysis of methods

Data handling

The information given in each answer was of varying quality and quantity. Oc- currences of homonyms and antonyms in a short sentence might have altered the interpretation substantially, which may in turn have lead to diculties when telling answers apart. This can be connected to the challenge of making the computer understand, generate and manage a natural language. Our methods of cleaning the data were elementary and failed to find corrections to every misspelled word. For further improvement of the general result, it might have helped with a more advanced spelling method as we could see a general trend that spelling of the data produced better results. This implies the importance of refinement when handling noisy text, such as survey answers, tweets and messages.

Text representation

We could draw the conclusion that the e↵ects of di↵erent text representations extensively alter the outcome of the clustering. Representations that correlate words in a sentence tries to extract semantic relations. Our methods did this in a simple way which might have included correlations that didn’t exist. This might have resulted in a abundant number of features that did not enhance the information extracted. This is a limitation of a simple methods, using more complex methods might have increase the performance but is left as an open question. Our approach to use a supervised method to filter the features showed surpris- ingly bad results. Generally speaking, if information is known beforehand then it should be able to be used to improve the clustering. From this thesis no con- clusion can be drawn from that result. It may have been a↵ected by technical reasons in our code or the data set at hand.

Clustering

The value of k in k-means were pre-determined with no further reflection re- garding the data set and its characteristics. The data sets ability of being di↵erentiated into k di↵erent clusters poses significant limitations to the results. To impose such limitations on a corpus can result in a bad clustering. The results implies that there are reasons to reflect over a di↵erent value of k to better adapt to the data at hand. As no general clustering algorithm is to be considered inferior, but rather depending on the data set, other clustering al- gorithms that may be more suitable for our problem. This poses limitations on the general conclusions regarding how well predicted clusters corresponds to manually labeled documents.

24 Evaluation

Evaluation of clusters was done under the assumption that the manual labeling of the answers corresponds to the true categorization. The manual labeling is to consider having a low bias, but high variance. As the entropy is supposedly high when handling rather subjective categories and adding the fact that there are 8 di↵erent clusters, made the evaluation a hard task. All kind of subjective judgment of the clusters can not be validated.

5.3 Future work

When evaluating the result, more features added improves the results. Consid- ering this fact, it would be interesting adding features i.e. previous answered questions, information about demographics and length of text answers. If the as- sumption is made that these features have a connection to the open answers, the clustering results may be improved. Further, it would be interesting to investi- gate whether open text questions, could entirely substitute for closed questions, if the same information could be extracted from unstructured as structured formats.

25 Chapter 6

Conclusion

This thesis have examined how well an unsupervised approach performs in imitating pre-labeled categories. This approach could not be shown to pro- duce equivalent results when comparing to manual labeling. Our unsupervised method can instead be used to validate correlation between structures in a data set and imposed structures, such as predetermined categories on a . Hence the construction of surveys and interpretation of information from open- ended answers may benefit from our findings. Furthermore, we investigated the e↵ects of di↵erent text representations on the clustering outcome. The impor- tance of text representation when working with short, noisy text documents have earlier been proved and this thesis reinforces that statement. Because of the benefits of a semantic-based text representation, as shown in this thesis, further development should be aligned with this conclusion. It is considered that the problem of natural language understanding belongs to the most dicult problems in the field of artificial intelligence. Nonetheless, representing the text with relevant information that tries to comprehend parts of the natural language is today a sensible approach.

26 Bibliography

(Jick, 1979) T.D. Jick. Mising qualitiative and quantitative methods: Triangu- lation in action. Administrative science quartely, 1979.

(Salant et. al., 1994) P. Salant and D.A. Dillman. How to conduct your own survey. New York, Wiley and sons, 1994.

(Hong et. al., 2010) L. Hong and B.D. Davidson. Empirical Study of Topic Modeling in Twitter, 2010.

(Kolcz et. al., 2014) S. Yang, A. Kolcz, A. Schlaikjer and P. Gupta. Large- scale high-precision topic modeling on Twitter. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014.

(Duda et. al., 2000) R.O. Duda, P.E. Hart, D.G. Stork. Pattern Classification. Wiley-Interscience, 2000.

(Russom, 2007) Philip Russom. BI search and text analytics. TDWI Best prectices report, 2007.

(James et. al., 2013) G. James, D. Witten, T. Hastie and R. Tibshirani. An introduction to Statistical Learning. New York, Springer, 2013.

(Blei, 2012) David Blei. Probablistiv Topic Models. Communications of the ACM. Pages 77-84, 2012.

(Turing, 1950) A.M. Turing. Computing Machinery and Intelligence. Mind 49. Pages 433-460, 1950.

(Johnson. 2009) M. Johnson. How the statistical revolution changes computa- tional) linguistics. Proceeding of the EACL 2009 Workshop on the Interacting between Linguistics and Computational Linguistics, 2009.

(Cavnar et. al., 1994) W.B. Cavnar and J.M Trenkle. N-gram-based text categorization. Environmental Reseach Institute of Michinan, 1994.

(Alonso et. al, 2011) M. Alsonso, J. Malpica and A. Martinez de Agirre. Con- sequences of the Hughes phenomenon on some classification techniques, 2011.

27 (Punch et. al., 1993) W.F. Punch, E.D. Goodman, M. Pei, L. Chia-Shun, P. Hovland and R. Enbody. Further Research on Feature Selection and Classifica- tion Using Genetic Algorithms, 1993.

(Chandrashekar, 2013) G. Chandrashekar and F. Sahin. A survey on feature selection methods. Computers and Electrical Engineering 40, Pages 16–28, 2013.

(Jain, 2009) A.K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, Volume 31, Issue 8, Pages 651-666, 2009.

(Macqueen, 1967), J. Macqueen, Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, Pages 281-297, 1967.

(Huang, 2008) A Huang. Similarity measures for text document clustering. Pro- ceedings of the New Zeeland Computer Science Research Student Conference, Pages 1-8, 2008.

(Halkidi et.al, 2001) M. Halkidi, Y. Batistakis, M. Vazirgiannis. On clustering validation techniques. Journal of Intelligent Infomration Systems, 2001.

(Novikov, 2012) E. Sivogolovko, B.A. Novikov. Validating cluster structures in data mining tasks. Proceedings of the 2012 Joint EDBT/ICDT Workshops, 2012.

(Amayri et. al., 2013) O. Amayri and N. Bouguila. Online news TD and tracking via localized feature selection. Neural Networks (IJCNN), The 2013 International Joint Conference Dallas, TX, 2013.

(Desai et. al., 2015) N. Desai, M. Narvekar. Normalization of Noisy Text Data. International Conference on Advanced Computing Technologies and Applica- tions, 2015.

(Robertsson, 2004) G. Robertsson. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation 50, Page 100, 2004.

(Zhang et. al., 2011) W. Zhang, T. Yoshida and X. Tang. A Comparative Study of TF-IDF, LSI and multi-words for Text Classification. Expert Systems with Applications 38, Pages 2758-2765, 2011.

(Foreman, 2014) G. Foreman. A Pitfall and Solution in Multi-Class Feature Selection for Text Classification, Proceedings of the twenty-first international conference on Machine learning, Page 38, 2004.

28 (Zong et. al., 2015) W. Zong, F Wu, L. Chu and D. Scuili. A discriminative and semantic feature selection method for text categorization. International Journal of production Economics, Vol 166, Pages 215-222, 2015

(Liu et. al., 2010) Y. Liu, S. Xiao, X. Lv and S. Shi. Research on K-Means Text Clustering Algorithm Based on Semantic, Proc. 10th International Conference on Computing, Control and Industrial Engineering (CCIE’10), vol.1, Pages 124- 127, 2010.

(Ma, 2014) J. Ma. Improved K-Means Algorithm in Text Semantic Clustering, The Open Cybernetics Systemics Journal, 2014, 8, Pages 530-534, 2010.

(Hirschberg et. al, 2007) J. Hirschberg and Andrew Rosenberg, V-Measure: A conditional entropy-based external cluster evaluation measure. In Proc. 2007 Joint Conf. Empirical Methods in Natural Language Processing and Computa- tional Natural Language Learning (EMNLP-CoNLL’07), Pages 410-420, 2007.

29 www.kth.se