2012 11th International Conference on and Applications

A Machine Learning based Topic Exploration and Categorization on Surveys

Clint P. George, Daisy Zhe Wang, and Joseph N. Wilson Liana M. Epstein, Philip Garland, and Annabell Suh Dept. of Computer & Information Science & Engg. Dept. of Methodology University of Florida SurveyMonkey Gainesville, USA Palo Alto, USA {cgeorge, daisyw, jnw}@cise.ufl.edu {liana, philg, annabell}@surveymonkey.com

Abstract—This paper describes an automatic topic extrac- tion, categorization, and relevance ranking model for multi- lingual surveys and questions that exploits machine learn- ing algorithms such as topic modeling and fuzzy clustering. Automatically generated question and survey categories are used to build question banks and category-specific survey templates. First, we describe different pre-processing steps we considered for removing noise in the multilingual survey text. Second, we explain our strategy to automatically extract survey categories from surveys based on topic models. Third, we describe different methods to cluster questions under survey categories and group them based on relevance. Last, we describe our experimental results on a large group of unique, real-world survey datasets from the German, Spanish, French, and Portuguese languages and our refining methods to determine meaningful and sensible categories for building question banks. We conclude this document with possible Figure 1. Survey clustering enhancements to the current system and impacts in the business domain. Keywords-topic modeling; survey clustering; fuzzy cluster- ing; categorization; category template building process does not consider any survey question usage statistics (from the existing surveys I. INTRODUCTION in the system) and is a language-specific task. To address this, we are building tools that help the category-specific As the amount of text data available keeps rising, it template building process use much less manual effort and becomes challenging for people to locate and track the employ language-independent system design. Moreover, our relevant information they require. We are particularly in- proposed system can automatically find commonly occur- terested, within the domain of multilingual survey texts, ring categories from multilingual survey data using survey to build language independent tools for topic discovery, questions’ word statistics. clustering, and ranking of surveys and their questions. Effec- tively addressing the potentially huge amount of information We focus on the task of automatically clustering surveys contained in a large collection of surveys leads us to use and questions and ranking them on relevancy to a specific tools for automatic text summarization and topic extraction. topic or survey. This includes challenges such as (a) rep- Topic modeling methods such as Latent Semantic Indexing resenting user surveys and questions in a machine readable (LSI) and Latent Dirichlet Allocation (LDA) are designed form by removing noise terms and stop-words, (b) employ- to assist with these types of problems. ing machine learning models that can learn topics from Conventional survey designer systems such as Survey- surveys and categorize them with minimal manual inter- Monkey provide manually designed, category-specific1 sur- vention, (c) post-processing strategies on the learned model vey templates (e.g., Education template, Customer Feedback for survey- and question-clustering, and (d) experiments on template, etc.) to ease the survey building process [1]. Dur- our unique set of multilingual (Spanish, German, French, ing survey building, template-questions can be customized and Portuguese) survey datasets from SurveyMonkey. Fig. 1 or new questions can be added based on user needs. One of and Fig. 2 show the visualization of survey-clustering and the disadvantages of this type of system is that the manual question-clustering. The different colors represent different labor required to build such templates is high. Similarly, the topic content in the survey text. The potential impact of this project in the business domain 1We use category and topic inter-changeably throughout in this paper. is unparalleled. There is not, to our knowledge, any other

978-0-7695-4913-2/12 $26.00 © 2012 IEEE 7 DOI 10.1109/ICMLA.2012.132 Topic models are well suited to a language independent approach to clustering and ranking for surveys because the bag-of-words document model, upon which they are based, is largely independent of semantic structures. The inference is based only upon the word co-occurrence frequencies in each document in a given corpus. We use a topic modeling algorithm based on HDP [3] to discover topics from surveys. The estimated topics are further used to rank relevant surveys in the corpus and group them (survey clustering). Topic models also provide relevant words and their probabilities for a given topic. Domain experts can use these words to name the learned categories or topics with minimal manual effort. We also considered the problem of grouping similar questions together (question clustering) to assist survey designers. We used LSI to represent questions due to its computational efficiency compared to the more complex Figure 2. Question clustering: q(s1) represents a question from survey 1. models such as LDA and HDP. We implemented our ques- tion clustering system based on fuzzy clustering [4] of the ”question bank” in any of the languages such as Spanish, questions represented in LSI space. Our results show that German, French and any other automatic survey and ques- our method can automatically find many manually defined tion categorization and ranking system in existence. The survey-categories, and group topically similar questions as categories that emerged from our system were qualitatively well as surveys with questions in a language different from different due to cultural differences in both the way that the survey group’s language. questions are asked in different languages and the informa- One of the challenges we faced in designing the multi- tion that real people who live in different countries want to lingual survey categorization system was the demographic find out the most. Thus, the process we developed supports and cultural variations in the language usage by people automating the construction of culturally-relevant question from different countries. The variation in question structure banks from existing survey corpora. was quite visible even with the formal environment imposed by the survey format. For example, in the case of Spanish Our approach surveys, many specific words and phrases were used to ask Our proposed system uses topic models to model the questions politely. During topic modeling inference, these corpus (document collection) of surveys. Topic models rep- caused trouble in forming relevant topics from the survey resent documents as bags of words without considering word text. Similarly, for German, many surveys include a large order as being of any importance. These models have the set of common words from colloquial phrases. This caused ability to represent large document collections with lower the topic-modeling-based ranking and clustering algorithms dimensional topics, which represent clusters of similarly be- to form overlapping, distinguishable question and survey having words. In addition, the document words are assumed groups. We try to tackle some of these problems by using to be generated from topic-specific multinomials and the language specific lemmatizers and stop-word list (section topic for a particular word is chosen from that document’s III). topic mixture. These topics are assumed to be generated over This paper is organized as follows. Section II describes the corpus vocabulary from a Dirichlet distribution. Blei et the state-of-the-art models in the area of document topic al. [2] give a detailed description of this language model and modeling, language-independent , and survey its assumptions. clustering. Section III describes our overall system archi- The analysis of topic models is dependent upon exploring tecture and algorithms. Section IV describes details about the posterior distribution of model parameters and hidden our unique multilingual datasets, evaluation metrics, results, variables conditioned on observed words. The model pa- and analysis. Section V concludes this paper. rameters are corpus-level topics or concepts—sets of words with corresponding probabilities—and document-level topic II. RELATED WORK mixtures. The original topic model assumes that one should Topic models are often used to characterize plain text know the number of topics in the corpus beforehand. How- documents and to extract topical content from them. One ever, Teh et al. [3] solve this issue with a new framework such model, LSI, can group together words and phrases called the Hierarchical Dirichlet Process (HDP), which can that exhibit synonymy (or similar meanings), e.g., car and learn a variable number of topics automatically from the automobile. The LSI method typically performs matrix fac- data. torization over a term-document matrix (TF-IDF matrix),

8 which represents the occurrence of words in documents the system components. First, we tokenize the raw survey- using the concepts of eigenvalue decomposition and iden- questions with a tool that is dependent on the survey’s source tifies patterns in the relationships between the document language. For Latin-character based languages such as Span- terms and concepts or topics. However, we used LSI to ish, German, and French, we build the tokenizers using cluster questions under a given survey topic and to build the python Natural Language Processing Toolkit (NLTK) topical question banks because probabilistic topic models [8] toolkit and predefined regular expressions. For Asian such as LDA and HDP are less effective in modeling small languages such as Japanese, we use morphology-based documents [5]. segmenters (e.g., MeCab and TinySegmenter for Japanese In the probabilistic topic modeling setting (e.g., LDA and text) to tokenize the survey text2. Second, we standardize HDP) [2], [3], a topic is represented by a multinomial distri- tokens by removing noise terms and stop-words. We used bution of words in a vocabulary. Topic modeling allows us language-dependent stop-word lists for this purpose. Third, to represent the properties of a large collection of documents we represent each survey or question as a document in containing numerous words with a small collection of topics. a sparse bag-of-words format, after building a vocabulary Each document is described by a mixture of topics, and of corpus-words (separately for each language we used). words are chosen from the multinomial that results from the Finally, we use documents as input to the topic learning mixture of that document’s topic multinomials. Topic models model which, in turn, learns clusters from the term co- are designed to handle both polysemy (single words with occurrence frequencies of the corresponding documents. See multiple meanings such as model and chip) and synonymy. Fig. 3 for more details. We use topic modeling algorithms such as HDP [3] and LDA [2] to discover topics from surveys. B. Topic learning Survey questions are usually short, which differs substan- As discussed earlier, topic models have the ability to tially from conventional document information retrieval and learn semantic relationships of words from an observed text mining problems. Grant et al. [5] tested the applicability of collection. In this system, topic modeling is used for three topic-modeling-based approaches to a Twitter dataset, and main purposes i) categorizing and ranking surveys, ii) survey found that the restricted lengths of tweets prevents them sub-categorization and ranking, and iii) clustering of survey from exploiting their full potential. Aggregating tweets to questions under an identified survey sub-cluster. train the topic model can yield an improved set of topics. Survey ranking is performed to identify relevant surveys The research work of Hong et al. [6] discusses similar that belong to general (top-level) topics such as market observations on a different Twitter dataset. In this paper, research, education, and sports. To perform ranking, we first we used a similar strategy to model surveys. We aggregate compute the topic mixtures of the survey documents, which the questions of each survey and consider that to be a single are formed by combining survey questions. To estimate the document for topic modeling. topical structure from the survey documents, we use HDP Francis et al. [7] described several methods to perform [3], which can learn the number topics automatically (this is text-mining including surveys (on the 2008 CAS Quinquen- one of our primary goals) along with the topic model from nial Membership Survey). They explained methods such as large document collections. A detailed theoretical review of TF-IDF, k-means, and hierarchical clustering on the survey HDP and its inference methods is presented by Teh et al question and answer words which are based on the R [3]. We use a modified version of the HDP implementation package tm. Here, however, we have a comprehensive set by Wang and Blei [9] in our experiments. The major com- of experiments on multilingual survey datasets to which we ponents of a learned HDP model are the corpus-level topic apply advanced statistical models such as LDA, HDP, and word association counts and document-level topic mixtures. LSI. Each topic in the estimated model is represented by its topic-word-probabilities. These words are used by language III. SYSTEM DESIGN experts to name survey categories. The document level topic This section explains our methodology and the system mixtures give an idea of the topicality of a particular survey architecture. Fig. 3 gives a graphical representation of our to a given topic. This is also quite useful in finding similar prototype system. It consists of two main modules – one surveys and grouping them together. that is language dependent and another that is language From the observations of the top-level survey categoriza- independent. The following sub sections explain individual tion explained above, we found that some of the topics found system components in detail. by the HDP estimation process can be further divided into A. Data pre-processing subtopics and the corresponding surveys can be ranked by subtopic relevance. For modeling survey subtopics, we use This component is part of the language dependent system module. We designed the preprocessor in such a way that 2We ignored Asian languages from our analysis because the datasets a change in the input language does not affect the rest of were too small to capitalize the results.

9 Figure 3. The system design

Table I the original LDA model [2] because it is more accurate RESEARCH DATASETS and less computationally expensive than HDP. We use the package’s [10] online variational inference imple- Language # of surveys vocabulary size # of stop-words mentation for the model estimation process. Spanish 7.3K 6.2K 350 German 3K 2.4K 400 Conventional topic modeling algorithms are designed to French 9.4K 5.3K 160 work on larger documents compared to survey-questions Portuguese 2K 1.5K 240 (section II). The chance of a term re-occurrence in the same question is quite low compared to typical documents used in the topic modeling literature. So, to cluster questions to build first apply fuzzy C-means (FCM) clustering [4], [11] to the question banks, we represent questions in a much simpler set of survey questions represented in LSI space (section format such as TF-IDF and perform LSI, which helps to III-B). Second, we rank the questions that belong to a given represent the questions in the smaller LSI space rather than cluster based on measures such as string matching, fuzzy the vocabulary space. set matching [12], and distance from the cluster centroid. Finally, we remove duplicate questions and present the C. Survey relevance ranking ranked questions to survey designers (Fig. 2). We use survey relevance ranking to group together surveys IV. EXPERIMENTAL RESULTS AND ANALYSIS belonging to an estimated topic (Fig. 1). We use individual ˆ surveys’ estimated document topic mixtures, θd, to rank This section describes our datasets, experiments, and them on relevance given a topic or set of topics. For a given observations. topic set T ⊂ K, we calculate   A. Dataset description and experimental setup ˆ ˆ m(d)= ln θd,k + ln(1 − θd,j) (1) We conducted our experiments on both research and real- k∈T j/∈T word datasets from SurveyMonkey, which are in a variety for all surveys d =1, 2, ..., D in the corpus and sort of languages including English, Spanish, German, French, them to rank their relevance. Here, we assume that the Portuguese, and Japanese. In this paper, we only describe ˆ results from the research datasets of the Spanish, German, document topic mixtures θd satisfy the multinomial property K ˆ French, and Portuguese languages. We only consider surveys j=1 θj =1. Intuitively, we can see that this equation maximizes the score of a topic set T ⊂ K given a document. having at least five questions for topic modeling. Similarly, A document with a high value of this score is a highly we only consider words that have a minimum overall corpus 10 relevant document for that topic set. frequency of , for vocabulary construction. The detailed description of specific datasets is given in Table I. The D. Question clustering and ranking reported number (#) of surveys and vocabulary size are One of the goals of this project is to design a system that approximate counts and computed after removing noise can recommend useful, relevant survey questions, given a (stop-words, duplicate surveys, etc.) in the survey text. selected survey topic (e.g., education) for building question We also removed English surveys and words from the banks. Once we have the surveys that belong to a given topic, foreign language surveys using an English dictionary. We we group similar survey questions into question groups and also observed that, when we perform topic modeling on rank them within group based on several ranking scores. We language specific datasets, most foreign language words

10 were grouped together into foreign language topics. For Table II example, surveys in the German collection having French SUBSET OF CATEGORIES FOUND FROM A SPANISH REPRESENTATIVE DATASET language questions had a very high probability for one particular topic that they all shared. We ignored those foreign Categories Sub-categories language topics and surveys in our analysis. Business Partnerships Web security B. Evaluation metrics Parent satisfaction surveys Football, grade levels, alumni Customer feedback Product evaluation For the English surveys, we had a set of manually Education Alumni, campus selection Student satisfaction survey identified category and their associated manually designed Course rvaluation survey questions (category templates) [1]. We used them for evaluating the automatically generated survey categories Table III and their relevance ranked surveys. This manual evaluation SUBSET OF CATEGORIES FOUND FROM A FRENCH REPRESENTATIVE process was performed by the domain experts in survey DATASET design. One of the key aims of this project is to automatically Categories Sub-categories identify topics from the multilingual survey datasets, and Market research Consumer preferences compare them with the manually identified English survey Product feedback Human resources Manager evaluation categories. Work place evaluation The ranking score, (1), can also be used to evaluate Employee evaluation cohesiveness of the identified sets of categorical surveys. Facilities and services Just for fun Transportation High scores of surveys represent high relevancy to the given Vacation travel category. So, for each category k, in a dataset, we compute Media usage the mean of the relevancy scores of all surveys using: μ = (exp(m(d))) k meand∈Dk (2) Our experiments on the Spanish, German, French, and where m(d) is from (1) and Dk is the set of grouped Portuguese research datasets identified several existing top- surveys for topic k. High values of μk indicate that the ics and new sub-topics. Tables II and III show subsets of ranked surveys in Dk are highly cohesive and relevant to topics and sub-topics found from Spanish and French. k topic . The manual evaluation of survey categories by our D. Discussion domain experts from different languages, supports this fact (See section IV-D). We observed that Spanish surveys behave in a manner Similarly, to evaluate the cohesiveness of question clus- similar to the English surveys, i.e., the system found similar, ters, for each question, we compute the distance from the meaningful topics. However, on the German and French cluster centroids (generated by FCM). Then, we compute datasets, our bag-of-words based topic modeling did not distance means for each question cluster. A small mean perform as well as expected. The majority of the topics value is a good indication of a cohesive cluster. We also use were polluted with noise, and it was hard to find mean- questions’ overall appearance frequencies in their associated ingful topics. We noticed that their survey and question surveys to compute questions’ importance. structures were considerably different from those in English and Spanish. For example, the German surveys contain C. Results many questions with diverse topical ranges. We believe the differences in the style of question and survey formation We conducted our initial experiments on a toy dataset, affect the performance of the topic-modeling-based survey using the HDP algorithm [3], which show that it can learn a categorization algorithm. considerable number of topics (e.g., ∼ 100 topics from 8K Our observations are supported quantitatively by the surveys). Based on the estimated topic-words (e.g., student, means of categorical survey ranking scores (2), μk, which teacher), we give an appropriate name (e.g., education) to represent the quality of the corresponding categories (Fig. 4 each of the topics found by HDP. Most of the manually and Fig. 5). For the French, Portuguese, and German identified categories [1] are automatically discovered by datasets, we can see that there is a sudden change in μk HDP. Moreover, we notice that HDP can find additional after the few topic sets (Dk). We observed that after the meaningful topics. In addition, we ranked relevant surveys first few topic survey sets, the topic mixtures of the surveys for a given topic based on (1). To determine subtopics under that belong to the rest of the sets, become almost uniform. an identified topic (e.g., education), we collected all the This adversely affects the ranking scores m(d) and their relevant surveys belong to that topic (based on (1) and a means μk. This indicates that the first few topic sets are predefined threshold) and performed another topic modeling good for question bank building. We believe that we can estimation activity on them. further improve the results by using a better lemmatizer for

11 V. C ONCLUSION In this paper, we described the problem of automatic topic discovery and categorization of multilingual surveys and questions. We suggested a system to tackle this problem that is based on the well known topic modeling frameworks such as , Latent Dirichlet Allocation, and Hierarchical Dirichlet Process, and fuzzy clustering methods. We also discussed our experimental results and refining methods to improve the question clusters and survey clusters, so that they can be used for commercial survey- question bank generation. ACKNOWLEDGMENT The authors would like to acknowledge the support for this project from SurveyMonkey. We would like to thank Carlos Ibarra and Eric Esteban for helping us to evaluate Figure 4. The means (μks) of survey ranking scores of topics (sorted in the descending order of μk, (2)) from the Portuguese, Spanish, German, the results. and French research datasets. REFERENCES [1] (2012) Surveymonkey. [Online]. Available: http://www. surveymonkey.com [2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” JMLR, vol. 3, pp. 993–1022, March 2003. [3] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical Dirichlet processes,” Journal of the American Statistical Association, vol. 101, no. 476, pp. 1566–1581, 2006. [4] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Norwell, MA, USA: Kluwer Academic Publishers, 1981. [5] C. Grant, C. P. George, C. Jenneisch, and J. N. Wilson, “Online topic modeling for real-time twitter search,” 2011, TREC 2011 Notebook. [6] L. Hong and B. D. Davison, “Empirical study of topic Figure 5. The survey counts of the top ranked topics using μk from the modeling in twitter,” in Proc. of SOMA, ser. SOMA ’10. NY, Portuguese, Spanish, German, and French research datasets. We can see USA: ACM, 2010, pp. 80–88. that the μk are not depended on the cardinalities. [7] L. Francis and M. Flynn, Text Mining Handbook, 2010, casualty Actuarial Society E-Forum, Spring 2010. German and French survey text. [8] S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python. O’Reilly Media, 2009. E. Future work [9] C. Wang and D. M. Blei, “A split-merge MCMC algorithm We have noticed that the majority of survey questions for the hierarchical Dirichlet process,” CoRR, 2012. follow the question-types (structures) of Yes/No (answers [10] R. Rehˇ u˚rekˇ and P. Sojka, “Software Framework for Topic to those questions are “Yes” or “No”) and question-word Modelling with Large Corpora,” in Proc. of the LREC 2010 (e.g., What, When, How, etc.) questions. Topic models may Workshop. ELRA, 2010, pp. 45–50. not be able to distinguish these question-types, since they [11] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and produce a global view of the corpus wide topics. They A. Weingessel, Misc Functions of the Dept. of Statistics, TU usually cluster these common words into a single cluster, so Wien (R pkg. e1071), 2011. it may be a good idea to group these questions into question- type classes beforehand. This may help us forming another [12] A. Cohen. (2011) Fuzzy string matching in python. [Online]. Available: https://github.com/seatgeek/fuzzywuzzy sublayer of topics based on question-types. We also plan to learn language specific model hyper-parameters [13] as an [13] H. Wallach, D. Mimno, and A. McCallum, “Rethinking lda: alternative to removing language specific stop words. Why priors matter,” NIPS, vol. 22, pp. 1973–1981, 2009.

12