A Machine Learning Based Topic Exploration and Categorization on Surveys

2012 11th International Conference on Machine Learning and Applications A Machine Learning based Topic Exploration and Categorization on Surveys Clint P. George, Daisy Zhe Wang, and Joseph N. Wilson Liana M. Epstein, Philip Garland, and Annabell Suh Dept. of Computer & Information Science & Engg. Dept. of Methodology University of Florida SurveyMonkey Gainesville, USA Palo Alto, USA {cgeorge, daisyw, jnw}@cise.ufl.edu {liana, philg, annabell}@surveymonkey.com Abstract—This paper describes an automatic topic extraction, categorization, and relevance ranking model for multilingual surveys and questions that exploits machine learning algorithms such as topic modeling and fuzzy clustering. Automatically generated question and survey categories are used to build question banks and category-specific survey templates. First, we describe different pre-processing steps we considered for removing noise in the multilingual survey text. Second, we explain our strategy to automatically extract survey categories from surveys based on topic models. Third, we describe different methods to cluster questions under survey categories and group them based on relevance. Last, we describe our experimental results on a large group of unique, real-world survey datasets from the German, Spanish, French, and Portuguese languages and our refining methods to determine meaningful and sensible categories for building question banks. We conclude this document with possible Figure 1. Survey clustering enhancements to the current system and impacts in the business domain. Keywords-topic modeling; survey clustering; fuzzy clustering; categorization; category template building process does not consider any survey question usage statistics (from the existing surveys I. INTRODUCTION in the system) and is a language-specific task. To address this, we are building tools that help the category-specific As the amount of text data available keeps rising, it template building process use much less manual effort and becomes challenging for people to locate and track the employ language-independent system design. Moreover, our relevant information they require. We are particularly in- proposed system can automatically find commonly occur- terested, within the domain of multilingual survey texts, ring categories from multilingual survey data using survey to build language independent tools for topic discovery, questions’ word statistics. clustering, and ranking of surveys and their questions. Effec- tively addressing the potentially huge amount of information We focus on the task of automatically clustering surveys contained in a large collection of surveys leads us to use and questions and ranking them on relevancy to a specific tools for automatic text summarization and topic extraction. topic or survey. This includes challenges such as (a) rep- Topic modeling methods such as Latent Semantic Indexing resenting user surveys and questions in a machine readable (LSI) and Latent Dirichlet Allocation (LDA) are designed form by removing noise terms and stop-words, (b) employ- to assist with these types of problems. ing machine learning models that can learn topics from Conventional survey designer systems such as Survey- surveys and categorize them with minimal manual inter- Monkey provide manually designed, category-specific1 sur- vention, (c) post-processing strategies on the learned model vey templates (e.g., Education template, Customer Feedback for survey- and question-clustering, and (d) experiments on template, etc.) to ease the survey building process [1]. Dur- our unique set of multilingual (Spanish, German, French, ing survey building, template-questions can be customized and Portuguese) survey datasets from SurveyMonkey. Fig. 1 or new questions can be added based on user needs. One of and Fig. 2 show the visualization of survey-clustering and the disadvantages of this type of system is that the manual question-clustering. The different colors represent different labor required to build such templates is high. Similarly, the topic content in the survey text. The potential impact of this project in the business domain 1We use category and topic inter-changeably throughout in this paper. is unparalleled. There is not, to our knowledge, any other 978-0-7695-4913-2/12 $26.00 © 2012 IEEE 7 DOI 10.1109/ICMLA.2012.132 Topic models are well suited to a language independent approach to clustering and ranking for surveys because the bag-of-words document model, upon which they are based, is largely independent of semantic structures. The inference is based only upon the word co-occurrence frequencies in each document in a given corpus. We use a topic modeling algorithm based on HDP [3] to discover topics from surveys. The estimated topics are further used to rank relevant surveys in the corpus and group them (survey clustering). Topic models also provide relevant words and their probabilities for a given topic. Domain experts can use these words to name the learned categories or topics with minimal manual effort. We also considered the problem of grouping similar questions together (question clustering) to assist survey designers. We used LSI to represent questions due to its computational efficiency compared to the more complex Figure 2. Question clustering: q(s1) represents a question from survey 1. models such as LDA and HDP. We implemented our question clustering system based on fuzzy clustering [4] of the ”question bank” in any of the languages such as Spanish, questions represented in LSI space. Our results show that German, French and any other automatic survey and ques- our method can automatically find many manually defined tion categorization and ranking system in existence. The survey-categories, and group topically similar questions as categories that emerged from our system were qualitatively well as surveys with questions in a language different from different due to cultural differences in both the way that the survey group’s language. questions are asked in different languages and the informa- One of the challenges we faced in designing the multi- tion that real people who live in different countries want to lingual survey categorization system was the demographic find out the most. Thus, the process we developed supports and cultural variations in the language usage by people automating the construction of culturally-relevant question from different countries. The variation in question structure banks from existing survey corpora. was quite visible even with the formal environment imposed by the survey format. For example, in the case of Spanish Our approach surveys, many specific words and phrases were used to ask Our proposed system uses topic models to model the questions politely. During topic modeling inference, these corpus (document collection) of surveys. Topic models rep- caused trouble in forming relevant topics from the survey resent documents as bags of words without considering word text. Similarly, for German, many surveys include a large order as being of any importance. These models have the set of common words from colloquial phrases. This caused ability to represent large document collections with lower the topic-modeling-based ranking and clustering algorithms dimensional topics, which represent clusters of similarly be- to form overlapping, distinguishable question and survey having words. In addition, the document words are assumed groups. We try to tackle some of these problems by using to be generated from topic-specific multinomials and the language specific lemmatizers and stop-word list (section topic for a particular word is chosen from that document’s III). topic mixture. These topics are assumed to be generated over This paper is organized as follows. Section II describes the corpus vocabulary from a Dirichlet distribution. Blei et the state-of-the-art models in the area of document topic al. [2] give a detailed description of this language model and modeling, language-independent text mining, and survey its assumptions. clustering. Section III describes our overall system archi- The analysis of topic models is dependent upon exploring tecture and algorithms. Section IV describes details about the posterior distribution of model parameters and hidden our unique multilingual datasets, evaluation metrics, results, variables conditioned on observed words. The model pa- and analysis. Section V concludes this paper. rameters are corpus-level topics or concepts—sets of words with corresponding probabilities—and document-level topic II. RELATED WORK mixtures. The original topic model assumes that one should Topic models are often used to characterize plain text know the number of topics in the corpus beforehand. How- documents and to extract topical content from them. One ever, Teh et al. [3] solve this issue with a new framework such model, LSI, can group together words and phrases called the Hierarchical Dirichlet Process (HDP), which can that exhibit synonymy (or similar meanings), e.g., car and learn a variable number of topics automatically from the automobile. The LSI method typically performs matrix fac- data. torization over a term-document matrix (TF-IDF matrix), 8 which represents the occurrence of words in documents the system components. First, we tokenize the raw survey- using the concepts of eigenvalue decomposition and iden- questions with a tool that is dependent on the survey’s source tifies patterns in the relationships between the document language. For Latin-character based languages such as Span- terms and concepts

Load more