Derived Sentiment Analysis Using JMP Pro Michael D
Total Page:16
File Type:pdf, Size:1020Kb
Derived Sentiment Analysis Using JMP Pro Michael D. Anderson, Ph.D., Christopher Gotwalt, Ph.D. October 20, 2017 JMP 13 introduced Text Explorer. The new platform provides users with the ability to curate freeform text and generate insights into themes and important terms. While incredibly useful, text exploration is really only a first step in answering the problem at hand. Often what we really want to do is identify the key words in a set of documents that are strongly associated with a particular response when, for example, evaluating purchasing behavior or customer reviews. Typically this is done using traditional Sentiment Analysis, which relies on word lists supplied by third party vendors that do not take into consideration specific contexts or audience. An alternative approach, sometimes called "Supervised Learning Sentiment Analysis," combines text analysis with predictive modeling to determine which words and phrases are most relevant to the problem at hand. It uses data to determine both the direction and strength of the term association via a fairly approachable modeling exercise. Using JMP Pro 13 for Supervised Learning Sentiment Analysis is now easier than ever; we aim to demonstrate why with a series of case studies arising from consumer research and social media contexts. Introduction "You Keep Using That Word, I Do Not Think It Means What You Think It Means. " - Inigo Montoya, From The Princess Bride, MGM Studios, 1987 Lets start off with a simple question: "What do you mean when you say something?" The concepts of thought, language and meaning are so intertwined that it is hard to disentangle them. Hu- mans use language as a method for conveying emotion, information and even entertainment. The study of sentiment, as it is used in the literature, "seeks to determine the general sentiments, opinions, and affective states of people reflected in a corpus of text.1" Since the 1 Practical Text Mining and Statistical early 1980s there have been there have been a number of papers Analysis for Non-Structured Text Data Ap- plications., Elder, et al., 2012, Academic addressing the concept of sentiment. But, the topic really started Press gaining momentum with the introduction of data mining techniques and machine learning algorithms around the turn of the 21st century. Sentiment analysis still has significant challenges because of the way in which we use language. Local dialects, idioms and trends toward hyperbole or sarcasm all provide challenges to someone attempting to study sentiment with data mining techniques. derived sentiment analysis using jmp pro 2 Why did it take so long for the field of text analytics to get off the ground? There are two main factors that appear to have contributed. First, the aforementioned computational power and methods necessary for text analytics haven’t been made available until only recently. Recall back to the early part of this century when desktop PC’s with 512 MB of RAM were common and multicore processors were in limited production. Most entry-level smartphones today now exceed these capabilities. In the past 10 years there has been an explosion in the processing capabilities available to scientists, both in server- and desktop-scale applications. It has therefore only been within in the past few years that we have had the hardware needed to analyze language effectively. The second factor is the issue of available data. In text analytics these data sets are called corpora (singularly, a corpus). Each item within a corpus is called a document2. Because of the extensive vari- 2 In JMP, and many other cases, these ability present in spoken and written language, the corpora for anal- documents are arranged one docu- ment per row in a data table or other ysis must be quite large, usually several thousand documents. These database that acts as the corpus. documents must then be curated into a compendium that can be primed for analysis. Until recently there just haven’t been many good data sets to analyze. Now it is possible to search social media sites like Twitter or Facebook to generate a corpus with a million docu- ments without issue. Sites such as Amazon have also provided access to their user-generated product reviews which provide both text and a favorability rating. Moreover, many government agencies in the US and abroad now maintain sites dedicated to publishing survey data which often includes free text and comment fields. This all means that we now have the corpora we need to finally start looking at these topics in detail. The Problem The real problem with sentiment analysis is how to go about doing it. There are three general schools of thought on how to approach sentiment analysis: using a dictionary, using a person, using a machine. To better understand these groups, it is first important to review a standard sentiment analysis workflow. In simplified form, all sentiment analysis methods require some element of predictive modeling. First, a corpus is edited to clean up mismatches related to spelling, slang or other typos; this will serve as a training set. This training data is then used to generate a model that determines a score for each sentiment thought to be in the data set. Models created from a training set are then applied to new documents. derived sentiment analysis using jmp pro 3 The three methods each approach score generation differently. The dictionary method uses keywords and phrases that have been associated with a set of sentiments. These associations are aggregated to produce the final scores for each sentiment. Using a person (i.e., supervised learning) involves having a panel score each document in the training corpus in isolation with the scores then being aggregated and checked for agreement. Machine learning algorithms then try to determine those factors that caused scorers to assign a certain senti- ment. Using a machine (i.e., unsupervised learning) applies machine learning algorithms to evaluate sentiment. Each of these methods have their drawbacks. The dictionary method presumes that a word embodies a given sentiment regard- less of context. Supervised learning is labor intensive and requires care in making sure that there is agreement between scores for the training corpus. Unsupervised learning requires large corpora with rich supplemental data and large computational resources to attempt. All three methods are susceptible to inaccuracies brought about by grammatical inconsistencies, e.g., sarcasm or regional idioms. In this paper we will apply a blend of self-reported sentiment in the form of scoring and contextual data along with generalized regression as a variable selection technique. We propose that these two components, when applied together, provide a faster work flow and more accurate assessment of the sentiment in a corpus than the more traditional methods detailed earlier. Derived Sentiment Analysis Using a model to determine sentiment requires both a response and variables. In the work flow that we propose re- sponses are self-reported or extracted from the documents in the corpus. Examples of this self-reporting include: stars provided with a written review, Likert Scale scores in a survey, or even (as we will demonstrate later) emoji. Leveraging the data that is already present in a corpus resolves a number of problems traditionally associated with sentiment analysis. First, the respondents themselves are pro- viding their sentiment scores. This saves on time and costs associated with curating the corpora. Second, it removes any ambiguity about respondent sentiment, which should make the results more accu- rate. The variables in this case come from a curated list of words and phrases from the corpus called a Document Term Matrix (DTM). The DTM takes the form of a collection of indicator columns that show when (and how many times) a given word or phrase is present in a document. derived sentiment analysis using jmp pro 4 The method for developing a sentiment model from the corpus is broken into two steps. First, the DTM is created using the Text Ex- plorer. This is accomplished using the Text Explorer Platform in JMP. Within the Text Explorer, terms that should be excluded from con- sideration are excluded using a stop word list. Regular Expressions are also used to remove formatting, URLs, unnecessary punctuation, etc. Terms are also stemmed to remove the influence of tense and part-of-speech usage. Lastly, a recoding operation is used to clean up spelling errors and change terms when needed. After the curation process is complete, the DTM is exported back to the data table. Once the DTM has been exported, it is used with the self-reported responses in Generalized Regression with an Elastic Net penalization. Generalized Regression and the Elastic Net Penalty were chosen due to the ability of the Elastic net to function both as a variable selection tool and as a method for dealing with covariance. The model report provides insights into the important terms in a given document that indicate a specific sentiment. The model can also be used to predict sentiment for new documents that may not include the scoring infor- mation. Case Study 1: The Toronto Casino Survey You’ve got to know when to hold ’em, Know when to fold ’em. - From The Gambler, Kenny Rodgers Background In 2012, the City of Toronto conducted an online survey to gauge public reaction to a proposed casino project. The survey was designed and conducted between November 2012 and January 2013. Approximately 18,000 responses were submitted. The results 3 3 were posted online for public consumption by the City of Toronto http://www.toronto.ca/ in a set of Excel files. The survey instrument was composed of 11 casinoconsultation/ questions in multiple parts with most questions having both a rating component and a comment section.