<<

Derived Sentiment Analysis Using JMP Pro Michael D. Anderson, Ph.D., Christopher Gotwalt, Ph.D. October 20, 2017

JMP 13 introduced Text Explorer. The new platform provides users with the ability to curate freeform text and generate insights into themes and important terms. While incredibly useful, text exploration is really only a first step in answering the problem at hand. Often what we really want to do is identify the key words in a set of documents that are strongly associated with a particular response when, for example, evaluating purchasing behavior or customer reviews. Typically this is done using traditional Sentiment Analysis, which relies on word lists supplied by third party vendors that do not take into consideration specific contexts or audience. An alternative approach, sometimes called "Supervised Learning Sentiment Analysis," combines text analysis with predictive modeling to determine which words and phrases are most relevant to the problem at hand. It uses data to determine both the direction and strength of the term association via a fairly approachable modeling exercise. Using JMP Pro 13 for Supervised Learning Sentiment Analysis is now easier than ever; we aim to demonstrate why with a series of case studies arising from consumer research and social media contexts.

Introduction

"You Keep Using That Word, I Do Not Think It Means What You Think It Means. . . " - Inigo Montoya, From The Princess Bride, MGM Studios, 1987

Lets start off with a simple question: "What do you mean when you say something?" The concepts of thought, language and meaning are so intertwined that it is hard to disentangle them. Hu- mans use language as a method for conveying emotion, information and even entertainment. The study of sentiment, as it is used in the literature, "seeks to determine the general sentiments, opinions, and affective states of people reflected in a corpus of text.1" Since the 1 Practical Text Mining and Statistical early 1980s there have been there have been a number of papers Analysis for Non-Structured Text Data Ap- plications., Elder, et al., 2012, Academic addressing the concept of sentiment. But, the topic really started Press gaining momentum with the introduction of data mining techniques and machine learning algorithms around the turn of the 21st century. Sentiment analysis still has significant challenges because of the way in which we use language. Local dialects, idioms and trends toward hyperbole or sarcasm all provide challenges to someone attempting to study sentiment with data mining techniques. derived sentiment analysis using jmp pro 2

Why did it take so long for the field of text analytics to get off the ground? There are two main factors that appear to have contributed. First, the aforementioned computational power and methods necessary for text analytics haven’t been made available until only recently. Recall back to the early part of this century when desktop PC’s with 512 MB of RAM were common and multicore processors were in limited production. Most entry-level smartphones today now exceed these capabilities. In the past 10 years there has been an explosion in the processing capabilities available to scientists, both in server- and desktop-scale applications. It has therefore only been within in the past few years that we have had the hardware needed to analyze language effectively. The second factor is the issue of available data. In text analytics these data sets are called corpora (singularly, a corpus). Each item within a corpus is called a document2. Because of the extensive vari- 2 In JMP, and many other cases, these ability present in spoken and written language, the corpora for anal- documents are arranged one docu- ment per row in a data table or other ysis must be quite large, usually several thousand documents. These database that acts as the corpus. documents must then be curated into a compendium that can be primed for analysis. Until recently there just haven’t been many good data sets to analyze. Now it is possible to search social media sites like Twitter or Facebook to generate a corpus with a million docu- ments without issue. Sites such as Amazon have also provided access to their user-generated product reviews which provide both text and a favorability rating. Moreover, many government agencies in the US and abroad now maintain sites dedicated to publishing survey data which often includes free text and comment fields. This all means that we now have the corpora we need to finally start looking at these topics in detail.

The Problem

The real problem with sentiment analysis is how to go about doing it. There are three general schools of thought on how to approach sentiment analysis: using a dictionary, using a person, using a machine. To better understand these groups, it is first important to review a standard sentiment analysis workflow. In simplified form, all sentiment analysis methods require some element of predictive modeling. First, a corpus is edited to clean up mismatches related to spelling, slang or other typos; this will serve as a training set. This training data is then used to generate a model that determines a score for each sentiment thought to be in the data set. Models created from a training set are then applied to new documents. derived sentiment analysis using jmp pro 3

The three methods each approach score generation differently. The dictionary method uses keywords and phrases that have been associated with a set of sentiments. These associations are aggregated to produce the final scores for each sentiment. Using a person (i.e., supervised learning) involves having a panel score each document in the training corpus in isolation with the scores then being aggregated and checked for agreement. Machine learning algorithms then try to determine those factors that caused scorers to assign a certain senti- ment. Using a machine (i.e., unsupervised learning) applies machine learning algorithms to evaluate sentiment. Each of these methods have their drawbacks. The dictionary method presumes that a word embodies a given sentiment regard- less of context. Supervised learning is labor intensive and requires care in making sure that there is agreement between scores for the training corpus. Unsupervised learning requires large corpora with rich supplemental data and large computational resources to attempt. All three methods are susceptible to inaccuracies brought about by grammatical inconsistencies, e.g., sarcasm or regional idioms. In this paper we will apply a blend of self-reported sentiment in the form of scoring and contextual data along with generalized regression as a variable selection technique. We propose that these two components, when applied together, provide a faster work flow and more accurate assessment of the sentiment in a corpus than the more traditional methods detailed earlier.

Derived Sentiment Analysis

Using a model to determine sentiment requires both a response and variables. In the work flow that we propose re- sponses are self-reported or extracted from the documents in the corpus. Examples of this self-reporting include: stars provided with a written review, Likert Scale scores in a survey, or even (as we will demonstrate later) emoji. Leveraging the data that is already present in a corpus resolves a number of problems traditionally associated with sentiment analysis. First, the respondents themselves are pro- viding their sentiment scores. This saves on time and costs associated with curating the corpora. Second, it removes any ambiguity about respondent sentiment, which should make the results more accu- rate. The variables in this case come from a curated list of words and phrases from the corpus called a Document Term Matrix (DTM). The DTM takes the form of a collection of indicator columns that show when (and how many times) a given word or phrase is present in a document. derived sentiment analysis using jmp pro 4

The method for developing a sentiment model from the corpus is broken into two steps. First, the DTM is created using the Text Ex- plorer. This is accomplished using the Text Explorer Platform in JMP. Within the Text Explorer, terms that should be excluded from con- sideration are excluded using a stop word list. Regular Expressions are also used to remove formatting, URLs, unnecessary punctuation, etc. Terms are also stemmed to remove the influence of tense and part-of-speech usage. Lastly, a recoding operation is used to clean up spelling errors and change terms when needed. After the curation process is complete, the DTM is exported back to the data table. Once the DTM has been exported, it is used with the self-reported responses in Generalized Regression with an Elastic Net penalization. Generalized Regression and the Elastic Net Penalty were chosen due to the ability of the Elastic net to function both as a variable selection tool and as a method for dealing with covariance. The model report provides insights into the important terms in a given document that indicate a specific sentiment. The model can also be used to predict sentiment for new documents that may not include the scoring infor- mation.

Case Study 1: The Toronto Casino Survey

You’ve got to know when to hold ’em, Know when to fold ’em. . . - From The Gambler, Kenny Rodgers

Background

In 2012, the City of Toronto conducted an online survey to gauge public reaction to a proposed casino project. The survey was designed and conducted between November 2012 and January 2013. Approximately 18,000 responses were submitted. The results 3 3 were posted online for public consumption by the City of Toronto http://www.toronto.ca/ in a set of Excel files. The survey instrument was composed of 11 casinoconsultation/ questions in multiple parts with most questions having both a rating component and a comment section. The application of a derived sentiment analysis to the first question in the survey, "Please indicate on the scale below how you feel about having a new casino in Toronto", will be used for this case study. derived sentiment analysis using jmp pro 5

Analysis and Results: Proof of Concept

Casinos tend to be polarizing topics. We found that this sur- vey was no exception. Looking at the distribution of responses(Figure

1) respondents were generally divided between two groups: strongly Strongly in Favor opposed or strongly in favor of the proposal. The middle responses represented only 11% of the corpus. We chose to exclude the neutral responses from the analysis (3%) and regroup the response data into a simple binary response (Figure 2). Neutral or Mixed

Next, the corpus must be curated for modeling. At the time

of the original survey, respondents had the opportunity to use a web- Strongly Opposed form or submit a paper copy. Both of these routes have potential for typographical and spelling errors. The corpus was analyzed using Figure 1: Distribution of survey re- sponses as found in the corpus. Text Explorer with stemming to help reduce some of the variabil- ity in word use. We worked with recode and stop words to remove words and terms that didn’t appear significant using a word cloud to In favor assist with the curation process. Before removing terms we created a word cloud to help with curation process. The word cloud after the curation process, colored using the recoded response variable, is

shown in Figure 3. While constructing the word cloud JMP catego- 0 rized "In Favor" as "1" and "Against" as "2". Simple inspection of the cloud provides a number of insights. The color scaling is centered on a value of 1.6 indicating that the majority of respondents oppose a casino. The word usage is also interesting. Terms like "Toronto" Opposed and "revenu·" tend to be used in a context more favorable to a casino, Figure 2: Distribution of survey re- whilst terms like "gambling" and "crime·" are associated with less sponses binned into a binary response. favorable opinions. This empirical correlation between the terms and the responses suggests that a sentiment model for the corpus is possi- ble. Using Text Explorer a DTM including all terms that appeared in the corpus more than 500 times was exported. We used the Ternary setting to account for multiple appearances in a document.

tourist· promot· world· class· Binary Sentiment gamble Figure 3: Word Cloud for the curated privat·

world· system· travel· propos· strong· citizen· 2.0 communiti· potenti· corpus colored by the binary response

unhealthy 1.9 violence

futur· gambling really alway· variable. Note that while constructing hous· enhanc· business industri· hors· major· fund· offer· restaur· without suffer· racetrack· 1.8 impact· encourag· waterfront· depend· close· environ·

undesir· desper· govern· detriment· "In corrupt· the word cloud JMP categorized especially negat· qualiti· individu· 1.7

atlantic favour· show· based class· tourism issu· famili· problem· gambling project· properti· access· casino's burden· entertain· event· Favor" as "1" and "Against" as "2". moral· surround· drive· reput· children· plan· 1.6 organ· profit· tacki· afford· safety live· expens· harm· might space· ruin· healthy go· evid· reason· visitor· avail· like· game· lose· exist· torontofriend· relat· extra rather contribut· rate· venu· gambling problem· least current· 1.5 economi· polic· public· cost· oper· activ· affect· habit· cultur· addict·fabric influenc· citi· race· come· serious· boost· wast· slots hotel· work· right· addit· people ad· develop· rais· place· generat· area· park· benefit· depress· greater never mean· owner· destin· sourc· associ· 1.4 lead· direct· result· need· lower· resourc· divers· wors· canada· addict· to gambling problem· nothing downtown· money· limit· 1.3 convent· think· general· destroy· transit· becom· built· caus· believ· make· vegas health· congest· pay· windsor· resid· signific· prey· level· ontario· requir· every build· higher small suicid· 1.2 increas· product· construct· better· drain· local· take· growth locat· enough focus· concern· vibrant around element· behaviour· financi· street· advers· gain· neighbourhood· infrastructur· improv· revenu· creat· gridlock· within toronto's nearby imag· woodbin· taxpay· destruct· crime· transport· studi· popul· opportun· amount· centr· municip· provid· little crimin· another trafficonsid· c· attract· leav· econom· often mental· fall· youth· invest· larg· theatr· valu· social· thing· oppos· dollar· tax· great· social· problem· crowd· sustain· societi· establish· support· spend· possibl· consequ· option· gambling is addict· facil· gambler· promis· enjoy· servic·

includ· damag· incom· torontonian· creation· gambling addict· complex· person· drug· chang· alcohol· already exploit· outsid· vulner· prostitut· instead promot· gambling encourag· gambling visit· employ· wrong· residential program· urban· effect· outweigh· advantag· niagara· number· someth· posit· downtown· toronto poverty import· short· provinc· interest· spent derived sentiment analysis using jmp pro 6

Taking the DTM and the restructured survey responses, we used Fit Model and Generalized Regression to derive a senti- ment4 model for the corpus. Given that we expected a high degree 4 In this example we are using "sen- of collinearity in the DTM due to grammar and word use forms in timent" in its most primitive form; simply, do they feel positively or nega- English, we used an Elastic Net penalty and a BIC model valida- tively. tion statistic. Since we created a binary response variable, we used the binomial distribution. This has the added benefit of providing probabilities for positive ("In Favor") and negative ("Against") sen- timent. In this case we also chose to only explore terms in the DTM by themselves, i.e., excluding interactions. A model was generated that included quadratic and two-way interactions, but no significant advantage over the main effects model was found for this corpus. Figure 4 shows the fitting statistics as well as the solution path and active terms for the fit.

Adaptive Elastic Net with BIC Validation Figure 4: Fit Statistics and solution path Model Summary Estimation Details for the survey sentiment model. Response Binary Sentiment[SurveyID] Elastic Net Alpha 0.85 Distribution Binomial Number of Grid Points 150 Estimation Method Adaptive Elastic Net Minimum Penalty Fraction 1e-4 Validation Method BIC Grid Scale Square Root Probability Model Link Logit Measure Number of rows 42557 Sum of Frequencies 41450 -LogLikelihood 17862.594 Number of Parameters 27 BIC 36012.259 AICc 35779.225 Generalized RSquare 0.3175386 Lambda Penalty 14.995904 Solution Path 150 46000

100 44000 50

0 42000

-50 BIC 40000 -100 Parameter Estimates -150 38000

-200 36000 -250 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 Magnitude of Scaled Parameter Estimates Magnitude of Scaled Parameter Estimates Active Parameter Estimates Wald Prob > Term Estimate Std Error ChiSquare ChiSquare Lower 95% Upper 95% Intercept 23.25772 1.076562 466.71972 <.0001* 21.147697 25.367743 toronto Ternary[Missing-Many] 0.1539344 0.0428922 12.879997 0.0003* 0.0698674 0.2380015 gambling Ternary[Present-Many] 1.7514493 0.0819662 456.58935 <.0001* 1.5907985 1.9121 people Ternary[Missing-Many] -0.924624 0.0644527 205.80165 <.0001* -1.050949 -0.798299 revenu· Ternary[Present-Many] -1.258206 0.0443443 805.0603 <.0001* -1.345119 -1.171292 increas· Ternary[Present-Many] -0.207007 0.0604636 11.721478 0.0006* -0.325513 -0.088501 crime· Ternary[Missing-Many] -3.344537 0.1674862 398.76212 <.0001* -3.672804 -3.01627 addict· Ternary[Missing-Many] -1.472201 0.685857 4.6075181 0.0318* -2.816456 -0.127946 addict· Ternary[Present-Many] 1.4093744 0.7044247 4.0029818 0.0454* 0.0287274 2.7900215 social· Ternary[Missing-Many] -2.31669 0.1394105 276.14997 <.0001* -2.589929 -2.04345 business Ternary[Missing-Many] -1.209404 0.0873142 191.85518 <.0001* -1.380537 -1.038272 creat· Ternary[Present-Many] -0.588163 0.063217 86.562118 <.0001* -0.712066 -0.46426 famili· Ternary[Missing-Many] -3.315867 0.2449341 183.27189 <.0001* -3.795929 -2.835805 traffic· Ternary[Missing-Many] -2.630131 0.1654103 252.83068 <.0001* -2.954329 -2.305932 attract· Ternary[Present-Many] -0.200427 0.0679647 8.6965346 0.0032* -0.333636 -0.067219 benefit· Ternary[Missing-Many] -0.577314 0.0807756 51.081518 <.0001* -0.735631 -0.418996 downtown· Ternary[Missing-Many] -0.760902 0.087263 76.032079 <.0001* -0.931934 -0.58987 already Ternary[Missing-Many] -1.420069 0.1166916 148.09477 <.0001* -1.64878 -1.191358 problem· Ternary[Missing-Many] -2.105089 0.1764633 142.30894 <.0001* -2.450951 -1.759228 negat· Ternary[Missing-Many] -1.724471 0.1551916 123.47391 <.0001* -2.028641 -1.420301 impact· Ternary[Present-Many] 0.4993125 0.1246431 16.047538 <.0001* 0.2550165 0.7436085 communiti· Ternary[Missing-Many] -1.278368 0.1110607 132.49234 <.0001* -1.496042 -1.060693 entertain· Ternary[Present-Many] -1.1898 0.0626797 360.32454 <.0001* -1.31265 -1.06695 tourism Ternary[Missing-Many] 0.8274839 0.637269 1.6860619 0.1941 -0.42154 2.0765083 tourism Ternary[Present-Many] -1.396435 0.641129 4.7440668 0.0294* -2.653025 -0.139846 local· Ternary[Missing-Many] -0.334786 0.102518 10.664331 0.0011* -0.535718 -0.133854 econom· Ternary[Present-Many] -0.319808 0.0834203 14.697194 0.0001* -0.483308 -0.156307 derived sentiment analysis using jmp pro 7

The model has what most would consider a poor r2 (Gener- alized RSquare in the Model Summary); however, all the terms except one appear to be significant. The low r2 in this case has less to do with the quality of the model and more to do with the noise in the data set. Since spoken and written language have significant flexi- bility in word usage we expect that real world corpora will be noisy. toronto 2 gambling This will depress r . Examination of how the model fit the data using people revenu· the Solution Path and the Active Parameters indicate that our model increas· crime· addict· is near the optimal for this corpus. We can take confidence in the al- social· business gorithm removing approximately half of the proposed terms in the creat· famili· traffic· DTM from the model. Further, the proposed optimal model sits in attract· benefit· the general minimum for the fitting statistic. downtown· already problem· Specifically considering the Solution Path we can make inferences negat· impact· about general features of the model and how given terms impact communiti· entertain· tourism the sentiment it predicts. During the set up we chose to target the local· econom· positive, i.e., in favor of the casino, sentiment response. This make it possible to use a Solution Path as a tool for understanding which Figure 5: Distribution of terms used by opponents to the casino measure. terms influence the sentiment and by how much. Each term in the model is represented in a line in the Solution Path. Lines in blue are toronto active in the model whilst those in black are excluded. The displace- people revenu· ment of a line away from zero on the y-axis is representative of the increas· coefficient estimate for that model term. Qualitatively, it’s possible to business creat· see that most of the language used in the surveys project a negative attract· benefit· sentiment, while a few key terms show a positive sentiment. downtown· communiti· entertain· We used tools in JMP to generate findings from the model. tourism We were first able to determine the primary pro and con arguments local· by optimizing the profile for positive and negative sentiment. Those econom· who view the proposal positively discussed the proposal from an Figure 6: Distribution of terms used by economic standpoint, citing increased revenues to the city and local proponents of the casino measure. businesses. Those opposed to the casino cited social issues such as crime, addiction and the presence of similar facilities near by. These two view points can be seen by examination of the histograms pro- vided in Figures 5 and 6, which were created by filtering for propo- nents and opponents as predicted by the sentiment model. derived sentiment analysis using jmp pro 8

Case Study 2:

A coat of gold, a coat of red, A lion still has claws, And mine are long and sharp, my Lord, As long and sharp as yours. . . " - From The Rains of Castamere, , George R.R. Martin

Figure 7: Geographic distribution of tweets in the corpus.

Background

We will do our best to avoid spoilers in this discussion, but if you read on, on your own head be it.

One of the most pivotal scenes in George R.R.Martin’s series is The Red Wedding. The scene occurs during the episode, "The Rains of Castamere." It aired on June 2, 2013 and is viewed as a very intense point in the narrative. We obtained a corpus containing documents (tweets) that used hashtags related to the program during and immediately after the first broadcast of the episode from around the world. The corpus contains approximately 27,000 tweets and includes 35 different languages. The tweets for which there was location data available are shown in Figure 7. Ap- proximately 24,000 documents (about 90%) are in English and 640 of those contain emoji. derived sentiment analysis using jmp pro 9

Analysis: The Red Wedding

The initial curation process, shown as a word cloud in Figure 8, showed the significant presence of an unexpected term, "ud83d." Initially, we thought this term was part of a url that was not scrubbed by our regular expression filtering. However, after a little investiga- tion, we discovered that this was the way that Twitter recorded emoji and other unicode characters. This realization presented an interest- ing avenue for sentiment analysis provided that the emoji could be extracted and recoded into a useful form.

frankieboyle compilation wow holi· season· Figure 8: Word Cloud for the corpus miss· during the process of curation with arbitrary coloring. Note the large

getting lannist· "ud83d" (middle left) indicating a large flip· show· work·

complet· always

redweddingtear· prepar· author· hour· sound· went head· take· quiet· game· help· number of unicode in the corpus. Also go· far ciarabaxendale upset· scene· gonna#redwed·depress· video· joffrey· robb· yet break· · o emot· mad· 140 omfg christ richard· 2 latest tri· face· event· love· ok sad· done life· want· crap come· castamere thing· crazy react· sure· quit· every next know· oh note that strong language was recoded heck someon· music· way· actual·

filch fan· lost walder· spoiler· north gad feel· lol· made ashens never d worse win· bed· put· year· cant uk· awesom· us recov· epic· yeah into less offensive terms. guess· tv call· die· day· say· battl· rob· ud83d lot· pleas· drink· televis· ago start· 1 seri· still·mech traumatis· rain· difficult ingloriousclos new sick· dark· kill· bad· im reaction· seem· 9 saw 5 must talk· liter· seen worst live· tweet· peopl· snow· wtf· see· least jon· rr· find· told friend· said stun· name· wit· trend· use· tell· sit· watch· dead· catelyn· internet· think· 10 expect· rememb· final· hate· best book· george one· good· u omg· s man· stop· something wait· today· thought· reader· frey· shock· make· world· anyon· now heart· total· post· left just look· night· ever felt death· back ruin· last·ned· moment· 3 ude2d ep· pain· lasers also anymore lord· just happen· noth· martin· mind· first red real· anything finish· intense mean· give· word· absolut· great· sleep· via much famili· might really jesus throne· spoil· cri· tonight· charact· like· read· hug· sorry silenc· gunnarbharmless guy· bit· tomorrow· nice· king· even· ude31 right· brutal· glad damn· week· hodor· since knew tear· wed· catch· s03e09 credit· amaz· need· little laugh· dwarf keep· #therainsofcastamere hope· play· youtub· bloody stark· piss· believ· speechless george r· r· martin· time· dont many serious· everybodi· end· thank· everyth· better happi· edsheeran surpris· well· pretty happen· everyon· tywin· minut· though

During the curation process, we noted a significant volume of profane language. For the figures in this document we have chosen tho employ the Recode feature in Text Explorer to change them to less offensive terms. The word cloud for the curated corpus (Figure 9) shows that viewers were taken aback by the episode.

death· #therainsofcastamere stun· reaction· Figure 9: Word Cloud for the curated corpus with arbitrary coloring. Note that strong language was recoded into

2 tomorrow· less offensive terms. flip· guy· really anymore

folk· said televis· sound· ep· complet· post· name· ashens hug· damn· dark· lasers went crazy us game· video· ned· silenc· done best always mind· tri· back lost #redwed·though look· intense next hate· emot· seem· 10 come· battl· take· bed· edsheeran frey· new dont 3 cri· s snow· think· react· peopl· red 5 told now brutal· music· trend· made ciarabaxendale right· tonight· stop· thronecast today· die· far heck total· live· way· bad· hbo· crap laugh· sorry traumatis·

shock·jon· wait· famili· make· need· many lot· saw long· internet· author· uk· put· ago bit· last·find· fan· 9 everyon· piss· left year· filch end· seen even· walder· omfg mean· yet throne· better via worst tweet· seri· wed·rr· word· 140 tv george start· mad· tell· finish· ever rememb· still· drink· hope· sure· yeah wit· watch·win· compilation wow help· see· ok quiet· call· o also nice· show· catelyn· life· knew read· king· holi· stark· absolut· anyon· least difficult hodor· guess· worse tywin· moment· sit· getting heart· man· s03e09 book· u awesom· everyth· traumat· upset· liter· sick· recov· ruin· 1 dead· im real· felt spoiler· hour· redweddingtear· kill· well· play· reader· just epic· charact· night· omg· robb· everybodi· rob· one· mech week· talk· good· d every like· use· oh pain· noth· time· feel· lannist· lol· little quit· wtf· never break· much something cant want· miss· day· speechless tear· gad love· happi· thank· go· head· amaz· season· minut· thing· friend· since glad just happen· first ingloriousclos north latest sad· must martin· frankieboyle sleep· castamere gonna world· anything work· know· rain· event· youtub· final· richard· actual· depress· lord· great· jesus bloody give· dwarf keep· might expect· catch· prepar· george r· r· martin· happen· spoil· joffrey· pleas· face· someon· pretty surpris· thought· christ gunnarbharmless scene· credit· say· serious· believ· derived sentiment analysis using jmp pro 10

Analysis: The Issue with Emoji

Today emoji5,6are commonplace. To digital natives, emoji are 5 Emoji are ideograms and smileys used becoming a form of pidgin language with which to express senti- in electronic messages and Web pages. Emoji are used much like emoticons and ment; emoji thus serve as an important response variable for derived exist in various genres, including facial sentiment analysis. From a data perspective, emoji are stored in Twit- expressions, common objects, places and types of weather, and animals. ter feeds as java source encodings of their unicode values - not as Originally meaning pictograph, the the pictograms we have come to expect. For instance, the emoji for word emoji comes from Japanese e a "grinning face with smiley eyes" ( ) would appear as "\uE415" in ("picture") + moji ( "character"). 6 en.wikipedia.org/wiki/Emoji the text. This can be further complicated in cases where non-English language documents are included in the corpus. In many cases, these documents will also include unicode for decorated letters, ı.e., a lower-case "a" with an umlaut (ä) which would appear as "\u00E4" in a tweet. To deal with these complications we first limited ourselves to English tweets. Furthermore, the data set was processed using a script that would parse emoji to create columns with occurrence fre- quency for each, translating them from java source notation to text description. By considering not just whether an emoji was used, but how often it was used, we sought insight into the intensity of the underlying sentiment expressed by the emoji.

Figure 10: Emoji present in the corpus and their frequency of occurrence. :joy: :cry: :sob: :rage: Others :weary: :fearful: :flushed: :scream: :pensive: :relieved: :frowning: :persevere: :tired_face: :anguished: :dizzy_face: :astonished: :cold_sweat: :see_no_evil: :open_mouth: :broken_heart:

When we examined the frequency of occurrence for the emoji present in the corpus, we found that a small fraction of the emoji represented the bulk of the frequency data. A histogram, binned to put the bottom 20% of the emoji used into a single category, is shown in Figure 10. This category, "Others," represents approximately 70 emoji that were used a combined 588 times. Examining the other 20 emoji it is easy to see there is significant overlap in sentiment; therefore it might be possible to perform factor analysis and extract a small number of latent factors to use in a sentiment analysis in place of the emoji themselves. derived sentiment analysis using jmp pro 11

The results of the factor analysis (Figure 11) demonstrate that there are probably 7 latent factors described by these emoji. The rotated factor loading show that factor analysis successfully grouped emoji into new factors that are mostly orthogonal (Figure 12). Examination of the emoji present in these rotated factors shows that the factors follow general emotional themes as indicated in the table (Figure 13).

Factor Analysis Figure 11: Word Cloud for the curated Eigenvalues Scree Plot corpus colored by the binary response 8 Number Eigenvalue Percent 20 40 60 80 Cum Percent variable. 1 7.0941 35.470 35.470 7 2 4.8730 24.365 59.835 3 2.8262 14.131 73.966 6 4 2.1620 10.810 84.777 5 1.2971 6.486 91.262 5 6 1.1698 5.849 97.111 7 1.0000 5.000 102.111 4 8 1.0000 5.000 107.111 9 1.0000 5.000 112.111 3

10 1.0000 5.000 117.111 Eigenvalue 11 0.2641 1.321 118.432 2 12 0.1118 0.559 118.991 13 0.0576 0.288 119.279 1

0

-1 0 5 10 15 20 Number of Components

Figure 12: Word Cloud for the curated F1 F2 F3 F4 F5 F6 F7 corpus colored by the binary response Meaning Emoji variable. :dizzy_face: ! 100% ...... :anguished: " 100% ...... :frowning: ☹ 94% ...... :astonished: $ 91% . . . . 62% . :open_mouth: % 85% . . . . 68% . :scream: & 81% 71% 54% . 37% . . :cold_sweat: ' . 100% . . . . . :relieved: ( . 97% . . . . . :fearful: ) . 95% . . . . . :tired_face: * . 88% . . 58% . . :pensive: + . . 100% . . . . :cry: , . . 84% . 50% . . :sob: - . 66% . 100% . . . :rage: . . . . 82% . . . :broken_heart: / . . . 81% . . . :persevere: 0 ...... :weary: 1 . . . . 100% . . :flushed: 2 . 67% . . . 100% . Factor Sentiment Emoji :joy: 3 ...... 100% 1 Shock ☹ ," ,# ,$ ,% ,& :see_no_evil: 4 ...... 2 Distress ' ,( ,) ,* ,+ ,$ ,, 3 Sadness - ,. ,$ 4 Heartbreak / ,0 ,* 5 Depressed . ,1 ,) ,$ 6 Amazement # ,% ,, 7 Amusement 2 Using the DTM for emoji-containing tweets and created Figure 13: Sentiments and their emoji as sentiment factors, we ran ran an analysis similar to that of the first determined by factor analysis. case study with the following exceptions. First we modeled all of the sentiment factors simultaneously. We also treated the sentiment factors as continuous variables, thus enabling us to predict intensity of sentiment for each emotional state. The results showed a similar noise level and fitting statistics as were seen in the Toronto corpus. derived sentiment analysis using jmp pro 12

We then exported the model back to the main corpus and used it to predict sentiments for the tweets that did not contain emoji. The scores for each sentiment factor were rescaled to between 0 and 1 for comparison purposes. The maximum score for any given tweet’s sentiment factor was interpreted as an overall intensity score.

Red Wedding Sentiment Profile Intensity Figure 14: Profile of sentiments present 1.00 max (100%) in the corpus colored by intensity of +1sd (70%) mean (66%) sentiment. -1sd (62%) min (55%) 0.75

0.50

0.25

0.00

Shock Distress Sadness Heart Break Depressed Amazement Amusement General Sentiments

Red Wedding Sentiment Profile Intensity Figure 15: Profile of sentiments present 1.00 max (100%) in the corpus focusing in on those with +1sd (70%) mean (66%) a high amusement score. -1sd (62%) min (55%) 0.75

0.50

0.25

0.00

Shock Distress Sadness Heart Break Depressed Amazement Amusement General Sentiments

Figure 14 shows a parallel plot depicting how each of the tweets in the corpus scored on the different sentiment scales. The lines are colored based on overall intensity score. Overall, the trends appear to make sense. We see the overall sentiment, based on the density of lines to be elevated in Shock, Sadness and Amazement. derived sentiment analysis using jmp pro 13

These sentiments are consistent with the intense nature of the episode and the Red Wedding scene in particular. Shock and Distress appear to be inversely correlated. We would expect that, while they may be shocked, viewers will generally not experience Distress in any significant degree from a fictional TV show.

The fact that Amusement shows up in the corpus at all is an extremely interesting result. Because of the nature of the episode, "amusement" is not a sentiment that would generally be expected to come up. This sentiment factor was built specifically on the "Face With Tears of Joy" emoji ( ). Initially, we theorized that this was malapropism7. The emoji for "Face With Tears of Joy" ( ) 7 malaprop (also malapropism): the and "sob" ( ) or even "tired face" ( ) could appear very similar on mistaken use of a word in place of a similar- sounding one, often with unintentionally a smart phone screen. It would not be unreasonable to expect users amusing effect, as in, for example, ?dance a to appropriate emoji to express emotions other than what they are flamingo? (instead of flamenco). Origin: mid 19th century: from the technically meant to represent. If the evidence supports such misuse, name of the character Mrs. Malaprop in it would not be unreasonable to bin the misused emoji in with their Sheridan’s play The Rivals (1775) correct sentiment. However, looking at the tweets in the corpus more closely, this turned out to be an incorrect hypothesis. The documents that scored highly on the Amusement scale were actually retweets8 8 Reposting of a tweet by another user of a joke, "Why doesn’t George R.R. Martin use Twitter? Because he killed to share with their followers. all 140 characters." Users would occasionally use the " " emoji as part of the retweet, creating a separate sentiment in the corpus. This brings up an interesting application of derived sentiment analysis in JMP. If someone were to have simply assumed the malapropism and recoded the emoji, they would have introduced further noise into an already noisy analysis. By being able to find the emoji in question and isolate the tweets containing that emoji in Text Explorer, we were able to understand something that might have otherwise been missed.

Conclusions and Final Thoughts

Using Text Explorer and Generalized Regression we have demonstrated that it is possible to do exploratory sentiment anal- ysis in JMP. We presented two case studies showing the method of constructing a sentiment model and how it might be applied to gain unique insights from free text data. In the first case study we showed that a derived sentiment model could be used to extract underlying themes driving positive and negative feeling toward a proposal. In the second case study we demonstrated that emoji could be used to generate a response for a sentiment model that could then be used to gain insights from a larger corpus of documents.