<<

Modelling Valence and in Facebook posts Daniel Preot¸iuc-Pietro H. Andrew Schwartz Positive Psychology Center Department of Computer Science University of Pennsylvania Stony Brook University [email protected] [email protected]

Gregory Park and Johannes C. Eichstaedt Margaret Kern Positive Psychology Center Centre for Positive Psychology University of Pennsylvania University of Melbourne

Lyle Ungar Elizabeth P. Shulman Computer & Information Science Department of Psychology University of Pennsylvania Brock University [email protected] [email protected]

Abstract what emotion it conveys (Strapparava and Mihal- cea, 2007) and towards which entity or aspect of Access to expressions of subjective personal the text i.e., aspect based sentiment analysis (Brody posts increased with the popularity of Social and Elhadad, 2010). Downstream applications are Media. However, most of the work in senti- mostly interested in automatically inferring public ment analysis focuses on predicting only va- lence from text and usually targeted at a prod- opinion about products or actions. Besides express- uct, rather than affective states. In this pa- ing attitudes towards other objects, texts can also ex- per, we introduce a new data set of 2895 So- press the emotions of the ones writing them, most cial Media posts rated by two psychologically- common recently with the rise of Social Media us- trained annotators on two separate ordinal age (Rosenthal et al., 2015). This study focuses nine-point scales. These scales represent va- on presenting a gold standard data set as well as a lence (or sentiment) and arousal (or intensity), model trained on this data in order to drive research which defines each post’s position on the cir- cumplex model of affect, a well-established in about the affective norms of people post- system for describing emotional states (Rus- ing subjective messages. This is of great interest to sell, 1980; Posner et al., 2005). The data set is applications in social science which study text at a used to train prediction models for each of the large scale and with orders of magnitude more users two dimensions from text which achieve high than traditional studies. predictive accuracy – correlated at r = .65 Emotion classification is a widely debated r = .85 with valence and with arousal anno- topic in psychology (Gendron and Barrett, 2009). tations. Our data set offers a building block to a deeper study of personal affect as expressed Two main theories about emotions exist: the first in social media. This can be used in appli- posits a discrete and finite set of emotions, while cations such as mental illness detection or in the second suggests that emotions are a combina- automated large-scale psychological studies. tion of different scales. Research in Natural Lan- guage Processing (NLP) has been focused mostly on 1 Introduction Ekman’s model of emotion (Ekman, 1992) which posits the existence of six basic emotions: anger, Sentiment analysis is a very active research area that disgust, fear, joy, sadness and surprise (Strappar- aims to identify, extract and analyze subjective in- ava and Valitutti, 2004; Strapparava and Mihalcea, formation from text (Pang and Lee, 2008). This 2008; Calvo and D’Mello, 2010). In this study, generally includes identifying if a piece of text is we focus on the most popular dimensional model of subjective or objective, what sentiment it expresses emotion: the circumplex model introduced in (Rus- (positive or negative; often referred to as valence), sell, 1980). This model suggests that all affective states are represented in a two-dimensional space • Arousal (or intensity) represents the intensity with two independent neurophysiological systems: of the affective content, rated on a nine point valence (or sentiment) and arousal. Any affective scale from 1 (neutral/objective post) to 9 (very experience is a linear combination of these two in- high). dependent systems, which is then interpreted as rep- resenting a particular emotion. For example, fear Our corpus is comprised of Facebook sta- is a state involving the combination of negative va- tus updates shared by participants as part of the lence and high arousal (Posner et al., 2005). Previ- MyPersonality Facebook application (Kosinski et ous research in NLP focused mostly on valence or al., 2013), in which they also took a of ques- sentiment, either binary or having a strength com- tionnaires. All authors have explicitly given permis- ponent coupled with sentiment (Wilson et al., 2005; sion to include their information in a corpus for re- Thelwall et al., 2010; Thelwall et al., 2012). search purposes. We have manually anonymized the In this paper we build a new data set con- entire corpus by removing any references to other sisting of 2895 anonymized Facebook posts labeled names of persons, addresses, telephone numbers, e- with both valence and arousal by two annotators mails and URLs, and replaced them with placehold- with psychology training. The ratings are made on ers. two independent nine point scales, reaching a high In order to reduce biases due our participant agreement correlations of .768 for valence and .827 demographics, the data set sample was stratified by for arousal. Data set statistics suggest that while the gender and age and we have not rated more than dimensions of valence and arousal are associated, two messages written by the same person. Re- they present distinct information, especially in posts search is inconclusive about whether females ex- with a clear positive or negative valence. press more emotions in general (Wester et al., 2002). Further, we train a bag-of- linear regres- With regards to age, an age positivity bias has been sion model to predict ratings of new messages. This found, where positive emotion expression increases model achieves high correlation with actual mean with age (Mather and Carstensen, 2005; Kern et al., ratings, reaching Pearson r = .85 correlation on the 2014). arousal dimension and r = .65 on the valence di- The data originally consisted of 3120 posts. mension without using any other sentiment analysis All of these posts were annotated by the same two resources. Comparing our method to other estab- independent raters with a training in psychology. lished lexicons for valence and arousal and methods The raters performed the coding in a similar environ- from sentiment analysis, we demonstrate that these ment without any distractions (e.g., no listening to methods are not able to handle well the type of posts music, no watching TV/videos) as these could have present in our data set. We further illustrate the most influenced the emotions of raters, and therefore the correlated words with both dimensions and identify coding. opportunities for improvement. The data set and an- The annotators were instructed to sparingly notations are freely available online.1 rate messages as un-ratable when they were writ- ten in other than English or that offered 2 Data set no cues for a accurate rating (only characters with no ). The annotators were instructed to rate We create a new data set with annotations on two a message if they could judge at least a part of the independent scales: message. Then, the raters were asked to rate the two dimensions, valence and arousal, after they have ex- • Valence (or sentiment) represents the polar- plicitly been briefed that these should be indepen- ity of the affective content t in a post, rated on dent of each other. The raters were provided with a nine point scale from 1 (very negative) to 5 anchors with specified valence and arousal and were (neutral/objective) to 9 (very positive); instructed to rate neutral messages at the middle of 1http://mypersonality.org/wiki/doku.php? the scale in terms of valence and 1 if they lacked id=download_databases arousal. Dimension R1 µ ± σ R2 µ ± σ IA Corr. Valence of posts 1–9 1–3.5 1–4 6–9 6.5–9 Valence 5.274 ± 1.041 5.250 ± 1.485 .768 Correlation to arousal .222 -.047 -.201 .226 .085 Arousal 3.363 ± 1.958 3.342 ± 2.183 .827 Mean arousal 3.35 3.85 3.47 4.31 4.68 Table 2: Individual rater mean and standard devia- Table 3: Correlation with arousal and mean arousal tion and inter-annotator correlation (IA Corr). values for different posts grouped by valence.

6.0 presence of either positive and negative valence is correlated with a arousal score different than 1, but 5.5 this correlation is weaker when the positive or neg- ative valence passes a certain threshold (i.e. 3.5 and Valence 5.0 6.5 respectively). We also note that the high overall

4.5 correlation is also due to higher mean arousal for 15 20 25 30 35 Age positive valence posts compared to negative posts 4.5 (4.68 cf. 3.85)

4.0 Figure 2 displays the relationship between the

3.5 age of the user at posting time and the valence and

3.0

Arousal arousal of their posts in our data set, and further di-

2.5 vided by gender. We notice some patterns emerge

2.0 in our data. Valence increases with age for both 15 20 25 30 35 Age genders, especially at the start and end of our age Figure 2: Variation in valence and arousal with age intervals (13–16 and 30–35), confirming the aging in our data set using a LOESS fit. Data is split positivity bias (Mather and Carstensen, 2005). Va- by gender: Male (coral orange) and Female (mint lence is higher for females across almost the entire green). age range. Posts written by females are also sig- nificantly higher in arousal for all age groups. Age In total, 2895 messages were rated by both does not play a significant effect in post arousal, al- users in both dimensions. Table 1 shows exam- though there is a slight increase with age especially ples of posts rated in all quadrants of the circumplex for females. Overall, these figures again illustrate model. the importance of age and gender as factors to be The correlation between the raters and the considered in these types of application (Volkova et mean and standard deviation for each rater are pre- al., 2013; Hovy, 2015). sented in Table 2. The inter-annotator agreement on deciding un-ratable posts is measured by Cohen’s 3 Predicting Valence and Arousal Kappa of κ = .93. The histograms of ratings are To study the linguistic differences of both dimen- presented in Figure 1. The data set is released with sions, we build a bag-of-words prediction model of the scores of both individual raters. valence and arousal from our corpus.2 We train two We study the correlation between the valence linear regression models with `2 regularisation on and arousal scores for posts in Table 3. We chose to the posts and test their predictive power in a 10- split values based on different valence thresholds in fold cross-validation setup. Results for predicting order to remove posts rated as neutral in valence (5) the two scores are presented in Table 4. from the analysis, as they are expected to be low in We compare to a number of different exist- intensity (1). We observed an overall correlation be- ing general purpose lexicons. First, we use the tween the valence and arousal ratings, which holds ANEW (Bradley and Lang, 1999) weighted dic- for both positive and negative valence tweets when tionary to compute a valence and arousal score as the neutral posts are removed (.222, .226 correla- the weighted sum of individual valence and tion). However, when the posts are both more pos- arousal scores. Similarly, we use the affective norms itive and negative in valence, arousal is only mildly correlated (.047 and .085). This highlights that the 2Available at http://wwbp.org/data.html Message V A Is the one whoz GOing to Light Up your Day!!!!!!!!!!!! 7 8 Blessed with a baby boy today ... 7.5 2 the boring life is back :( ... 3 2.5 IS SUPER STRESSED AND ITS JUST THE SECOND MONTH OF SCHOOL ..D: 2.5 7 Table 1: Example of posts annotated with average valence (V) and arousal (A) ratings.

900 500

700

300 500

300 Number of posts Number of posts

100 100

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Valence Arousal (a) Valence. (b) Arousal. Figure 1: Histrograms of average rating scores. of words obtained by extending ANEW with Method Valence Arousal ratings for ∼14000 words (Warriner et al., 2013). ANEW .307 .085 We also benchmark with standard methods for esti- Aff Norms .113 .188 mating valence from sentiment analysis. First, we MPQA .385 – use the MPQA lexicon (Wilson et al., 2005), which NRC .405 – contains 7629 words rated for positive or negative BOW Model .650 .850 sentiment, to obtain a score based on the difference Table 4: Prediction results for valence and arousal between positive and negative words in the post. of posts reported in Pearson correlation on 10-fold Second, we use the NRC Hashtag Sentiment Lexi- cross-validation for the BOW model. con (Mohammad et al., 2013), which obtained the best performance on the Semeval Twitter Sentiment Analysis tasks.3 than the affective norms lexicons, albeit lower than our model trained on parts of the same data set. Our method achieves very high correlations with the target score. Arousal is easier to predict, The performance improvement is most likely reaching r = 0.85 correlation between predicted driven by the domain of the data set. While our and rater score. ANEW obtains significant corre- method is trained on held-out data from the same lations with both of our ratings, however these are domain in a cross-validation setup, the other meth- significantly lower than our model. The extended ods suffer from lack of adaptation to this domain. list of affective norms obtains, perhaps surprisingly, The NRC lexicon, trained for predicting sentiment lower correlation for valence, but stronger correla- on Twitter, obtains the highest performance of the tion with arousal than ANEW. For valence, both sen- established models, due to the fact that is trained on timent analysis lexicons provide better performance a more similar domain. The lower performance of the existing models can also be explained by the fact 3https://www.cs.york.ac.uk/semeval-2013/ that they predict a score used for classification into task2/ positive vs. negative, while our target score repre- sents the strength of the positive or negative expres- arousal (‘Sunday’, ‘Yay’), negative valence (‘Why’, sion. Moreover, the affective norms scores are hand- ‘Stupid’) or negative arousal (‘Life’, ‘Every’, ‘Peo- crafted dictionaries where the weights assigned to ple’). The question mark is correlated to negative words are derived in isolation of context, contain no valence, together with the word ‘Why’, showing that adaptations to new words, spellings and to the lan- questions on Facebook are usually negative in va- guage use from Facebook. lence. Also in terms of punctuation, positive valence and arousal is expressed through exclamation marks, 4 Qualitative Analysis while negative valence and especially arousal is ex- pressed through repeated periods. This behavior is In this section we highlight the most important uni- specific to Social Media and which standard emo- gram features for each dimension as well as the qual- tion lexicons usually does not capture. itative difference between the two dimensions of va- Emoticons also exhibit an interesting pattern lence and arousal. To this end, we show the words across the two dimensions. The smiley :) is the sec- with the highest univariate Pearson correlation with ond most correlated feature with valence, but is not either of the two dimensions in Table 5. Each score in the top 10 for arousal. Similarly, the frown emoti- is represented by the mean of the two ratings. cons (:(, :’() are amongst the top 10 features corre- Valence r Arousal r lated with negative valence, but have no relationship Positive ! .251 ! .773 with arousal. The only emoticon correlated highly :) .237 Birthday .097 with low arousal is the undecided emoticon (:/). Birthday .212 Happy .081 Happy .197 Its .079 5 Conclusion Thank .196 Wishes .076 Great .195 Soooo .074 In this work, we introduced a new corpus of So- Love .195 Thanks .073 cial Media posts mapped to the circumplex model Thanks .179 Christmas .071 of affect. Each post is annotated by two annota- Wishes .170 Sunday .069 tors with a background in psychology on two in- Wonderful .159 Yay .064 dependent nine point scales of valence and arousal, Negative Hate -.163 [..]* -.206 who were calibrated before rating the statuses. We :( -.159 . -.164 described our annotation process and reviewed the ? -.117 Status -.064 annotation guidelines. In total, we annotated 2895 Sick -.112 Life -.064 Facebook posts, discarding the un-ratable ones. The Why -.102 People -.060 corpus and our valence and arousal bag-of-words :’( -.094 Bored -.059 prediction models are publicly available. Not -.093 :/ -.056 The results of the annotations have very high Bored -.092 Of -.056 agreement. A linear regression model using a bag Stupid -.089 Deal -.056 of words representation trained on this data achieves ... -.087 Every -.054 high correlations with the outcome annotations, es- Table 5: Words most correlated positively and nega- pecially when predicting arousal. Standard senti- tively with the two dimensions. ment analysis lexicons predicted both dimensions with lower accuracies. The results show that both dimensions have Our system can be further improved by lever- similar top features as well as distinct ones. To- aging the vast amount of available data for Twit- kens such as ‘!’, ‘Happy’, ‘Birthday’, ‘Thanks’, ter sentiment analysis. We consider this model ex- ‘Wishes’ are indicative of both positive valence and tremely useful for computational social science re- arousal, while tokens like ‘Bored’ and ‘...’ are in- search that aims to measure individual user valence dicative of both negative valence and low arousal. and arousal, its relationship to demographic traits We notice however tokens that are only indicative and its changes over time or in relation to certain of positive valence (‘Wonderful’, ‘Love’), positive life events. Acknowledgements Bo Pang and Lillian Lee. 2008. Opinion Mining and Sentiment Analysis. Foundations and Trends in In- The authors acknowledge the support of the Temple- formation Retrieval, 2(1-2):1–135. ton Religion Trust, grant TRT-0048. Jonathan Posner, James A Russell, and Bradley S Peterson. 2005. The Circumplex Model of Affect: An Integrative Approach to Affective References Neuroscience, Cognitive Development, and Psy- Margaret Bradley and Peter Lang. 1999. Affective chopathology. Development and Psychopathology, Norms for English Words (ANEW): Stimuli, In- 17(3):715–734. struction Manual, and Affective Ratings. Technical Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko, report. Saif M Mohammad, Alan Ritter, and Veselin Stoy- Samuel Brody and Noemie Elhadad. 2010. An Unsu- anov. 2015. Semeval-2015 Task 10: Sentiment pervised Aspect-Sentiment Model for Online Re- Analysis in Twitter. In Proceedings of the 9th In- views. In Proceedings of the 2010 Annual Confer- ternational Workshop on Semantic Evaluation, Se- ence of the North American Chapter of the Associa- mEval, pages 451–463. tion for Computational , NAACL, pages James A. Russell. 1980. A Circumplex Model of Af- 804–812. fect. Journal of Personality and Social Psychology, Rafael Calvo and Sidney D’Mello. 2010. Affect De- 39(6):1161–1178. tection: An Interdisciplinary Review of Models, Carlo Strapparava and Rada Mihalcea. 2007. SemEval- Methods, and their Applications. IEEE Transac- 2007 Task 14: Affective Text. In Proceedings of tions on Affective Computing, 1(1):18–37. the 4th International Workshop on Semantic Eval- Paul Ekman. 1992. An Argument for Basic Emotions. uations, SemEval, pages 70–74. Cognition & Emotion, 6(3-4):169–200. Carlo Strapparava and Rada Mihalcea. 2008. Learning Maria Gendron and Lisa Feldman Barrett. 2009. Recon- to Identify Emotions in Text. In Proceedings of structing the Past: A Century of Ideas about Emo- the 2008 ACM Symposium on Applied Computing, tion in Psychology. Emotion Review, 1(4):316– SAC, pages 1556–1560. 339. Carlo Strapparava and Alessandro Valitutti. 2004. Word- Dirk Hovy. 2015. Demographic Factors Improve Classi- net affect: an affective extension of wordnet. In fication Performance. In Proceedings of the 53rd Proceedings of the Fourth International Confer- Annual Meeting of the Association for Computa- ence on Resources and Evaluation, vol- tional Linguistics, ACL, pages 752–762. ume 4 of LREC, pages 1083–1086. Margaret L Kern, Johannes C Eichstaedt, H Andrew Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Schwartz, Greg Park, Lyle H Ungar, David J Still- Di Cai, and Arvid Kappas. 2010. Sentiment well, Michal Kosinski, Lukasz Dziurzynski, and strength detection in short informal text. Journal of Martin EP Seligman. 2014. From ”sooo ex- the American Society for Information Science and cited!!!” to ”so proud”: Using language to study Technology, 61(12):2544–2558. development. Developmental Psychology, 50:178– Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. 188. 2012. Sentiment Strength Detection for the Social Michal Kosinski, David Stillwell, and Thore Graepel. Web. Journal of the American Society for Informa- 2013. Private Traits and Attributes are Predictable tion Science and Technology, 63(1):163–173. from Digital Records of Human Behavior. Pro- Svitlana Volkova, Theresa Wilson, and David Yarowsky. ceedings of the National Academy of Sciences of the 2013. Exploring Demographic Language Varia- United States of America (PNAS), 110(15):5802– tions to Improve Multilingual Sentiment Analysis 5805. in Social Media. In Proceedings of the 2013 Con- Mara Mather and Laura L Carstensen. 2005. Aging and ference on Empirical Methods in Natural Language Motivated Cognition: The Positivity Effect in At- Processing, EMNLP, pages 1815–1827. tention and . Trends in Cognitive Sciences, Amy Beth Warriner, Victor Kuperman, and Marc Brys- 9(10):496–502. baert. 2013. Norms of Valence, Arousal, and Dom- Saif M. Mohammad, Svetlana Kiritchenko, and Xiaodan inance for 13,915 English Lemmas. Behavior Re- Zhu. 2013. NRC-Canada: Building the State-of- search Methods, 45(4):1191–1207. the-Art in Sentiment Analysis of Tweets. In Pro- Stephen R Wester, David L Vogel, Page K Pressly, and ceedings of the 7th International Workshop on Se- Martin Heesacker. 2002. Sex Differences in Emo- mantic Evaluation, SemEval, pages 321–327. tion a Critical Review of the Literature and Implica- tions for Counseling Psychology. The Counseling Psychologist, 30(4):630–652. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing Contextual Polarity in Phrase- level Sentiment Analysis. In Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing, EMNLP, pages 347–354.