Psycholinguistic Features for Emotion Intensity Prediction in Tweets

PLN-PUCRS at EmoInt-2017: Psycholinguistic features for emotion intensity prediction in tweets Henrique D. P. dos Santos, Renata Vieira Pontifical Catholic University of Rio Grande do Sul Porto Alegre - Brazil [email protected], [email protected] Abstract valued score between 0 and 1. The minimum possible score 0 stands for feeling the least amount of Linguistic Inquiry and Word Count emotion and the maximum possible score 1 stands (LIWC) is a rich dictionary that map for feeling the maximum amount of emotion. This words into several psychological cat- shared task analyze the emotion: anger, fear, joy egories such as Affective, Social, and sadness. We show an approach that can score Cognitive, Perceptual and Biological emotions based on psycholinguistic features. processes. In this work, we have used The rest of this paper is organized as follows. In LIWC psycholinguistic categories to train Section2 we describe LIWC, the well-known psy- regression models and predict emotion cholinguistic dictionary used in our experiments, intensity in tweets for the EmoInt-2017 Section3 covers some previous work that use psy- task. Results show that LIWC features cholinguistic features to classify text. Section4 may boost emotion intensity prediction on presents the proposed techniques and their evalua- the basis of a low dimension set. tion. In Section5 we discuss the most informative LIWC categories for each emotion set and finally, 1 Introduction we conclude in Section6 with future work. In Natural Language Processing tasks many techniques rely on statistical methods to classify texts 2 LIWC Categories based on word distribution. Sentiment analysis also takes advantage of this kind of approach to Linguistic inquiry and word count (LIWC), be- detect emotion or polarity in sentences (Liu and sides being a software, is a psycholinguistic lexi- Zhang, 2012). Twitter became the main source con created by psychologists with focus on study- of data to extract sentiment information in social ing the various emotional, cognitive, and struc- media because of its data characteristics: huge tural components present in individuals’ verbal amount of small sentences distributed in a time- and written speech samples (Pennebaker et al., line, which are easily gathered. 2015). This resource allows non-specialists to re- In Twitter, sentiment classification intends to trieve psychological statistics in text, and to search extract polarity or emotion with regards to a spe- for patterns that are able to detect differences in cific subject. The polarity defines a positive or group of documents. negative valency and the emotion usually is mod- The first LIWC version was developed as part eled over Ekman’s six basic emotions: joy, anger, of an exploratory study of language and disclosure sadness, happiness, surprise, fear and disgust (Ek- (Pennebaker, 1993). The second (LIWC2001) and man, 1992). third (LIWC2007) versions updated the original This work intends to score tweets for emotion with an expanded dictionary and a modern soft- intensities, by giving a real value for each tweet ware design (Pennebaker et al., 2001, 2007).The (Mohammad and Bravo-Marquez, 2017a), as part most recent evolution, LIWC2015 (Pennebaker of the EmoInt-2017 task. The goal of the task is, et al., 2015), has significantly altered both the dic- given a tweet, to predict the intensity of a specific tionary and software options. LIWC 2007 has emotion expressed in it (Mohammad and Bravo- been available as a open source dictionary. Marquez, 2017b). The intensity score is a real- LIWC dictionary classifies words in a variety 189 Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 189–192 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics Category Examples autism community from others. They analyze the Affective happy, cried, love, hurt frequency distribution differences in psychologi- Social mate, talk, they, dad cal processes between those communities and are Cognitive cause, know, ought, think able to detect them with 79% of accuracy using Perceptual look, heard, feeling, view machine learning. Mohtasseb and Ahmed(2009) Biological eat, blood, pain, hand use psychological features to find online diaries in blogs. Iyyer et al.(2014) classifies political Table 1: LIWC psychological process examples ideology between liberal and conservatives in social media. Santos et al.(2017) took advantage of of psychological categories based on psycholo- LIWC dictionary to analyze and detect personal gists studies and observations (Tausczik and Pen- stories posts in Brazilian blogs with 81% of preci- nebaker, 2010). LIWC assigns words into one sion over thousands of posts. of four high-level categories: linguistic processes, LIWC Psycholinguistic features are also used to psychological processes, personal concerns, and define the writer personality, as Poria et al.(2013) spoken categories. These are further subdivided shown in their work. Besides, it can be used to into a three-level hierarchy. The taxonomy ranges identify mental issues in online forum communi- across topics (e.g., health and money), emotional ties (Cohan et al., 2016). responses (e.g., negative emotion) and processes There is a great potential for psychologically not captured by either, such as cognition (e.g., dis- oriented dictionaries and here we use it to score crepancy and certainty). The words carry rich emotions values in tweets together with Support information about the author’s personality, senti- Vector Machines algorithms. ments, style, topics, and social relationships. The main categories in LIWC dictionary are the 4 Psycholinguistic Features following: For evaluating the prediction property of psy- Linguistic Dimensions and Other Grammar cholinguistic categories, each tweet is converted • to a vector of 64 positions, one for each LIWC Affective, Social, Cognitive, Perceptual and • category, explained previously. Each LIWC cate- Biological processes gory represents the frequency distribution of this category appearance in the specific tweet. Each Drives, Time orientations and Relativity • word could fit multiples categories, e.g. the word Personal concerns and Informal language ”admits” belongs to categories: Common verbs, • Present tense, Social processes, Cognitive pro- These categories are then specialized in other cesses and Insight. sub-categories, as in Affective processes sub- For our experiments we use Python library categorized as Positive and Negative Emotions, Scikit-Learn (Pedregosa et al., 2011) machine Anxiety, Anger and Sadness. learning algorithms. We ran cross-fold validation Some examples of words in such categories with 10 folds. can be found in Table1. These categories were We use Support Vector Regression (SVR) tun- translated to other languages (Balage Filho et al., ning the RBF, Linear, Linear SVR and Sigmoid 2013), and have been used to compare writing kernel parameters C (the penalty parameter) and γ styles between languages and countries (Afroz (the kernel width hyperparameter) performing full et al., 2012). In this paper we use this dictionary grid search over the 800 combinations of expo- for emotion prediction. nentially spaced parameter pairs (C, γ) following (Hsu et al., 2003). For Gradient Boosting Regres- 3 Related Work sion we run a simple grid search. Only the best re- There has been a lot of research seeking text clas- sults of each algorithm, using Spearman rank cor- sification in the scope of social media. Here we relation, are shown in Table2. focus on the works that use LIWC psycholinguis- The best results were obtained using Gradient tic features to solve some of those problems. Boosting Regression, Linear SVR and SVR with Nguyen et al.(2013) use the LIWC psycho- linear kernel, all with default parameters. All logical lexicon to distinguish blog posts of the three algorithms are highlighted in Table2 be- 190 Algorithm joy anger sadness fear Avg Score SVR k=Linear 0.431 0.502 0.557 0.441 0.483 Linear SVR 0.428 0.504 0.556 0.443 0.482 Gradient Boosting 0.420 0.519 0.565 0.420 0.481 SVR k=RBF 0.399 0.445 0.517 0.407 0.442 SVR k=Sigmoid -0.016 -0.085 -0.108 0.069 -0.035 Table 2: Spearman Score running each algorithm over emotions sets Joy Anger Sadness Fear Total function words Auxiliary verbs 1st pers singular Anxiety Negations Present tense Social processes Sadness Cognitive processes Negations Sadness Feel Discrepancy Swear words See Ingestion Tentative Humans Ingestion Space Exclusive Relativity Leisure Death Positive emotion Negative emotion Affective processes Anger Table 3: Top 10 LIWC most informative features cause there is no statistical difference in the Spear- regression task, is expect that they have a good re- man rank correlation. gression information. It is important to state that In Scikit-learn library, SVR with linear ker- Anger is a subcategory of Negative Emotion. nel differs from Linear SVR because the last use Another interesting confirmation is death, sad- liblinear rather than libsvm. The processing time ness and anxiety categories as good predictors for liblinear and prediction score is better using then Fear emotion set. Anger category appears as an the generic SVM library, as we see in Table2. informative feature for Joy emotion set, we will After defining the regression algorithm and the look further in the details of that to see whether it best parameters, we built the model for each emo- is informative due to a low feature value or some- tion dataset, based on the training set. Then we run thing else. Also, we want to look further to ex- each model for the test set and generate the output plain Negations LIWC category as good predictor for evaluation. The LIWC resource, test dataset in Joy emotion set. and scripts can be accessed in author’s Github project page 1. 6 Conclusion and Further Work 5 Most Informative Features Using univariate linear regression tests, we tested Psycholinguistic features have been used to clas- the effect of a single regressor and listed the most sify texts and sentences for a variety of tasks.

Load more