<<

2020 International Conference on Computational Science and Computational (CSCI)

Emotion detection in Twitter posts: a rule-based algorithm for annotated data acquisition

Maria Krommyda Anastatios Rigos Institute of Communication and Computer Systems Institute of Communication and Computer Systems Athens, Greece Athens, Greece [email protected] [email protected]

Kostas Bouklas Angelos Amditis Institute of Communication and Computer Systems Institute of Communication and Computer Systems Athens, Greece Athens, Greece [email protected] [email protected]

Abstract—Social media analysis plays a key role to the person that provided the text and it is categorized as positive, understanding of the public’s opinion regarding recent events negative, or neutral. Such analysis is mainly used for movies, and decisions, to the design and management of advertising products and persons, when measuring their appeal to the campaigns as well as to the planning of next steps and mitigation actions for public relationship initiatives. Significant effort has public is of [4] and require extended quantities of text been dedicated recently to the development of data analysis collected over a long period of time from different sources to algorithms that will perform, in an automated way, sentiment ensure an unbiased and objective output. analysis over publicly available text. Most of the available While this technique is very popular and well established, work, focuses on binary categorizing text as positive or with many important use cases and applications, there are negative without further investigating the leading to that categorization. The current needs, however, for in-depth analysis other equally important use cases where such analysis fail to of the available content combined with the complexity and multi- provided the needed information. As an example, such analysis dimensional aspects of the human emotions and opinions have will not be of value in the case of users in the proximity of rendered such solutions obsolete. Due to these needs, currently, an stressful or extreme event, such as a natural disaster due research is focusing on specifying the emotions and not only the to extreme weather phenomena. To begin with, users in the sentiment expressed in a given text. This is, however, a very challenging effort due to not only the lack of annotated datasets vicinity of an event are expected to be negatively affected by that can be used for detection in text but also the the situation, they may be scared, worried or angry, so such subjectivity infused in datasets that have been created based on an analysis would have little or no added value for the risk manual annotations. A hybrid rule-based algorithm is presented assessment and the end users. Also, the text, regardless of in this paper, that supports the creation of a fully annotated the sources examined that will be available, is expected to be dataset over the Plutchik’s eight basic emotions. The presented algorithm takes into consideration the available emoji in the text limited, coming from the few users at the area of the event and utilized them as objective indicators of the and produced within a short amount of time. thus efficiently tackling both identified challenges. This is a full Extending the idea of the sentiment analysis, the emotion regular paper submitted to the CSCI-ISNA Symposium. detection [5] does not examine if the expressed sentiments are Index Terms —social media analysis, data analytics, emotion positive or negative but focuses on identifying the exact human detection, sentiment analysis, data acquisition, data annotation, Plutchik’s eight basic emotions, social media posts emotion that is present in an image, video, voice recording or text. The task of identifying in an automated way the I. INTRODUCTION emotions expressed by an individual, especially when the size Social media monitoring can be referring either to mea- and the features of the input are limited, is not a trivial or suring opinions about current events, also called sentiment easy to model task. Humans have the ability to understand the analysis, or to the emotion detection in the produced content emotions of the people around them using a series of signs [1]. The term sentiment analysis [2], [3] is referring to the in addition to the actual words exchanged, including the body process of identifying and categorizing text based on the language, the voice tone and the facial expressions. Even then, opinions expressed in it using an automated way. The process there are cases where there are contradictory opinions about is focusing exclusively to the analysis of the of the what the real expressed emotion is in a given context, as each individual is expected to understand and interpret differently This work is part of the RESIST project. RESIST has received funding emotional expressions due to personal social experiences. from the European Union’s Horizon 2020 research and innovation programme The most popular theory regarding emotion classification, under grant agreement no 769066. Content reflects only the authors’ view. The Innovation and Networks Executive Agency (INEA) is not responsible for any called the discrete emotion theory [6], is that there are some use that may be made of the information it contains. core human emotions that are the basics upon which all

978-1-7281-7624-6/20/$31.00 ©2020 IEEE 257 DOI 10.1109/CSCI51800.2020.00050 the other emotions can be interpreted and categorized. The utilized them as objective indicators of the expressed emotions that form this basis, however, have been in discussion emotion thus efficiently tackling the challenge of the among for many years. subjectivity of the emotion detection. It was as early as the 1872, in Charles Darwin’s book ‘The • A manually created list that provides the categorization of Expression of the Emotions in Man and Animals’ [7] that specific emoji over the Plutchik’s eight basic emotions. the idea of the discrete emotion theory was first formed. The The list has been designed to include only emoji that facial, physiological as well as behavioural characteristics of can be exclusively mapped to one of the eight emotions an individual were associated with the emotional state of the examined, excluding all the others. individual. In this book, however, there is no discussion about These functionalities have been developed so that they can which may be the basic emotions that humans express. be fully parameterized and used in a modular way, provid- The idea around the discrete emotions evolved over time ing training datasets fully compliant with the characteristics [8] and in 1957 presented [9] his initial view needed for the modeling task they will be used for. about basic emotions. His initial work can be summarized in two main assumptions. To begin with, Ekman claimed II. NATURAL LANGUAGE PROCESSING FOR SOCIAL that a pleasant-unpleasant and active-passive scale is sufficient MEDIA POSTS to capture the differences among emotions. Next, he argued Natural Language Processing (NLP) [12] is a field of that the association between the body language and the facial artificial intelligence that focuses on the interpretation of expression with the emotion that it corresponds to is a skill that the human language from computers as well as the human- people develop through social interaction and heavily based on computer interaction using natural language 1. The ultimate their cultural background. objective of NLP is to recognize text, identify the meaning A few years later as he proceed with his research regarding of the words used, interpret the meaning in the context used the expression of emotions, he became the first to challenge and in the end understand the text the same way that a human his own assumptions and proposed categorization and tried to would do. In the end, the purpose of the NLP is to extract establish a systematic and unbiased methodology of contacting knowledge and meaning from the text that can have added emotion classification. While Ekman’s work has received a value for applications and systems. Given the complexity of lot of criticism regarding the reliability, the data collection the task, the plethora of meanings a word or phrase can have and validation process and the trustworthiness of the result it based on its usage and content and the around the has provided a significant contribution, the six basic emotion way humans are able to understand text, NLP is using machine classes, which are , , , , , and learning techniques to derive meaning from text. . NLP is one of the very challenging fields of computer Further evolving Ekman’s work, [10] in- science due to the characteristics of the human language and creased the number of primitive emotions to eight. He pro- the multiple indicators that contribute to the understanding of posed a psycho-evolutionary classification approach for the the meaning of a phrase [13]. Grammar and syntax rules used emotions based on psychological observations of general for the formation of sentences vary in the level of detail, may emotional responses [11]. He justified the selection of these have many exceptions and their applicability can depend on emotions, as well as their need to belong to the list of primitive the content of the phrase. One of the most indicative rules ones, by placing them as the triggers to behaviors important that can used here as an example is the plurality of items. The for the survival, in emergency situations, such as the fight-or- general rule, that dictates that the use of the character “s” at flight response triggered by the emotion of fear. The Plutchik’s the end of a noun signifies the plurality, has three word groups eight average-intensity emotional categories are , , fear, as exceptions. In the one group there are words that end with surprise, sadness, disgust, anger and . the character “s” but are singular such as bus, in the second Contributions. A complete solution for the creation of group are words that do not end with the character “s” but are a fully annotated dataset is presented here. The proposed in plural such as the word ‘children’ and in the third group solution can be used as the training basis in many diverse are words that do not change between singular and plural emotion detection tasks. The presented solution provides a form such as the word ‘fish’. While these rules are complex, series of functionalities that are summarized as follows: they are exhaustive and can be modeled into deterministic • A natural language processor that handles the unique rules. Other elements of the human language however, such linguistic characteristics of social media posts in regard as sarcastic remarks, double negatives, cannot be modeled in to lexical, syntax and annotation preferences and provides a deterministic way. NLP aims to bring together the meaning a uniform text in the annotated dataset. of the words with the understanding of how the concepts are • A fully annotated dataset over the Plutchik’s eight basic connected and the message that they aim to deliver. emotions: joy, trust, fear, surprise, sadness, disgust, anger NLP systems, like any other machine learning methodolo- and anticipation. gies/techniques, are heavily dependent on the characteristics • A hybrid rule-based algorithm that supports the creation of an objectively classified dataset. The algorithm takes 1https://medium.com/dair-ai/deep-learning-for-nlp-an into consideration the available emoji in the text and overview-of-recent-trends-d0d8f40a776d

258 TABLE I TRADITIONAL TEXT VS.SOCIAL MEDIA POST NLP

# Traditional NLP Social posts NLP 1 The input text is one large passage written by exclu- The input is multiple small phrases of text written from many people sively one person, rarely a couple people 2 Limited amount of text per topic, written over a signif- Large amounts of text per topic, constantly produced icant period of time 3 Content focused on specific topics, rarely changing Fast-paced content change in short amount of time, posts covering multiple within the same passage topics 4 Contains only formal vocabulary Informal, non-existent words may be present 5 Properly spellchecked multiple times by professionals Non-standard or incorrect spelling, including letter repetition for emphasis and common replacement of expressions with others similar sounding such as there, they’re, their 6 Correct use of all the syntax rules Sporadic use of syntax rules 7 Proper use of punctuation marks Incorrect use of punctuation marks, often used in excess 8 Proper word capitalization Random use of capital letters, used occasionally to express emotions or to avoid spaces between words 9 Objective text, containing properly formatted facts and Messages focusing on expressing emotions and opinions as well as strong accurate information with limited emotional expressions arguments, fake news or misrepresented information may be included and devoted from opinions 10 Official abbreviations, used after being Custom abbreviations, used constantly to shorten the messages, without having explained/expanded in previous places within the one unique interpretation, their meaning is dependent on the context and they passage are not explained within the post 11 Original text, no repetition Retweeting, sharing, quotes

cover the needs and requirements of any emotion detector and provide a robust solution, tweets will be annotated using a rule based [16] approach that will take into consideration all the peculiarities of the social media posts, which can be used to train an emotion detector, as shown in Figure 1. In order to collect the needed social media posts a devel- oper’s account was created for the Twitter platform2, and a dedicated application was registered and used for the data collection process. A python script using the Twitter Streaming API3 and the Tweepy4 library was created. The main problem with the streaming API is the requirement to provide a filter Fig. 1. Dataset usage in model training. that uses specific keywords. The initial idea was to add each emotion, and its synonyms to that filter. This however would create a dataset that would lack diversity and fail to support all of the text they have been trained on [14]. In the case of the use cases. Aiming to collect a plethora of tweets, expressing NLP, the origin of the text and the grammatical and syntactical emotions using multiple ways, including emoji, hashtags and expectations are key elements that should be taken into consid- keywords, the most frequently used words in tweets were used eration. Traditionally, NLP systems are trained over properly to collect posts. formatted text, that complies with grammar and syntax rules It comes as no surprise that articles and pronouns are at the and uses official vocabulary, such as encyclopedias and journal top of this list5. So the streaming filter was set as {”a”, ”the”, articles. Solutions that have been trained using such text, ”I”, ”to”, ”you”, ”in”, ”on”, ”for”, ”with”, ”that”} aiming to provide remarkable results when used with similar text but capture tweets with a versatile contain. The tweet production cannot be used for short texts such as posts from social media rate is estimated on an average of 6000 tweets per second6, [15]. The main reason for that is that the characteristics that the streaming API has a limit of 50 tweets per second7 make the analysis possible are completely different to what the though, allowing us to capture approximately only 0.83% of system expects. Social media posts are expected to have loose the produced tweets. Retweets, quotes and responses to tweets syntax rules, use informal vocabulary, unofficial abbreviations, were removed from the dataset to eliminate text repetition and poor spelling and incorrect grammar and punctuation. The main expected differences, that the data harmonizer will tackle, 2https://twitter.com/home?lang=en are presented in Table I. 3https://developer.twitter.com/en/docs 4https://www.tweepy.org/ 5https://techland.time.com/2009/06/08/the-500-most-frequently III. DATASET COLLECTION &HARMONIZATION used-words-on-twitter/ In order to provide a dataset that is able to tackle the above- 6https://www.internetlivestats.com/twitter-statistics/ 7https://developer.twitter.com/en/docs/labs/filtered-stream/faq#:∼: mentioned linguistic challenges regarding social media posts, text=Youcanstreamupto,than50Tweetspersecond

259 Hashtags: Hashtags are very important for the tweets. They are used in excess, either to highlight important meanings in the text or to draw to specific situations. Hashtags are more often than not abbreviations or compositions of multiple words, making it very difficult to process and analyse. In our case we try to convert hashtags into words by establishing a robust data flow, as shown in Figure 3, that allow us to identify the most probable words for the hashtag. First we examine if the hashtag is simply one word, by looking for it in a dictionary. Then we examine if the hashtag is written using the camel case style, which is the most common style currently used. If this is the case, we are utilizing the capital letters to split the hashtag into words, validating this by ensuring that the words are in the dictionary. If this is not the case, then we are using examine each character of the hashtag, trying to use all the characters into words. All the above, are based on the usage of the nltk text corpora and lexical resources8 to identify valid words. Last but not least, parts of the hashtag that do not correspond in any word in the dictionary are examined against the Abbreviations9 open API and replaced with the corresponding words in case of a match. If the hashtag cannot be converted into words, then it is removed from the text, as Fig. 2. Tweet collection process. shown in Figure 3. URLs: More often than not, tweets contain hyperlinks to other sources. These add to the context and meaning of the tweet, they are however of no value for the emotion detector. For this reason, the tweet pre-processor Python library10 is used to eliminated them from the tweets. Mentions: Same with the URLs, the mentions are used to draw the attention of another user to a post. While their use is extensive and serves the purposes of a social media platform, they have no added value for the emotion detector. To this end, the pre-processor library is used once more to remove them from the text. Character repetition/ misspelled words: Each tweet is split into the words that it is composed of. Each word is checked against the nltk dictionary in order to be validated. If a word is not present in the dictionary, we examine if there is character repetition. Any character that appears more than twice is replaced by only two occurrences. If this is still not a valid word it is replaced with one character. If this still does not create a valid word as well as for any other invalid word that does not have letter repetition, the word is spellchecked, using the relevant Python library11. Only words that are valid and included in the dictionary are left in the tweet. Fig. 3. Hashtag processing in tweets IV. DATASET ANNOTATION The collected dataset is next annotated with a rule-based short phrases that would be too difficult to interpret out of python script in nine categories, the eight emotion categories context. The data flow is shown in Figure 2. of the Plutchik’s wheel and one for unidentified emotion. The After the collection of the dataset all the tweets were rules were specifically designed to take into consideration the processed in order to be harmonized, eliminating some of 8http://www.nltk.org/howto/corpus.html the identified challenges and ensuring that only proper text 9https://www.abbreviations.com/ was used for the training dataset. The harmonization process 10https://pypi.org/project/tweet-preprocessor/ focused on: 11https://pypi.org/project/pyspellchecker/

260 TABLE II INDICATIVE EXAMPLES OF EMOJI MAPPED TO EMOTIONS

Anger , , , ...

Anticipation , , ...

Disgust , , ...

Fear , , ...

Joy , , , ...

Sadness , , , ...

Surprise , , ...

Trust , , ...

No emotion , , , ... characteristics of the social media posts. The rules that are used by the system are the following: Emoji: The list with all the available emoji, was examined and split into nine categories. Eight of the categories were one to one mapped to the eight emotion categories of the Plutchik’s wheel while the ninth contained emoji that were either not associated with the eight emotions, such as food, vegetable, flags and professionals, or were the result of the combination from two or more emotions as per the chosen classification such as and disapproval. Fig. 4. Rule based tweet annotation More than 4000 emoji were examined, based on the list 12 provided by the emoji python library , and only 6% of them expressing a distinct concept. The sets of synonyms are also were classified to one of the eight categories with the majority interlinked by means of semantic and lexical relations. These of them falling into the ninth one. Each emotion category relationships are: was associated with a list of 20 to 30 emoji. As a further • Synonymy. Two terms are characterized as synonymous validation step, the list of the most commonly used emoji, a when they have exactly or nearly the same meaning, such live application that tracks all the emoji usage in tweets13,was as the terms car and automobile. consulted to ensure that the majority of the most popular emoji • Antonymy. Two terms are characterized as antonymous were included in the lists for the emotions. A few indicative when they represent the complete opposite from one examples of the emoji that were mapped to the eight emotions other, such as hot and cold. are presented in Table II. • Hyponymy. It is used to show a relationship of specifi- Based on the above analysis, each tweet is examined for the cation, such as the relationship between the world colour usage of emoji. The emoji are examined with regards to the and red. In this case, red is a hyponym of colour. category they belong to, if they are in the ninth category they • Hypernymy. It is used to show a relationship of general- are simply replaced with the text that corresponds to them, but ization, such as the relationship between the world fork if they belong to one of the eight emotion categories then they and the general term cutlery. Here, cutlery is a hypernym are not replaced with the corresponding text but used so that fork. the tweet is classified accordingly. The text used to replace • Meronymy. It is a relationship connecting a part to its the emoji is based on the text provided by the emoji library whole. As an example, a tree is a meronym of a forest. for the Unicode, where ‘:’ are completely removed and ‘ ’ are replaced with spaces. For example, Unicode ‘U0000263A’, In the rules designed for the annotation algorithm, special which corresponds to the  emoji, is replace by the library focus is given to two main groups of relationships, the with :smiling face: and then with ‘smiling face’. synonyms/hyponyms/hypernyms that are providing for each Lexical relations: WordNet [17] is a large lexical database emotion a bag of words that are can be used to classify a of English, where nouns, verbs, adjectives and adverbs are tweet in the category of this emotion and the antonyms of grouped into sets of cognitive synonyms, so that each set is each emotion that create a bag of words that can be used to classify a tweet in the polar opposite of the emotion based on 12https://github.com/carpedm20/emoji/ the Plutchik’s wheel [18], [19]. The annotation process for the 13http://emojitracker.com/ tweets is shown in Figure 4.

261 Due to the fact that the annotation process is based on identify the emotions expressed by individuals in the proximity specific characteristics of the tweet, specifically emoji and of commute disturbances in highways, such as unplanned words associated with the emotions through lexical relations, closed infrastructure or enforced traffic rerouting. this could lead to the creation of a training dataset not In order to evaluate the two datasets, the datasets have applicable in some use cases. For example, tree-like machine been used to train different machine learning models. Tree- learning models such as decision trees tent to be easily affected like models, such as the decision tree [20] and the random by the presence of specific keywords and patterns. To ensure forest [21], were able to reach an accuracy between 75% to the creation of the proper dataset for each case, the algorithm 85%, depending heavily on the volume of the training dataset, offers four options regarding the methods used for the text and performing significantly better when the dataset was annotation in the dataset. The first option includes in the containing the complete social media post. Neural Networks dataset only the text that has been annotated using an emoji, [22] were able to reach higher accuracy than the tree-like the second option includes only text that has been annotated models with the same data volume, between 82% to 90%, using a lexical relationship, the third option is to keep both performing better when the dataset had the part of the social texts with an appropriate tag specifying the annotation method. media post used to provide the annotation removed. Finally, there is also a more restrictive option available that REFERENCES includes in the dataset only text that has been annotated by both methods and there was an agreement in the annotation. [1] R. K. Bakshi, N. Kaur, R. Kaur, and G. Kaur, “Opinion mining and sentiment analysis,” in 2016 3rd International Conference on Computing In addition, the algorithm provides the option to either keep for Sustainable Global Development (INDIACom). IEEE, 2016, pp. the elements that were used for the classification of the tweet 452–455. in the text or remove the emoji and the words that specified [2] B. Liu, “Sentiment analysis and opinion mining,” Synthesis lectures on human language technologies, vol. 5, no. 1, pp. 1–167, 2012. the lexical relation from the text. [3] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. J. Passonneau, Taking into consideration the fact that collected tweets that “Sentiment analysis of twitter data,” in Proceedings of the workshop do not have any of the discussed specific characteristics are not on language in social media (LSM 2011), 2011, pp. 30–38. [4] R. Feldman, “Techniques and applications for sentiment analysis,” included in the annotated dataset and that some emotions are Communications of the ACM, vol. 56, no. 4, pp. 82–89, 2013. more popular and more often expressed than others causing an [5] F. A. Acheampong, C. Wenyu, and H. Nunoo-Mensah, “Text-based emo- unbalance between the availability of tweets for each emotion, tion detection: Advances, challenges, and opportunities,” Engineering Reports, p. e12189, 2020. the final dataset is expected to have less tweets than the [6] I. J. Roseman, “Cognitive determinants of emotion: A structural theory.” initially collected and a different distribution of tweets per Review of personality & social , 1984. emotion category. Tests performed in different hours during [7] C. Darwin and P. Prodger, The expression of the emotions in man and animals. Oxford University Press, USA, 1998. the day and during different days of the week, including late at [8] S. S. Tomkins, imagery consciousness: Volume I: The positive night and during weekends showed that there is a constant lack affects. Springer publishing company, 1962, vol. 1. of social media posts expressing fear and anticipation. Tweets [9] P. Ekman, “A methodological discussion of nonverbal behavior,” The Journal of psychology, vol. 43, no. 1, pp. 141–149, 1957. expressing anger and trust are also hard to collect, while joy [10] R. Plutchik, The emotions. University Press of America, 1991. and surprised are the most common emotions expressed. [11] ——, “A general psychoevolutionary theory of emotion,” in Theories of emotion. Elsevier, 1980, pp. 3–33. CONCLUSIONS [12] D. Jurafsky, Speech & language processing. Pearson Education India, 2000. A holistic solution for the creation of a fully annotated [13] C. Manning and H. Schutze, Foundations of statistical natural language processing. MIT press, 1999. dataset which maps short text to the Plutchik’s eight basic [14] S. Bird, E. Klein, and E. Loper, Natural language processing with emotions was presented in this document. The presented Python: analyzing text with the natural language toolkit. ” O’Reilly solution ensures that the dataset contains only valid words Media, Inc.”, 2009. [15] A. Farzindar and D. Inkpen, “Natural language processing for social and phrases, without any of the linguistic characteristics of media,” Synthesis Lectures on Human Language Technologies, vol. 8, the social media posts, thus allowing its usage to wide range no. 2, pp. 1–166, 2015. of use cases. Finally, the categorization algorithm which is [16] L. Canales and P. Mart´ınez-Barco, “Emotion detection from text: A sur- vey,” in Proceedings of the Workshop on Natural Language Processing based on the available emoji in the text provides an objective in the 5th Information Systems Research Working Days (JISIC), 2014, categorization devoted from opinion subjectivity. pp. 37–43. Two datasets with one million annotated social media posts [17] C. Fellbaum, “Wordnet,” The encyclopedia of applied linguistics, 2012. [18] M. Krommyda and V. Kantere, “Improving the quality of the conver- each were created with the proposed solution. The first dataset sational datasets through extensive semantic analysis,” in 2019 IEEE contains only posts annotated using emoji, while the second International Conference on Conversational Data Knowledge Engineer- contains posts annotated either due to an emoji or a keyword. ing (CDKE), 2019, pp. 1–8. [19] ——, “Semantic analysis for conversational datasets: improving their In both cases two variations of the dataset were created, one quality using semantic relationships,” International Journal of Semantic with the complete content of the post and one with the part of Computing, vol. 14, no. 3, 2020. the social media post used to provide the annotation removed [20] J. R. Quinlan, “Induction of decision trees,” Machine learning, vol. 1, no. 1, pp. 81–106, 1986. from the training dataset. The datasets are currently been used, [21] A. Liaw, M. Wiener et al., “Classification and regression by randomfor- in the context of the EU funded RESIST14 project, aiming to est,” Rnews, vol. 2, no. 3, pp. 18–22, 2002. [22] M. H. Hassoun et al., Fundamentals of artificial neural networks. MIT 14https://www.resistproject.eu/ press, 1995.

262