<<

Multi-label detection in Twitter

Vannesa Rinny Berhitoe

July 2017

Data Science in Action - Master’s Thesis

Data Science: Business and Governance

Faculty of Humanities

Tilburg University, Tilburg

Thesis supervisor : Chris Emmery Second reader : prof. dr. E.O. Postma i

Preface

I would like to dedicate the thesis to my mother and father in , I I make both of you proud. I would like to thank my dearest friends, Deve, Krista, Emily, my siblings, Ricardo and Re- becca, my family in GKIN Tilburg and in Indonesia, especially my partner Nofardo, for the end- less support, both spiritually and materially. My greatest appreciation for Chris Emmery for his undertanding, advices, support, critics and in supervising the entire process of this thesis.

July 2017

(Vannesa Rinny Berhitoe) ii

Summary

The present study intended to investigate differences in emotional writing on social media, with a specific focus on gender differences and the effect of politics on emotion shift. To achieve this, emotion detection was performed on Twitter’s textual content (i.e. tweets) during a year starting from March 2016, with an assumption that a tweet contained more than one emotion types. To measure emotion shift in political situation, the United States presidential election of 2016 were set as the context. Two multi-label classifiers were trained for emotion detection, and two bi- nary classifiers were trained for politics/non-politics categorization. The top emerged from tweets were either positive-related emotions (e.g. , , ) or negative-related emotions (i.e. , , ). The combination of both positive and negative emotions were rarely the case. Findings from gender-emotions stereotype study by Plant et al.(2000) was tested on the Twitter data and resulted in statistically significant result with a minor difference between male and female. In the political context, significant result was found in the emotion shift (i.e. positive-related and negative-related emotions) of each gender group and politics- labeled tweets negative-related emotions, between the period of two months before and after the election date. However, the effect size between groups were small (r 0.30), demonstrating < a trivial change. Contents

Preface...... i Summary...... ii

1 Introduction 2 1.1 Research questions...... 4

2 Related work 6 2.1 Sentiment analysis...... 6 2.2 Emotion detection...... 7 2.2.1 Types of emotions...... 8 2.2.2 Emotion analysis...... 10 2.3 Multi-label classification...... 13 2.4 Description of present study...... 14

3 Experimental setup 16 3.1 Project Instruments...... 16 3.2 Data Annotation...... 17 3.2.1 Annotation procedure...... 17 3.2.2 Annotation analysis...... 18 3.3 Data preprocessing...... 19 3.3.1 Data cleaning...... 20 3.3.2 Data reduction and filtering...... 21 3.3.3 Data labeling...... 22 3.4 Task A: Classification of multi-label emotions and political tweets...... 22

iii CONTENTS 1

3.4.1 Feature extraction...... 23 3.4.2 Implementation: Multi-label emotions...... 24 3.4.3 Implementation: Politics Label...... 26 3.5 Task B: Statistical Analysis of gender and emotion labels...... 27

4 Results 28 4.1 Classifier performance...... 28 4.2 Dominant emotion labels...... 29 4.2.1 Wilcoxon rank-sum test: gender and the positive-related emotions..... 31 4.3 Emotion theory...... 32 4.3.1 Female-emotion stereotypes...... 34 4.3.2 Male-emotion stereotypes...... 34 4.4 Emotion shift in situational context...... 37 4.4.1 Observation of the non-politics/politics tweets...... 38 4.4.2 Observation of gender-emotion shift...... 40

5 Discussion 42

6 Conclusion 47

Bibliography 48 Chapter 1

Introduction

A person may have written their thoughts because they wanted to record an idea in a thorough and concise manner. It could also be that they had psychological problem (i.e. mental illness) or exercise in writing as a healing process. Studies in (Kennedy-Moore and Watson, 2001; Di Blasio et al., 2015) have found that writing helps the brain to regulate emotions and to cope with emotional distress. It may also assist to heal trauma and severe . The con- tent of these writings was shown to be related to the disclosure of the person’s deepest thoughts and (Krpan et al., 2013), as participants were specifically asked to explore and reflect on major traumatic events (Ironson et al., 2013). The format of these conventional writings (e.g. a diary written with pen and paper) usually involved chronologically ordered, detailed de- scription of emotional moments, written in long, nicely-worded sentences. The advancement of technology is gradually replacing traditional tools, and tweets — the messages posted on the micro-blogging social media platform Twitter1 — are a good example of how digital forms are different from conventional types of written language. Twitter provides a smaller space (in the form of a text-box) to write, physically and literally, with a limit of 140 characters. This limit leads to the increased use of slang and abbreviation, and the frequent occurrence of grammatical er- rors. Users of social media overcome the space constraint by using ill-formed words to express their thoughts. This opens up wide opportunity for researchers to conduct related studies deal- ing with this noisy user-generated text, and trying to automatically interpret the thoughts users express on these media. One of such areas is the Natural Language Processing sub-field of emo-

1www.twitter.com

2 CHAPTER 1. INTRODUCTION 3 tion detection. Emotion detection deals with inferring certain classes of emotions from written text. From everyday experience, it seems that some emotions are distinct and occur independently. The inherently contradictory emotions, such as and hate, might need a disjoint set of classes to accommodate the unique aspects of each class. On the other hand, identical emotions com- monly fall under the same emotional valence and they are frequently co-occur in a certain situational context. Therefore, these multiple emotions may be grouped together after being processed through some empirical justification. Emotion detection, cast as a multi-label classi- fication problem, could aid in illuminating the complex nature of emotions co-occurrence, thus providing an understanding of the characteristics of each emotion. Psychology studies have revealed that emotion co-occurrence might originate from bipolar emotional valence (i.e. positive and negative); this is known as emotional (Berrios et al., 2015; Heavey et al., 2017). Emotional ambivalence, or mixed emotional states, often occur as often as homogeneous emotions. Plant et al.(2000) studied gender stereotypes in emotional co-occurrence, and found that females were often associated with , fear, love, , and ; whereas males were associated with anger and . Later, Durik et al.(2006) extended Plant et al.(2000)’s study by including four ethnic groups as their subject (African Americans, Hispanic Americans, Asian Americans, and European Americans) and found sim- ilar results. The present study will try to verify Plant’s results by applying emotion detection on tweets from males and females, respectively. Gender roles in emotional states are put into context to learn social attributes in online en- vironment. Coates (1991) and Hall (1995) have implied that social attributes are performances, since they manifest as styles of writing that can be adapted and can vary amongst their users in different situations (as cited in Bamman et al.(2014)). This has been confirmed by a study by Bailey et al.(2013) which showed that girls who posted online tended to follow the main- stream stereotype of young women in online spaces: (girls) show themselves to be attractive, have a boyfriend, and a part of the party scene. Following this concept, this study also uses a political situation as a context for deeper investigation into the differences of emotive language between gender. There are several possible motivations for specifically choosing political sit- uations: many believe that recent historical moments in the world of politics have caused a CHAPTER 1. INTRODUCTION 4 paradigm shift in the Overton window (Lehman, 2014). For instance, President Donald Trump was once regarded as an ‘abnormal’ politician, meaning that he is not the traditional figure of a president candidate (EASTMAN and GIILDER, 2017). Moreover, the United Kingdom’s vote to leave the European Union, known as Brexit, also came as a surprise. These turbulent times could trigger people to disclose not only their hopes and , but also their thoughts and opinions in various formats (e.g. via demonstrations, video campaigns, blogs, articles, and so- cial networking sites). Twitter’ top ten trending topics of 2016 are a perfect reflection of people’s reactions around the globe. Among the top ten topics were Election 2016, Brexit, BlackLives- Matter and Trump. Popularity of Twitter among other micro-blogging services is reflected upon its diversity of users. Twitter connects people from all strata who share independent user-defined expressions, and an agreement or disagreement with certain ideas (e.g. a trending topic). At the time of writ- ing, Twitter consists of more than three-hundred million active users with total tweets exceeded a billion. It has become the perfect example of big data with overwhelmingly big volume and massive user contribution. As such, it poses an ideal platform for assessing emotional language use between genders.

1.1 Research questions

There are three main research questions for this thesis.

Research Question I What are the dominant emotions that emerged from the dataset? It is interesting to learn whether or not the diverse users of Twitter convey a similar emotions in their tweets.

Research Question II How well does applying this detection task on Twitter replicate Plant’s findings regarding emotion stereotypes? It raise one’s regarding the generalization of a rather old study in the today’s world. The difference in setting, and emotional regulation of people in past and present triggered the formulation of the second research question. CHAPTER 1. INTRODUCTION 5

Research Question III To what degree does emotional word-use vary surrounding the American elections, specifically comparing gender, non-political, and political tweets? Do people’s reactions which were published in news media regarding the result of the US pres- idential election reflected on Twitter, remembering the fact that the president-elect Donald Trump is an active Twitter user?

To achieve this, this study implements multi-label classifiers which was inspired by a re- cent study on word-emotion which was conducted by Bravo-Marquez et al.(2016). They im- plemented MEKA (WEKA for multi-label classification) (Read et al., 2016) to expand the NRC word-emotion association lexicon, the EmoLex (Mohammad and Turney, 2013). In the present study, Python scikit-multilearn (Szyma´nski, 2017) is applied (rather than MEKA) for this task. The role of gender will be assessed by using demographic information contained in the user profile. Since Twitter does not provide such information, manual annotation was conducted among three annotators to obtain users’ gender, age, and socio economic status. Finally, binary classifiers are trained to classify tweets into politics or non-politics category. The specific data range focused on here are exactly two months (i.e. 6 sept-6 nov, and 10 nov-10 jan), prior to and after the election date which fall on November, 8th 2016. This was done to increased confidence of the validity of the outcome.

The rest of this paper is organized as follows. Chapter 2 discusses the related work on senti- ment analysis, analysis of emotions in text, detection of these emotions, and multi-labelled clas- sification. The annotation procedure for the Twitter data, methodology for training the emotion detection and political content classifiers, and brief methodology for statistics task are described in Chapter 3. Then, Chapter 4 provides the classifiers performances, statistical analyses regard- ing Plant’s emotion stereotypes, and gender differences in emotional word-use and emotion shift of politics/non politics tweets in a political context. Chapter 5 discusses major findings, limitations and suggestions for further work. Finally, Chapter 6 concludes the thesis by summa- rizing findings which answer each research question. Chapter 2

Related work

Sentiment analysis as the base of this thesis’ emotion analysis/emotion detection study will be described in section 2.1, followed by prominent findings in emotion detection study (sec- tion 2.2) and a detailed explanation of the implementation of the emotion analysis study. Fur- thermore, a brief introduction of multi-label classification will be given in section 2.3. Finally, a summary of the present study will be provided in section 2.4.

2.1 Sentiment analysis

Sentiment analysis is one of the application of text mining focused on at the identification of “opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues events, topics and their attributes” in text (Liu, 2012). Sentiment analysis uses machine learning methods to assess sentiment polar- ity and intensity in text. The dominant techniques to extract features are lexicon-based tech- niques; the representation of words or texts as feature vectors, traditionally using Bag Of Words, n-grams, and Part-Of-Speech tagging (Araque et al., 2017). To increase the performance of this task, prior knowledge about the sentiment of a word (e.g. positive, negative, and neutral) can be included as a feature; such a list is called a sentiment lexicon (Araque et al., 2017; Medhat et al., 2014; Kiritchenko et al., 2014). Additionally, characteristic of certain languages can be added (Taboada et al., 2011) and linguistic content such as negation can be considered (Polanyi and Zaenen, 2006; Kiritchenko et al., 2014). However, this particular method only uses manually

6 CHAPTER 2. RELATED WORK 7 engineered syntactic features, thus lacking consistency, reliability (Taboada et al., 2011), and is time-consuming. An alternative approach can be found in representation learning, where the main idea is that feature vectors weights can be automatically learned using neural networks, replacing manual coding of these. One commonly used approach is word2vec (Mikolov et al., 2013a) which is based on CBOW (Continuous Bag Of Words) or skip-gram models. Unlike the traditional techniques, word2vec uses the syntactic and semantic regularities present in the lan- guage (Mikolov et al., 2013c). It is especially helpful for sentiment analysis to represent seman- tic relationships in the vector representation, so that identical words (i.e. synonym or words that frequently co-occur) can be regarded as representing the same sentiment polarity. Con- clusively, the vectors, which are automatically generated by word2vec (CBOW or skip-gram), incorporated language properties and might considerably improve the overall performance of a sentiment classifier (cf. Astudillo et al., 2015).

2.2 Emotion detection

An extended field, rooted from sentiment analysis, is emotion detection. It has been inten- sively studied for many purposes, one of which was to observe in online environments. Facebook has shown in this field; users of Facebook might have noticed significant updates on the “like” button in the past years. Instead of only showing a thumbs- up symbol, Facebook engineers recently added more options to react to a Facebook post. This includes a heart-shaped symbol, smiley face, sad face, and the latest update is the ’thankful’ reaction symbolized by a purple flower. Unfortunately, there is no established paper regarding Facebook’s recent attempts to detect emotion, other than the mood manipulation experiment to study emotional contagion (Kramer et al., 2014). In this experiment, nearly 700,000 Facebook users were exposed to predominantly positive or negative posts, without them knowing. A week later, their own postings were analyzed to observe any difference in user’s mood compared to the previous week. The result showed that mood was affected as users posted more negative or more positive posts depending on the manipulation setting that they were in. Additionally, the quantity of words in each post decreased. A serious ethical issue arose as a consequence of the absence of user’s consent related to in- CHAPTER 2. RELATED WORK 8 formation about the research (i.e. its purpose, explanation of the study outcome post-research, and so on). Legality of the study also became problematic, since Facebook modified or renewed its end-user agreement four months after the research had been conducted (Hill, 2014). Face- book’s mood manipulation study, however, has been an inspiration for researchers to conduct studies about similar topics while avoiding the ethically-problematic consequences. For exam- ple, Ferrara and Yang(2015) collected tweets posted by a Twitter user’s followees and measured tweet sentiment using SentiStrength. The purpose was to build a situational context the user was exposed to without direct manipulation of the user. Later, sentiment of the user’s tweets was also measured to observe whether it reflected the sentiment contained in the stimuli. The result was comparable to Facebook’s study: a linear relationship was found between emotional valence (i.e. intensity) of the stimuli and the response produced by users. Both studies implied that the online environments (i.e. sentiment contained in the social media postings) had an effect on one’s emotional states measured by the sentiment of one’s so- cial media posts. In the framework of the present study, the presidential election in the United States is used as the context to describe emotions shift before and after the event.

Communicating emotion via texts can be a challenging task due to the absence of (non) verbal cues, high chance of misinterpretation, ambiguity and so on. After noise reduction (i.e. focused on word-emotion only), scientists have attempted to classify emotion; emotion with similar characteristic are grouped together constituting a type of emotion. In this way, under- standing of emotion is facilitated, that is by distinguishing each type of emotions.

2.2.1 Types of emotions

Emotions can be observed from two broad perspectives: neurological and human felt expe- riences (Damasio et al., 2000). The former is related to human instinct to avoid dangers and to take advantages of opportunities. Ekman’s basic emotion is one of such examples. Human felt experience captures broader types of emotion due to cultural adaptation and some form of rules. Love, and are a few of them. Neurological levels are often distin- guished in discrete categories of emotion, while human felt experience can be represented in an emotion continuum defined in a dimensional model. Two fundamental dimensions are va- CHAPTER 2. RELATED WORK 9 lence, which determine the strength of positive or negative, and , which represents the amount of perceived energy (Russell and Barrett, 1999). Discrete models, on the other hand, have been frequently used in NLP tasks. There are three popular theories of emotion which have been researched over the past half century: Ekman’s list of basic emotions (Ekman and Friesen, 1971), Plutchik’s wheel of emotions (Plutchik, 1980), and Parrots’ classification of emo- tions (Parrott, 2001). Ekman’sbasic emotions are the result of research on facial expression. After showing photographs of different facial expressions, Ekman and Friesen analyze participants’ response. They concluded that human share six basic emotions, namely, anger, disgust, fear, happiness, sadness, and surprise. This work has been applied to emotion classification study in various domains, and proven consistent across studies, despite consisting of a rather small number of emotions. Some of these studies are e.g. Mohammad and Turney(2010), Moham- mad and Kiritchenko(2015), Ghazi et al.(2014), Li and Xu(2014), and Chen et al.(2017). Chen et al.(2017) adopt Ekman’s emotion category to classify videos on YouTube using supervised and unsupervised learning. Li and Xu(2014) also adopt these emotions and applied emotion cause extraction methods to carefully select related emotions from a textual dataset taken from the Chinese Twitter, Weibo. The second famous emotion category is the wheel of emotions pro- posed by Robert Plutchik (Plutchik, 1980). He suggested eight bipolar emotions, which are joy versus sadness, anger versus fear, trust versus disgust, and surprise versus . The last category, which is also the most nuanced, is Parrots’ classification of emotion. He proposed a tree-structured emotion classification where each sub-level of the tree consists of more specific and more practical emotions. On the root of the tree are the primary emotions consisting of love, joy, surprise, anger, sadness and fear. Emotions can be considered stereotypic especially related to gender. Burgess and Borgida (1999) stated that emotions of female or male are generally expected to follow certain standards which are already established in the society. Specifically, their emotions are expected to reflect descriptive norms, although it also generate prescriptive norms, that is, “which emotions are seen as appropriate or desirable for whom” (Fischer and LaFrance, 2015). In the context of the present study, gender was taken into account and we investigate the possibility of gender stereotype of emotion occurring in Twitter. Among all studies about gen- der stereotype of emotion, Plant’s study (Plant et al., 2000) was selected, as it sampled similar CHAPTER 2. RELATED WORK 10 demographic of participants to the present study: domiciled in the United States, and age ap- proximation ranged from 17 – 39 years old. Regarding the 17 years difference of the time pub- lished, it was interesting to explore the degree of applicability and consistency of Plant’s finding in today’s society. Plant examined gender stereotypes of emotion by examining 19 emotions which represent human emotions comprehensively. Twelve emotions (e.g. anger, , and ) were be- lieved to have distinct facial expressions. Two emotions (e.g. distress and ) were added as it specifically distinguished infant’s emotion. The rest five emotions (e.g. jealousy, sympa- thy, and love) were included due to their prominence in interpersonal relationships. Her study consists of three different studies. Here, the focused is on Study 1 as it omit the use of any fa- cial expression, but only measured participants’ view via questionnaires. Two questionnaires that were handed out asked about the frequency of emotion which were experienced/expressed by female/male. The first is questionnaire about cultural stereotype; opinions should be given based on the beliefs in the United States culture. Thus, participants were required to be as objec- tive as possible. The second questionnaire about personal belief asked participants to provide subjective opinion about female/male emotions. Consistent with the common belief of female are more emotionally expressive, findings of Plant’s study 1 showed similar pattern as partic- ipants believed that females endorse more of the 19 emotions, namely happiness, fear, love, sadness and sympathy, while males are associated with anger and pride. From Ekman’s emotions category to Plant’s gender-emotion stereotype, all contained over- lapping emotions resulting from different adoptions of experimental methods. These different perceptions of classifications of emotions provide evidence of how rich emotion is. Applying these methods to observe emotion in text is called emotion analysis.

2.2.2 Emotion analysis

As stated before, emotion analysis and sentiment analysis are closely related tasks. To illustrate: “I am scared” can be recognized as having both negative sentiment and the "fear" category of primary emotions. Given a context, a word or a sentence may convey a different meaning. For instance, Twitter hashtags may have enabled users to define the context of their tweets and pro- vide an indication of its sentiment or emotion, thus, it may have drawn the tweet’s content to CHAPTER 2. RELATED WORK 11 a certain direction (positive/negative). This could be measured by linguistic features, such as choice of adjective words, and emotion indication, such as use of emoticon. Mohammad and Kiritchenko(2015) utilize hashtags in tweets as labels to detect emotion in a supervised learning setting. They argue that people use hashtags to declare their current emotional state. Some of these are #sohappy, #excited, #bored, #irritated, and, extracted from the real tweet, “Have re- ceived #Xscape album, regular edition. The gift from friend from US. #Happy. Still waiting for Deluxe arrival!”. Using the Archivist 1, they searched for tweets containing hashtag according to six Ekman emotions, namely, #anger, #disgust, #fear, #happy, #sadness, and #surprise. After several cleaning steps, such as removing “RT”, they collected about 21,000 tweets - forming the hashtag emotion corpus. Since each user indirectly labels their own tweets, the hashtag emo- tion corpus has high reliability. To prove the consistency of “user annotation”, Mohammad and Kiritchenko(2015) created an experiment using domain adaptation techniques where the hash- tag emotion corpus is implemented to classify emotions in news headlines. The results showed that the self-reported corpus can be applied to different domains. Ghazi et al.(2014) study word meaning in sentential context, arguing that there is a big difference in emotion of an indepen- dent word (i.e. on its own) and emotion conveyed by a word in a sentence. Considering both options, his team implemented prior word emotion, that is the emotion contained in a word on its own, and then took the sentence as the context. The experiment was carried out by taking into account Ekman six emotions as the emotion labels. They demonstrated the insufficiency of prior emotion and stress the importance of a sentence’s context for automatic discovery of the emotion. Other studies emphasized the benefit of considering emotion as one of the features in sentiment analysis in various domains. In a study about document-emotion association in on- line news, Li et al.(2016) emphasized that a word can have multiple emotional senses at the se- mantic level by introducing topic models. In developing sentiment analysis systems using emo- tional signals, Hu et al.(2013a) proposed two categories to detect emotion: emotion indication and emotion correlation. To represent each category, his team utilized emoticon and sentiment co-occurrence between words or posts. The latter is based on the consistency theory, where words or posts that frequently co-occur together are assumed to contain similar sentiment po- larity. Prior to the experiment, statistical tests were conducted to validate the two-proposed

1http://archivist.visitmix.com CHAPTER 2. RELATED WORK 12 theories (emotion indication and emotion correlation). Specifically, two-sample one-tail t-test was used to validate the existence of both positive emotion indication and negative emotion indication. Moreover, in verifying emotion correlation, sentiment difference scores were first computed with respect to the sentiment polarity obtained from the MPQA sentiment lexicon, before the two-sample one-tail t-test calculation was conducted. In both experiments the null hypothesis was rejected, which provides convincing evidence of the existence of both proposed theories in social media data. Emotion analysis can therefore be regarded as an extension of sentiment analysis, exploiting a wider range of perspectives in assessing an opinion, attitude, and expression conveyed in texts. However, the inclusion of emotion component to feature vectors does not guarantee perfor- mance improvement. Vo and Collier(2013) analyzed emotions when an earthquake occurred in Japan. His team revisited the emotion category of Nakamura and adjusted them so that it is associated with emotional experiences in the aforementioned natural disaster. Highly positive emotion, such as happiness and pleasantness, were removed - calm emotion was kept. Nega- tive emotions, such as anger, and unpleasantness were grouped together as un- pleasantness. Some of the features use to extract emotions included Bag of Words, n-grams and emoticons. Due to Japan’s unique emoticon style, namely Kaomoji, regular expressions were implemented to group emoticons to their associated emotions. They found that emoticons did not play a significant role in this case, but n-grams did contribute to improved accuracy. Simple word n-grams, in contrast, did succeed in classifying emotions. The result also showed that fear and are highly correlated and frequently occur at the time of earthquake. Calm emo- tions appeared more frequently after the earthquake. In the area of Medical Natural Language, Luyckx et al.(2012) studied suicide notes and developed an emotion classification system to automatically annotate each message. Suicide notes conveyed collective psychological prob- lems that drove people to early termination of life. During the manual labeling process, each note was labeled with more than one type of emotion, e.g. hopelessness, love and . Lexi- cal, context, and lexicon-based — using emotion-related keywords by Drummond (untracable source) — features were used; however, adding word-emotion did not improve performance, despite its extensive use in emotion detection studies. In authorship analysis, Alsmearat et al. (2015) use a dataset consisting of Arabic articles to identify the gender of the authors. This study CHAPTER 2. RELATED WORK 13 specifically aim to demonstrate gender stereotype of emotion, that is, females being more emo- tional in writings compare to male. Some demographic information of authors was provided, for instance, all writers share a background in journalism: they are well-educated people, and originated from the same region (Jordan/Palestine). Sentiment/Emotion features were built by adapting terms from NRC’s EmoLex (Mohammad and Turney, 2010) which were automatically translated to Arabic using Google Translate. To suppress the translation error (i.e. duplicates, inconsistencies), manual inspection was conducted. The study found no conclusive evidence of gender-emotion stereotypes, and in particular no effect of the emotion feature. The inconsistent result of research in emotion analysis can be explained by two main fac- tors: source of data and domain of study. The former consists of data from social networking sites, news articles, personal messages, short text messages, online reviews and many more. Texts used on Twitter might differ from texts found in online news. However, some research uti- lizes cross-domain data to justify the universality of a corpus (see Mohammad and Kiritchenko (2015)). Nevertheless, results from a study which obtain data from a specific domain might have little generalization if one is to apply that result to another study in different domain. Likewise, corpora consisting of newspapers have an insignificant chance to reproduce result for Twitter emotion analysis. Therefore, an adoption of available clean data must be conducted in a delib- erate manner, with deeper investigation on the nature of the adopted source.

2.3 Multi-label classification

Many studies in sentiment analysis categorize texts into binary classes: positive and negative. An additional category, neutral, is frequently included to gain higher accuracy score and to bet- ter distinguish positive examples from negative examples (Koppel and Schler, 2006). Another representation of sentiment is fine-grained sentiment, represented as a sentiment score. It helps to differentiate whether a word is strongly or weakly correlated to either sentiment po- larity. Medhat et al.(2014) compiled a comprehensive list of recent works in sentiment analysis. However, sentiment classification system were rather simplified by focusing only on valence measures (positive/negative) either in a coarse or fine-grained setting. An alternative method to binary sentiment which recently became popular and has been CHAPTER 2. RELATED WORK 14 extensively studied is the multi-labelled prediction. It aims to “predict a set of relevant labels for a new data instance” (Herrera et al., 2016). For example, a paragraph in a film script describing a scene where two actors were arguing in the middle of a party can be labeled both temper and feast. Text categorization is one of the fields where multi-label data often emerges naturally (Ri- vas et al., 2017). Here, some of the applications will be briefly described. In an online news article study, Zhang et al.(2015) concluded that different emotions of the readers are more rep- resentative than single author emotion. In document classification, Galke et al.(2017) used three corpora consisting of news articles and titles of scientific publications and ended up with nearly 10,000 labels to classify document using only the title. In multi-label emotions detec- tion, Liu and Chen(2015) analyze collective emotions contained in micro-blogs using eleven distinct multi-label classifiers and three sentiment dictionaries. Similar to Liu and Chen(2015), given Twitter as the data source, we expect multiple emotions to be detectable in a single tweet. Therefore, a multi-label classification based approach is selected for the current study. The proposed methods to solve multi-label classification, described by Tsoumakas and Katakis (2006), consists of two main approaches: problem transformation and algorithm adaptation. Problem transformation deals with transforming multi-label data into multiple binary classes, while algorithm or method adaptation directly handles multi-label data using an existent clas- sification algorithm. In the present study, multi-label classifiers regarded as problem transfor- mation methods were used; Binary Relevance (BR) and Classifier Chain (CC). The description of both multi-label classifiers can be found in chapter 3.

2.4 Description of present study

The current study implements multi-label classification to predict a set of emotion labels of a given tweet. This method is adopted from Bravo-Marquez et al.(2016) who used EmoLex to first label Twitter data from the Edinburgh corpus and then expand EmoLex using MEKA clas- sifiers (e.g. Binary Relevance and Classifier Chain). EmoLex (Mohammad and Turney, 2013) is an available corpus of emotion words labeled with multiple emotions. EmoLex consists of eight basic emotion: anticipation, anger, disgust, fear, joy, sadness, surprise, trust, and two sen- CHAPTER 2. RELATED WORK 15 timents: negative and positive. EmoLex was built with the assistance of human annotators on an Amazon online service called Mechanical Turk. Lists of words were generated based on Mac- quarie Thesaurus, the Google n-gram corpus, General Inquirer and WordNet Lexicon. The comprehensive mixture of these corpuses made EmoLex the most complete emotion corpus to comprise idioms and compound phrases (El-Beltagy, 2016). Beside multi-label classification, current study also implements binary classifiers to predict (non)politics label on Twitter data set given a Political Tweet Corpus, compiled by Marchetti- Bowick and Chambers(2012), as the training data set. The Political Tweet Corpus consists of two main corpora. The first is randomly selected tweets, and the second is also random but contains, at least, one political keywords. More details about the corpus can be found on chapter 3. Politics were selected based on some motivations which described in detail on chapter 1, and to observe any emotion shift before and after the US presidential election date. The current research has two aims, consisting of a machine learning part and a statistical analysis part. The former adapt the approach of Bravo-Marquez et al.(2016) with some modi- fications, which are choice of feature extraction method, classifier selection, and the utilization of skmultilearn package of Python. In addition, binary classifiers are implemented to predict (non)politics-labelled tweets. The second aim, the statistical analysis, is to access the relation between gender and emotion types, and to aid in providing answers to all three research ques- tions. A brief note on Plant’s study 1 and in relation to the experiment conducted to answer the second research question, happiness was replaced by joy due to restricted number of emotion labels (EmoLex). Although some domains of study regard happiness and joy as two distinct concepts (Cottrell, 2016), happiness and joy are often used interchangeably in most contexts Kövecses(1991); Gillham and Seligman(1999); Ross(2017). Chapter 3

Experimental Setup

This chapter will describe the technical preparation prior to data mining and data analysis. It is structured as follows: Section 3.1 contains information regarding the software used. Section 3.2 contains the annotation procedure and analysis. Section 3.3 encompassed major filtering meth- ods to support both data mining and data analysis tasks. Finally, Section 3.4 and Section 3.5 covers detailed setup of each task.

3.1 Project Instruments

The current project utilized two open-source programming languages: Python 3.5.2 and R 1.0.44. Each programming language has its own capability to accommodate two main tasks in this project. Table 3.1 shows list of R packages, along with its author, which aid in exploratory data analysis and non-parametric test. For machine learning part, Python will be utilized. The main Python libraries used in this study are skmultilearn, sklearn, gensim, pandas, and numpy. Both programming languages were used side-by-side. For instance, R to run Wilcoxon rank-sum test and Wilcoxon signed-rank test, meanwhile Python to build word embeddings, train classifiers and run the best classifier on test data. Data for this project is imported from Twitter using Twitter API. To store the data, mongoDB was used as the database. The version of mongoDB is 3.4.4. MongoDB is a NoSQL database, that is based on documents instead of relations like traditional databases (Chodorow, 2013). It is suitable for storing large number of documents or textual data. Regarding accessibility, no

16 CHAPTER 3. EXPERIMENTAL SETUP 17

Table 3.1: R packages

No Package Author Year 1 dplyr Hadley Wickham 2016 and Roman Francois 2 ggplot2 Hadley Wickham 2009 3 irr Gamer, Matthias and 2012 Lemon, Jim and Fel- lows, Iad and Singh, Puspendra 4 reshape2 Hadley Wickham 2007 5 gridExtra Baptiste Auguie and 2016 Anton Antonov 6 xtable David B. Dahl 2016 7 data.table Matt Dowle, Arun 2017 Srinivasan, Jan Gorecki, Tom Short, Steve Lianoglou, and Eduard Antonyan

premade keys or values are necessary; it offers flexibility to connect among documents.

3.2 Data Annotation

3.2.1 Annotation procedure

Three annotators were given the task of annotating each twitter user with several of their at- tributes. It was done individually and in the annotators’ own time. The annotation period lasted from March to May 2017, where each annotator was required to inspect more than 6,000 user profiles. Each annotator was required to fully understand and grasp the instruction of this task by reading the Guidelines in the menu (see Figure 3.1). On the annotation page (Figure 3.2), there are six columns to be filled: bot (true/false), face (true/false), gender (f: female, m: male, o: trans-gender, -: no gender for bot), age, socioe (l: poor, w: working class, lm: lower middle, um: upper middle), and signal (name, handle, description, image, and link). Bot and face were dis- played as check-box, where annotators could provide indication whether a user profile was run by an individual person or a bot, from e.g. marketing generator, company, or organization. The CHAPTER 3. EXPERIMENTAL SETUP 18

Figure 3.1: Homepage of annotation

Figure 3.2: Annotation work page other four columns were displayed as a text-box which can be filled with numbers and char- acters, as previously described. On the annotation page, several tools can be utilized to check background information of a user in order to confidently predict a user’s gender, age and socio economic status. These tools were the user image, URL which link to a user profile in Twitter web page, external link (which were provided by the user on their Twitter profile), the collection of statuses or tweets, and collection of images.

3.2.2 Annotation analysis

Fleiss kappa was implemented to measure annotators’ agreement. Fleiss kappa (Fleiss, 1971) is the generalization of Cohen’s kappa. It is designed to calculate score of agreement between more than two annotators. It computes the number of observed agreement subtracted by the expected value, then divided by 1 – expected value. Expected value of agreement is the propor- tion of frequency of identical predictions between annotators divided by the total observation multiplied by number of annotators. It is worth noting that similarity of annotators’ prediction would assist in shaping future assumption regarding the outcome and would provide a substan- CHAPTER 3. EXPERIMENTAL SETUP 19 tial base for interpretation (Bobicev et al., 2012; Miller et al., 2017). The R irr package was used to obtain agreement score. In general, the result from inter-rater agreement was low: k = -0.0674 (z = -12.4; p = 0) for gender, k = -0.0909 (z = -0.662; p = 0.508) for age, and k= -0.198 (z=-41.7, p=0) for socio eco- nomic status, respectively. After closely observing to the annotators’ data, the annotation result from the second annotator revealed inconsistency and high disagreement. After removing the second annotator column, agreement score was recalculated: for gender k = 0.8 (z = 51.2; p = 0). For numerical data the Single Score Intraclass Correlation function was utilized. ICC score essentially measures agreement on numerical data or ratings. The ICC reliability score of age identification increased significantly, ICC = 0.701, p < 0.01. Figure 3.3 visualizes agreement of two annotators with a mean age of 25.13 and 25.27 for annotator 1 and annotator 3, respec- tively. It seems that the majority of Twitter users were people whose age ranging from 23 years old to 29 years old. The second largest group was 19 years old to 22 years old, and the third largest group was 30 years old to 65 years old. Fleiss kappa score for three annotators on socio economic status showed a very imbalanced score. Further inspection showed that only one annotator provided full annotation. Thus, this attribute was ignored and removed entirely from the data. Figure 3.4 shows the source or “signal” that provides information for the demographic iden- tification task. On average, user image (mean = 3068, sd = 813.1728) or profile picture was re- garded as the most informative signal, followed by tweets (mean = 1647, sd = 1435.4268) and username (mean = 1475, sd = 1296.8338). While, description (mean = 957, sd = 714.1778) and link (mean = 114, sd = 158.3919) were the least informative.

3.3 Data preprocessing

Data preprocessing is an integral part of data mining. The purpose of data preprocessing is to produce data in an understandable format and enhance data reliability (Rokach and Maimon, 2010a). This process is often concurred with data cleaning. Data preprocessing approximately requires 60% of time and effort to be allocated in advance of the entire data mining project (Pyle, 1999) . Essentially, preprocessing raw data should produce a higher quality input for the CHAPTER 3. EXPERIMENTAL SETUP 20

Figure 3.3: Boxplot of age annotation of two annotators with highly similar mean scores. next stage (i.e. data transformation). Additionally, having a high-quality input data puts some confidence in the transformation phase and classifier performance, in that no "garbage" is being processed. Thus, the final result could be interpreted in a concise and coherent manner.

3.3.1 Data cleaning

The purpose of data cleaning is to reduce noisy variables, which was done by lowering all word including capital letters, removing stop words, and URLs. Specific to Twitter, hashtags (#), men- tions (@), and RTs were also removed. The words or sentences following those symbols were kept because it was assumed that a user who retweets a Twitter post might conform to the idea contained in it. Regarding punctuation marks, characters which indicate end of the sentence were kept; these characters include question marks, exclamation marks, comma’s and periods. Lin et al.(2016) and Amunategui et al.(2015) found that retaining punctuation marks during preprocessing resulted in overall improvement of performance. They argued that word trans- formation tools (e.g. word2vec) can better learn sentence components with the presence of these marks. One of the properties that is usually removed in text mining task is stop words. Depending on the purpose of a task, stop words might be regarded differently (see Saif et al., 2014). Many regarded stop words as low quality features in vector representation (Zhang et al., 2012; Goku- CHAPTER 3. EXPERIMENTAL SETUP 21

Figure 3.4: Barplot of signals. lakrishnan et al., 2012) but some authors considered stop words useful, especially in Twitter-like data (Saif et al., 2012; Hu et al., 2013b). In the context of present study, stop words consisted of preposition, articles, and pronouns. After the experiment, it was found that stop words de- creased overall classifier result. Therefore, they were removed.

3.3.2 Data reduction and filtering

The purpose of data reduction is to obtain the most informative data and simultaneously de- creasing the size of the data. As data become more specific, the probability to obtain a reliable answer for research questions potentially increase. In order to achieve this, filtering methods were applied per location and language use. Users whom listed their location outside of the United States and used language other than English were removed. Location and language use can be observed through a user its Twitter profile; some users may not explicitly mention US or United States, instead users might put the name of a state or its abbreviation (i.e. HI, Hawaii; AZ, Arizona) — these were retained. The total of 50 states were put into the states vector, which served as a filter. Number of followers and following were also taken into account. An extreme number might indicate bots, inactive accounts, celebrities, online news, company accounts, and so on. To avoid this, users with a number of followers less than 50 and more than 2,000, and with a number CHAPTER 3. EXPERIMENTAL SETUP 22 of following less than 50 were removed. Also, users with a total of statuses less than 10 and more than 800,000 were removed. In this way, celebrities, online newspapers, and probably bots, were ruled out from the data. The data was further reduced by filtering date creation of tweets that started from the be- ginning of 2016. Further observation revealed that the data was created as early as 2008 (which consisted of 16 tweets), and increased by 6150 % in the coming year. The same pattern occurred in 2012-2013 for an increase of 122.7%, and in 2015-2016 for an increase of 364.5%. To ensure stability and to rule out the possibility of noisy samples, the data before the year 2016 was ex- cluded.

3.3.3 Data labeling

Data labeling is divided between labelling tweets with emotions (multi-labelled), and by po- litical content (binary). Both labellings were implemented using machine learning approach. Thorough explanation of classifier selection, parameter setting, and evaluation metrics will be on section 3.4

3.4 Task A: Classification of multi-label emotions and political

tweets

Supervised classification is one of the methods to train a learning algorithm. It is called su- pervised due to provision of an established relation between an instance and a target variable (Rokach and Maimon, 2010b). An instance is a single datum representing set of attributes. A target variable can take many forms. In a regression problem, a target variable is a continuous numerical representation, while in a classification problem, a target variable might be an ordi- nal (i.e. lower class, working class, middle class, and upper class in socio economic status) or nominal (i.e. iris-setosa, iris-viginica, iris-versicolor in iris data) category. The goal of supervised classification is to train classifiers by uncovering relations between instances and the corresponding target variables. Prior to the learning process, instances were labeled according to some set of predefined rules. This labeling process might take various form CHAPTER 3. EXPERIMENTAL SETUP 23 such as manual labeling, automatic labeling, and the combination of two. Some authors hired novice annotators or crowd-sourced annotators using an online environment to manually label tweets with a set of emotion types (Summa et al., 2016; Novak et al., 2015; Hu et al., 2013b). Oliveira et al.(2014) used inherently manually-labeled data, since his study imported data from StockTwits 1 which users of the website could classify their own messages into one of the binary classes (i.e. bullish or bearish). Mehrotra et al.(2013) used an automatic approach to labeled data by using pointwise mutual information and Zhai et al.(2011) conducted a hybrid approach using expectation-maximization algorithm based on Naïve Bayes. Below sections will described experimental setup for the classification of tweets into multi- label emotions and into politics / non-politics classes.

3.4.1 Feature extraction

The purpose of feature extraction is to obtain properties of text. These properties must repre- sent the original idea of a text in order to maintain reliable result. In this section, two feature extraction methods, namely bag of words and word embeddings, will be discussed.

Bag of Words and Word embeddings

Natural Language Processing tasks require a language representation that can be computation- ally executed. A common way is to convert array of strings into a numerical representation. The most frequently used implementation of feature extraction is to transform strings of words into sparse or dense vector representations. The former is used in a technique called Bag of Words (BOW), while the latter is holds for word embeddings. The term “Bag of Words” originally came from Harris(1954). BOW is applied to “represent words as indices in a vocabulary” (Maas et al., 2011). Context of the word and order of the word are disregarded in this approach. Additionally, BOW only counts word frequency. Rich linguistic relation among words needs to be retained so that the actual meaning of a word given a context remains intact. Word embeddings are an al- ternative solution to this particular disadvantage of BOW. In general, word embeddings capture more linguistic features than BOW. Characteristics of languages which provide word meaning,

1http://stocktwits.com CHAPTER 3. EXPERIMENTAL SETUP 24 syntactic and semantic properties, are encoded in the embedding space. There are several im- plementations of word embeddings, one which is word2vec. word2vec, invented by Mikolov et al.(2013b), is an encoding mechanism which consists of two different learning algorithms, the Continuous Bag of Words (CBOW) algorithm and the Skip-gram algorithm. The objective function of the CBOW algorithm is to predict a word at certain position i given the context of window i d, where d represent the size of the window. The skip-gram algorithm, on the other − hand, is used to predict a set of context words given a single word. As input, word2vec takes a one-hot representation of a word: assigning 1 on the position of the word being processed and 0 everywhere else. Word embeddings are able to “learn complex features extracted from data with minimum external contribution” (Araque et al., 2017). It requires little to no amount of manually annotated data. However, in order to work effectively, word embeddings/ word2vec should be trained with large textual corpora. The use of slang and abbreviations are very common among tweets, which is the result of external factors such as cultural influence, and internal factors, such as limited space and a re- stricted number of characters. These factors make working with Twitter data a very challenging task, especially with respect to text analysis. word2vec will ease these difficulties; not only does it omit the process of manual labeling, it also preserves the language properties of the words in the transformed vector model.

In the present study, unigrams were used to represent the tweets, which were then converted to a dense vector representation using word2vec. Specifically, the skip-gram algorithm was se- lected to predict context-words given some target word, similar to the method used in Bravo- Marquez et al.(2016) study.

3.4.2 Implementation: Multi-label emotions

NRC-Emotion-Lexicon-Wordlevel version 0.92 was utilized for this study. The same corpus was also used by other authors in their research (El-Beltagy, 2016; Bravo-Marquez et al., 2016). The corpus consists of three columns: the first column is list of words, the second column is list of emotion labels, and the third is one-zero notation to indicate if a word correlate to certain emotion labels. Algorithm1 was used to label each instance after being preprocessed. CHAPTER 3. EXPERIMENTAL SETUP 25

Algorithm 1 Algortihm labeling 1: procedure EMOTIONLABELING 2: LabeledData emptylist ← 3: Xlen length of X ← 4: z emptylist ← 5: for i 0 to Xlen do = 6: X_ilen length of X_i ← 7: for j 0 to X_ilen do = 8: if j in EmoLex then 9: add EmoLex score 10: else continue 11: z set of z 12: convert z to list 13: LabeledData z return LabeledData←

The algorithm iterated over each word or character in each instance and checked if it re- sembled the word in EmoLex. If yes, then the correlation of the word and the corresponding emotion label was checked by the presence of a zero-one notation which was placed in the third column. If it was not correlated (i.e. 0), then it was ignored. Otherwise, the emotion label was saved to a list. Due to the unpredictable and inconsistent nature of tweets, it might contain du- plicate emotion labels. To handle this issue, the set() function was utilized to remove redundant labels. Finally, labels per tweet were saved and the value of list of labels were returned. All instances were used to train and build a vocabulary in word2vec to provide richness of semantic and syntactic relations. In an iterative process, window and size parameter were set to [2,5,10] and [100,400,1000], respectively, to check for the best combination of word2vec repre- sentations. The results with Binary Relevance as the baseline, showed that a window of 10 and size of 400 provide the best result on validation set.

Classifiers selection

Two multi-label classifiers from skmultilearn library were trained. Binary relevance (BR) was selected as the baseline due to its ability to handle irregularity and as it is the most commonly implemented method in multi-label classification tasks (Godbole and Sarawagi, 2004; Read et al., 2011). Given a set of labels, BR remodels the multi-label classifier into binary classifier by assign- ing positive value if the label exists, and negative value otherwise(Tsoumakas et al., 2009). BR CHAPTER 3. EXPERIMENTAL SETUP 26

Table 3.3: Example tweet and (non)politic label from Political Tweet Corpus

Label Tweet POLIT @mystic23 I also think that most liberals don’t spend a lot of time thinking about tol- erating. Tolerance connotes condescension. NOT So... was that my invite to whoop ur ass? Sounded like it. RT: @therealPRYSLEZZ: @gylliwilli it made nox & I just go out & buy rockband 2.

assumes label independence, meaning that the existence of a label is not conditional on the existence of another label. Classifier Chain (CC), on the other hand, assumes label interdepen- dencies. As the name implies, Classifier Chain extends the model by including label relevance from previous classifiers, thus passing label information between classifiers (Read et al., 2011). In this study, both classifiers were implemented a support vector machine with a linear kernel.

Evaluation metrics

To evaluate classifier performance on the multi-label emotion task, F1 score with averaged sam- ples was implemented. F1 score is the harmonic mean of precision and recall. Harmonic mean is the average score of ratios or percentages instead of arithmetic number.

3.4.3 Implementation: Politics Label

The situational context to frame gender-emotion relations is politics, aligned with the third re- search question. Here, the focus is not towards political affiliation per user. Instead, the focus was directed towards political content in tweets. To obtain political labels, Marchetti-Bowick and Chambers(2012) their corpus of political tweets along with the associated labels (i.e. POLIT and NOT) were utilized as the training data. There are two corpora: the first is randomly selected from Twitter’s spritzer feed, and the second is randomly selected from subsets of tweets repre- senting, at least, one political keyword. Each corpus contains 2,000 tweets. Table 3.3 displays the example of tweet and its corresponding label. word2vec was applied for feature vector representation. A pretest was conducted with each CHAPTER 3. EXPERIMENTAL SETUP 27 corpus separately, as the training data. The result was unsatisfactory as further observation re- vealed a class imbalance problem. The first corpus produce ‘NOT’ to nearly all tweets, which means most tweets are non-political related content. The ratio was ten to one. The second cor- pus, similarly, produced a class imbalance problem but with ‘POLIT’ label dominating all labels. Thus, it was decided to combine both corpora as the training-development set. A gridsearch process with 10-fold cross-validation was conducted for hyperparameter tuning.

Classifiers selection

For politics / non-politics classification, support vector machine (SVM), was implemented due to its profound ability in classification tasks. Specifically, data with linearly separable charac- teristic and a small number of classes, are suitable to be used with an SVM with a linear kernel (Ben-Hur and Weston, 2010; Zhang and Zhang, 2010). Zhang and Zhang(2010) predict author’s gender of blog posts, which is a binary class (i.e. male/female) by implementing a linear ker- nel SVM and achieve the highest result of all classifiers (i.e. Naïve Bayes, SVM rbf kernel, SVM polynomial kernel). As a comparison, Naïve Bayes (GaussianNB from sklearn)will be used as a baseline.

Evaluation metrics

Evaluation of the binary classifiers on this binary task was done using accuracy score.

3.5 Task B: Statistical Analysis of gender and emotion labels

Tools for analyzing included R packages such as dplyr, reshape2, ggplot2, and data.table. Wilcoxon rank-sum test was applied to measure gender-emotion significance and the Wilcoxon signed-rank test was conducted to test differences of emotions before and after November 2016 based on several groups: (1) general comparison of emotions (2) gender (female-female and male-male) and (3) political message (non politic-non politic and politic-politic). Chapter 4

Results

This chapter commences with the description of the machine learning task results. In Sec- tion 4.2, the top five emotions emerged from the dataset will be discussed. Section 4.3, results from the implementation of Plants findings will be discussed. In Section 4.4, the discussion will be directed towards emotion shift at before and after a specific political event: the United States presidential election of 2016

4.1 Classifier performance

In predicting labels of non-politics and politics for the hold-out dataset (i.e. tweets in this study) Naive Bayes was implemented as the baseline (i.e. GaussianNB from sklearn). Results from training and the development phase with 10-fold cross-validation showed an average accuracy score of 0.65.

Figure 4.1: matrix (non)politics-tweets

SVM with a linear kernel was utilized (C = 0.1, tol = 0.0001) and achieved a higher average accuracy score than the baseline, 0.81 on 10-fold cross-validation. Applying the linear kernel

28 CHAPTER 4. RESULTS 29

Table 4.1: Precision and recall (accuracy score) of SVM linear kernel for politics and nonpolitics tweets Label Precision Recall F1 score Support NON 0.85 0.87 0.86 763 POLIT 0.81 0.79 0.80 559 avg / total 0.83 0.84 0.83 1322

SVM on the test set, the accuracy score increased to 0.83. Table 4.1 displays a precision recall table of POLIT and NON label; F1 score is higher for NON label.

In the training and development phase, the F1 score average on samples —special case for multi-label data, of the Binary Relevance and Classifier Chain was 0.822 and 0.807, respectively. Table 4.2 displays both multi-label classifiers performance. After testing classifier performance on unseen data, the test dataset, results show a slightly lower F1 score for Binary Relevance (0.809) and Classifier Chain (0.795). The slightly lower performance score on both multi-label classifier (see Table 4.2) suggest that the training-validation dataset and test dataset were quite identical, meaning any variations contained in the data were scattered evenly between both datasets. An alternative explanation is that the data was dominated by a specific set of emotion labels, which was further revealed by findings from the statistical analysis (i.e. the non-linearity of scatterplot (cf. Figure 4.2)).

4.2 Dominant emotion labels

Figure 4.2 depicts total emotions contained in tweets over time, with gender as control variable. Total emotions were obtained by summing up emotion labels per day based on gender. For instance, a single user whose gender was identified as female posted a tweet which contained [joy, positive, surprise] label on a particular day. Then, a different user with an identical gender label posted a tweet which contained similar emotion labels on the same day as the former. Both events would be merged in a single row with the total emotions counted as 2. If another female user posted a tweet on the same day but with different emotion labels (e.g [negative, fear, sadness]), it would be inserted into a new row; likewise, a tweet posted on different date would be on a new row as well. The most dominant multi-label emotion was [anticipation, joy, positive, surprise, trust] with CHAPTER 4. RESULTS 30

Table 4.2: Precision and recall of BR (top) and CC (bottom) for emotion labels Binary Relevance emotion labels precision recall F1 score support 0 0.81 0.70 0.75 101250 1 0.84 0.82 0.83 140288 2 0.83 0.63 0.72 79514 3 0.81 0.72 0.76 103594 4 0.88 0.86 0.87 147838 5 0.85 0.85 0.85 148870 6 0.88 0.92 0.90 188794 7 0.82 0.69 0.75 97919 8 0.85 0.68 0.76 87973 9 0.84 0.84 0.84 150566 avg / total 0.85 0.80 0.82 1246606 F1 average samples 0.809 Hamming loss 0.16

Classifier Chain 0 0.81 0.70 0.75 101250 1 0.83 0.83 0.83 140288 2 0.78 0.66 0.71 79514 3 0.80 0.69 0.74 103594 4 0.84 0.87 0.86 147838 5 0.89 0.75 0.81 148870 6 0.90 0.87 0.89 188794 7 0.78 0.69 0.73 97919 8 0.81 0.71 0.75 87973 9 0.81 0.85 0.83 150566 avg / total 0.83 0.78 0.81 1246606 F1 average samples 0.795 Hamming loss 0.17

57,707 total emotions. The square bracket here is used to indicate that these emotions co- occurred simultaneously. The second most dominant group of emotion labels was similar to the former, but without the surprise element, [anticipation, joy, positive, trust] with 50,384 to- tal emotions. The third was [anger, disgust, fear, negative, sadness] with 37,862 total emotions, followed by [joy, positive, trust] (33,760 total emotions) and [anger, disgust, negative] (18,481 total emotions) at the fourth and fifth place, respectively. There were 717 levels of multi-label emotions where frequencies for each level ranged from 1 to 57,707. CHAPTER 4. RESULTS 31

Figure 4.2: Scatter plot of the most dominant multi-label emotion over time controlled by gen- der

4.2.1 Wilcoxon rank-sum test: gender and the positive-related emotions

The positive-related emotions turned out to be the most frequent emotions co-occurrence among other possible combination of emotion labels. Based on the project’s second research question, gender difference was tested on this set of emotions. Wilcoxon rank-sum test was applied due to the difference participants used (female and male) and the distribution of data showed violation of normality assumption. Table 4.3 shows skew and kurtosis score deviated far from zero. Figure 4.3 depicts a positively skewed graph. It also illustrate that emotion frequencies were more clustered around the low end of the scale. Result from the Wilcoxon rank-sum test showed that tweets from females (Mdn = 47) con- tained significantly more positive-related emotions than tweets from males (Mdn = 28), W = 120760, p 0.001, but with a small effect size r 0.254 < = −

Table 4.3: Descriptive statistics of gender in positive-related emotions to inspect normality as- sumption gender n mean sd median min max range skew kurtosis se F 432 73.84 71.1 47 11 392 381 2.05 4.02 3.42 M 432 59.74 73.72 28 4 393 389 2.2 4.35 3.55 CHAPTER 4. RESULTS 32

Figure 4.3: Histogram distribution of positive-related emotion

Figure 4.4: Scatter plot of the set of gender-emotion stereotype (left= female-emotion stereo- type; right = male-emotion stereotype) over time controlled by gender

4.3 Emotion theory

Testing the relation between gender and emotion stereotypes was done in two ways: firstly, in a collective manner where multi-label emotions was regarded as one set of emotion, and sec- ondly, by testing gender effect for each emotion separately. CHAPTER 4. RESULTS 33 Figure 4.5: Scatter(bottom-middle), plot and negative of (bottom-right) the over time frequencies controlled by of gender joy (top-left), fear (top-middle), sadness (top-right), anger (bottom-left), positive CHAPTER 4. RESULTS 34

4.3.1 Female-emotion stereotypes

Plant et al.(2000) study of gender-emotion stereotypes, especially study 1, found that in the United States females expressed and experienced happiness, fear, love, sadness, and sympathy more than males. For this experiment, happiness and love was replaced by joy, while sympathy was omitted. Reasons behind this substitution are described in chapter 2.

Table 4.4: Descriptive statistics of gender in female-emotion stereotype to inspect normality assumption gender n mean sd median min max range skew kurtosis se F 213 1.93 1.34 1 1 8 7 1.81 3.62 0.09 M 139 1.45 0.81 1 1 6 5 2.37 7.35 0.07

Table 4.4 indicates that more female (n = 213) seems to occupy the female-emotion stereo- type than male (n = 139), even though the amount of expression (mean and median) was roughly the same. The Wilcoxon rank sum test was again conducted due to the deviation from normality. Joy, Fear and Sadness in female (Mdn = 1) differ significantly from male (Mdn = 1 ) W 17622, p = < 0.001 with a very small effect size r 0.182 = −

4.3.2 Male-emotion stereotypes

Regarding males, the result from Plant’s study revealed that males were associated more with anger and pride. Here, [anger, joy, positive, negative] were tested to replace pride.

Table 4.5: Descriptive statistics of gender in male-emotion stereotype to inspect normality as- sumption gender n mean sd median min max range skew kurtosis se F 207 2.08 1.45 2 1 9 8 1.68 3.11 0.1 M 157 1.7 1.31 1 1 7 6 2.3 5.23 0.1

The amount of female user is averagely higher than male in any emotion labels. In Table 4.5, however, male user (n = 157) on male-emotion stereotype was actually higher than male user (n = 139) on female-emotion stereotype (see Table 4.4). The same condition was true for fe- male user, where female user was higher in the female-emotion stereotype and lower in male- emotion stereotype. CHAPTER 4. RESULTS 35

The test of gender differences of male-emotion stereotype using Wilcoxon rank sum test dis- played a significance. Tweets containing anger, joy, positive and negative, simultaneously, were significantly higher in female (Mdn = 2) than male (Mdn = 1), W 19258, p 0.001, r 0.176 = < = −

The second gender effect test was applied to each emotion. However, since the data were la- beled with a minimum of three emotion labels, it was not possible to test it on singular emotion. Therefore, the extracted data for the response variable is a set of multi-label emotions which contained at least one related emotion that was of interest. Figure 4.5 depicts the relation of the amount of emotions over time controlled by gender, where each figure illustrates different emotion labels as written in the table caption. The distri- bution of positive (bottom-middle) and negative (bottom-right) emotions over time was nearly identical, while sadness contained the least of emotion frequency distribution over time. It was seen from the figures the regression slope of joy was larger than fear, fear was slightly larger than sadness.

Table 4.6: Descriptive statistics of gender in emotion containing joy to inspect normality as- sumption gender n mean sd median min max range skew kurtosis se F 432 432.65 440.38 263 63 2253 2190 2.13 4.25 21.19 M 432 316.59 375.5 151.5 49 2059 2010 2.18 4.43 18.07

The amount of female and male users were balanced, but the range of emotion was very wide, especially for females, as seen on Table 4.6. The result of the non-parametric test, Wilcoxon rank sum test, showed that females (Mdn = 263) were significantly expressing more joy in their tweets than males (Mdn = 151.5) W 123010, p 0.001, r 0.27. = < = −

Table 4.7: Descriptive statistics of gender in emotion containing fear to inspect normality as- sumption gender n mean sd median min max range skew kurtosis se F 432 308.07 351.05 174 38 1836 1798 2.28 4.86 16.89 M 432 236.81 301.65 111.5 27 1683 1656 2.28 4.83 14.51

Table 4.7 again displays a balanced amount of female and male users. There were signif- icantly more tweets displaying the fear emotion for females (Mdn = 174) than males (Mdn = 111.5), W 117200, p 0.001, r 0.221 = < = − CHAPTER 4. RESULTS 36

Table 4.8: Descriptive statistics of gender in emotion containing Sadness to inspect normality assumption gender n mean sd median min max range skew kurtosis se F 432 297.04 331.25 166.5 34 1735 1701 2.23 4.64 15.94 M 432 221.38 278.87 102 22 1473 1451 2.26 4.69 13.42

Similar to joy and fear, tweets containing sadness have a balanced amount of female and male users (Table 4.8). Wilcoxon rank sum test showed that there were significantly more tweets containing sadness came from females (Mdn = 166.5) than males (Mdn = 102) W 125580, = p 0.001, r 0.241. < = −

Table 4.9: Descriptive statistics of gender in emotion containing positive sentiment to inspect normality assumption gender n mean sd median min max range skew kurtosis se F 432 547.81 585.54 322 78 2946 2868 2.21 4.56 28.17 M 432 417 510.42 195 64 2810 2746 2.23 4.62 24.56

Regarding emotion containing positive sentiment, Table 4.9 display a large deviation from normality as skew and kurtosis score was far from zero. Result from the Wilcoxon rank sum test showed that there were significance difference of females (Mdn = 322) and males (Mdn = 195), where females’ tweet were displaying more positive sentiment than males W 120500, = p 0.001, r 0.252 < = −

Table 4.10: Descriptive statistics of gender in emotion containing negative sentiment to inspect normality assumption gender n mean sd median min max range skew kurtosis se F 432 448.67 510.13 252.5 58 2626 2568 2.25 4.7 24.54 M 432 345.65 437.49 159.5 38 2434 2396 2.27 4.78 21.05

Lastly, in Table 4.10 females (Mdn = 252.5) are more expressive in tweet containing negative sentiment than male (Mdn = 159.5), and this difference was significant W 116580, p 0.001, = < r 0.215. = − CHAPTER 4. RESULTS 37

4.4 Emotion shift in situational context

The emotion labels being tested here are the positive-related emotions (i.e. anticipation, trust, joy, surprise, positive) and negative-related emotions (i.e. anger, fear, disgust, sadness, negative) each of which was regarded as one set of multi-label emotions. The data was separated into two conditions: 6-September-2016 until 6-November-2016 for the first part, and 10-November-2016 until 10-January-2017 for the second part.

Figure 4.6: Boxplot of positive-related emotions (left) and negative-related emotions (right) each with two points in time

The boxplot in Figure 4.6 shows the emotions’ mean difference at two points, that is dur- ing September-November and November-January. Visually, the mean of emotion frequency for both positive-related and negative-related emotions are slightly increased during the second period. Table 4.11 and Table 4.12 display descriptive statistics to check normality assumption prior to testing difference between two periods. Skew and kurtosis values should be zero for a nor- mal distribution. In this case both values were far from zero. Further inspection was done by calculating Shapiro-Wilk test (Table 4.13), and resulting in a significant value which indicates a deviation from normality. Thus, the non-parametric test, Wilcoxon signed-rank test was con- ducted. Positive-related emotions during the second period were significantly higher (Mdn = 12) than during the first period (Mdn = 9), p 0.001. The effect size (r 0.24) was found to be < = − a low to medium change regarding this set of emotion labels. Negative-related emotions during the second period were significantly higher (Mdn = 8) CHAPTER 4. RESULTS 38 than during the first period (Mdn = 7), p 0.001. The effect size (r 0.18) showed that this < = − change was very low.

Table 4.11: Descriptive statistics of positive-related emotions in both periods to inspect normal- ity assumption vars n mean sd median min max range skew kurtosis se 1 506.00 11.96 10.47 9.00 1.00 82.00 81.00 1.82 5.33 0.47 2 506.00 15.55 14.94 12.00 1.00 114.00 113.00 2.36 8.93 0.66

Table 4.12: Descriptive statistics of negative-related emotions in both periods to inspect nor- mality assumption vars n mean sd median min max range skew kurtosis se 1 454.00 9.31 9.19 7.00 1.00 65.00 64.00 2.14 6.49 0.43 2 454.00 11.34 11.09 8.00 1.00 85.00 84.00 2.21 7.66 0.52

Table 4.13: Shapiro-Wilk normality test Positive-related emotions W p Period of September-November 0.84 p < 2.2e-16 Period of November-January 0.79 p < 2.2e-16

Negative-related emotions W p Period of September-November 0.79 p < 2.2e-16 Period of November-January 0.79 p < 2.2e-16

4.4.1 Observation of the non-politics/politics tweets

For further investigation, political tweets were split into two parts, to detect changes in non- political tweets and political tweets in separate conditions. In other words, non-political tweets in the ‘before’ part will be tested against non-political tweets in the ‘after’ part. This also applies to political tweets.

Table 4.14: Descriptive statistics of politics-labeled positive-related emotions vars n mean sd median min max range skew kurtosis se 1 511.00 138.28 171.94 76.00 1.00 1302.00 1301.00 2.50 8.30 7.61 2 511.00 139.34 175.37 80.00 1.00 1364.00 1363.00 2.70 10.00 7.76 CHAPTER 4. RESULTS 39

Table 4.15: Descriptive statistics of politics-labeled negative-related emotions vars n mean sd median min max range skew kurtosis se 1 369.00 5.83 5.47 4.00 1.00 42.00 41.00 2.36 7.77 0.28 2 369.00 7.24 7.27 5.00 1.00 66.00 65.00 2.98 14.77 0.38

Table 4.16: Descriptive statistics of nonpolitics-labeled positive-related emotions vars n mean sd median min max range skew kurtosis se 1 435.00 8.56 7.83 6.00 1.00 52.00 51.00 1.92 4.66 0.38 2 435.00 10.16 19.63 7.00 1.00 368.00 367.00 14.25 251.83 0.94

Table 4.17: Descriptive statistics of nonpolitics-labeled negative-related emotions vars n mean sd median min max range skew kurtosis se 1 368.00 5.34 4.49 4.00 1.00 31.00 30.00 1.82 4.61 0.23 2 368.00 5.98 5.25 4.50 1.00 37.00 36.00 1.70 4.33 0.27

Inspection on emotions exhibited in politically-labeled tweets, as shown in Figure 4.7, ex- posed interesting findings. At both periods, politics-labeled tweets in positive-related emotions were noticeably linearly separable. However, the opposite was the case for negative-related emotions, where data points were scattered in the first period. In the second period, the points were slightly more separated and the regression slope was higher. This was also true for the positively-related emotions. The data was not normally distributed, as the skew and kurtosis values of each line were deviated from zero (see Table 4.14, Table 4.15, Table 4.16, and Table 4.17). Shapiro-Wilk test confirmed the non-normal distribution.

Figure 4.7: Positive-related emotions and negative-related emotions during September- November (left) and November-January (right) for politics (top) and non-politics (bottom) tweets CHAPTER 4. RESULTS 40

To check whether the emotions differences between two periods were significant, the Wilcoxon signed-rank test was conducted. The only significant difference was found in politics-labeled negative-related emotions, where emotions on the second period were significantly higher (Mdn = 5) than on the first period (Mdn = 4), p 0.001, with a low effect size (r 0.20). < = −

4.4.2 Observation of gender-emotion shift

Finally, further investigation was conducted by comparing median emotions between gender. The aim was to obtain insights whether the median of female or male emotions were signifi- cantly changed between these periods.

Figure 4.8: Positive-related emotions and negative-related emotions during September- November (left) and November-January (right) controlled by gender

Figure 4.8 depicts linearly separable data points in the September-November period. For positive-related emotions, the data points became more scattered by the end of the second period with emotions frequency dramatically increase. The negative-related emotions, gen- der were becoming more separated by the end of January with higher slopes. In all figures, fe- males were seen to have higher emotions frequencies than males. The results from the Wilcoxon signed-rank test showed significance of all four conditions: female users with positive-related emotions changed significantly on the second period p 0.01 with the lowest effect size of other < conditions r 0.18; the same goes to male users, as the change were also significant p 0.001, = − < but with slightly higher effect size r 0.22; negative-related emotions of the female users sig- = − nificantly change p 0.03 with low effect size r 0.20; for male users, the change was also = = − CHAPTER 4. RESULTS 41 significant p 0.001 with the highest effect size r 0.27 among other conditions. < = −

Table 4.18: Descriptive statistics female users with positive-related emotions vars n mean sd median min max range skew kurtosis se 1 292.00 12.99 11.06 10.00 1.00 82.00 81.00 1.91 5.85 0.65 2 292.00 15.78 14.94 12.00 1.00 114.00 113.00 2.83 12.96 0.87

Table 4.19: Descriptive statistics male users positive-related emotions vars n mean sd median min max range skew kurtosis se 1 215.00 10.70 9.70 7.00 1.00 56.00 55.00 1.54 2.80 0.66 2 215.00 16.88 28.33 10.00 1.00 368.00 367.00 9.06 107.27 1.93

Table 4.20: Descriptive statistics female users with negative-related emotions vars n mean sd median min max range skew kurtosis se 1 262.00 10.61 9.97 8.00 1.00 65.00 64.00 2.06 6.01 0.62 2 262.00 12.59 12.39 9.00 1.00 85.00 84.00 2.24 7.19 0.77

Table 4.21: Descriptive statistics male users with negative-related emotions vars n mean sd median min max range skew kurtosis se 1 192.00 7.55 7.67 5.00 1.00 44.00 43.00 2.07 5.11 0.55 2 192.00 9.64 8.76 6.00 1.00 45.00 44.00 1.35 1.53 0.63 Chapter 5

Discussion

Generally, there were two main tasks in this study. The first was a machine-learning-related task which aimed at predicting multi-labels for emotion and binary politics / non-politics labels for tweets. A corpus of word-emotion (EmoLex) was used for the multi-labeling process, and a po- litical tweet corpus was used to label tweets with politics/ non-politics label. The outcomes for both labeling processes were utilized in the statistical analysis. The second task was a statistical analysis where non-parametric tests were conducted.

This study commenced with the introduction of three research question. The following dis- cussion reflects upon each research question.

Research Question I What are the dominant emotions that emerged from the dataset?

Firstly, two multi-label classifiers: Binary relevance (BR) and Classifier Chain (CC) were used, each of which implemented a support vector machine with a linear kernel.

Results from the validation set indicated a slightly lower accuracy (samples-averaged F1 score) of CC (0.80) compare to BR (0.82). Moreover, when applying to test set, the accuracy decreased for both classifiers (CC: 0.79, BR: 0.80). However, given the small differences between samples-averaged F1 score of validation and test set, it is possibly fair to state the validation and test performance are satisfactory. The marginally worse performance of CC over BR might be

42 CHAPTER 5. DISCUSSION 43 due to label dependency existed in the data is low. There might be an absence of interdepen- dencies between labels, so that the BR model (which assumes label independency) turned out to perform better. This result was in line with study by (Bravo-Marquez et al., 2016) which found BR to outperform CC in word2vec representation. The outcome of multi-label classifiers are emotion labels — given the assumption that a tweet might contain more than one type of emotion. The dominant emotions were [anticipa- tion, joy, positive, surprise, trust],[anticipation, joy, positive, trust], [anger, disgust, fear, nega- tive, sadness],[joy, positive, trust], and [anger, disgust, negative]. These were the top five multi- label emotions, with two of them negative-related emotions and the rest being positive-related emotions. EmoLex consists of eight emotion types which are highly related to either positive or nega- tive sentiment. The EmoLex corpus is inherently multi-label, and the co-ocurrences of emotion labels in the corpus revealed a pattern: the dominant emotion labels were either related to pos- itive or negative sentiment; however, it was rarely the case that these mixed. This was clearly confirmed in the result that mixed emotion labels appear less than ten times. Further investiga- tion revealed that there were substantial number of tweets which contained only one word, e.g. excited, but revealed five emotion labels. This became problematic if the one-word-tweet is a part of retweeted status; not originally posted by the user.

Research Question II How well does applying this detection task on Twitter replicate Plant’s findings regarding emotion stereotypes?

The theory of emotions was adopted from Plant’s study (study 1) for gender-emotion stereo- types. Her study found that female were associated with certain emotion stereotypes: joy, fear, and sadness, while male incorporated anger and pride. Due to some limitations, positive and negative sentiment were also added to both emotions. An interesting observation was made regarding the amount of female and male users in each emotion stereotype. The total amount of female users in the female-emotion stereotype was higher than in male-emotion stereotype. Similarly, the total amount of male users was higher in male-emotion stereotype. CHAPTER 5. DISCUSSION 44

In the isolated setting, the largest effect size was found on gender-emotions relation con- taining joy, with females posted more tweet related to joy than males. However, generally, the effect size is considered small (r 0.30), and therefore, this result should be interpreted care- < fully. Gender-emotion stereotype findings in Plant’s study 1 might not yield identical results on Twitter, especially given our rather uniform sample (i.e. possibly Twitter users shared similar characteristic).

Research Question III To what degree does emotional word-use vary surrounding the American elections, specifically comparing gender, non-political, and political tweets?

The present study conducted a statistical analysis of gender and emotions in a general set- ting and an isolated setting. The former analyzed total emotions emerging from the dataset and applied findings from Plant’s study of gender-emotion stereotype. The latter focused on com- paring median of emotions in two period of time, that is before and after the US presidential election of 2016. In the context of the positive-related emotion labels, median scores of tweets with politics labels were found to be lower than non-politics labeled tweets. However, in the negative-related emotion labels, median score of emotions in the politics labels were slightly higher. Analyses of emotion labels and politics/non-politics-labeled tweets displayed a trend, that is the total emotions increased over time. In other words, the stability of non politics-labeled tweets was not confirmed. Given a situational context of the US presidential election, emotions were compared using a Wilcoxon signed-rank test. Data was divided into two parts, before and after November, 8th 2016. Both positive-related emotions and negative-related emotions were set as the response variable to test different conditions. Difference in emotion frequencies on the first and second period given gender as the mediator reveal significant result. Regarding political tweets, the only significant change was found in politics-labeled negative-related emotions, however, the effect size was small. Most noticeably, users are becoming more positive after November, 8th 2016. Additionally, CHAPTER 5. DISCUSSION 45 users who tend to express negative emotions became more negative in the second period, as seen from the statistical result. However, there were no substantial attributes which related to politics to support these findings. It was therefore not completely clear whether the emotions difference was caused solely by the US presidential election or random chance. Nevertheless, the strict amount of data which was sampled for the Wilcoxon signed-rank test (i.e. exactly two months for both period) boosts confidence in the validity of the result.

The annotation data in the present study showed users whom were identified as female were larger than male. Nevertheless, male users were still shown to express their emotions as much as female users. Graph illustrations in the previous chapter demonstrated relation between to- tal emotions over time. Male users consistently had a smaller slope of regression line, while female users’ emotions drastically increased. Regarding positive-related emotions, male users had higher mean score and larger standard deviation of emotion frequencies compare to female users. However, it was seen from the figures in chapter 4 that the size of the data increased over time. These figures visualize the non-linear relation between emotion frequencies and time: shaping a curve which drastically increase near the end of the year of 2016. This might explain that the emotions captured were not representative of data in general, but for a specific time- range where tweet postings were more frequent. Further inspection were conducted on some of the extreme numbers of total emotions. It revealed that several male users actually dominated this area. The content of tweets around this extreme number were not personal; it consisted of a retweeted status to support celebrities in an award show. Keeping such retweeted statuses might raise serious validity issue. In the data cleaning process, RTs were removed but the message contained in it was kept. Presumably, people who retweeted the tweet status conformed to the idea contained in it. This could prove problematic when a retweeted status was merely an act of : helping an acquaintance’s by promoting his/her business, and so on. In the case of the present study, it be- came a problem because these tweets were posted to support a user’s idol to win a competition. However, specific to this study, these users can be considered noise. Deeper investigation revealed that they set English language as the main language in their Twitter account, but their location was outside the United States. Since the current study focused on users located inside CHAPTER 5. DISCUSSION 46 the United States, irrelevant results from users outside this circle should not be considered. Twitter may not always reflect the real world because the sample might be biased compared to the actual population. Besides, people who use Twitter might share similar characteristic, and are distinct from those who don’t. For instance, Twitter users might be an idealist, activist, non-stereotypic people, and so on. Although the statistics results revealed significance, the effect size of all results was small; demonstrating a minor difference between groups (i.e. male-female, politics-non politics) in between periods. Further studies might consider a method to modify the retweeted status, for instance, by separating RTs that contain noise from RTs that conform to an idea/original tweet. Additionally, politics-related attributes which might increase the effect size, such as, certain scandal/rumors related to political actors, can also be included in future studies. Chapter 6

Conclusion

In the present study, three research questions were specified. The first was associated with the most emotion labels which emerged from the data. These emotion labels were anticipation, trust, joy, positive, and surprise, where all were positive-related emotions. The second research question linked to the applicability of Plant’s study regarding gender-emotion stereotype to the Twitter dataset. Wilcox signed-rank test was applied to learn the relationship of gender-emotion stereotype in two settings: the first was that emotions was considered to co-occur simultane- ously, and the second setting was to test each emotions separately. The output showed sig- nificant result in both settings. However, the effect size between male and female was small, thus the difference between gender was rather trivial. The last research question was aimed to discover if there were any significant change in people’s emotional states before and after the US presidential election of 2016. Both positive-related emotions and negative-related emotions was tested against non politics/politics-labeled tweets and gender. Statistically significant re- sults were found in the relation between gender and both emotional valence, and in the relation between politics-labeled tweets and negative-related emotions. Similar to the effect size of the second research question, statistics result for the third research question also produced small effect size. In other words, emotion shift of each group (i.e. gender and non politics/politics- labeled tweets) between periods showed a minor difference.

47 Bibliography

Alsmearat, K., Shehab, M., Al-Ayyoub, M., Al-Shalabi, R., and Kanaan, G. (2015). Emotion anal- ysis of arabic articles and its impact on identifying the author’s gender. In Computer Systems and Applications (AICCSA), 2015 IEEE/ACS 12th International Conference of, pages 1–6. IEEE.

Amunategui, M., Markwell, T., and Rozenfeld, Y. (2015). Prediction using note text: Synthetic feature creation with word2vec. arXiv preprint arXiv:1503.05123.

Araque, O., Corcuera-Platas, I., Sánchez-Rada, J. F., and Iglesias, C. A. (2017). Enhancing deep learning sentiment analysis with ensemble techniques in social applications. Expert Systems with Applications, 77:236–246.

Astudillo, R. F., Amir, S., Ling, W., Martins, B., Silva, M., Trancoso, I., and Redol, R. A. (2015). Inesc-id: A regression model for large scale twitter sentiment lexicon induction. In Proceed- ings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 613– 618.

Bailey, J., Steeves, V., Burkell, J., and Regan, P. (2013). Negotiating with gender stereotypes on social networking sites: From “bicycle face” to facebook. Journal of Communication Inquiry, 37(2):91–112.

Bamman, D., Eisenstein, J., and Schnoebelen, T. (2014). Gender identity and lexical variation in social media. Journal of Sociolinguistics, 18(2):135–160.

Ben-Hur, A. and Weston, J. (2010). A user’s guide to support vector machines. Data mining techniques for the life sciences, pages 223–239.

48 BIBLIOGRAPHY 49

Berrios, R., Totterdell, P., and Kellett, S. (2015). Investigating goal conflict as a source of mixed emotions. Cognition and Emotion, 29(4):755–763.

Bobicev, V., Sokolova, M., Jafer, Y., and Schramm, D. (2012). Learning sentiments from tweets with personal health information. In Canadian Conference on Artificial Intelligence, pages 37–48. Springer.

Bravo-Marquez, F., Frank, E., Mohammad, S. M., and Pfahringer, B. (2016). Determining word– emotion associations from tweets by multi-label classification. In WI’16, pages 536–539. IEEE Computer Society.

Burgess, D. and Borgida, E. (1999). Who women are, who women should be: Descriptive and prescriptive gender stereotyping in sex discrimination. Psychology, Public Policy and Law, 5:665–692.

Chen, Y.-L., Chang, C.-L., and Yeh, C.-S. (2017). Emotion classification of youtube videos. Deci- sion Support Systems.

Chodorow, K. (2013). MongoDB: The Definitive Guide: Powerful and Scalable Data Storage. O’Reilly Media.

Cottrell, L. (2016). Joy and happiness: a simultaneous and evolutionary concept analysis. Jour- nal of advanced nursing, 72(7):1506–1517.

Damasio, A. R., Grabowski, T. J., Bechara, A., Damasio, H., Ponto, L. L., Parvizi, J., and Hichwa, R. D. (2000). Subcortical and cortical brain activity during the feeling of self-generated emo- tions. Nature neuroscience, 3(10):1049–1056.

Di Blasio, P.,Camisasca, E., Caravita, S. C. S., Ionio, C., Milani, L., and Valtolina, G. G. (2015). The effects of expressive writing on postpartum depression and posttraumatic stress symptoms. Psychological reports, 117(3):856–882.

Durik, A. M., Hyde, J. S., Marks, A. C., Roy, A. L., Anaya, D., and Schultz, G. (2006). Ethnicity and gender stereotypes of emotion. Sex Roles, 54(7-8):429–445. BIBLIOGRAPHY 50

EASTMAN, S. and GIILDER, E. (2017). As america trump (ets), the world gets tinnitus: Constru- ing the personal/political sphere of donald trump’s supporters and its effects upon accurately forecasting the election of 2016. Romanian Review of Political Sciences & International Rela- tions, 14(1).

Ekman, P.and Friesen, W. V. (1971). Constants across cultures in the face and emotion. Journal of personality and social psychology, 17(2):124.

El-Beltagy, S. R. (2016). Nileulex: A phrase and word level sentiment lexicon for egyptian and modern standard arabic. In LREC.

Ferrara, E. and Yang, Z. (2015). Measuring emotional contagion in social media. PloS one, 10(11):e0142390.

Fischer, A. and LaFrance, M. (2015). What drives the smile and the tear: Why women are more emotionally expressive than men. Emotion Review, 7(1):22–29.

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological bul- letin, 76(5):378.

Galke, L., Mai, F., Schelten, A., Brunsch, D., and Scherp, A. (2017). Comparing titles vs. full-text for multi-label classification of scientific papers and news articles. arXiv preprint arXiv:1705.05311.

Ghazi, D., Inkpen, D., and Szpakowicz, S. (2014). Prior and contextual emotion of words in sentential context. Computer Speech & Language, 28(1):76–92.

Gillham, J. E. and Seligman, M. E. (1999). Footsteps on the road to a . Be- haviour Research and Therapy, 37:S163–S173.

Godbole, S. and Sarawagi, S. (2004). Discriminative methods for multi-labeled classification. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 22–30. Springer.

Gokulakrishnan, B., Priyanthan, P.,Ragavan, T., Prasath, N., and Perera, A. (2012). Opinion min- ing and sentiment analysis on a twitter data stream. In Advances in ICT for emerging regions (ICTer), 2012 International Conference on, pages 182–188. IEEE. BIBLIOGRAPHY 51

Harris, Z. S. (1954). Distributional structure. Word, 10(2-3):146–162.

Heavey, C. L., Lefforge, N. L., Lapping-Carr, L., and Hurlburt, R. T. (2017). Mixed emotions: Toward a phenomenology of blended and multiple feelings. Emotion Review, 9(2):105–110.

Herrera, F.,Charte, F., Rivera, A. J., and Del Jesus, M. J. (2016). Multilabel Classification: Problem Analysis, Metrics and Techniques. Springer.

Hill, K. (2014). Facebook added ‘research’to user agreement 4 months after emotion manipula- tion study. Forbes. http://onforb. es/15DKfGt.

Hu, X., Tang, J., Gao, H., and Liu, H. (2013a). Unsupervised sentiment analysis with emotional signals. In Proceedings of the 22nd international conference on World Wide Web, pages 607– 618. ACM.

Hu, X., Tang, L., Tang, J., and Liu, H. (2013b). Exploiting social relations for sentiment analysis in microblogging. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 537–546. ACM.

Ironson, G., O’cleirigh, C., Leserman, J., Stuetzle, R., Fordiani, J., Fletcher, M., and Schneider- man, N. (2013). Gender-specific effects of an augmented written emotional disclosure in- tervention on posttraumatic, depressive, and hiv-disease-related outcomes: a randomized, controlled trial. Journal of consulting and clinical psychology, 81(2):284.

Kennedy-Moore, E. and Watson, J. C. (2001). How and when does help? Review of General Psychology, 5(3):187.

Kiritchenko, S., Zhu, X., and Mohammad, S. M. (2014). Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research, 50:723–762.

Koppel, M. and Schler, J. (2006). The importance of neutral examples for learning sentiment. Computational Intelligence, 22(2):100–109.

Kövecses, Z. (1991). Happiness: A definitional effort. Metaphor and Symbol, 6(1):29–47. BIBLIOGRAPHY 52

Kramer, A. D., Guillory, J. E., and Hancock, J. T. (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sci- ences, 111(24):8788–8790.

Krpan, K. M., Kross, E., Berman, M. G., Deldin, P. J., Askren, M. K., and Jonides, J. (2013). An everyday activity as a treatment for depression: the benefits of expressive writing for people diagnosed with major depressive disorder. Journal of affective disorders, 150(3):1148–1151.

Lehman, J. (2014). A brief explanation of the overton window. Mackinac Center for Public Policy.- 2012.

Li, W. and Xu, H. (2014). Text-based emotion classification using emotion cause extraction. Expert Systems with Applications, 41(4):1742–1749.

Li, X., Xie, H., Rao, Y., Chen, Y., Liu, X., Huang, H., and Wang, F. L. (2016). Weighted multi-label classification model for sentiment analysis of online news. In Big Data and Smart Computing (BigComp), 2016 International Conference on, pages 215–222. IEEE.

Lin, Z., Jin, H., Robinson, B., and Lin, X. (2016). Towards an accurate social media disaster event detection system based on deep learning and semantic representation.

Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1):1–167.

Liu, S. M. and Chen, J.-H. (2015). A multi-label classification based approach for sentiment classification. Expert Systems with Applications, 42(3):1083–1093.

Luyckx, K., Vaassen, F., Peersman, C., and Daelemans, W. (2012). Fine-grained emotion de- tection in suicide notes: A thresholding approach to multi-label classification. Biomedical informatics insights, 5(Suppl 1):61.

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 142–150. As- sociation for Computational Linguistics. BIBLIOGRAPHY 53

Marchetti-Bowick, M. and Chambers, N. (2012). Learning for microblogs with distant super- vision: Political forecasting with twitter. In Proceedings of the 13th Conference of the Euro- pean Chapter of the Association for Computational Linguistics, pages 603–612. Association for Computational Linguistics.

Medhat, W., Hassan, A., and Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4):1093–1113.

Mehrotra, R., Sanner, S., Buntine, W., and Xie, L. (2013). Improving lda topic models for mi- croblogs via tweet pooling and automatic labeling. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 889–892. ACM.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representa- tions in vector space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed represen- tations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.

Mikolov, T., Yih, W.-t., and Zweig, G. (2013c). Linguistic regularities in continuous space word representations. In Hlt-naacl, volume 13, pages 746–751.

Miller, M., Banerjee, D., Muppalla, R., Romine, D., Sheth, D., et al. (2017). What are people tweeting about zika? an exploratory study concerning symptoms, treatment, transmission, and prevention. arXiv preprint arXiv:1701.07490.

Mohammad, S. M. and Kiritchenko, S. (2015). Using hashtags to capture fine emotion categories from tweets. Computational Intelligence, 31(2):301–326.

Mohammad, S. M. and Turney, P. D. (2010). Emotions evoked by common words and phrases: Using mechanical turk to create an emotion lexicon. In Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text, pages 26–34. Association for Computational Linguistics. BIBLIOGRAPHY 54

Mohammad, S. M. and Turney, P.D. (2013). Crowdsourcing a word–emotion association lexicon. Computational Intelligence, 29(3):436–465.

Novak, P. K., Smailovi´c,J., Sluban, B., and Mozetiˇc,I. (2015). Sentiment of emojis. PloS one, 10(12):e0144296.

Oliveira, N., Cortez, P., and Areal, N. (2014). Automatic creation of stock market lexicons for sentiment analysis using stocktwits data. In Proceedings of the 18th International Database Engineering & Applications Symposium, pages 115–123. ACM.

Parrott, W. G. (2001). Emotions in social psychology: Essential readings. Psychology Press.

Plant, E. A., Hyde, J. S., Keltner, D., and Devine, P.G. (2000). The gender stereotyping of emotions. Psychology of Women Quarterly, 24(1):81–92.

Plutchik, R. (1980). A general psychoevolutionary theory of emotion. Theories of emotion, 1(3- 31):4.

Polanyi, L. and Zaenen, A. (2006). Contextual valence shifters. In Computing attitude and affect in text: Theory and applications, pages 1–10. Springer.

Pyle, D. (1999). Data Preparation for Data Mining. Morgan Kaufmann Publishers Inc., San Fran- cisco, CA, USA, 1st edition.

Read, J., Pfahringer, B., Holmes, G., and Frank, E. (2011). Classifier chains for multi-label classi- fication. Machine learning, 85(3):333–359.

Read, J., Reutemann, P., Pfahringer, B., and Holmes, G. (2016). Meka: A multi-label/multi-target extension to weka. Journal of Machine Learning Research, 17(21):1–5.

Rivas, A. J. R., Ojeda, F.C., Pulgar, F.J., and del Jesus, M. J. (2017). A transformation approach to- wards big data multilabel decision trees. In International Work-Conference on Artificial Neural Networks, pages 73–84. Springer.

Rokach, L. and Maimon, O. (2010a). Introduction to knowledge discovery and data mining. Data Mining and Knowledge Discovery Handbook, pages 2–5. BIBLIOGRAPHY 55

Rokach, L. and Maimon, O. (2010b). Supervised learning. Data Mining and Knowledge Discovery Handbook, pages 133–147.

Ross, C. M. (2017). Happiness is like medicine: Wisdom for activities professionals. BAOJ Pall Medicine, 3:030.

Russell, J. A. and Barrett, L. F. (1999). Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant. Journal of personality and social psychology, 76(5):805.

Saif, H., Fernández, M., He, Y., and Alani, H. (2014). On stopwords, filtering and data sparsity for sentiment analysis of twitter.

Saif, H., He, Y., and Alani, H. (2012). Semantic sentiment analysis of twitter. The Semantic Web– ISWC 2012, pages 508–524.

Summa, A., Resch, B., GIS, G.-Z., and Strube, M. (2016). Microblog emotion classification by computing similarity in text, time, and space. In Proceedings of the Workshop on Computa- tional Modeling of People’s Opinions, Personality, and Emotions in Social Media, pages 153– 162.

Szyma´nski, P.(2017). A scikit-based python environment for performing multi-label classifica- tion. arXiv preprint arXiv:1702.01460.

Taboada, M., Brooke, J., Tofiloski, M., Voll, K., and Stede, M. (2011). Lexicon-based methods for sentiment analysis. Computational linguistics, 37(2):267–307.

Tsoumakas, G. and Katakis, I. (2006). Multi-label classification: An overview. International Jour- nal of Data Warehousing and Mining, 3(3).

Tsoumakas, G., Katakis, I., and Vlahavas, I. (2009). Mining multi-label data. In Data mining and knowledge discovery handbook, pages 667–685. Springer.

Vo, B.-K. H. and Collier, N. (2013). Twitter emotion analysis in earthquake situations. Interna- tional Journal of Computational Linguistics and Applications, 4(1):159–173. BIBLIOGRAPHY 56

Zhai, Z., Liu, B., Xu, H., and Jia, P. (2011). Clustering product features for opinion mining. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 347–354. ACM.

Zhang, C. and Zhang, P.(2010). Predicting gender from blog posts. University of Massachussetts Amherst, USA.

Zhang, L., Jia, Y., Zhou, B., and , Y. (2012). Microblogging sentiment analysis using emotional vector. In Cloud and Green Computing (CGC), 2012 Second International Conference on, pages 430–433. IEEE.

Zhang, Y., Su, L., Yang, Z., Zhao, X., and Yuan, X. (2015). Multi-label emotion tagging for online news by supervised topic model. In Asia-Pacific Web Conference, pages 67–79. Springer.