Arxiv:2005.00547V2 [Cs.CL] 3 Jun 2020

GoEmotions: A Dataset of Fine-Grained Emotions Dorottya Demszky1∗ Dana Movshovitz-Attias2 Jeongwoo Ko2 Alan Cowen2 Gaurav Nemade2 Sujith Ravi3* 1Stanford Linguistics 2Google Research 3Amazon Alexa [email protected] fdanama, jko, acowen, [email protected] [email protected] Abstract Sample Text Label(s) OMG, yep!!! That is the final ans- gratitude, Understanding emotion expressed in language wer. Thank you so much! approval has a wide range of applications, from build- I’m not even sure what it is, why ing empathetic chatbots to detecting harmful confusion online behavior. Advancement in this area can do people hate it be improved using large-scale datasets with Guilty of doing this tbph remorse a fine-grained typology, adaptable to multiple downstream tasks. We introduce GoEmo- This caught me off guard for real. surprise, tions, the largest manually annotated dataset I’m actually off my bed laughing amusement of 58k English Reddit comments, labeled for I tried to send this to a friend but disappointment 27 emotion categories or Neutral. We demon- [NAME] knocked it away. strate the high quality of the annotations via Principal Preserved Component Analysis. We Table 1: Example annotations from our dataset. conduct transfer learning experiments with existing emotion benchmarks to show that our dataset generalizes well to other domains and different emotion taxonomies. Our BERT- sification into Ekman (Ekman, 1992b) or Plutchik based model achieves an average F1-score of (Plutchik, 1980) emotions. .46 across our proposed taxonomy, leaving Recently, Bostan and Klinger(2018) have ag- 1 much room for improvement. gregated 14 popular emotion classification corpora under a unified framework that allows direct com- 1 Introduction parison of the existing resources. Importantly, Emotion expression and detection are central to the their analysis suggests annotation quality gaps in human experience and social interaction. With as the largest manually annotated emotion classifi- many as a handful of words we are able to express a cation dataset, CrowdFlower(2016), containing wide variety of subtle and complex emotions, and it 40K tweets labeled for one of 13 emotions. While has thus been a long-term goal to enable machines their work enables such comparative evaluations, it to understand affect and emotion (Picard, 1997). highlights the need for a large-scale, consistently In the past decade, NLP researchers made avail- labeled emotion dataset over a fine-grained taxon- able several datasets for language-based emotion omy, with demonstrated high-quality annotations. arXiv:2005.00547v2 [cs.CL] 3 Jun 2020 classification for a variety of domains and appli- To this end, we compiled GoEmotions, the cations, including for news headlines (Strapparava largest human annotated dataset of 58k carefully and Mihalcea, 2007), tweets (CrowdFlower, 2016; selected Reddit comments, labeled for 27 emotion Mohammad et al., 2018), and narrative sequences categories or Neutral, with comments extracted (Liu et al., 2019), to name just a few. However, ex- from popular English subreddits. Table1 shows an isting available datasets are (1) mostly small, con- illustrative sample of our collected data. We design taining up to several thousand instances, and (2) our emotion taxonomy considering related work in cover a limited emotion taxonomy, with coarse clas- psychology and coverage in our data. In contrast to Ekman’s taxonomy, which includes only one posi- ∗ Work done while at Google Research. tive emotion (joy), our taxonomy includes a large 1Data and code available at https://github.com/ google-research/google-research/tree/ number of positive, negative, and ambiguous emo- master/goemotions. tion categories, making it suitable for downstream conversation understanding tasks that require a sub- hashtags on Twitter (Wang et al., 2012; Abdul- tle understanding of emotion expression, such as Mageed and Ungar, 2017). We build our dataset the analysis of customer feedback or the enhance- manually, making it the largest human annotated ment of chatbots. dataset, with multiple annotations per example for We include a thorough analysis of the annotated quality assurance. data and the quality of the annotations. Via Princi- Several existing datasets come from the domain pal Preserved Component Analysis (Cowen et al., of Twitter, given its informal language and expres- 2019b), we show a strong support for reliable disso- sive content, such as emojis and hashtags. Other ciation among all 27 emotion categories, indicating datasets annotate news headlines (Strapparava and the suitability of our annotations for building an Mihalcea, 2007), dialogs (Li et al., 2017), fairy- emotion classification model. tales (Alm et al., 2005), movie subtitles (Ohman¨ We perform hierarchical clustering on the emo- et al., 2018), sentences based on FrameNet (Ghazi tion judgments, finding that emotions related in et al., 2015), or self-reported experiences (Scherer intensity cluster together closely and that the top- and Wallbott, 1994) among other domains. We are level clusters correspond to sentiment categories. the first to build on Reddit comments for emotion These relations among emotions allow for their prediction. potential grouping into higher-level categories, if 2.2 Emotion Taxonomy desired for a downstream task. We provide a strong baseline for modeling fine- One of the main aspects distinguishing our dataset grained emotion classification over GoEmotions. is its emotion taxonomy. The vast majority of ex- By fine-tuning a BERT-base model (Devlin et al., isting datasets contain annotations for minor varia- 2019), we achieve an average F1-score of .46 over tions of the 6 basic emotion categories (joy, anger, our taxonomy, .64 over an Ekman-style grouping fear, sadness, disgust, and surprise) proposed by into six coarse categories and .69 over a sentiment Ekman(1992a) and/or along affective dimensions grouping. These results leave much room for im- (valence and arousal) that underpin the circumplex provement, showcasing this task is not yet fully model of affect (Russell, 2003; Buechel and Hahn, addressed by current state-of-the-art NLU models. 2017). We conduct transfer learning experiments with Recent advances in psychology have offered new existing emotion benchmarks to show that our conceptual and methodological approaches to cap- data can generalize to different taxonomies and turing the more complex “semantic space” of emo- domains, such as tweets and personal narratives. tion (Cowen et al., 2019a) by studying the distribu- Our experiments demonstrate that given limited re- tion of emotion responses to a diverse array of stim- sources to label additional emotion classification uli via computational techniques. Studies guided data for specialized domains, our data can provide by these principles have identified 27 distinct va- baseline emotion understanding and contribute to rieties of emotional experience conveyed by short increasing model accuracy for the target domain. videos (Cowen and Keltner, 2017), 13 by music (Cowen et al., in press), 28 by facial expression 2 Related Work (Cowen and Keltner, 2019), 12 by speech prosody (Cowen et al., 2019b), and 24 by nonverbal vocal- 2.1 Emotion Datasets ization (Cowen et al., 2018). In this work, we build Ever since Affective Text (Strapparava and Mihal- on these methods and findings to devise our gran- cea, 2007), the first benchmark for emotion recog- ular taxonomy for text-based emotion recognition nition was introduced, the field has seen several and study the dimensionality of language-based emotion datasets that vary in size, domain and tax- emotion space. onomy (cf. Bostan and Klinger, 2018). The majority of emotion datasets are constructed manually, 2.3 Emotion Classification Models but tend to be relatively small. The largest manu- Both feature-based and neural models have been ally labeled dataset is CrowdFlower(2016), with used to build automatic emotion classification mod- 39k labeled examples, which were found by Bostan els. Feature-based models often make use of hand- and Klinger(2018) to be noisy in comparison with built lexicons, such as the Valence Arousal Dom- other emotion datasets. Other datasets are automat- inance Lexicon (Mohammad, 2018). Using rep- ically weakly-labeled, based on emotion-related resentations from BERT (Devlin et al., 2019), a transformer-based model with language model pre- negative emotions. The dataset includes the list of training, has recently shown to reach state-of-the- filtered tokens. art performance on several NLP tasks, also includ- Manual review. We manually review identity ing emotion prediction: the top-performing models comments and remove those offensive towards a in the EmotionX Challenge (Hsu and Ku, 2018) all particular ethnicity, gender, sexual orientation, or employed a pre-trained BERT model. We also use disability, to the best of our judgment. the BERT model in our experiments and we find that it outperforms our biLSTM model. Length filtering. We apply NLTK’s word tok- enizer and select comments 3-30 tokens long, in- 3 GoEmotions cluding punctuation. To create a relatively balanced Our dataset is composed of 58K Reddit comments, distribution of comment length, we perform down- labeled for one or more of 27 emotion(s) or Neutral. sampling, capping by the number of comments with the median token count (12). 3.1 Selecting & Curating Reddit comments Sentiment balancing. We reduce sentiment bias We use a Reddit data dump originating in the reddit- by removing subreddits with little representation 2 data-tools project , which contains comments from of positive, negative, ambiguous, or neutral senti- 2005 (the start of Reddit) to January 2019. We ment. To estimate a comment’s sentiment, we run select subreddits with at least 10k comments and our emotion prediction model, trained on a pilot remove deleted and non-English comments. batch of 2.2k annotated examples. The mapping Reddit is known for a demographic bias lean- of emotions into sentiment categories is found in ing towards young male users (Duggan and Smith, Figure2. We exclude subreddits consisting of more 2013), which is not reflective of a globally diverse than 30% neutral comments or less than 20% of population.

Load more