Casting Multi-Label Emotion Classification As Span-Prediction
Total Page:16
File Type:pdf, Size:1020Kb
SpanEmo: Casting Multi-label Emotion Classification as Span-prediction Hassan Alhuzali Sophia Ananiadou National Centre for Text Mining Department of Computer Science, The University of Manchester, United Kingdom fhassan.alhuzali@postgrad.,[email protected] Abstract fication (Ying et al., 2019; Baziotis et al., 2018; Yu et al., 2018; Badaro et al., 2018; Mulki et al., Emotion recognition (ER) is an important task 2018; Mohammad et al., 2018; Yang et al., 2018) in Natural Language Processing (NLP), due to its high impact in real-world applications do not effectively capture emotion-specific associa- from health and well-being to author profiling, tions, which can be useful for prediction, as well consumer analysis and security. Current ap- as learning of association between emotion labels proaches to ER, mainly classify emotions inde- and words in a sentence. In addition, standard pendently without considering that emotions approaches in emotion classification treat individ- can co-exist. Such approaches overlook poten- ual emotion independently. However, emotions tial ambiguities, in which multiple emotions are not independent; a specific emotive expression overlap. We propose a new model “SpanEmo” casting multi-label emotion classification as can be associated with multiple emotions. The span-prediction, which can aid ER models to existence of association/correlation among emo- learn associations between labels and words in tions has been well-studied in psychological the- a sentence. Furthermore, we introduce a loss ories of emotions, such as Plutchik’s wheels of function focused on modelling multiple co- emotion (Plutchik, 1984) that introduces the notion existing emotions in the input sentence. Exper- of mixed and contrastive emotions. For example, iments performed on the SemEval2018 multi- “joy” is close to “love” and “optimism”, instead of label emotion data over three language sets “anger” and “sadness”. (i.e., English, Arabic and Spanish) demon- strate our method’s effectiveness. Finally, we present different analyses that illustrate the # Sentence GT benefits of our method in terms of improving S1 well my day started off great anger, dis- the model performance and learning meaning- the mocha machine wasn’t gust, joy, ful associations between emotion classes and working @ mcdonalds. sadness words in the sentence1. S2 I’m doing all this to make sure joy, love, 1 Introduction you smiling down on me bro. optimism Emotion is essential to human communication, thus Table 1: Example Tweets from SemEval-18 Task 1. GT emotion recognition (ER) models have a host of ap- represents the ground truth labels. plications from health and well-being (Alhuzali and Ananiadou, 2019; Aragon´ et al., 2019; Chen Consider S1 in Table1, which contains a mix of et al., 2018) to consumer analysis (Alaluf and Il- positive and negative emotions, although it is more louz, 2019; Herzig et al., 2016) and user profil- negative oriented. This can be observed clearly via ing (Volkova and Bachrach, 2016; Mohammad and the ground truth labels assigned to this example, Kiritchenko, 2013), among others. Interest in this where the first part of this sentence only expresses a area has given rise to new NLP approaches aimed positive emotion (i.e., joy), while the other part ex- at emotion classification, including single-label presses negative emotions. For example, clue words and multi-label emotion classification. Most ex- like “great” are more likely to be associated with isting approaches for multi-label emotion classi- “joy”, whereas “wasn’t working” are more likely to be associated with negative emotions. Learn- 1Source code is available at https://github.com/ hasanhuz/SpanEmo ing such associations between emotion labels and 1573 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 1573–1584 April 19 - 23, 2021. ©2021 Association for Computational Linguistics words in a sentence can help ER models to predict tions into a single score for each token. We then the correct labels. S2 further highlights that certain used the scores for the label tokens as predictions emotions are more likely to be associated with each for the corresponding emotion label. The green other. Mohammad and Bravo-Marquez(2017) also boxes at the top of the FFN illustrate the positive observed that negative emotions are highly asso- label set, while the red ones illustrate the negative ciated with each other, while less associated with label set for multi-label emotion classification. We positive emotions. Based on these observations, we now turn to describing our framework in detail. seek to answer the following research questions: i) how to enable ER models to learn emotion-specific associations by taking into account label informa- CLS C1 C2 ... Cm SEP W1 W2 ... Wn tion and ii) how to benefit from the multiple co- existing emotions in a multi-label emotion data Feed Forward Network set with the intention of learning label correlations. Our contributions are summarised as follows: BERT Encoding I. a novel framework casting the task of multi-label emotion classification as a span-prediction prob- lem. We introduce “SpanEmo” to train the model CLS C1 C2 ... Cm SEP W1 W2 ... Wn to take into consideration both the input sentence and a label set (i.e., emotion classes) for select- Classes (C) Input (Si) ing a span of emotion classes in the label set as the Figure 1: Illustration of our proposed framework output. The objective of SpanEmo is to predict emo- (SpanEmo). tion classes directly from the label set and capture associations corresponding to each emotion. 2.2 Our Method (SpanEmo) II. a loss function, modelling multiple co-existing N emotions for each input sentence. We make use of Let f(si; yi)gi=1 be a set of N examples with the the label-correlation aware loss (LCA)(Yeh et al., corresponding emotion labels of C classes, where m 2017), originally introduced by Zhang and Zhou si denotes the input sentence and yi 2 f0; 1g (2006). The objective of this loss function is to represents the label set for si. As shown in Fig- maximise the distance between positive and nega- ure1, both the label set and the input sentence were tive labels, which is learned directly from the multi- passed into the encoder BERT (Devlin et al., 2019). label emotion data set. The encoder received two segments: the first cor- III. a large number of experiments and analyses responds to the set of emotion classes, while the both at the word- and sentence-level, demonstrating second refers to the input sentence. The hidden T × D 2 the strength of SpanEmo for multi-label emotion representations (Hi 2 R ) for each input classification across three languages (i.e. English, sentence and the label set were obtained as follows: Arabic and Spanish). The rest of the paper is organised as follows: sec- Hi = Encoder([CLS] + jCj + [SEP] + si); (1) tion2 describes our methodology, while section3 where f[CLS], [SEP]g are special tokens and jCj de- discusses experimental details. We evaluate the notes the size of emotion classes. Feeding both proposed method and compare it to related meth- segments to the encoder has a few advantages. ods in section4. Section5 reports on the analysis First, the encoder can interpolate between emotion of results, while section6 reviews related work. classes and all words in the input sentence. Second, We conclude in section7. a hidden representation is generated both for words 2 Methodology and emotion classes, which can be further used to understand whether the encoder can learn associa- 2.1 Framework tion between the emotion classes and words in the Figure1 presents our framework ( SpanEmo). Given input sentence. Third, SpanEmo is flexible because an input sentence and a set of classes, a base en- its predictions are directly produced from the first coder was employed to learn contextualised word segment corresponding to the emotion classes. representations. Next, a feed forward network 2T and D denote the input length and dimensional size, (FFN) was used to project the learned representa- respectively. 1574 We further introduced a feed-forward network where α 2 [0; 1] denotes the weight used to control (FFN) consisting of a non-linear hidden layer with the contribution of each part to the overall loss. a Tanh activation (fi(Hi)) as well as a position D 3 Experiments vector pi 2 R , which was used to compute a dot product between the output of fi and pi. As 3.1 Implementation Details our task involved a multi-label emotion classifica- We used PyTorch (Paszke et al., 2017) for imple- tion, we added a sigmoid activation to determine mentation and ran all experiments on an Nvidia whether classi was the correct emotion label or not. GeForce GTX 1080 with 11 GB memory. We It should be mentioned that the use of the position also trained BERT utilising the open-source vector is quite similar to how start and end vec- BASE Hugging-Face implementation (Wolf et al., 2019). tors are defined in transformer-based models for For experiments related to Arabic, we chose “bert- question-answering. Finally, the span-prediction base-arabic” developed by Safaya et al.(2020), tokens were obtained from the label segment and while selecting “bert-base-spanish-uncased” devel- then compared with the ground truth labels since oped by Canete˜ et al.(2020) for Spanish. All three there was a 1-to-1 correspondence between the la- models were trained on the same hyper-parameters bel tokens and the original emotion labels. with a fixed initialisation seed, including a feature ^y = sigmoid(FFN(Hi)); (2) dimension of 786, a batch size of 32, a dropout rate of 0:1, an early stop patience of 10 and 20 epochs. 2.3 Label-Correlation Aware (LCA) Loss Adam was selected for optimisation (Kingma and Following Yeh et al.(2017), we employed the label- Ba, 2014) with a learning rate of 2e-5 for the BERT correlation aware loss, which takes a vector of true- encoder, and a learning rate of 1e-3 for the FFN.