Seq2emo: a Sequence to Multi-Label Emotion Classification Model
Total Page:16
File Type:pdf, Size:1020Kb
Seq2Emo: A Sequence to Multi-Label Emotion Classification Model Chenyang Huang, Amine Trabelsi, Xuebin Qin, Nawshad Farruque, Lili Mou, Osmar Zaïane Alberta Machine Intelligence Institute Department of Computing Science, University of Alberta {chenyangh,atrabels,xuebin,nawshad,zaiane}@ualberta.ca [email protected] Abstract suggests strong correlation among different emo- tions (Plutchik, 1980). For example, “hate” may Multi-label emotion classification is an impor- tant task in NLP and is essential to many co-occur more often with “disgust” than “joy.” applications. In this work, we propose An alternative approach to multi-label emotion a sequence-to-emotion (Seq2Emo) approach, classification is the classifier chain (CC, Read et al., which implicitly models emotion correlations 2009). CC predicts the label(s) of an input in an in a bi-directional decoder. Experiments on autoregressive manner, for example, by a sequence- SemEval’18 and GoEmotions datasets show to-sequence (Seq2Seq) model (Yang et al., 2018). that our approach outperforms state-of-the-art However, Seq2Seq models are known to have the methods (without using external data). In problem of exposure bias (Bengio et al., 2015), i.e., particular, Seq2Emo outperforms the binary relevance (BR) and classifier chain (CC) ap- an error at early steps may affect future predictions. proaches in a fair setting.1 In this work, we propose a sequence-to-emotion (Seq2Emo) approach, where we consider emotion 1 Introduction correlations implicitly. Similar to CC, we also build Emotion classification from text (Yadollahi et al., a Seq2Seq-like model, but predict a binary indica- 2017; Sailunaz et al., 2018) plays an important role tor of an emotion at each decoding step of Seq2Seq. in affective computing research, and is essential to We do not feed predicted emotions back to the de- human-like interactive systems, such as emotional coder; thus, our model does not suffer from the chatbots (Asghar et al., 2018; Zhou et al., 2018; exposure bias problem. Compared with BR, our Huang et al., 2018; Ghosal et al., 2019). Seq2Emo model implicitly considers the correla- Early work treats this task as multi-class classi- tion of emotions in the hidden states of the decoder, fication (Scherer and Wallbott, 1994; Mohammad, and with an attention mechanism, our Seq2Emo is 2012), where each data instance (e.g., a sentence) able to focus on different words in the input sen- is assumed to be labeled with one and only one tence that are relevant to the current emotion. emotion. More recently, researchers relax such We evaluate our model for multi-label emo- an assumption and treat emotion analysis as multi- tion classification on SemEval’18 (Mohammad label classification (MLC, Mohammad et al., 2018; et al., 2018) and GoEmotions (Demszky et al., Demszky et al., 2020). In this case, each data in- 2020) benchmark datasets. Experiments show that stance may have one or multiple emotion labels. Seq2Emo achieves state-of-the-art results on both This is a more appropriate setting for emotion anal- datasets (without using external data). In particular, ysis, because an utterance may exhibit multiple Seq2Emo outperforms both BR and CC in a fair, emotions (e.g., “angry” and “sad”, “surprise” and controlled comparison. “joy”). 2 Related work The binary relevance approach (BR, Godbole and Sarawagi, 2004) is widely applied to multi- Emotion classification is an activate research area label emotion classification. BR predicts a binary in NLP. It classifies text instances into a set of indicator for each emotion individually, assuming emotion categories, e.g., angry, sad, happy, and that the emotions are independent given the in- surprise. Well-accepted emotion categorizations put sentence. However, evidence in psychotherapy include the six basic emotions in Ekman(1984) 1Our code is available at https://github.com/ and the eight primary emotions in Plutchik’s wheel chenyangh/Seq2Emo of emotions (1980). 4717 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4717–4724 June 6–11, 2021. ©2021 Association for Computational Linguistics Early work uses manually constructed emo- use both token-level and contextual pretrained em- tion lexicons for the emotion classification beddings to represent a word in the sentence. task (Tokuhisa et al., 2008; Wen and Wan, 2014; Formally, let a sentence be x = (x1; ··· ; xM ). Shahraki and Zaiane, 2017). Such lexicon re- We first encode each word xi with GloVe em- sources include WordNet-Affect (Strapparava and beddings (Pennington et al., 2014), denoted by Valitutti, 2004), EmoSenticNet (Poria et al., 2014), GloVe(xi). We further use the ELMo contextual and the NRC Emotion Intensity Lexicon (Moham- embeddings (Peters et al., 2018), which processing mad, 2018). the entire sentence x by a pretrained LSTM. The Distant supervision (Mintz et al., 2009) has been corresponding hidden state is used as the embed- applied to emotion classification, as researchers ding representation of a word xi in its context. This find existing labeled datasets are small for training is denoted by ELMo(x)i. an emotion classifier. For example, Mohammad We use a two-layer bi-directional LSTM on the (2012) finds that social media users often use hash- above two embeddings. The forward LSTM, for tags to express emotions, and thus certain hashtags example, has the form can be directly regarded as the noisy label of an ut- −! −! −! h E = LSTM E ([GloVe(x ); ELMo(x) ]; h E ) terance. Likewise, Felbo et al.(2017) use emojis as t t t t−1 noisy labels for emotion classification. Such distant where the superscript E denotes the encoder. Like- supervision can also be applied to pretrain emotion- wise, the backward LSTM yields the representation − −! − E E E E specific embeddings and language models (Tang ht . They are concatenated as ht = [h ; h ]. et al., 2014; Ghosh et al., 2017). Here, we use BiLSTM for simplicity, follow- In addition, Yu et al.(2018) apply multi-task ing Sanh et al.(2019) and Huang et al.(2019). learning to combine polarity sentiment analysis Other pretrained models, such as the Tranformer- and multi-label emotion classification with dual based BERT (Devlin et al., 2019), may also be attention. adopted. This, however, falls out of the scope Different from the above studies that use extra of our paper, as we mainly focus on multi-label emotional resources, our work focuses on mod- emotion classification. Empirical results on the eling the correlations among emotions. This im- GoEmotions dataset shows that, by properly ad- proves multi-label emotion classification without dressing multi-label classification, our model out- using additional data. A similar paper to ours is performs a Transformer-based model (Table2). the Sequence Generation Model (SGM, Yang et al., Decoder. In Seq2Emo, an LSTM-based decoder 2018). SGM accomplishes multi-label classifica- is used to make sequential predictions on every tion by an autoregressive Seq2Seq model, and is an candidate emotion. Suppose a predefined order of adaptation of classifier chains (Read et al., 2009) emotions is given, e.g., “angry,” “joy,” and “sad.” in the neural network regime. Our paper models The decoder will perform a binary classification emotion correlation implicitly by decoder hidden over these emotions in sequence. The order, in fact, states and does not suffer from the drawbacks of does not affect our model much, as it is the same for autoregressive models. all training samples and can be easily learned. In addition, we feed a learnable emotion embedding 3 Methodology as input at each step of the decoder. This enhances the decoder by explicitly indicating which emotion Consider a multi-label emotion classification prob- is being predicted at a step. lem. Suppose we have K predefined candidate Different from a traditional Seq2Seq decoder, we emotions, and an utterance or a sentence x can be do not feed previous predictions back as input, so as assigned with one or more emotions. We represent to avoid exposure bias. This also allows Seq2Emo K the target labels as y = (y1; ··· ; yK ) 2 f0; 1g to use a bi-directional LSTM as the decoder, which with yi = 1 representing that the ith emotion is on. implicitly model the correlation among different Our Seq2Emo is a Seq2Seq-like framework, emotions. shown as Figure1. It encodes x with an LSTM, Without loss of generality, we explain the for- and iteratively performs binary classifications over ward direction of the decoder LSTM, denoted by y with another LSTM as the decoder. −! i LSTMD . The hidden state at step j is given by Encoder. We use a two-layer bi-directional −! −! −! −! D D ~ D D LSTM to encoder an utterance. Specifically, we hj = LSTM ([ej; hj−1]; hj−1) (1) 4718 Attention sad angry happy Encoder Decoder Figure 1: Overview of the Seq2Emo model. where ej is the embedding for the jth emotion, suited to multi-label classification than BR’s in- −! ~ D dividual predictions. Our Seq2Emo also differs and hj−1 is calculated by the attention mechanism in Luong et al.(2015). from the classifier chain approach (CC, Read et al., Here, the attention mechanism dynamically 2009), which uses softmax to predict the next plau- aligns source words when predicting the specific sible emotion from all candidates. Thus, CC has ! to feed the previous predictions as input, and suf- target emotion at a decoding step. Let αj;i be the attention probability of the jth decoder step over fers from the exposure bias problem. By contrast, the ith encoder step, computed by we predict the presence of all the emotions in se- quence. Hence, feeding back previous predictions −! ! D > ! E sj;i = (hj ) Wa hi (2) is not necessary, and this prevents the exposure exp(s! ) bias. In this sense, our model combines the merits α! = j;i (3) j;i PM ! i=1 exp(sj;i) of both BR and CC.