Improving Multi-Label Emotion Classification Via Sentiment Classification with Dual Attention Transfer Network

Improving Multi-label Emotion Classification via Sentiment Classification with Dual Attention Transfer Network Jianfei Yu|, Lu´ıs Marujo~, Jing Jiang|, Pradeep Karuturi~, William Brendel~ | School of Information Systems, Singapore Management University, Singapore ~ Snap Inc. Research, Venice, California, USA | [email protected], [email protected] ~ fluis.marujo, pradeep.karuturi, [email protected] Abstract ID Tweet Emotion In this paper, we target at improving the per- T1 AI revolution, soon is possible joy, optimism #fearless #good #goodness formance of multi-label emotion classifica- T2 Shitty is the worst feeling ever fear, sadness tion with the help of sentiment classification. #depressed #anxiety Specifically, we propose a new transfer learn- T3 I am back lol. #revenge joy, anger ing architecture to divide the sentence rep- Table 1: Example Tweets from SemEval-18 Task 1. resentation into two different feature spaces, which are expected to respectively capture the general sentiment words and the other impor- graphical model-based methods (Li et al., 2015b) tant emotion-specific words via a dual atten- and linear classifier-based methods (Quan et al., tion mechanism. Extensive experimental re- 2015; Li et al., 2015a). Given the recent suc- sults demonstrate that our transfer learning ap- cess of deep learning models, various neural net- proach can outperform several strong base- lines and achieve the state-of-the-art perfor- work models and advanced attention mechanisms mance on two benchmark datasets. have been proposed for this task and have achieved highly competitive results on several benchmark 1 Introduction datasets (Wang et al., 2016; Abdul-Mageed and In recent years, the number of user-generated Ungar, 2017; Felbo et al., 2017; Baziotis et al., comments on social media platforms has grown 2018; He and Xia, 2018; Kim et al., 2018). exponentially. In particular, social platforms such However, these deep models must overcome as Twitter allow users to easily share their per- a heavy reliance on large amounts of annotated sonal opinions, attitudes and emotions about any data in order to learn a robust feature represen- topic through short posts. Understanding people’s tation for multi-label emotion classification. In emotions expressed in these short posts can fa- reality, large-scale datasets are usually not read- cilitate many important downstream applications ily available and costly to obtain, partly due to such as emotional chatbots (Zhou et al., 2018b), the ambiguity of many informal expressions in personalized recommendations, stock market pre- user-generated comments. Conversely, it is eas- diction, policy studies, etc. Therefore, it is crucial ier to find datasets (especially in English) associ- to develop effective emotion detection models to ated with another closely related task: sentiment automatically identify emotions from these online classification, which aims to classify the sentiment posts. polarity of a given piece of text (i.e., positive, neg- In the literature, emotion detection is typically ative and neutral). We expect that these resources modeled as a supervised multi-label classifica- may allow us to improve sentiment-sensitive rep- tion problem, because each sentence may con- resentations and thus more accurately identify tain one or more emotions from a standard emo- emotions in social media posts. To achieve these tion set containing anger, anticipation, disgust, goals, we propose an effective transfer learning fear, joy, love, optimism, pessimism, sadness, sur- (TL) approach in this paper. prise and trust. Table1 shows three example Most existing TL methods either 1) assume that sentences along with their emotion labels. Tra- both the source and the target tasks share the same ditional approaches to emotion detection include sentence representation (Mou et al., 2016) or 2) lexicon-based methods (Wang and Pal, 2015), divide the representation of each sentence into a 1097 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1097–1102 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics ys yt ys yt ys yt ys yt Sentence Embedding Attention 푠 푝 푠 푝 푠 푝 푠 푝 weights α α α α α α α α L L L L L L L L Sentence S S S S S S S Encoding S T T T T T T T Layer T M M M M M M M M xs xt xs xt xs xt xs xt a. Fully-Shared (FS) b. Private-Shared-Private (PSP) c. Shared-Private (SP) d. Dual Attention Transfer Network Figure 1: Overview of Different Transfer Learning Models. shared feature space and two task-specific feature feeds the attention weights in one feature space as spaces (Liu et al., 2017; Yu et al., 2018), as demon- extra inputs to compute those in the other feature strated by Fig1.a and Fig1.b. However, when ap- space, and explicitly minimizes the similarity be- plying these TL approaches to our scenario, the tween the two sets of attention weights. Experi- former approach may lead the learnt sentence rep- mental results show that our dual attention trans- resentation to pay more attention to general senti- fer architecture can bring consistent performance ment words such as good but less attention to the gains in comparison with several existing transfer other sentiment-ambiguous words like shock that learning approaches, achieving the state-of-the-art are also integral to emotion classification. The lat- performance on two benchmark datasets. ter approach can capture both the sentiment and the emotion-specific words. However, some sen- 2 Methodology timent words only occur in the source sentiment 2.1 Base Model for Emotion Classification classification task. These words tend to receive Given an input sentence, the goal of emotion anal- 31 more attention in the source-specific feature space ysis is to identify one or multiple emotions con- but less attention in the shared feature space, so tained in it. Formally, let x = (w ; w ;:::; w ) they will be ignored in our emotion classification 1 2 n be the input sentence with n words, where w is task. Intuitively, any sentiment word also indicates j a d-dimensional word vector for word w in the emotion and should not be ignored by our emotion j vocabulary V, and is retrieved from a lookup ta- classification task. ble E 2 Rd×|Vj. Moreover, let E be a set of Therefore, we propose a shared-private (SP) pre-defined emotion labels. Accordingly, for each model as shown in Fig1.c, where we employ a x, our task is to predict whether it contains one shared LSTM layer to extract shared sentiment or more emotions in E. We denote the output as K features for both sentiment and emotion classifi- e 2 f0; 1g where ek 2 f0; 1g denotes whether cation tasks, and a target-specific LSTM layer to or not x contains the k-th emotion. We further as- extract specific emotion features that are only sen- sume that we have a set of labeled sentences, de- e (i) (i) N sitive to our emotion classification task. How- noted by D = fx ; e gi=1. ever, as pointed out by Liu et al.(2017) and Yu Sentence Representation: We use the stan- et al.(2018), it is not guaranteed that such a simple dard bi-directional Long Short Term Memory (Bi- model can well differentiate the two feature spaces LSTM) network to sequentially process each word to extract shared and target-specific features as we in the input: expect. Take the sentence T1 in Table1 as an −! −−! example. Both the shared and task-specific lay- hj = LSTM(hj−1; xj; Θf ); ers could assign higher attention weights to good − −− hj = LSTM(hj+1; xj; Θb); and goodness due to their high frequencies in the training data but lower attention weights to fear- where Θf and Θb denotes all the parameters in less due to its rare occurrences. In this case, this the forward and backward LSTM. Then, for each d SP model can only predict the joy emotion but ig- word xj, its hidden state hj 2 R is generated by −! − −! − nores the optimism emotion. Hence, to enforce the concatenating hj and hj as hj = [hj; hj]. orthogonality of the two feature spaces, we fur- For emotion classification, since emotion words ther introduce a dual attention mechanism, which are relatively more important for final predic- 1098 Target Task tions, we adopt the widely used attention mech- anticipation anger fear disgust joy love optimism sad surprise trust pessimism anism (Bahdanau et al., 2014) to select the key words for sentence representation. Specifically, MLP Source Task negtive neutral positive we first take the final hidden state hn as a sentence Ht summary vector z, and then obtain the attention Softmax 푡 weight αi for each hidden state hj as follows: Hc α 푠 u = v> tanh(W h + W z); (1) α j h j z ... ... exp(uj) αj = Pn ; (2) l=1 exp(ul) ... ... See Justin Bieber …… So happy AI revolution #fearless …… #good #goodness a×d a Source Input Sentence Target Input Sentence where Wh; Wz 2 R and v 2 R are learnable Figure 2: Dual Attention Transfer Network. parameters. The final sentence representation H is computed as: n a shared attention-based Bi-LSTM layer to trans- X form the input sentences in both tasks into a H = αjhj: j=1 shared hidden representation Hc, and also employ another task-specific Bi-LSTM layer to get the 31 Output Layer: We first apply a Multilayer Per- target-specific hidden representation Ht. Next, we ceptron (MLP) with one hidden layer on top of H, employ the following operations to map the hid- followed by normalizing it to obtain the probabil- den representations to the sentiment label y and ity distribution over all of the emotion labels: the emotion label e: p(e(i) j H) = o(i) = softmax(MLP(H)): (m) s s p(y jHc) = softmax W Hc + b ; (i) Then, we propose to minimize the KL divergence p(e jHc; Ht) = softmax MLP([Hc; Ht]) ; between our predicted probability distribution and the normalized ground truth distribution as our ob- jective function: where Ws 2 Rd×3 and bs 2 R3 are the parameters for the source sentiment classification task.

Load more