<<

Improving Multi-label Classification via Sentiment Classification with Dual Transfer Network

Jianfei Yu♣, Lu´ıs Marujo♥, Jing Jiang♣, Pradeep Karuturi♥, William Brendel♥ ♣ School of Systems, Singapore Management University, Singapore ♥ Snap Inc. , Venice, California, USA ♣ [email protected], [email protected] ♥ {luis.marujo, pradeep.karuturi, william.brendel}@snap.com

Abstract ID Tweet Emotion In this paper, we target at improving the per- T1 AI revolution, soon is possible , #fearless #good #goodness formance of multi-label emotion classifica- T2 Shitty is the worst ever , tion with the help of sentiment classification. #depressed # Specifically, we propose a new transfer learn- T3 I am back lol. # joy, ing architecture to divide the sentence rep- Table 1: Example Tweets from SemEval-18 Task 1. resentation into two different feature spaces, which are expected to respectively capture the general sentiment words and the other impor- graphical model-based methods (Li et al., 2015b) tant emotion-specific words via a dual atten- and linear classifier-based methods (Quan et al., tion mechanism. Extensive experimental re- 2015; Li et al., 2015a). Given the recent suc- sults demonstrate that our transfer learning ap- cess of deep learning models, various neural net- proach can outperform several strong base- lines and achieve the state-of-the-art perfor- work models and advanced attention mechanisms mance on two benchmark datasets. have been proposed for this task and have achieved highly competitive results on several benchmark 1 Introduction datasets (Wang et al., 2016; Abdul-Mageed and In recent years, the number of user-generated Ungar, 2017; Felbo et al., 2017; Baziotis et al., comments on social media platforms has grown 2018; He and Xia, 2018; Kim et al., 2018). exponentially. In particular, social platforms such However, these deep models must overcome as Twitter allow users to easily share their per- a heavy reliance on large amounts of annotated sonal opinions, attitudes and about any data in order to learn a robust feature represen- topic through short posts. Understanding people’s tation for multi-label emotion classification. In emotions expressed in these short posts can fa- , large-scale datasets are usually not read- cilitate many important downstream applications ily available and costly to obtain, partly due to such as emotional chatbots (Zhou et al., 2018b), the ambiguity of many informal expressions in personalized recommendations, stock market pre- user-generated comments. Conversely, it is eas- diction, policy studies, etc. Therefore, it is crucial ier to find datasets (especially in English) associ- to develop effective emotion detection models to ated with another closely related task: sentiment automatically identify emotions from these online classification, which aims to classify the sentiment posts. polarity of a given piece of text (i.e., positive, neg- In the literature, emotion detection is typically ative and neutral). We expect that these resources modeled as a supervised multi-label classifica- may allow us to improve sentiment-sensitive rep- tion problem, because each sentence may con- resentations and thus more accurately identify tain one or more emotions from a standard emo- emotions in social media posts. To achieve these tion containing anger, , , goals, we propose an effective transfer learning fear, joy, , optimism, , sadness, sur- (TL) approach in this paper. prise and . Table1 shows three example Most existing TL methods either 1) assume that sentences along with their emotion labels. Tra- both the source and the target tasks share the same ditional approaches to emotion detection include sentence representation (Mou et al., 2016) or 2) lexicon-based methods (Wang and Pal, 2015), divide the representation of each sentence into a

1097 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1097–1102 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics ys yt ys yt ys yt ys yt

Sentence Embedding

Attention 푠 푝 푠 푝 푠 푝 푠 푝 weights α α α α α α α α

L L L L L L L L Sentence S S S S S S S Encoding S T T T T T T T Layer T M M M M M M M M xs xt xs xt xs xt xs xt a. Fully-Shared (FS) b. Private-Shared-Private (PSP) c. Shared-Private (SP) d. Dual Attention Transfer Network Figure 1: Overview of Different Transfer Learning Models. shared feature space and two task-specific feature feeds the attention weights in one feature space as spaces (Liu et al., 2017; Yu et al., 2018), as demon- extra inputs to compute those in the other feature strated by Fig1.a and Fig1.b. However, when ap- space, and explicitly minimizes the similarity be- plying these TL approaches to our scenario, the tween the two sets of attention weights. Experi- former approach may lead the learnt sentence rep- mental results show that our dual attention trans- resentation to pay more attention to general senti- fer architecture can bring consistent performance ment words such as good but less attention to the gains in comparison with several existing transfer other sentiment-ambiguous words like shock that learning approaches, achieving the state-of-the-art are also integral to emotion classification. The lat- performance on two benchmark datasets. ter approach can capture both the sentiment and the emotion-specific words. However, some sen- 2 timent words only occur in the source sentiment 2.1 Base Model for Emotion Classification classification task. These words tend to receive Given an input sentence, the goal of emotion anal- 31 more attention in the source-specific feature space ysis is to identify one or multiple emotions con- but less attention in the shared feature space, so tained in it. Formally, let x = (w , w ,..., w ) they will be ignored in our emotion classification 1 2 n be the input sentence with n words, where w is task. Intuitively, any sentiment word also indicates j a d-dimensional word vector for word w in the emotion and should not be ignored by our emotion j vocabulary V, and is retrieved from a lookup ta- classification task. ble E ∈ Rd×|V|. Moreover, let E be a set of Therefore, we propose a shared-private (SP) pre-defined emotion labels. Accordingly, for each model as shown in Fig1.c, where we employ a x, our task is to predict whether it contains one shared LSTM layer to extract shared sentiment or more emotions in E. We denote the output as K features for both sentiment and emotion classifi- e ∈ {0, 1} where ek ∈ {0, 1} denotes whether cation tasks, and a target-specific LSTM layer to or not x contains the k-th emotion. We further as- extract specific emotion features that are only sen- sume that we have a set of labeled sentences, de- e (i) (i) N sitive to our emotion classification task. How- noted by D = {x , e }i=1. ever, as pointed out by Liu et al.(2017) and Yu Sentence Representation: We use the stan- et al.(2018), it is not guaranteed that such a simple dard bi-directional Long Short Term Memory (Bi- model can well differentiate the two feature spaces LSTM) network to sequentially process each word to extract shared and target-specific features as we in the input: expect. Take the sentence T1 in Table1 as an −→ −−→ example. Both the shared and task-specific lay- hj = LSTM(hj−1, xj, Θf ), ers could assign higher attention weights to good ←− ←−− hj = LSTM(hj+1, xj, Θb), and goodness due to their high frequencies in the training data but lower attention weights to fear- where Θf and Θb denotes all the parameters in less due to its rare occurrences. In this case, this the forward and backward LSTM. Then, for each d SP model can only predict the joy emotion but ig- word xj, its hidden state hj ∈ R is generated by −→ ←− −→ ←− nores the optimism emotion. Hence, to enforce the concatenating hj and hj as hj = [hj; hj]. orthogonality of the two feature spaces, we fur- For emotion classification, since emotion words ther introduce a dual attention mechanism, which are relatively more important for final predic-

1098 Target Task tions, we adopt the widely used attention mech- anticipation anger fear disgust joy love optimism sad trust pessimism anism (Bahdanau et al., 2014) to select the key words for sentence representation. Specifically, MLP Source Task negtive neutral positive we first take the final hidden state hn as a sentence Ht summary vector z, and then obtain the attention Softmax 푡 weight αi for each hidden state hj as follows: Hc α

푠 u = v> tanh(W h + W z), (1) α j h j z ...... exp(uj) αj = Pn , (2) l=1 exp(ul) ...... See Justin Bieber …… So happy AI revolution #fearless …… #good #goodness a×d a Source Input Sentence Target Input Sentence where Wh, Wz ∈ R and v ∈ R are learnable Figure 2: Dual Attention Transfer Network. parameters. The final sentence representation H is computed as:

n a shared attention-based Bi-LSTM layer to trans- X form the input sentences in both tasks into a H = αjhj. j=1 shared hidden representation Hc, and also employ

another task-specific Bi-LSTM layer to get the 31 Output Layer: We first apply a Multilayer Per- target-specific hidden representation Ht. Next, we ceptron (MLP) with one hidden layer on top of H, employ the following operations to map the hid- followed by normalizing it to obtain the probabil- den representations to the sentiment label y and ity distribution over all of the emotion labels: the emotion label e:

p(e(i) | H) = o(i) = softmax(MLP(H)). (m) s s p(y |Hc) = softmax W Hc + b , (i)  Then, we propose to minimize the KL divergence p(e |Hc, Ht) = softmax MLP([Hc; Ht]) , between our predicted probability distribution and the normalized ground truth distribution as our ob- jective function: where Ws ∈ Rd×3 and bs ∈ R3 are the parame- ters for the source sentiment classification task. N K 1 X X (i) (i) (i) L = e log(e ) − log(o ). N k k k i=1 k=1 2.2.2 Proposed Dual Attention Transfer Network (DATN) During the test stage, we will select a threshold γ on the development set so that the emotion with As we introduced before, the shared and target- scores higher than γ will be predicted as 1. specific feature spaces in the above SP model are expected to respectively capture the general senti- 2.2 Transfer Learning Architecture ment words and the task-specific emotion words. Due to the limited number of annotated data for However, without any constraints, the two feature multi-label emotion classification, here we resort spaces may both tend to pay more attention to fre- to sentiment classification to consider a transfer quently occurring and important sentiment words s (m) (m) M like great and happy, but less to those rarely oc- learning scenario. Let D = {x , y }m=1 be another set of labeled sentences for sentiment curring but crucial emotion words like anxiety and classification, where y(m) is the ground-truth label . Therefore, to encourage the two feature indicating whether the m-th sentence is positive, spaces to focus on sentiment words and emotion- negative or neutral. specific words respectively, we propose using the attention weights computed from the shared layer 2.2.1 Shared-Private (SP) Model as extra inputs to compute the attention weights of Intuitively, sentiment classification is a coarse- the target-specific layer. Specifically, as shown in grained emotion analysis task, and can be fully Fig.2, we first use Eq.1 and Eq.2 to compute the leveraged to learn a more robust sentiment- attention weights αs in the shared layer, and then sensitive representation. Therefore, we first use use the following equation to obtain the attention

1099 weights αt in the target specific layer: Dataset Train Dev Test Words E1 SemEval-18 6,838 886 3,259 32,557 t t> t t s t t S1 SemEval-16 28,631 - - 40439 uj = v tanh(Whhj + wααj + Wzz ), t E2 Ren-CECps-1 13,841 1,972 3,602 exp(u ) 40,099 t j S2 Ren-CECps-2 15,199 - - αj = Pn t . l=1 exp(u ) l Table 2: The number of sentences in each dataset. In addition, we introduce another similarity loss to explicitly enforce the difference between the two their preprocessing rules except that we split the attention weights and minimize the cosine simi- hashtag into ‘#’ and its subsequent word. larity between αs and αt. For Chinese, we use a well known Chinese blog Finally, our combined objective function is de- dataset Ren-CECps from (Quan and Ren, 2010), fined as follows: which contains 1487 documents with each sen- tence labeled by a sentiment label and 8 emotion M 1 X labels: anger, expectation, anxiety, joy, love, hate, J = − log p(y |H ) + L M m c and surprise. Given the difficulty of find- m=1 ing a large-scale sentiment classification dataset N X s t specific to Chinese blogs, we simply divided the + λ cos sim(αi , αi), i=1 original dataset to form our source and target tasks1. The basic statistics of our two datasets are where λ is a hyperparameter used to control the summarized in Table2. effect of the similarity loss. Parameter Settings: The word embedding size d is set to be 300 for E1 and 200 for E2, and 2.2.3 Model Details the lookup table E is initialized by pre-trained During the training stage, we adopted the widely word embeddings based on Glove2. The hidden used alternating optimization strategy, which iter- and the number of LSTM layers in both s atively samples one mini-batch from D for only datasets are set to be 200 and 1. During training, updating the parameters in the left part of our Adam (Kingma and Ba, 2014) is used to schedule model, followed by sampling another mini-batch the learning rate, where the initial learning rate is e from D for updating all the parameters in our set to be 0.001. Also, the dropout rate is set to model. It is also worth noting that in Fig.2, we 0.5. After tuning, λ is set as 0.05 for both datasets, s first obtain the shared attention weights α and and γ is set as 0.12 for E1 and 0.2 for E2. All the t feed it as extra inputs to compute α . In fact, to models are implemented with Tensorflow. differentiate the attention weights in the two fea- Evaluation Metrics: We take the official code t ture spaces, we can also first compute α , followed from SemEval-18 Task 1C and use accuracy and s t by computing α based on α . We refer to these Macro F1 score as main metrics. For E2, we fol- two variants of our model as DATN-1 and DATN- low (Zhou et al., 2018a) to use average precision 2 respectively. (AP) and one error (OE) as secondary metrics.

3 Experiments 3.2 Results 3.1 Experiment Settings To better evaluate our proposed methods, we em- ployed the following systems for comparison: 1) Datasets: We conduct experiments on both En- Base, training our base model in Section 2.1 only glish and Chinese languages. on De; 2) FT (Fine-Tuning), using Ds to pre- For English, we employ a widely used Twit- train the whole model, followed by using De to ter dataset from SemEval 2016 Task 4A (Nakov Fine Tune the model parameters; 3) FS, the Fully- et al., 2016) as our source sentiment classification Shared framework by (Mou et al., 2016) as shown task. For our target emotion classification task, in Fig1.a; 4) PSP and APSP, the Private-Shared- we use the Twitter dataset recently released by Se- Private framework and its extension with Adver- mEval 2018 Task 1C (Mohammad et al., 2018), 1 which contains 11 emotions as shown in the top of The first 560/80/160 documents are used as train/dev/test set for emotion classification, and the remaining 687 docu- Fig.2. To tokenize the tweets in our dataset, we ments are used for sentiment classification. follow (Owoputi et al., 2013) by adopting most of 2https://nlp.stanford.edu/projects/glove/.

1100 Methods When you dread going to work early … but you always come back home happy ; smiling # goodday Prediction Base joy, optimism 푠 α joy, optimism, DATN-2 α푡 love Figure 3: Comparison of attention weights between Base and our DATN-2 model on a test sentence from SemEval-18. Note that the ground truth emotion labels for this example are joy, optimism and love.

Methods S1 → E1 S2 → E2 Even compared with the state-of-the-art systems ACC↑ F1↑ ACC↑ F1↑ AP↑ OE↓ in E1 which also employ other external resources, Base 0.569 0.521 0.368 0.399 0.648 0.531 including the affective embedding, emotion lexi- FT 0.575 0.519 0.372 0.398 0.655 0.519 con and sentiment classification datasets (Baziotis FS 0.577 0.526 0.386 0.403 0.662 0.507 et al., 2018), the ensemble results of DATN-2 can PSP 0.579 0.531 0.384 0.405 0.658 0.517 APSP 0.580 0.540 0.389 0.399 0.670 0.499 achieve slightly better performance; in addition, it SP 0.577 0.532 0.389 0.410 0.667 0.507 is clear that our model can obtain the best perfor- DATN-1 0.582 0.543 0.393 0.410 0.670 0.501 mance in E2. DATN-2 0.583 0.544 0.400 0.420 0.674 0.498 Furthermore, to obtain a better understanding of Rank 2 0.582 0.534 - 0.392 0.641 0.523 the advantage of our method, we choose one sen- Rank 1 0.595 0.542 - 0.416 0.680 0.455 ∗ tence from the test set of E1, and visualize the at- DATN-2 0.597 0.551 ---- tention weights obtained by Base and DATN-2 in Base† - - 0.445 0.426 0.725 0.425 DATN-2† -- 0.457 0.444 0.732 0.415 Fig3. We can see that Base pays more attention to those frequent emotion words while ignoring the Table 3: The results of different transfer learning methods by averaging ten runs (top) and the comparison between our less frequent but important emoji, and thus fails to best model and the state-of-the-art systems (bottom). DATN- predict the love emotion implied by the emoji. In ∗ † 2 indicates the ensemble results of ten runs. Base and contrast, with the proposed dual attention mech- DATN-2† denotes the average results of conducting ten-fold cross validation on the whole dataset for fair comparison, and anism, DATN-2 makes correct predictions since 31 here for the source and target tasks in DATN-2†, we use the it can respectively capture the general sentiment same training data. For E1, Rank1 and Rank2 are the top words and the emotion-specific emojis. two systems from the official leadboard; For E2, Rank1 and Rank2 are from (Zhou et al., 2016, 2018a). 4 Conclusion sarial losses by (Liu et al., 2017) as shown in In this paper, we proposed a dual attention-based Fig1.b; 5) SP, DATN-1 and DATN-2, the Shared- transfer learning approach to leverage sentiment Private model and two variants of our Dual At- classification to improve the performance of multi- tention Transfer Network as shown in Fig1.c and label emotion classification. Using two bench- Fig1.d. mark datasets, we show the effectiveness of the In Table3, we report the comparison results be- proposed transfer learning method. tween our method and the baseline systems. It can be easily observed that 1) for transfer learning, al- Acknowledgments though the performance of SP is similar to or even We would like to thank Aletta Hiemstra, Andres´ lower than some baseline systems, our proposed Monroy-Hernandez´ and three anonymous review- dual attention models, i.e., DATN-1 and DATN- ers for their valuable comments. 2, can generally boost SP to achieve the best re- sults. To investigate the significance of the im- provements, we combine each model’s predictions of all emotion labels followed by treating them as References a single label, and then perform McNemar’s sig- Muhammad Abdul-Mageed and Lyle Ungar. 2017. nificance tests (Gillick and Cox, 1989). Finally, Emonet: Fine-grained emotion detection with gated we verify that for English, DATN-1 is significantly recurrent neural networks. In Proceedings of the better than Base, FT, FS and SP, while DATN-2 55th Annual Meeting of the Association for Compu- is significant better than all the methods except tational Linguistics. APSP; for Chinese, DATN-1 and DATN-2 are sig- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- nificantly better than all the compared methods. 2) gio. 2014. Neural machine translation by jointly

1101 learning to align and translate. arXiv preprint Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio arXiv:1409.0473. Sebastiani, and Veselin Stoyanov. 2016. Semeval- 2016 task 4: Sentiment analysis in twitter. In Se- Christos Baziotis, Nikos Athanasiou, Alexandra mEval. Chronopoulou, Athanasia Kolovou, Georgios Paraskevopoulos, Nikolaos Ellinas, Shrikanth Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Narayanan, and Alexandros Potamianos. 2018. Kevin Gimpel, Nathan Schneider, and Noah A NTUA-SLP at SemEval-2018 task 1: Predicting af- Smith. 2013. Improved part-of-speech tagging for fective content in tweets with deep attentive rnns and online conversational text with word clusters. In transfer learning. arXiv preprint arXiv:1804.06658. Proceedings of NAACL. Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Changqin Quan and Fuji Ren. 2010. Sentence emotion Rahwan, and Sune Lehmann. 2017. Using millions analysis and recognition based on emotion words us- of emoji occurrences to learn any-domain represen- ing ren-cecps. International Journal of Advanced tations for detecting sentiment, emotion and sar- . casm. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Process- Xiaojun Quan, Qifan Wang, Ying Zhang, Luo Si, and ing. Liu Wenyin. 2015. Latent discriminative models for social emotion detection with emotional depen- Laurence Gillick and Stephen J Cox. 1989. Some sta- dency. ACM Transactions on Information Systems tistical issues in the comparison of speech recogni- (TOIS), 34(1):2. tion algorithms. In International Conference onA- coustics, Speech, and Signal Processing. Yaqi Wang, Shi Feng, Daling Wang, Ge Yu, and Yifei Zhang. 2016. Multi-label chinese microblog emo- Huihui He and Rui Xia. 2018. Joint binary neu- tion classification via convolutional neural network. ral network for multi-label learning with appli- In Asia-Pacific Web (APWeb) and Web-Age Informa- cations to emotion classification. arXiv preprint tion Management (WAIM) Joint Conference on Web arXiv:1802.00891. and Big Data. Yanghoon Kim, Hwanhee Lee, and Kyomin Jung. Yichen Wang and Aditya Pal. 2015. Detecting emo- 2018. Attnconvnet at semeval-2018 task 1: tions in social media: A constrained optimization Attention-based convolutional neural networks for approach. In Proceedings of the International Joint multi-label emotion classification. arXiv preprint Conference on Artificial Intelligence. arXiv:1804.00831. Jianfei Yu, Minghui Qiu, Jing Jiang, Jun Huang, Diederik Kingma and Jimmy Ba. 2014. Adam: A Shuangyong Song, Wei Chu, and Haiqing Chen. method for stochastic optimization. arXiv preprint 2018. Modelling domain relationships for transfer arXiv:1412.6980. learning on retrieval-based question answering sys- tems in E-commerce. In Proceedings of the Eleventh Li Li, Houfeng Wang, Xu Sun, Baobao Chang, Shi ACM International Conference on Web Search and Zhao, and Lei Sha. 2015a. Multi-label text cate- Data Mining. gorization with joint learning predictions-as-features method. In Proceedings of the 2015 Conference on Deyu Zhou, Yang Yang, and He Yulan. 2018a. Rel- Empirical Methods in Natural Language Process- evant emotion ranking from text constrained with ing. emotion relationships. In Proceedings of the North American Chapter of the Association for Computa- Shoushan Li, Lei Huang, Rong Wang, and Guodong tional Linguistics. Zhou. 2015b. Sentence-level emotion classification with label and context dependence. In Proceedings Deyu Zhou, Xuan Zhang, Yin Zhou, Quan Zhao, and of the 53rd Annual Meeting of the Association for Xin Geng. 2016. Emotion distribution learning from Computational Linguistics. texts. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Process- Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. ing. Adversarial multi-task learning for text classifica- tion. In Proceedings of the 55th Annual Meeting of Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan the Association for Computational Linguistics. Zhu, and Bing Liu. 2018b. Emotional chatting ma- chine: emotional conversation generation with in- Saif M Mohammad, Felipe Bravo-Marquez, Moham- ternal and external memory. In Proceedings of the mad Salameh, and Svetlana Kiritchenko. 2018. AAAI Conference on Artificial Intelligence. Semeval-2018 task 1: in tweets. In SemEval. Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. 2016. How transferable are neural networks in NLP applications? In Proceed- ings of the 2016 Conference on Empirical Methods in Natural Language Processing.

1102