<<

EmoGraph: Capturing Correlations using Graph Networks

Peng Xu∗, Zihan Liu∗, Genta Indra Winata, Zhaojiang Lin, Pascale Fung Center for Artificial Research (CAiRE) Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong {pxuab, zliucr, giwinata, zlinao}@connect.ust.hk

Abstract ask is, how can we obtain and incorporate the emo- tion correlation to improve emotion understanding Most methods tackle the tasks, such as classification? emotion understanding task by considering in- To obtain emotion correlations, a possible way dividual emotion independently while ignor- is to take advantage of a multi-label emotion ing their fuzziness nature and the intercon- nections among them. In this paper, we ex- dataset. Intuitively, with high correla- plore how emotion correlations can be cap- tions will be labeled together, and therefore, emo- tured and help different classification tasks. tion correlations can be extracted from the label co- We propose EmoGraph that captures the de- occurrences. Recently, a multi-label emotion classi- pendencies among different emotions through fication competition (Mohammad et al., 2018) with graph networks. These graphs are con- 11 emotions has been introduced to promote re- structed by leveraging the co-occurrence statis- search into emotional understanding. To tackle this tics among different emotion categories. Em- pirical results on two multi-label classification challenge, the best team (Baziotis et al., 2018) first datasets demonstrate that EmoGraph outper- pre-trains on a large amount of external emotion- forms strong baselines, especially for macro- related datasets and then performs transfer learning F1. An additional experiment illustrates the on this multi-label task. However, they still captured emotion correlations can also benefit the correlations between different emotions. a single-label classification task. In this paper, we propose EmoGraph which leverages graph neural networks to model the de- 1 Introduction pendencies between different emotions. We take Understanding human emotions is considered as each emotion as a node and first construct an emo- the key to building engaging dialogue systems tion graph based on the co-occurrence statistics (Zhou et al., 2018). However, most works on between every two emotion classes. Graph neural emotion understanding tasks treat individual emo- networks are then applied to extract the features tions independently while ignoring the fuzziness from the neighbours of each emotion node. We nature and the interconnections among them. A conduct experiments on two multi-label emotion psychoevolutionary theory proposed by Plutchik classification datasets. Empirical results show that (1984) shows that different emotions are actually our model outperforms strong baselines, especially arXiv:2008.09378v1 [cs.CL] 21 Aug 2020 correlated, and all emotions follow a circular struc- for macro-F1 score. The analysis shows that low ture. For example, “optimism” is close to “” resource emotions, such as “”, can particu- and “” instead of “” and “sad- larly benefit from the emotion correlations. An ness”. Without considering the fundamental inter- additional experiment illustrates that the captured correlation between them, the understanding of emotion correlations can also help the single-label emotions can be unilateral, leading to sub-optimal emotion classification task. performance. These understanding can be particu- larly important for low resource emotions, such as 2 Related Work “” and “trust” whose training samples are For emotion classifications, Tang et al.(2016) pro- hard to get. Therefore, the research question we posed sentiment embeddings that incorporate sen- ∗∗ Equal contributions. timent into word vectors. Felbo et al. (2017) trained a huge LSTM-based emotion rep- 3.1 Emotion Graphs resentation by predicting emojis. Various meth- We take each emotion class as a node in the emotion ods have also been developed for automatic con- graph. To create the connections between emotion structions of sentiment lexicons using both a su- nodes, we use the co-occurrence statistics between pervised and unsupervised method (Wang and Xia, emotions. The intuition is that if two emotions 2017). Duppada et al.(2018) combined both pre- co-occurs frequently, they will have a high corre- trained representations and emotion lexicon fea- lation. Directly using this co-occurrence matrix tures, which significantly improved the emotion un- as our graph may be problematic because the co- derstanding systems. Park et al.(2018); Fung et al.; occurrence matrix is symmetric while the emotion Xu et al.(2018) encoded emotion information into relation is not symmetric. For example, in our cor- word representations and demonstrated improve- pus, “anticipation” co-occurs with “optimism” 197 ments over emotion classification tasks. Shin et al. times, while “anticipation” appears 425 times and (2019); Lin et al.(2019); Xu et al.(2019) further in- “optimism” appears 1143 times. Thus, knowing troduced emotion supervision into generation tasks. “anticipation” and “optimism” co-occur is notably Multi-label classification is an important yet more important for “anticipation” than “optimism”. challenging task in natural language processing. Thus, we calculate the co-occurrence matrix M Binary relevance (Boutell et al., 2004) transformed from a given emotion corpus and then normalize the multi-label problem into several independent Mi,j with Mi,i so that the graph encodes the asym- classifiers. Other methods to model the depen- metric relation between different emotions. dencies have since been proposed by creating new 1 Mi,j Gi,j = . (1) labels (Tsoumakas and Katakis, 2007), using clas- Mi,i sifier chains (Read et al., 2011), graphs (Li et al., Due to the fuzziness nature of emotions, the 2015), and RNN (Chen et al., 2017; Yang et al., graph matrix G1 may contain some noise. Thus, 2018). These models are either non-scalable or we adopted the approach in Chen et al.(2019) to modeling the labels as a sequence. binarize G1 with a threshold µ to reduce noise and Graph networks have been applied to model rela- tune another hyper-parameter w to mitigate the tions across different tasks such as image recogni- over-smoothing problem (Li et al., 2018): tion (Chen et al., 2019; Satorras and Estrach, 2018), ( 1 2 1, if Gi,j ≥ µ and text classification (Ghosal et al., 2019; Yao Gi,j = (2) et al., 2019) with different graph networks (Kipf 0, otherwise. and Welling, 2017; Velikovi et al., 2018).  n  2 X 2 Gi,j/( Gi,j), if i 6= j Despite the growing interests in low-resource Gi,j = j=1 (3) studies in machine translation (Artetxe et al., 2017;  1 − w, otherwise. Lample et al., 2017), dialogue systems (Bapna et al., 2017; Liu et al., 2019, 2020), speech recogni- 3.2 Emotion Classification Models tion (Miao et al., 2013; Thomas et al., 2013; Winata Our emotion classification model consists of an en- et al., 2020), emotion recognition (Haider et al., coder and graph-based emotion classifiers follow- 2020), and etc, emotion detection for low-resource ing the framework of Chen et al.(2019). For the en- emotions has been less studied. coder, we choose the Transformer (TRS) (Vaswani et al., 2017) and the pre-trained model BERT (De- vlin et al., 2019) for their strong representation ca- 3 pacities on many natural language tasks. We denote the encoded representation of x as s. For graph- based emotion classifiers, we experiment with two In this section, we first introduce the emotion types of graph networks: GCN (Kipf and Welling, graphs and then our emotion classification mod- 2017) and GAT (Velikovi et al., 2018). The GCN els. We denote the input sentence as x and the takes the features of each emotion node as inputs emotion classes as e = {e1, e2, ··· , en}. The la- n and applies a convolution over neighboring nodes bel for x is y, where y ∈ {0, 1} and yj denotes to generate the classifier for ei: the label for ej. The embedding matrix is E. The co-occurrence matrix of these emotions is M. C = ReLU(GE˜ eW1), (4) where G˜ is the normalized matrix of G following Accuracy Micro-F1 Macro-F1 NTUA-SLP 58.8 70.1 52.8 Kipf and Welling(2017). Ee is the embeddings for DATN 58.3 - 54.4 all emotions e, and W1 is the trainable parameters. SGM 48.2 57.5 41.1 Our emotion classifiers are C = C ,C , ··· ,C , 1 2 n TRS 51.1 63.4 46.7 where Ci is the classifier for ei. Alternatively, GAT TRS-GAT 51.7 64.6 48.3 takes the features of each node as input and learns TRS-GCN 51.9 63.8 49.2 a multi-head self- over the emotion nodes BERT 58.4 70.1 53.8 to generate classifiers C. BERT-GAT 58.3 69.9 56.9 We then simply take the inner product between BERT-GCN 58.9 70.7 56.3 s and C to compute yˆ , the logits of the emotion i i Table 1: Comparisons among different systems on class ei for classification: SemEval-2018 dataset. yˆ = s ∗ C . i i (5) Accuracy Micro-F1 Macro-F1 TRS 30.4 46.5 35.0 As our task is a multi-label prediction problem, SGM 33.5 42.9 25.2 we add a sigmoid activation to yˆi and use a cross- TRS-GAT 33.1 49.1 38.6 entropy loss function. TRS-GCN 34.4 50.5 40.8

4 Experimental Setup Table 2: Comparisons among different systems on Twitter dataset. 4.1 Dataset and Evaluation Metrics We choose two datasets for our multi-label clas- sification training and evaluation. SemEval-2018 the labels using beam search. TRS is the sys- (Mohammad et al., 2018) contains 10,983 tweets tem that adds a linear layer on top of the Trans- with 11 different emotion categories. It is divided former (Vaswani et al., 2017). BERT is the system into three splits: training (6838 samples), vali- that adds a linear layer on top of the BERT (Devlin dation set (886 samples), and testing set (3259 sam- et al., 2019). ples). Due to its small label quantities and small 5 Results and Analysis dataset size, we crawled another Twitter dataset where we use emojis as emotion labels. We follow Results on two emotion classification datasets the same 64 emoji types as in Felbo et al.(2017) The results on SemEval-2018 dataset are shown in and collect 4 million twitter data, where each tweet Table1, which illustrates several points. Firstly, has at least two emojis. We split them into 2.8 BERT-GCN/TRS-GCN consistently improves over million (70%) for training, 0.4 million (10%) for the baseline of BERT/TRS, in terms of accu- validation and 0.8 million (20%) for testing. racy (+0.5%/+0.8%), micro-F1 (+0.6%/+0.4%) and Following the metrics in Mohammad et al. macro-F1 (+2.5%/+2.5%). It shows the effective- (2018), we use Jaccard accuracy, micro-average ness of the emotion graph by giving the perfor- F1-score (micro-F1), and macro-average F1-score mance boost on the strong baseline BERT, espe- (macro-F1) as our evaluation metrics. cially on the macro-F1 score. Secondly, our BERT- GCN achieves a 58.9% accuracy score, 70.7% 4.2 EmoGraph and Baselines micro-F1 score, and 56.3% macro-F1 score, which Our EmoGraph has four variants,TRS-GCN,TRS- beats the best system NTUA-SLP in terms of all GAT,BERT-GCN, and BERT-GAT, depending metrics, without using any features or pre- on the choice of sentence encoder (TRS/BERT) training on emotional datasets. Thirdly, EmoGraph and graph networks (GCN/GAT). We compare is particularly effective in macro-F1. For example, our model to several strong baselines. NTUA- BERT-GAT is 2.5% better than DATN and 3.1% SLP (Baziotis et al., 2018) is the top-1 system of better than BERT. the SemEval-2018 competition, which pre-trains We notice that for both accuracy and micro- the model on large amounts of external emotion- F1, the improvements of graph networks are very related datasets. DATN (Yu et al., 2018) is the marginal on the SemEval-2018 dataset. This is be- system that transfers sentiment information with cause the small emotion label space (only 11 emo- dual attention transfer network. SGM (Yang et al., tions) limits the effectiveness of emotion graphs. 2018) models labels as a sequences and generates Thus, we train our models on another Twitter F1 score # Parameters Surprise Average F1 Accuracy LSTM 1.11M 72.7 12.2 53.7 31.3 69.3 46.2 66.7 LSTM-GCN 1.16M 75.6 39.1 60.8 38.0 70.5 54.7 69.5 BERT 110M 83.4 39.1 69.9 47.1 80.8 62.0 78.5 BERT-GCN 110M 83.8 51.6 72.9 48.7 82.0 65.4 79.8

Table 3: Comparisons of F1 scores and accuracy between EmoGraph w/ and w/o graph on IEMOCAP dataset.

Surprise Trust “anger” in the wheel of emotions, which is not true BERT 18.7 6.9 in our case. Another exception we found is that BERT-GAT 31.9 14.8 “surprise” is close to “joy” in our emotion graph BERT-GCN 27.8 24.3 while they are quite far away from each other in the wheel of emotions. We believe it reflects the bias Table 4: Comparison of F1 scores between BERT-GAT, BERT-GCN and BERT for “surprise” and “trust”. when people post tweets online, and the distance be- tween neighboring emotions are not quantitatively the same as the wheel of emotions. dataset with 64 emoji labels. Table2 shows both TRS-GCN and TRS-GAT consistently improves Analysis on low resource emotions Table4 TRS with a large margin for all metrics, which shows that the significant improvements in terms of shows that graph networks are better at dealing the macro-F1 on the SemEval-2018 dataset mainly with rich emotion information.1 come from the low resource emotions, such as “sur- prise” and “trust”, where only 5% of the labels are positive. From Figure1, we observe that both “sur- optimism prise” and “trust” are connected to “joy”, which has 39.3% samples labeled as positive. We con- joy jecture that low resource emotion classifiers learn anticipation trust effective inductive bias through connections with high-resource emotion classes, such as “joy” and “optimism”, and achieve better performances. anger fear Generalization to single emotion classification task To verify the generalization ability of Emo- disgust surprise Graph, we conduct an extra experiment on IEMO- sad CAP dataset (Busso et al., 2008), which is a single label multi-class classification dataset. We only use the six emotions that overlap with the SemEval- 2018 task, which are “anger”, “sadness”, “happi- Figure 1: The graph visualization of G. The arrow de- ness”, “disgust”, “fear” and “surprise” with 2931 notes the conditional dependency from the source node to the target node. samples (the graph matrix G is obtained from the SemEval-2018 dataset). We then train our Emo- Graph using either a one-layer LSTM with atten- Connection to the wheel of emotions Fol- tion or BERT as an encoder and GCN as the graph lowing the positioning of the wheel of emo- structure. To adapt the change from 11 emotions to tion (Plutchik, 1984), we visualize the graph struc- 6 emotions, we only train the classifiers Ci cor- ture G in Figure1. It shows that our emotion graph responding to those six emotions. The results can be divided into three sub-graphs, 1) the nodes with 10-fold cross-validation are reported in Ta- connected with “disgust”, 2) “fear” 3) the nodes ble3. We didn’t report “disgust” as there are less connected with “joy”. Most nodes are locally con- than five positive examples. The results confirm nected, which means the positioning in the wheel of that with captured emotion correlations, classifica- emotion does contain the correlation information. tion performances can be improved by 2.8%/8.5% One exception is that “anticipation” is also close to accuracy/average F1 score using LSTM encoder 1Data statistics, training details, and ablation studies for and 1.3%/3.4% accuracy/average F1 score using threshold µ (Eq. 2) and weight w (Eq. 3) are in the appendix. BERT encoder. Surprisingly, for both encoders with GCN improves more than 10% on “fear” (a Narayanan. 2008. Iemocap: Interactive emotional low-resource emotion in the IEMOCAP dataset), dyadic motion capture database. Language re- although it doesn’t have connections with other sources and evaluation, 42(4):335. emotions in the graph. We conjecture that the im- Guibin Chen, Deheng Ye, Zhenchang Xing, Jieshan provements come from the shared representation of Chen, and Erik Cambria. 2017. Ensemble appli- W1 in Eq.4 among all emotion categories, which cation of convolutional and recurrent neural net- helps our model optimize to a better local minimal. works for multi-label text categorization. In 2017 International Joint Conference on Neural Networks 6 Conclusion (IJCNN), pages 2377–2383. IEEE. In this paper, we proposed EmoGraph which lever- Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yan- ages graph neural networks to model the dependen- wen Guo. 2019. Multi-label image recognition with graph convolutional networks. In Proceedings of the cies among different emotions. We consider each IEEE Conference on Computer Vision and Pattern emotion as a node and construct an emotion graph Recognition, pages 5177–5186. based on the co-occurrence statistics. Our model with EmoGraph outperforms the existing strong Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of multi-label classification baselines. Our analysis deep bidirectional transformers for language under- shows EmoGraph is especially helpful for low re- standing. In Proceedings of the 2019 Conference of source emotions and large emotion space. An ad- the North American Chapter of the Association for ditional experiment shows that it can also help a Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages single-label emotion classification task. 4171–4186.

Acknowledgments Venkatesh Duppada, Royal Jain, and Sushant Hiray. This work is partially funded by ITF/319/16FP, 2018. Seernet at semeval-2018 task 1: Domain adap- tation for affect in tweets. In Proceedings of The HKUST 16248016 of Hong Kong Research Grants 12th International Workshop on Semantic Evalua- Council and MRP/055/18 of the Innovation Tech- tion, pages 18–23. nology Commission, the Hong Kong SAR Govern- ment. Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. 2017. Using millions of emoji occurrences to learn any-domain represen- tations for detecting sentiment, emotion and sarcasm. References In Proceedings of the 2017 Conference on Empiri- Mikel Artetxe, Gorka Labaka, Eneko Agirre, and cal Methods in Natural Language Processing, pages Kyunghyun Cho. 2017. Unsupervised neural ma- 1615–1625. chine translation. arXiv preprint arXiv:1710.11041. Pascale Fung, Dario Bertero, Peng Xu, Ji Ho Park, Ankur Bapna, Gokhan Tur,¨ Dilek Hakkani-Tur,¨ and Chien-Sheng Wu, and Andrea Madotto. Empathetic Larry Heck. 2017. Towards zero-shot frame seman- dialog systems. tic parsing for domain scaling. Proc. Interspeech 2017, pages 2476–2480. Deepanway Ghosal, Navonil Majumder, Soujanya Po- ria, Niyati Chhaya, and Alexander Gelbukh. 2019. Christos Baziotis, Athanasiou Nikolaos, Alexan- Dialoguegcn: A graph convolutional neural network dra Chronopoulou, Athanasia Kolovou, Geor- for emotion recognition in conversation. In Proceed- gios Paraskevopoulos, Nikolaos Ellinas, Shrikanth ings of the 2019 Conference on Empirical Methods Narayanan, and Alexandros Potamianos. 2018. in Natural Language Processing and the 9th Inter- Ntua-slp at semeval-2018 task 1: Predicting affec- national Joint Conference on Natural Language Pro- tive content in tweets with deep attentive rnns and cessing (EMNLP-IJCNLP), pages 154–164. transfer learning. In Proceedings of The 12th Inter- national Workshop on Semantic Evaluation, pages Fasih Haider, Senja Pollak, Pierre Albert, and Sat- 245–255. urnino Luz. 2020. Emotion recognition in low- Matthew R Boutell, Jiebo Luo, Xipeng Shen, and resource settings: An evaluation of automatic fea- Christopher M Brown. 2004. Learning multi- ture selection methods. Computer Speech & Lan- label scene classification. Pattern recognition, guage, page 101119. 37(9):1757–1771. Thomas N Kipf and Max Welling. 2017. Semi- Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe supervised classification with graph convolutional Kazemzadeh, Emily Mower, Samuel Kim, Jean- networks. In International Conference on Learning nette N Chang, Sungbok Lee, and Shrikanth S Representations. Guillaume Lample, Alexis Conneau, Ludovic Denoyer, Victor Garcia Satorras and Joan Bruna Estrach. 2018. and Marc’Aurelio Ranzato. 2017. Unsupervised ma- Few-shot learning with graph neural networks. In chine translation using monolingual corpora only. International Conference on Learning Representa- arXiv preprint arXiv:1711.00043. tions. Qimai Li, Zhichao , and Xiao-Ming Wu. 2018. Jamin Shin, Peng Xu, Andrea Madotto, and Pascale Deeper insights into graph convolutional networks Fung. 2019. Happybot: Generating empathetic dia- for semi-supervised learning. In Thirty-Second logue responses by improving user experience look- AAAI Conference on Artificial Intelligence. ahead. arXiv preprint arXiv:1906.08487. Shoushan Li, Lei Huang, Rong Wang, and Guodong Duyu Tang, Furu Wei, Bing Qin, Nan Yang, Ting Zhou. 2015. Sentence-level emotion classification Liu, and Ming Zhou. 2016. Sentiment embed- with label and context dependence. In Proceedings dings with applications to sentiment analysis. IEEE of the 53rd Annual Meeting of the Association for Transactions on Knowledge and Data Engineering, Computational Linguistics and the 7th International 28(2):496–509. Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1045–1053. Samuel Thomas, Michael L Seltzer, Kenneth Church, and Hynek Hermansky. 2013. Deep neural net- Zhaojiang Lin, Andrea Madotto, Jamin Shin, Peng Xu, work features and semi-supervised training for low and Pascale Fung. 2019. Moel: Mixture of empa- resource speech recognition. In 2013 IEEE inter- thetic listeners. In Proceedings of the 2019 Con- national conference on acoustics, speech and signal ference on Empirical Methods in Natural Language processing, pages 6704–6708. IEEE. Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- Grigorios Tsoumakas and Ioannis Katakis. 2007. IJCNLP), pages 121–132. Multi-label classification: An overview. Interna- tional Journal of Data Warehousing and Mining Zihan Liu, Jamin Shin, Yan Xu, Genta Indra Winata, (IJDWM), 3(3):1–13. Peng Xu, Andrea Madotto, and Pascale Fung. 2019. Zero-shot cross-lingual dialogue systems with trans- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob ferable latent variables. In Proceedings of the Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz 2019 Conference on Empirical Methods in Natu- Kaiser, and Illia Polosukhin. 2017. Attention is all ral Language Processing and the 9th International you need. In Advances in neural information pro- Joint Conference on Natural Language Processing cessing systems, pages 5998–6008. (EMNLP-IJCNLP), pages 1297–1303. Petar Velikovi, Guillem Cucurull, Arantxa Casanova, Zihan Liu, Genta Indra Winata, Peng Xu, and Pas- Adriana Romero, Pietro Li, and Yoshua Bengio. cale Fung. 2020. Coach: A coarse-to-fine approach 2018. Graph attention networks. In International for cross-domain slot filling. In Proceedings of the Conference on Learning Representations. 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 19–25, Online. Associa- Leyi Wang and Rui Xia. 2017. Sentiment lexicon con- tion for Computational Linguistics. struction with representation learning based on hier- archical sentiment supervision. In Proceedings of Yajie Miao, Florian Metze, and Shourabh Rawat. 2013. the 2017 Conference on Empirical Methods in Natu- Deep maxout networks for low-resource speech ral Language Processing, pages 502–510. recognition. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 398– Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, 403. IEEE. Zhaojiang Lin, Andrea Madotto, Peng Xu, and Pascale Fung. 2020. Learning fast adaptation on Saif Mohammad, Felipe Bravo-Marquez, Mohammad cross-accented speech recognition. arXiv preprint Salameh, and Svetlana Kiritchenko. 2018. Semeval- arXiv:2003.01901. 2018 task 1: Affect in tweets. In Proceedings of The 12th International Workshop on Semantic Eval- Peng Xu, Andrea Madotto, Chien-Sheng Wu, Ji Ho uation, pages 1–17. Park, and Pascale Fung. 2018. Emo2vec: Learn- Ji Ho Park, Peng Xu, and Pascale Fung. 2018. ing generalized emotion representation by multi- Plusemo2vec at semeval-2018 task 1: Exploiting task training. In Proceedings of the 9th Workshop emotion knowledge from emoji and# hashtags. In on Computational Approaches to Subjectivity, Senti- Proceedings of The 12th International Workshop on ment and Social Media Analysis, pages 292–298. Semantic Evaluation, pages 264–272. Peng Xu, Chien-Sheng Wu, Andrea Madotto, and Pas- Robert Plutchik. 1984. Emotions: A general psy- cale Fung. 2019. Clickbait? sensational headline choevolutionary theory. Approaches to emotion, generation with auto-tuned reinforcement learning. 1984:197–219. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the Jesse Read, Bernhard Pfahringer, Geoff Holmes, and 9th International Joint Conference on Natural Lan- Eibe Frank. 2011. Classifier chains for multi-label guage Processing (EMNLP-IJCNLP), pages 3056– classification. Machine learning, 85(3):333. 3066. Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang. 2018. Sgm: Sequence generation model for multi-label classification. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3915–3926. Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7370–7377. Jianfei Yu, Lu´ıs Marujo, Jing Jiang, Pradeep Karu- turi, and William Brendel. 2018. Improving multi- label emotion classification via sentiment classifica- tion with dual attention transfer network. In Pro- ceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pages 1097– 1102, Brussels, Belgium. Association for Computa- tional Linguistics.

Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2018. The design and implementation of xi- aoice, an empathetic social chatbot. arXiv preprint arXiv:1812.08989. A Appendices Anger Anticipation Disgust Fear Joy Love 36.1 13.9 36.6 16.8 39.3 12.3

A.1 Preprocessing Optimism Pessimism Sadness Surprise Trust 31.3 11.6 29.4 5.2 5.0 To cope with the noise in the twitter data, we low- ercase the tweets, remove the “#” punctuation and Table 5: Percentage of tweets that are labeled with a given emotion in the SemEval-2018 Task 1 dataset. replace “URL link” with a special token , “user mentions” with and “number” with . For example, one original tweet ”@Jessi- 33.37 21.86 38.13 16.63 0.41 15.40 3.71 4.35 caZ00 @ZRlondon ditto!! Such an amazing atmo- sphere! We have 10 people here. #LondonEvents 32.95 0.95 3.78 1.43 5.02 0.21 3.13 10.13 #cheer” will be cleaned as ” ditto! such an amazing atmosphere! we have 2.59 0.26 9.67 4.40 2.18 2.83 4.86 2.23 people here. londonevents cheer”. By doing so, we can reduce the vocabulary size and 1.39 1.58 3.50 3.03 1.98 3.35 1.14 3.73 make the model easier to learn. 2.92 3.21 9.91 0.51 2.16 4.00 3.30 2.97 A.2 Training Details 5.14 3.66 1.69 1.89 2.43 6.95 2.10 2.56 To train the SemEval-2018, we calculate the co- 1 occurrence matrix G based on the training and 0.19 3.83 1.35 3.49 1.55 2.12 2.83 0.78 development sets. We set the threshold µ as 0.4, and the weight w as 0.35 for GCN. We set the 1.48 2.03 0.94 0.25 1.28 0.53 0.56 0.28 threshold µ as 0.5 for GAT. For both BERT-GCN Table 6: Percentage of tweets that are labeled with a and BERT-GAT, the emotion label embeddings are given emoji in the twitter data. Note that our dataset is initialized with 300-dim GloVe vectors. For the multi-labeled, the sum of all classes is not one. BERT structure, we choose 12 layers BERT-base model, and the hidden size of the graph is set to F1 score Anger Fear Happiness Surprise Sadness 768. An Adam optimizer with the learning rate Distribution(%) 37.7 1.4 20.1 3.5 37.2 1e-4 is used to train the graph networks. Another Table 7: Percentage of data samples that are labeled Adam optimizer is used to train the BERT model with a given emotion in the IEMOCAP dataset. with the learning set as 2e-5 and dropout rate as 0.3. For TRS-GCN and TRS-GAT, the hidden size of the graph network is set to 200. The heads of A.3 Data Statistics the Transformer are 4, and the depth is 44. An The data statistics for the SemEval-2018, Twitter Adam optimizer is used to train the full model with and IEMOCAP are shown in Table5, Table6, and a 0.001 learning rate. For the Twitter dataset, we Table7, respectively. calculate the co-occurrence matrix G1 based on the From Table5, we can see that in the SemEval- training set. We set the threshold µ as 0.1, and 2018 dataset, only 5.2% and 5.0% samples are the weight w is set as 0.35 for both. We use the labeled as positive for “surprise” and “trust”, re- TRS as the sentence encoder. The hidden size of spectively. And from Table7, we can see that in the both graph networks and TRS are 200. The heads IEMOCAP dataset, only 1.4% samples are labeled of TRS are 6, and the depth is 120. An Adam as positive for “fear”. optimizer is used to train our model with a 0.001 learning rate. A.4 Effects of threshold µ and weight w To train IEMOCAP dataset, the parameters of As illustrated in Figure2 and Figure3, we plot both GCN and BERT follow the same setting as in the diagram of accuracy, micro-F1 and macro-F1 SemEval 2018 task. For the LSTM based method, on the test set by varying µ and w, respectively, we set the hidden size of LSTM encoder as 200 for BERT-GCN. We also include another base- and initialize the word embedding with GloVe. An line BERT-GCN model which is trained by setting Adam optimizer with a learning rate of 0.001 is G = G1 (w/o binarization), to see the effects of used to optimize all parameters. binarization (Eq. 2) and weighting (Eq. 3). Evaluation Scores for different weight

70.0

67.5 micro-F1 (w/o binarization) jaccard acc. (w/o binarization) 65.0 macro-F1 (w/o binarization) micro-F1 62.5 jaccard acc.

Scores macro-F1 60.0

57.5

55.0

52.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Figure 2: Evaluation Scores for different threshold µ on the SemEval-2018 dataset. Evaluation Scores for different weight w

70.0

67.5 micro-F1 (w/o binarization) jaccard acc. (w/o binarization) 65.0 macro-F1 (w/o binarization) micro-F1 62.5 jaccard acc.

Scores macro-F1 60.0

57.5

55.0

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 w

Figure 3: Evaluation Scores for different weight w on the SemEval-2018 dataset.

Figure2 shows that removing moderate amount of noisy connections by changing µ can improve the performance. For macro-F1, it clearly shows that small µ achieves much better results than large µ. If µ is large, useful emotion connections might be cut off, therefore leading to worse results. Figure3 shows that for accuracy and micro-F1, changing w can achieve better performance than the baseline without binarization and weighting.