TERG: Topic-Aware Emotional Response Generation for Chatbot

Pei Huo†, Yan Yang*†, Jie Zhou†, Chengcai Chen@, Liang He† †Department of Computer Science and Technology, East China Normal University @Xiaoi Research, Xiaoi Robot Technology Co., Ltd Shanghai, China *[email protected]

Abstract—A more intelligent chatbot should be able to express emotion, in addition to providing informative responses. Despite much works in designing neural dialogue generation systems in recent years, few studies consider both emotion to be expressed and topic relevance in the generation process. To address this problem, we present a Topic-aware Emotional Response Gen- eration (TERG) model, which can not only exactly generate desired emotional response but perform well in topic relevance. Specifically, TERG equips an encoder-decoder structure with an emotion aware module to control the emotional sentence generation and a topic aware module to enhance topic relevance. We evaluate our model on a large real-world dataset of conversations from social media. Experimental results show that our model obtains a significant improvement against several strong baseline Fig. 1. The comparison of generated responses and real responses. Emotion- methods on both automatic and human evaluation. related words are in red, keywords in blue and others are ordinary words. Index Terms—dialogue generation, emotion, topic aware commonsense, latent variable, Seq2Seq, CVAE the conversations between humans are usually limited to a I.INTRODUCTION particular topic during a period of time. What’s more, we use With the availability of large-scale dialogue corpus, there the efficient unsupervised topic model BTM [9] to analyse is a boom in research on open-domain chit-chat dialogue the topic relevance of real-word conversations much higher systems. Emotion expression is an important inherent attribute than the generated ones, more details in Section V.D. As in the dialogue system. In recent years, some research is about shown in Figure 1, we can intuitively found the importance how to supply chatbots with an emotion expression ability. of topic relevance in an emotional dialogue system. Given a The studies [1], [2] have proved that the emotional chatbot message about basketball, the natural responses should also be can significantly improve the user satisfaction and enrich the basketball related, but the responses from existing generative human-computer interactions. models are rarely related to basketball. Early related studies [3]–[5] are either rule-based, retrieval- To comprehensively consider emotion and topic factors in based, or limited to small-scale data that can hardly express response generation, we present a Topic-aware Emotional Re- complex, various emotions and difficult to scale well to large sponse Generation (TERG) model. In the following, we refer datasets. Most recently, sequence to sequence (Seq2Seq) with to the architecture with the abbreviation TERG. Our model attention [6] represents a good neural network framework for equips an encoder-decoder structure with an Emotion-Aware dialogue generation. Zhou et al. [7] proposed an emotional module (EA) to control the emotional sentence generation and chatting machine (ECM) based Seq2Seq that is able to gen- a Topic Commonsense-Aware module (TCA) to enhance topic erate a specific emotional response. Immediately after this relevance. In EA module, we use a learnable latent variable work, Asghar et al. [8] constructed an emotional dialogue to learn the semantic information of the specific emotion system by adding affective word embeddings, the affective response and three kinds of word distribution: emotion-related object function and diverse beam search algorithms. words, keywords and ordinary words. In decoding, the latent Although current emotional generative conversation models variable and emotion label embedding are fed into each have achieved promising results, they still suffer from the decoder unit and the word type distribution obtained by the following issues. First, these models tend to generate trivial latent variable will be used to explicit modulate the generation or universally relevant responses with little meaning like ” distribution of the entire vocabulary. In addition, we introduce Haha”, ” I love you ”, ”I hate you” due to the addition of the TCA module to enrich the dialogue topic relevance, which emotional factor. Second, they tend to ignore topic relevance can obtain external topic commonsense and then integrated in generating emotional responses. As widely acknowledge, into generation process in the form of attention fusion.

II.RELATED WORK &BACKGROUND Neural response generation models are built upon the encoder-decoder framework [6]. The research of generation emotional response is an important step for building a more intelligent chatbot. Zhou et al. [7] proposed an emotional Fig. 2. A simple directed graph is used to illustrate the inference and generation process of CVAE. Dashed lines represent the inference of z. Solid chat machine (ECM) utilizing emotion category embeddings, lines represent the generation process. internal emotion states, and external emotion vocabulary. ECM only performs better in several specific emotion categories in which there are sufficient training data. Immediately after this as, ’NBA’, ’Kobe’, ’referee’, ’dribble’, ’coach’ etc. Besides, we work, Asghar et al. [8] used an affective dictionary to add fine-tune the BERT [14] model for training the classifiers to three dimensions for each word embedding for constructing evaluate model performance. The BERT is a new method of the affective word embeddings. And they also proposed an pre-training language representations which obtains state-of- affective object function and an affective diverse beam search the-art results on a wide array of Natural Language Processing algorithm to generate proper emotional response. Some people (NLP) tasks. Thus, the experimental evaluation results are use multi-task learning for building emotional conversation credible. [10], but the model is so rude that the experiment is not good. Song et al. [11] proposed an emotional dialogue system III.OUR TERGMODEL (EmoDS) that is able to put a specific emotion into responses A. Task Definition and Model Overview explicitly or implicitly. Ekman et al. [12] proposed the emotion Our problem is formulated as follows: given a message classification method whose author is one of the earliest X = x , x , ..., x and a specified emotion label el, the goal emotion theorists. According to their theory, there are six basic 1 2 n is to generate a response Y = y , y , ..., y that not only emotions: anger, disgust, fear, joy, sadness and surprise. In our 1 2 m match the emotion category el but also is topic-related with the work, we made a slight adjustment to this classification metrics message. Essentially, the generation model aim to maximize according to the existing corpus. the generation probability of Y conditioned on X and el. In our model, we harness a latent variable in the Conditional m Variational Autoencoder (CVAE) [13] framework to project Y different emotional responses into a latent space. CVAE based P (Y |z, X, el, v) = p(yt|y

n l X X Ct = αtjhj; TCt = βtjt˜j (3) j=1 j=1

where αtj measures the semantic relevance between state st−1 and hidden state hj, βtj denotes the weight between hidden state st−1 and the j-th topic word in T , which are given by:

exp(etj) αtj = Pn ; etj = η(st−1, hj) (4) k=1 exp(etk) exp(w ) β = tj ; w = g(s , hn, t˜ ); (5) tj PN tj t−1 j k=1 exp(wtk) Fig. 3. The architecture of TERG which consists of four parts: Encoder, Emotion-aware (EA) module, Topic commonsense-aware (TCA) module and where η and g are deep neural networks such as multiple Topic-aware Emotion (TE) Decoder. The model graph is the t-th step state layer perceptions (MLPs). In order to extract the most relevant in decoding stage, where hX is all hidden states in encoder, hn is final step feature of the external topic words lexicon, we use the previous hidden state in Bi-GRU encoder. Besides, the response encoder only exists during the training process. hidden state st−1 of the decoder, the last message encoder hidden state hn, and the embedding of j-th topic word as the input of g to get the weight score. representation. Formally, the hidden states of the encoder can The fusion of multi-attention makes the topic commonsense be computed as follows: to be more naturally integrated into the generation process. The −→ −−→ ←− ←−− concatenation ([C ; TC ]) between context attention and topic h = GRU x˜ , h ; h = GRU x˜ , h (2) t t t f t t−1 t b t t+1 attention is input into another deep neural network g∗ to get where x˜t is the embedding of word xt. In this paper, we rep- the final attention Mt. resent lower case letters with wavy lines as words embedding. ∗ −→ ←− Mt = g (Ct; TCt) (6) ht and ht are the j-th hidden states of forward and backward 1×d 1×d GRU respectively. The hidden ht is the concatenation of the where Ct ∈ R , TCt ∈ R , and the final attention −→ ←− 1×d two hidden states, denoted as ht = [ht, ht]. The response Mt ∈ R . Through such a fusion mechanism, the module encoder is similar to the message encoder. Significantly, the can weaken the noise effect of topic words that are irrelevant response encoder is only used in training. to the message in generation and can seamlessly plug proper topic words into the generated texts at the right time steps. C. Topic Commonsense-Aware Module The human’s conversations are usually under a particular D. Emotion-Aware Module topic during a period of time. After receiving a message, The emotion-aware module consists of Q(P) networks and the topic commonsense is essential for continuing the con- an emotion supervisor. Following [15], we use two networks versation. In this paper, the topic commonsense is a set of to draw latent variable samples during training and test respec- topic words. We dynamically construct a specific topic-related tively [15]. The implementation principle adopted from CVAE lexicon for each message. Specifically, we employ the bi-term [13] framework. topic model (BTM) [9] (an efficient topic model specifically a) Q(P) Networks: According to the theory of CVAE, for short texts) to obtain topic-related words. BTM models we should have sampled the latent variable from the true the generation procedure of bi-terms in a short text collection, posterior distribution P (z|Y,X), but the posterior distribution evolving from the LDA model. Here we omit the exhaustive is intractable. Therefore, we input the messages [X...] and background description of BTM because the topic model is responses [Y...] into the Q network to get an approximate pos- not the main point of this paper. The procedure of topic words terior distribution qφ(z|Y,X) during training process. Besides, selection is as follows. Firstly, the BTM will assign the most we input the z into an emotional supervisor to predict the related topic RT for the current message. Then we will pick emotion label of response. After the training of large samples, the latent variable z can map different kinds of emotional responses into different regions in a latent space. s = GRU(M , el, y˜ , z, s ) (11) In specific practice, we assume that latent variable z follows t t t−1 t−1 the Gaussian distribution whose covariance matrix is diagonal. We divide the words in the vocabulary into three types. During training, we construct the Q network to output the 2 The keywords are crucial for expressing core meaning. The pivotal parameter µ and σ of the approximate posterior emotional words have strong emotional polarity. And the distribution qφ(z|Y,X) and then sample latent variable z. The ordinary words play a role which connects the emotion and Q network is a multiple layer perception (MLP): content words to make a natural and grammatical sentence. 2 2 Following [15], we use the latent variable z and the hidden MLPqφ ([Y ; X]) =⇒ [µ; σ ]; qφ(z|Y,X) ∼ N (z; µ, σ I) (7) state st to estimate the distribution over word types at each However, during prediction process, there is no encoding decoding step which is used to explicitly control the emotional feature of the response Y . Therefore, we adopt another MLP, sentence generation. The formula is as follows: namely P network, to approximate the true prior distribution, which is implemented in the same way: ρe,k,o = softmax(Weko ∗ tanh(Wsz[st; z] + bsz)) (12) 0 02 0 02 MLPp (X) =⇒ [µ ; σ ]; pθ(z|X) ∼ N (µ , σ I) (8) 3 θ where ρe,k,o ∈ R , it can also be viewed as weights of choos- Our model is trained to minimize the KL divergence be- ing different types. We define the final generation probability tween the prior and posterior distribution so that our model as follows: can approximate the posterior distribution accurately using  e  ρe ∗ Pet (yt = w ) the prior distribution. The lower KL loss, the closer distance k yt ∼ P (yt) =  ρk ∗ Pkt(yt = w )  (13) between the two distributions, which is defined as : o ρo ∗ Pot(yt = w )

Dkl = KL[qφ(z|Y,X)||pθ(z|X)] where Pet, Pkt and Pot are deﬁned as the probabilities

Nz of selecting emotional words, keywords and ordinary words X qφ(z = zi|X,Y ) = q (z = z |X,Y ) log respectively. The probabilities of choosing words in different φ i p (z = z |X) i=1 θ i types are deﬁned as: where Nz is the dimension of latent variable z, θ and φ denote the model parameters. Then, during the inference process, the model samples a latent variable z merely based on the prior P (yet) = softmax(We ∗ [st; z; el]) (14) distribution. P (ykt) = softmax(Wk ∗ [st; Mt]) (15)

b) Emotion Supervisor: Furthermore, there is an emotion P (yot) = softmax(Wo ∗ st) (16) supervisor that guides the latent variable to encode emotional information in the response with emotion label. Following As for the probability of emotion words, P (yet) depends on [15], the supervisor takes z as input and then predicts the the hidden state st, the latent variable z and emotion label el. emotion label: After considering these factors comprehensively, the decoder can generate proper emotional words related to the speciﬁc P (el|z) = softmax(We ∗ f(z)); f(z) = tanh(Mz + b) (9) emotion label. The probability P (ykt) of selecting a keyword where el is the emotion label, latent variable z is a k depends on hidden state st and the output of TCA module dimensional vector, M ∈ d×k, W ∈ c×d is the trainable R e R Mt. For the probability P (yot), we just consider the hidden d×1 transformation matrix and b ∈ R , c is the number of emo- state. The generation loss is based on cross-entropy: tion categories. The loss function of the Emotion Supervisor m is deﬁned as: X c lossg = − log (P (yt|y

BLEU Fluency Relevance Emotion Topic Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 perplexity Emb Average Exterma Greedy emotion acc topic relevance Seq2Seq 0.0893 0.0206 0.0035 0.00169 83.89 0.531 0.360 0.396 - 0.2713 ECM 0.0912 0.0196 0.0040 0.00189 75.58 0.711 0.470 0.580 0.6511 0.3611 ERG 0.0937 0.0196 0.0061 0.00246 65.84 0.733 0.485 0.598 0.7210 0.4857 TERG 0.0999 0.0234 0.0067 0.00320 63.95 0.786 0.588 0.638 0.7109 0.6083

where np represents the number of positive labels predicted the best score in all metrics. Agreements to measure inter- by the topic classifier, na is the total number of test samples. rater consistency among three annotators were calculated with the Fleiss’s kappa [25]. The Fleiss’s kappa for fluency, topic B. Human Evaluation relevance and emotion accuracy is respectively 0.46,0.41,0.49, We recruited six volunteers who are well-educated native showing moderate annotator agreement. In the pairwise anno- speakers of Chinese to score the test results of our TERG and tation protocol, the scores larger than 0 indicates our model baselines. We randomly sampled 200 messages and generated outperforms its competitors. responses in the test set. Following [24], we designed two evaluation strategies. Pointwise evaluation: Three volunteers TABLE IV rated the generated responses from the perspective of fluency, THE POINTWISE HUMAN EVALUATION RESULTS topic relevance and emotion expression accuracy. A graded Model Fluency Topic relevance Emotion accuracy assessment scale was used to score the generated responses, Seq2Seq 2.8432 2.0522 - where 0=very terrible, 1= bad, 2=borderline, 3=not bad, 4= ECM 2.6757 2.0943 2.5277 good, 5=surprised. Pairwise evaluation: The remained three ERG 2.9772 2.3030 3.0025 TERG 3.1919 3.3863 3.0833 volunteers evaluated whether the responses generated by our model are better than the baselines, where 1=better, 0=equal, TABLE V -1=worse. If they could not understand both replies, they were THE PAIRWISE HUMAN EVALUATION RESULTS asked to choose ”equal”. The source of generated responses is Model score blind for volunteers and the final scores are average scores. TERG vs Seq2Seq 0.5327 By this way, we can comprehensively evaluate the results TERG vs ECM 0.4826 generated by different models. TERG vs ERG 0.2587

C. Evaluation Results and Analysis Our model shows substantial improvements against base- D. Topic Relevance Analysis line methods in terms of perplexity, bleu score and manual evaluation. Table III report evaluation results on automatic In this section, we will further analyze the topic relevance metrics. The lower perplexity indicates that our model has the between the real dialogue corpus and generated dialogues ability to generate more fluency responses and the bleu score using the unsupervised topic model BTM. Specifically, we of our model is much higher than the ECM and Seq2Seq, randomly selected 5,000 message-response pairs from the which indicates responses generated by our model are closer to test set and took the same messages as inputs to generate the ground truth. Since dialogue generation is an open-ended the responses by baseline and our models. Then we adopted problem, scores in the tasks are typically much lower than a statistical algorithm which is shown in Algorithm 1 to those observed in machine translation. In terms of semantic calculate the topic relevance score. From the line chart of and topic relevance, our model yielded a significant perfor- results in Figure 4, we can intuitively find that the topic mance boost. As we can see, after removing the TCA module, relevance of the real-life dialogue corpus is much higher than the topic relevance score decreased significantly. The results the dialogue generated by the baseline models. And the scores verify that introduction of TCA model is particularly useful of our model are very close to or even higher than the scores in generation topic-related responses. In addition, the emotion of real conversations. The reason why the score is higher accuracy of ERG is a little higher than the TERG. The reason than the real dialogue ’s is that the corpus is collected from of the slightly lowness on the emotion accuracy may be that Weibo rather than the real-word dialogue. Weibo users do not the addition of our external topic commonsense. So the model always use standard grammar or spellings, and frequently use is slightly biased towards the capture of topic information. colloquial language. Thus, it’s significant to introduce the topic From the overall results, the TERG model is better than the commonsense knowledge for a dialogue system. In this work, ERG. the topic commonsense is in the form of a series of topic Human evaluation results are shown in Table IV and V. words that are closely related to conversation. Some specific The pointwise evaluation results show the TERG model yields examples of topic commonsense are shown in Figure 5. Algorithm 1 Topic relevance score calculation Given: Topic model T Input: Test pairs range n [1:15] for each message-response pair do T assign topic distribution for message and response sort and select top n topics M for message sort and select top n topics R for response count=0 T if M R 6= ∅ then count++; score=count / size of test pairs return score Fig. 5. Examples of message & topic commonsense

step to introducing external topic common sense in response generation. Although our dialogue system performs well in the many evaluation metrics, there are still some problems to solve in the future. The dialogue system was trained on the corpus from the Weibo platform. There is a lot of noise data in the conversational corpus, which leads to the weak logical relationship between the generated responses and messages. In the future, we will explore how to enhance the logical association among the conversations. How to model the emotion and topic attributes to capture their transformations in the multiple rounds dialogue system. Fig. 4. The column chart of dialogue’s topic relevance score . VII.ACKNOWLEDGMENT E. Case Study This research is funded by the Science and Technology The presented test results’ examples in Figure 6 show that Commission of Shanghai Municipality (No. 18511105502) our model can generate more informative responses which are and Xiaoi Research. related to the given messages. In the first example, to generate REFERENCES a response with ’like’ emotion, TERG no longer uses the word ‘like’, to our surprise, it uses the word ’darling’. Furthermore, [1] H. Prendinger and M. Ishizuka, “The empathic companion: A character- the generated responses by our model are all related to the based interface that addresses users’affective states,” Applied Artificial Intelligence, vol. 19, no. 3-4, pp. 267–285, 2005. topic of eggs. In the second example, our dialogue system [2] H. Prendinger, H. Dohi, H. Wang, S. Mayer, and M. Ishizuka, “Empathic can get the words ”monsters, Sailor Moon” in the generated embodied interfaces: Addressing users’ affective state,” in Tutorial and sentences which are related to the entity word ”Ultraman” due Research Workshop on Affective Dialogue Systems. Springer, 2004, pp. to the addition of the TCA module. 53–64. [3] T. S. Polzin and A. Waibel, “Emotion-sensitive human-computer interfaces,” in ISCA tutorial and research workshop (ITRW) on speech and VI.CONCLUSIONANDFUTUREWORK emotion, 2000. In this paper, we have constructed a novel response gen- [4] M. Skowron, “Affect listeners: Acquisition of affective states by means of conversational systems,” in Development of Multimodal Interfaces: eration model for chatbots by introducing the latent variable Active Listening and Synchrony. Springer, 2010, pp. 169–181. and fusion of multi-attention. Our model shows substantial im- [5] H. Wang, Z. Lu, H. Li, and E. Chen, “A dataset for research on short- provements against several baseline methods in both automatic text conversations,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 935–945. and manually evaluation. Our work has important implications [6] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning for the design of chatbots. An excellent chatbot should be able with neural networks,” in Advances in neural information processing to perceive, understand, and express different emotions like a systems, 2014, pp. 3104–3112. [7] H. Zhou, M. Huang, T. Zhang, X. Zhu, and B. Liu, “Emotional chatting human. It is a crucial step to generate emotional responses machine: Emotional conversation generation with internal and external in the process. The purpose of chatbots is to have more memory,” in Thirty-Second AAAI Conference on Artificial Intelligence, rounds of conversations with their users and to accomplish 2018. [8] N. Asghar, P. Poupart, J. Hoey, X. Jiang, and L. Mou, “Affective neural some ambiguous goals, such as: filling the user’s boring time, response generation,” in European Conference on Information Retrieval. emotional companionship, and encouraging users. It’s a useful Springer, 2018, pp. 154–166. Fig. 6. Examples of responses generated by different models. The red marked words contain the emotional properties and the blue marked words are keywords which are used to express the core meaning of sentences.

[9] X. Yan, J. Guo, Y. Lan, and X. Cheng, “A biterm topic model for short [18] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and texts,” in Proceedings of the 22nd international conference on World S. Bengio, “Generating sentences from a continuous space,” Computer Wide Web. ACM, 2013, pp. 1445–1456. Science, 2015. [10] R. Zhang, Z. Wang, and D. Mai, “Building emotional conversation [19] A. Graves, S. Fernandez,´ and J. Schmidhuber, “Bidirectional lstm systems using multi-task seq2seq learning,” in National CCF Conference networks for improved phoneme classification and recognition,” in on Natural Language Processing and Chinese Computing. Springer, International Conference on Artificial Neural Networks. Springer, 2005, 2017, pp. 612–621. pp. 799–804. [11] Z. Song, X. Zheng, L. Liu, M. Xu, and X.-J. Huang, “Generating [20] Z. Lin, M. Feng, C. N. D. Santos, Y. Mo, and Y. Bengio, “A structured responses with a specific emotion in dialog,” in Proceedings of the 57th self-attentive sentence embedding,” 2017. Conference of the Association for Computational Linguistics, 2019, pp. [21] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap- 3685–3695. proaches to attention-based neural machine translation,” arXiv preprint [12] P. Ekman, W. V. Friesen, and P. Ellsworth, Emotion in the Human Face: arXiv:1508.04025, 2015. Guide-lines for Research and an Integration of Findings: Guidelines for [22] O. Vinyals and Q. Le, “A neural conversational model,” arXiv preprint Research and an Integration of Findings. Pergamon, 1972. arXiv:1506.05869, 2015. [13] K. Sohn, H. Lee, and X. Yan, “Learning structured output representation [23] C.-W. Liu, R. Lowe, I. V. Serban, M. Noseworthy, L. Charlin, and using deep conditional generative models,” in Advances in neural J. Pineau, “How not to evaluate your dialogue system: An empirical information processing systems, 2015, pp. 3483–3491. study of unsupervised evaluation metrics for dialogue response genera- arXiv preprint arXiv:1603.08023 [14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training tion,” , 2016. of deep bidirectional transformers for language understanding,” arXiv [24] L. Mou, Y. Song, R. Yan, G. Li, L. Zhang, and Z. Jin, “Sequence preprint arXiv:1810.04805, 2018. to backward and forward sequences: A content-introducing approach to generative short-text conversation,” arXiv preprint arXiv:1607.00970, [15] P. Ke, J. Guan, M. Huang, and X. Zhu, “Generating informative 2016. responses with controlled sentence function,” in Proceedings of the [25] J. L. Fleiss and J. Cohen, “The equivalence of weighted kappa and the 56th Annual Meeting of the Association for Computational Linguistics intraclass correlation coefficient as measures of reliability,” Educational (Volume 1: Long Papers), 2018, pp. 1499–1508. and psychological measurement, vol. 33, no. 3, pp. 613–619, 1973. [16] K. Cho, B. Van Merrienboer,¨ C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014. [17] C. Xing, W. Wu, Y. Wu, J. Liu, Y. Huang, M. Zhou, and W.-Y. Ma, “Topic aware neural response generation,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.