Towards Controlled and Diverse Generation of Article Comments

Linhao Zhang1, Houfeng Wang1 1MOE Key Lab of Computational Linguistics, Peking University, Beijing, 100871, China 2Baidu Inc., China {zhanglinhao, wanghf}@pku.edu.cn

Abstract Articles (truncated): With the popularity of the TV series Ode to Joy, the actors’ salaries are revealed recently. According to the news, Tao’s salary is about 400K per episode, while Jiang Xin is paid about 100K to Much research in recent years has focused 300K. , who plays Qiu Yingying in the play, also earns about 100K on automatic article commenting. However, to 300K. Compared with them, Wang Ziwen and Guan Muer are relatively not so famous, and their salaries are only about tens of thousands. few of previous studies focus on the control- Seq2Seq I love this movie. lable generation of comments. Besides, they Disgust Does Wang Ziwen have any acting skills? tend to generate dull and commonplace com- Like Liu Tao is so beautiful. I am so relieved to see such news. ments, which further limits their practical ap- Ours Happiness Sadness I am wordless after watching Ode to Joy. plication. In this paper, we make the first step Anger What’s wrong with Jiang Xin? towards controllable generation of comments, by building a system that can explicitly control Figure 1: Comparison of output comments of the the emotion of the generated comments. To Seq2Seq baseline and our model. We can see that the achieve this, we associate each kind of emo- baseline cannot control the expressed emotion, and the tion category with an embedding and adopt generated comment is dull and irrelevant to the arti- a dynamic fusion mechanism to fuse this em- cle (red colored). By contrast, Our model is emotion- bedding into the decoder. A sentence-level controllable and can generate more diverse and relevant emotion classifier is further employed to bet- comments, thanks to the hierarchical copy mechanism ter guide the model to generate comments ex- (blue colored). pressing the desired emotion. To increase the diversity of the generated comments, we pro- pose a hierarchical copy mechanism that al- ments for users, who may later select one and refine lows our model to directly copy words from it (Zheng et al., 2018). the input articles. We also propose a restricted beam search (RBS) algorithm to increase intra- In addition to the practical importance of this sentence diversity. Experimental results show task, it also has significant research value. It can that our model can generate informative and di- be seen as a natural language generation (NLG) verse comments that express the desired emo- task, yet unlike machine translation or text summa- tions with high accuracy. rization, the comments can be rather diverse. That is, for the same article, there can be numerous ap- 1 Introduction propriate comments that are from different angles. Automatic article commenting is a valuable yet In this sense, this task is similar to dialog system, arXiv:2107.11781v1 [cs.CL] 25 Jul 2021 challenging task. It requires the machine first to yet because the input article is much longer and understand the articles and then generate coherent more complex than dialog text, it is hence more comments. The task was formally proposed by challenging. (Qin et al., 2018), along with a large-scale dataset. Despite the importance of this task, it is still rel- Much research has since focused on this task (Lin atively new and not well-studied. One of the major et al., 2018b; Ma et al., 2018a; Li et al., 2019). limitations of current article commenting systems The ability to generate comments is especially is that the generation process is not controllable, useful for online news platforms, for the comments meaning that the comments are conditioned entirely can encourage user engagement and interactions on the articles, and users can not further control (Qin et al., 2018). Besides, an automatic comment- the features of comments. In this work, we make ing system also enables us to build a comment- the first step towards controllable article comment- assistant which can generate some candidate com- ing and propose a model to control the emotion of generated comments, which has wide practical sentence diversity, respectively. application. Take comment-assistant (Zheng et al., 2018) as an example, it is far more desirable to 2 Method have candidate comments each expresses a differ- The overall architecture of our proposed model, ent emotion, and users can hence choose one that Controllable Commenting System (CCS), is shown matches their own emotion towards the article. in Figure2. We describe the details in the following Another problem of current commenting sys- subsections. tems arises from the limitation of the Seq2Seq framework (Sutskever et al., 2014), which has been 2.1 Task Defination known to suffer from generating dull and responses The task can be formulated as follows: Given an that are irrelevant to the input articles (Li et al., article D and an emotion category χ, the goal is to 2015; Wei et al., 2019; Shao et al., 2017). As shown generate an appropriate comment Y that expresses in Figure1, the Seq2Seq baseline generates I love the emotion χ. this movie for the input article, despite the fact that Specifically, D consists of ND sentences. Each Ode of joy is not a movie, but a TV series. sentence consists of T words, where T can vary In this work, we propose a controllable article among different sentences. For each word, we use commenting system that can generate diverse and ei to denote its word embedding. relevant comments. We first create two emotional 2.2 Basic Structure: Hierarchical Seq2Seq datasets based on (Qin et al., 2018). The first one is a find-grained dataset that contains {Anger, Dis- We encode an article by first building representa- gust, Like, Happiness and Sadness } and the sec- tions of sentences and then aggregating them into ond one is a coarse-grained one that contains only an article representation. {Positive and Negative}. To incorporate the emo- Word-Level Encoder - We first encode the ar- tion information into the decoding process, we first ticle D on the word level. For each sentence in enc associate each kind of emotion with an embedding D, the word-level LSTM encoder LST M w reads and then feed the emotion embedding into the de- the words sequentially and updates its hidden state coder at each time step. A dynamic fusion mech- hi. Formally, anism is employed to utilize the emotion informa- h = LST M enc(e , h ) (1) tion selectively at each decoding step. Besides, the i w i i−1 decoding process is further guided by a sequence- The last hidden state, hT , stores information of the level emotion loss term to increase the intensity of whole sentence and thus can be used to represent emotional expression. that sentence: To generate diverse comments, we propose a eˆ = hT (2) hierarchical copy mechanism to copy words from Sentence-Level Encoder - Given sentence em- the input articles. This is based on the observation beddings (ˆe , eˆ , ..., eˆ ), we then encode the arti- that we can discourage the generation of dull com- 1 2 ND cle at the sentence-level: ments like I don’t know by enhancing the relevance between comments and articles. In this way, the enc gi = LST Ms (ˆei, gi−1) (3) inter-sentence diversity gets increased. Moreover, enc we observe that the repetition problem of Seq2Seq where the LST Ms is the sentence-level LSTM framework (See et al., 2017; Lin et al., 2018a) can encoder and gi is its hidden state. The last hidden be seen as a lack of intra-sentence diversity, and state, gND , aggregates information of all the sen- we further adopt a restricted beam search (RBS) tences and is later used by the decoder to initialize algorithm to tackle this problem. its hidden state. To sum up, our contributions are two-folds: Decoder Similarly, the decoder LSTM dec 1) We make the first step to build a controllable LST M updates its hidden state st at time step article commenting system by injecting emotion t: into the decoding process. Features other than emo- dec tion can be controlled in a similar way. st = LST M (et, st−1) (4)

2) We propose the hierarchical copy mechanism where et is the embedding of the previous word. and RBS algorithm to increase the inter- and intra- While training, this is the previous word of the iPad Hierarchical P Emotion Copy emotion Loss step embedding MLP top k { Pfinal ∑ t step γij Patten × (1 − pg en) “iPad”

a ...... zoo

× pg en P βt Sentence vocab i LSTM A new MLP

st Dynamic Fusion t αij Word LSTM new ⨀ ⊕ ⊗ first sentence second sentence ⨀

Hierarchical Encoder Decoder

Figure 2: The architecture of CCS. On the left is the encoder, which encodes articles at the word- and sentence- level. On the right is the decoder, with emotion information dynamically fused into the decoding process. We add a emotion loss item to further bias the decoding process. Besides, a hierarchical copy mechanism is proposed to improve the diversity of the generated comments. reference comment; at test time it is the previous ever, this method is overly simplistic. It uses emo- word emitted by the decoder. The attention mecha- tion information indiscriminately at each decoding nism (Luong et al., 2015) is also applied. Formally, step, yet not all steps require the same amount At each decoding step, we compute the attention of emotion information. Simply using the same weights between the current decoder hidden vector emotion embedding during the whole generation and all the outputs of the sentence-level encoder. process may sacrifice grammaticality (Ghosh et al., We then compute the context vector ct using the 2017). weighted sum of the sentence-level encoder out- To solve this issue, we adopt a dynamic fusion puts and obtain our attention vector at for the final mechanism that can update and selectively utilize prediction. Formally, the emotion embedding at each decoding step. Be- sides, given that the same word can convey differ- T mt,i = softmax(st Wggi) ent emotions, we also alter the word embedding X according to emotions to be expressed. Specifically, ct = mt,igi e i (5) we first employ a emotion gate zt and a word gate w a = tanh(W [c ; s ] + b ) zt : t a t t a e zt = σ (We[vχ; st−1] + be) (7) Pvocab = softmax(Wpat) where Wa ,Wg, Wp and ba are model parameters; w zt = σ (Ww[vχ; st−1] + bw) (8) [; ] is vector concatenation. where σ is the sigmoid function; We, Ww, be, 2.3 Dynamic Fusion Mechanism e bw are model parameters. The resulting vectors zt w To control the emotions expressed by the com- and zt are of the same dimension as vχ and st−1. ments, we first associate each emotion with an They are used to control the amount of emotion in- embedding vχ. For clarification, we note that χ formation used in current decoding step, changing is the user-specified emotion category and vχ is Equation6 into: obtained by a trainable embedding layer. We then add the emotion vector v to the decoder input e χ t dec e w at each decoding step, changing Equation4 into: st = LST M (vχ zt + et zt , st−1) (9) where is element-wise multiplication. s = LST M dec(v + e , s ) (6) t χ t t−1 Experiments show that the dynamic fusion mech- In this way, emotion information is injected into anism leads to better generation quality and higher the decoding process at each decoding step. How- emotion accuracy. 2.4 Emotion Loss t t t To further guide the model to generate comments γij = βi ∗ αij (15) expressing the desired emotion, we use a sequence- j where Wy is model parameters, hi is the jth words level emotion classifier (Song et al., 2019) to guide of the ith sentence. In this way, words from more the generation process. informative sentences are rewarded, and words We first approximate the step embedding Et us- from less informative ones are discouraged. ing the weighted sum of embeddings of words with Finally, we get the context vector ct using the the TopK probabilities, where K is a hyperparame- weighted sum of all the words and obtain our at- ter. tention vector at for the final prediction. Formally, X X c = γt hj Et = P (wi) · Emb (wi) (10) t ij i i,j wi∈T opK (16) Then, we calculate the emotion distribution us- at = tanh (Wa [ct; st] + ba) ing the average step embeddings of all time steps, Pvocab = softmax (at) based on which we calculate the final emotion clas- where Wa, ba are model parameters. Pvocab is a sification loss: probability distribution over all words in the vocab- t ulary. Then, the attention distribution γij and the T 1 X vocabulary distribution Pvocab are then weighted G(E|Y ) = softmax(W · E )) (11) g T t and summed to obtain the final word distribution: t=1 T T T  pgen = σ wa at + ws st + we et + bptr (17) Lemo = −P (E) log(G(E|Y )) (12) where Wg is model parameters and P(E) is the gold emotion distribution. X t P (w) = pgenPvocab(w) + (1 − pgen) γ Intuitively, the emotion loss forces a strong de- ij ij:wij =w pendency relationship between the input emotion (18) category and generated comment, further biasing Intuitively, pgen can be seen as a soft switch to the comments towards our desired emotion. choose between generating a word from the vocab- 2.5 Hierarchical Copy Mechanism ulary, and copying a word from the source article. Experiment results show that our hierarchical copy A key limitation of Seq2Seq models is that they mechanism can effectively improve comments di- tend to generate dull and commonplace text such versity. as I don’t know and I like it. We observe that we can encourage the generation of diverse comments 2.6 Restricted Beam Search by enhancing the relevance between comments and The above method can improve the inter-sentence source articles. diversity of comments. However, the Seq2Seq To achieve this, we adopt copy mechanism (See framework has also long been known of generating et al., 2017), which allows for copying words from repeating and redundant words and phrases, as the source text. We adapt the original copy mechanism example shown in (Lin et al., 2018a): to consider the hierarchical structure of articles. Fatah officially officially elects Abbas as candi- Specifically, we first compute attention βt over sen- date for candidate. tences. We regard this problem as a lack of intra- sentence diversity. To mitigate this issue, we pro- t T  βi = softmax st Wxgi (13) pose the Restricted Beam Search (RBS) algorithm. where Wx is model parameter, st is decoder’s tth Based on the observation that ground truth com- hidden state and gi is ith sentence representation. ments seldom contain the same n-grams multiple We then compute the attentions over individual times, we explicitly lower the probability of those words and normalize them using the sentence-level words that would create repetitive n-grams at each attention βt. decoding step. Specifically, we will lower the prob- ability of generating wt by η if outputting wt cre- t j αij = softmax(stWyhi ) (14) ates an n-gram that already exists in the previously decoded sequence of the current beam. Formally, trained our model with only one comment that has the probability of wt will be modified as the fol- the most upvotes for each article. We found this lowing. cleaning procedure led to better generation quality and expedited training substantially.   P (wt) if wt won’t create P (wt) = repetitive n-gram 3.2 Training Procedure  max(P (wt) − πη, 0) otherwise (19) We trained our models on a single NVIDIA TITAN where η is a hyperparameter and π is the times that RTX GPU. The LSTMs (Hochreiter and Schmidhu- ber, 1997) are with 512-dimensional hidden states. wt-created n-gram has occurred in the previously decoded sequence of the current beam. The dimensions of both word embeddings and sen- Despite its brevity, we found that this procedure timent embeddings are 512. Dropout (Hinton et al., can substantially improve intra-sentence diversity. 2012) is used with dropout rate set to 0.3. The Note that the RBS algorithm is notably different number of layers of LSTM encoder/decoder is set to 2. The batch size is set to 64. from (Paulus et al., 2017). They set P (wt) = 0 during beam search, when outputting wt would We use Adam (Kingma and Ba, 2014) with −3 create a n-gram that already exists. We, on the learning rate = 10 , β1 = 0.9, β2 = 0.98 and  = 10−9 other hand, lower the probability of wt by a certain . We obtain the pretrained word em- amount. We believe that forcing the decoder to bedding by training an unsupervised word2vec never output the same n-gram more than once is (Mikolov et al., 2013) model on the training set. too strict, for repetitive trigrams do exist in natural We compare the char-based model with the word- 4 language. In this sense, our RBS algorithm can be based model. For the latter we use jieba for Chi- seen as a soft version of (Paulus et al., 2017) that nese word segmentation (CWS) to preprocess the enables more flexibility. text. We find that char-based model given better performance, so we adopt this setting. This results 3 Experiments is in line with (Meng et al., 2019). We shared the vocabulary between articles and 3.1 Dataset comments, and the vocabulary size was restricted Currently there is no off-the-shelf article comment- to 5,000. During training, we truncated the article ing dataset annotated with emotions, so we cre- to 30 sentences and limited the length of each sen- ated our own datasets based on the Tencent News tence to 80 tokens. At test time our comments were dataset released by (Qin et al., 2018). For fine- produced using restricted beam search with beam grained setting, which involves {Anger, Disgust, size = 5. We found that setting n to 1 and η to 0.5 Like, Happiness and Sadness}, we trained a Bi- is enough for effective repetition reduction. ξ, τ, 1 LSTM emotion tagger using NLPCC dataset. For N, K are set to 0.01, 1, 20 and 50 respectively. coarse-grained setting, which involves {Positive and Negative}, we used a well-trained emotion 3.3 Systems for Comparison tagger provided by Baidu2, which provides the As we have mentioned, this is the first work to function to classify a text into positive or nega- consider the emotion factor for article commenting. tive with high accuracy. We then annotated the We did not find closely-related baselines in the lit- Tencent News dataset with the two taggers, cre- erature. Nevertheless, we first chose two baselines ating two emotional commenting datasets, called that are widely used in NLG tasks: Tencent-coarse and Tencent-fine respectively3. For the original Tencent News dataset, there are • Seq2Seq - In this paper we use Seq2Seq 169,060 articles in the training set, 4,455 in the (Sutskever et al., 2014) model enhanced with development set and 1,397 in the test set. There are attention mechanism (Luong et al., 2015). several comments for the same article, and many of these comments are short and meaningless. To • Transformer - The Transformer (Vaswani promote the quality of the generated comments, we et al., 2017) also follows the encoder-decoder 1http://tcci.ccf.org.cn/conference/2014/ paradigm but relies on self-attention instead 2https://ai.baidu.com/tech/nlp/sentiment classify of RNN. 3We will release the two datasets to promote future re- search 4https://github.com/fxsjy/jieba Generation Quality Diversity Emotion Acc. Model BLEU METEOR ROUGE-L D1 D2 D3 Acc-C Acc-F Seq2Seq (Sutskever et al., 2014) 5.15 9.43 22.67 4.51 16.48 23.56 -- Transformer (Vaswani et al., 2017) 4.23 9.89 24.09 3.98 15.23 21.96 -- Flat-Emo (Huang et al., 2018) 4.13 7.75 23.40 3.38 10.85 15.99 76.21 43.92 Hier-Emo (Huang et al., 2018) 5.02 8.97 24.14 3.54 13.01 20.25 78.35 44.26 CCS 7.54 10.50 25.29 5.36 19.74 29.36 83.10 48.34

Table 1: Results of the generation quality, diversity and emotion accuracy(%).

However, these two baselines cannot control the For clarification, we note that at test time the expressed emotion. (Huang et al., 2018) proposed input emotion tags are not used as external knowl- to generate dialogue with expressed emotion, we edge. That is, the specified emotion categories adapted their approach and created two new models are manually designed rather than reflecting the that are emotion-controllable: emotions of the gold comments, so the comparison with models that cannot use emotion information • Flat-Emo - The article is represented as a (e.g., Seq2Seq and Transformer) would be fair. The flat structure, and the emotion embedding reported results are averaged over all emotion cate- serves as an input to every decoding step to a gories. Although this setting makes the task harder, Seq2Seq network, as in (Huang et al., 2018). we believe it is much closer to practical scenarios.

• Hier-Emo - Similar to Flat-Emo, only that 4.2 Emotion Accuracy the article is encoded in a hierarchical manner. To measure whether the model can generate com- 4 Results ments expressing the desired emotion, we adopt emotion accuracy as the agreement between the 4.1 Generation Quality desired emotion and the predicted emotion by our Acc-C Acc-F Automatic metrics such as BLEU, METEOR, and emotion tagger. and report the coarse- ROUGE are widely used for NLG tasks (Lin et al., grained and fine-grained results, respectively. Of 2018a; Rose, 2016; Song et al., 2019). We adopt the four baselines, only Flat-Emo and Hier-Emo these metrics to evaluate generation quality, that is, are emotion-controllable. The results are shown in whether the comments are relevant and grammati- Table1. cal. Our first observation is that these two baselines The results can be seen in Table1. The first give similar results on this metric, with Hier-Emo observation is that the first three baselines give sim- perform slightly better. Compared to them, our ilar and unsatisfactory results. This may due to model performs significantly better under both the the general limitations of non-hierarchical struc- coarse-grained setting and the fine-grained setting, ture. Hier-Emo, on the other hand, makes use of with over 3% absolute improvement. this structure information and hence gives better More detailed analysis validates the effective- performance. However, it still suffers from the gen- ness of the dynamic fusion mechanism and the eral problems of Seq2Seq framework. Besides, the emotion loss term. As shown in Table2, using a emotion information is indiscriminately fused into simple fusing method (CCS-Emo) results in drastic the decoding process, which also hurts generation drop in emotion accuracy, almost to the same level quality (Ghosh et al., 2017). of our baselines. Compared to baselines, our model gives sub- 4.3 Diversity stantially better performance in terms of all met- rics. One reason is that we adopt the hierarchical To measure whether the hierarchical copy mech- copy mechanism, which improves the coherence anism can promote the diversity of the generated between comments and articles. Besides, by adopt- comments, we report the proportion of novel n- ing the dynamic fusion mechanism, our model can grams in all the generated texts, represented as Dn. use the emotion information selectively, which is This metric has been widely used to evaluate the also beneficial to generation quality. diversity of generated text in a variety of NLG tasks Coarse-grained Fine-grained Model Pos. Neg. Ave. Disgust Anger Sadness Like Happiness Ave. CCS-Emo 79.89 77.09 78.04 59.24 31.38 41.02 63.54 39.58 46.95 CCS 81.66 84.54 83.10 61.39 33.52 43.19 69.72 33.88 48.34

Table 2: Effectiveness of our emotion-control approach (%).

Model D1 D2 D3 % that are duplicates CCS w/o RBS 3.28 14.25 22.32 20 CCS w/o HC 4.05 17.80 29.17 CCS 5.36 19.74 29.36 15

Table 3: Effectiveness of RBS and hierarchical copy 10 (HC) mechanism on diversity of comments (%). 5

(Lin et al., 2018a; Chen and Bansal, 2018; Li et al., 0 2015). 1-gram 2-gram 3-gram 4-gram As shown in Table1, all three metrics get im- CCS -RBS CCS reference proved drastically. Especially, CCS beats Trans- Figure 3: Restricted beam search (RBS) helps to miti- former on D3 by over 7% (absolute), showing that gate repetition problem. Comments from normal beam CCS can effectively generate diverse comments search contain many duplicated n-grams, which our compared to strong baselines. RBS algorithm produces a similar number as the ref- From Table3, we can see that after removing the erence comments. hierarchical copy mechanism, the diversity drops significantly (over 2% absolute for D2). Besides, As for the emotion expressed, we can see that the RBS algorithm also contributes to comments di- many emotional words appear in the generated versity. After removing RBS, the comments diver- comments, like pathetic and brute. Besides, there sity drops significantly. To examine the effective- are also other comments that do not contain any ness of RBS more closely, we report the percentage explicit emotional words, yet also express the spec- of repetitive n-grams within comments. From Fig- ified emotions. ure3, we can observe that the repetition problem However, we do observe some limitations of our (lack of intra-sentence diversity) is rather severe model. The emotion expressed is not always pre- when trained with the normal beam search algo- cise. We believe a major reason is that the dataset rithm (CCS -RBS), with repetitive 3-grams and we use to train our emotion tagger is noisy. Be- 4-grams almost ten times more than reference. De- sides, some comments are not very informative. spite the brevity of our RBS algorithm, the repeti- We believe a possible solution might be introduc- tion problem is almost completely eliminated, with ing external knowledge. We will leave these to repetitive 3-grams and 4-grams only slightly higher future work. than reference comments.

4.4 Case Study 5 Related Work To gain an insight into how well the emotion is ex- Automatic Article Commenting -(Qin et al., pressed in the generated comments, we provide an 2018) formally proposed the automatic article com- example in Figure4. We can see that the Seq2Seq menting task, along with a large-scale annotated baseline cannot control the expressed emotion, and article commenting dataset, making data-driven the comment is not much relevant to the article. seq2seq approaches to solve this problem viable. By contrast, our model can generate far more rele- Much research has since focused on this area (Lin vant comments. The main character’s name Bozhi et al., 2018b; Ma et al., 2018b). Zhang and Nicholas Lse are directly copied into the However, none of these models are controllable, comments, thanks to the hierarchical copy mecha- which limits their practical application. They also nism. suffer from the general limitations of Seq2Seq Examples of Generated Comments

Article Model Comments Translation

Original (truncated): Seq2Seq What’s on your mind? ,46 positive Come on! Zhang Bozhi! , 7Lucas ECS-C 33, negtive Zhang Bozhi is pathetic Disgust I don’t know what’s on your mind Translation: announced his wedding on May 1. His love story with Zhang Bozhi has Like Nicholas Tse is so great. finally come to an end. In order to make up for their sons, he bought 300 million teaching funds. Happiness Since the 46-year-old and Nicholas ECS-F I am relieved to see Nicholas Tse Tse were back together, their actions were highly I don’t understand why so many people concerned. They will hold a wedding in Dali. The Sadness three people’s love triangle is finally over! are cursing Nicholas Tse Fortunately for Zhang Bozhi, although Nicholas Tse and Faye Wong are together, she still has two Anger The man is a brute! sons to accompany her.

Figure(a) 4: Case study.

framework, like repetition and lack of diversity. 6 Conclusion Controllable Generation -(Hu et al., 2017) is In this paper, we make the first step towards con- one of the earliest work that consider the problem trolled and diverse article commenting. We build of controllableArticle generation. However,Model they aimed Commentstwo emotional datasets toTranslation validate our approach, Original (truncated): TFboys4 to generate text conditioned only on theSeq2Seq representa-and propose a dynamicI want to fusion know what mechanism kind of guy he is. to effec- tion vectors, which is, significantly different from TFboys, tively control the expressed emotions of comments. positive is my favourite. the task of automatic article commenting. Our work Besides, our model can also generate more diverse is close to (HuangTFboys et al., 2018), whichECS-C first intro- , negtive comments thanks to theCai hierarchical Xukun is really ugly. copy mecha- duced the emotion factor into dialog generation nism and RBS. Experimental results show that our process. However, as we have discussedDisgust in the In-model beats strong baselines.Wang Junkai is rubbish. troduction,Translation: There the were problem four people ofin TFboys! article commenting is Like Actually, I like him. In the beginning, he refused to join, and now he inherentlyregretted it! Cai Xukun’s more parents challenging. said that they Besides, they sim- did not want to delay his studies, so they sent Happiness ply concatenated the emotion embeddingECS-F into the References I think it’s pretty good. Cai Xukun to study abroad. In fact, they were decoderjust not sure atabout every the future time of TFboys. step, To which can be seen as Sadness Yen-Chun Chen andI wonder Mohit why so Bansal. many people 2018. like Wang Fast Junkai? abstrac- study abroad is just an excuse. It is said that Cai aXukun’s variant ability ofhas ourbeen underestimated. baseline Flat-Emo. He As we have tive summarization with reinforce-selected sentence shownwon’t be dwarfed in this by Wang paper, Yuan, this Wang simple Junkai approach gives rewriting. arXiv preprint arXiv:1805.11080. and Xi Qian. Anger ? You don’t understand why? unsatisfactory results. Sayan Ghosh, Mathieu Chollet, Eugene Laksana, Copy Mechanism -(See et al., 2017) proposed Louis-Philippe Morency, and Stefan Scherer. 2017. (b) Affect-lm: A neural language model for customiz- the Pointer-Generator, which copies words from able affective text generation. arXiv preprint the source text for summarization. However, they arXiv:1704.06851. regard the input article as a flat structure, ignoring the hierarchical structure of document altogether. Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Based on this observation, we propose the hierar- Improving neural networks by preventing co- chical copy mechanism to better suit this task. adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Restricted Beam Search - Our RBS algorithm can be seen as a soft version of (Paulus et al., 2017). Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. They set p(yt) = 0 during beam search, when Long short-term memory. Neural computation, 9(8):1735–1780. outputting yt would create a trigram that already exists. We, on the other hand, lower the probability Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan of yt by a certain amount rather than set it to 0. We Salakhutdinov, and Eric P. Xing. 2017. Toward con- believe that forcing the decoder to never output the trolled generation of text. In ICML. same trigram more than once is overly simplistic, Chenyang Huang, Osmar R Zaane, Amine Trabelsi, for repetitive trigrams do exist in natural language. and Nouha Dziri. 2018. Automatic Dialogue Gen- eration with Expressed Emotions. Naacl, pages 49– Louis Shao, Stephan Gouws, Denny Britz, Anna 54. Goldie, Brian Strope, and Ray Kurzweil. 2017. Gen- erating high-quality and informative conversation re- Diederik P Kingma and Jimmy Ba. 2014. Adam: A sponses with sequence-to-sequence models. arXiv method for stochastic optimization. arXiv preprint preprint arXiv:1701.03185. arXiv:1412.6980. Zhenqiao Song, Xiaoqing Zheng, Lu Liu, Mu Xu, and Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, Xuan-Jing Huang. 2019. Generating responses with and Bill Dolan. 2015. A diversity-promoting objec- a specific emotion in dialog. In Proceedings of the tive function for neural conversation models. arXiv 57th Conference of the Association for Computa- preprint arXiv:1510.03055. tional Linguistics, pages 3685–3695. Wei Li, Jingjing Xu, Yancheng He, Shengli Yan, Yun- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. fang Wu, et al. 2019. Coherent comment generation Sequence to sequence learning with neural networks. for chinese articles with a graph-to-sequence model. Advances in Neural Information Processing Systems arXiv preprint arXiv:1906.01231. (NIPS), pages 3104–3112. Junyang Lin, Xu Sun, Shuming Ma, and Qi Su. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 2018a. Global encoding for abstractive summariza- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz tion. arXiv preprint arXiv:1805.03989. Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- Zhaojiang Lin, Genta Indra Winata, and Pascale cessing Systems, pages 5998–6008. Fung. 2018b. Learning comment generation by leveraging user-generated data. arXiv preprint Bolin Wei, Shuai Lu, Lili Mou, Hao Zhou, Pascal arXiv:1810.12264. Poupart, Ge Li, and Zhi Jin. 2019. Why do neu- ral dialog systems generate short and meaningless Minh-Thang Luong, Hieu Pham, and Christopher D replies? a comparison between dialog and transla- Manning. 2015. Effective approaches to attention- tion. In ICASSP 2019-2019 IEEE International Con- based neural machine translation. arXiv preprint ference on Acoustics, Speech and Signal Processing arXiv:1508.04025. (ICASSP), pages 7290–7294. IEEE.

Shuming Ma, Lei Cui, Furu Wei, and Xu Sun. 2018a. Hai-Tao Zheng, Wei Wang, Wang Chen, and Arun Ku- Unsupervised machine commenting with neural vari- mar Sangaiah. 2018. Automatic generation of news ational topic model. CoRR, abs/1809.04960. comments based on gated attention neural networks. IEEE Access, 6:702–710. Shuming Ma, Lei Cui, Furu Wei, and Xu Sun. 2018b. Unsupervised machine commenting with neural variational topic model. arXiv preprint arXiv:1809.04960.

Yuxian Meng, Xiaoya Li, Xiaofei Sun, Qinghong Han, Arianna Yuan, and Jiwei Li. 2019. Is word segmen- tation necessary for deep learning of chinese repre- sentations?

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive sum- marization. arXiv preprint arXiv:1705.04304.

Lianhui Qin, Lemao Liu, Wei Bi, Yan Wang, Xiaojiang Liu, Zhiting Hu, Hai Zhao, and Shuming Shi. 2018. Automatic article commenting: the task and dataset. arXiv preprint arXiv:1805.03668.

Jean Rose. 2016. Get to the point. Nursing Standard, 25(22):61–61.

Abigail See, Peter J Liu, and Christopher D Man- ning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368.