arXiv:1908.07026v1 [cs.LG] 19 Aug 2019 to opnn,ie,tedcdr upt sum- outputs decoder, the i.e., component, ation e tal. et See ( progress of lot a made since human the of minds text the input in readers. the nonetheless in are specified but explicitly not are — that back- knowledge other its and document, story ground the of understanding deeper al. et Liu new 2016 introducing and ( rephrasing concepts/words generate via to summary important aims challeng- a summarization more the the abstractive preserve text, ing original to the is in messages goal primary the al. et Dorr rmteiptdcmn ( sentences) document and input select- phrases, the words, on from (e.g., parts focuses ing summarization Extractive Introduction 1 erlbsdasrciesmaiainhas summarization abstractive Neural-based mrvdRUEsoe hncmae to compared models. state-of-the-art when Wiki- scores strongly ROUGE the attain improved we and Concretely, Mail datasets. CNN/Daily How approach the proposed both the on of advantage the vali- date empirically We level. corpus ad- document at to captured access statistics have co-occurrence word to particular, ditional decoder In the bias enable words. to they generate used to be decoder se- can the global that more information reveals topic mantic LDA, docu- a as by the identified such of topics, model topics latent The latent the ment. and the text both on input conditioning sum- by generated output is the mary where decoder new a we propose paper, this In models. learning to-sequence sequence- abstractive attention-based in with made summarization been has progress Steady ; e tal. et See , , oi umne eeao o btatv Summarization Abstractive for Generator Augmented Topic , 2018 2017 1 2003 alm hn3,feisha}@usc.edu zhan734, {ailem, .Altoeeti rae and broader a entail those All ). .B ag,telnug gener- language the large, By ). ege al. et Zeng ; , alpt tal. et Nallapati Abstract 2017 1 nvriyo otenClfri,LsAgls CA Angeles, Los California, Southern of University eis Ailem Melissa ; , alpt tal. et Nallapati 2016 alse al. et Paulus uice al. et Kupiec , ; alpt tal. et Nallapati 2017 Zeitgeist .While ). , 1 , oe Zhang Bowen , , 2016 2017 1999 — ; ; ; , words. hntetpc r nrdcd oevr the Moreover, introduced. are topics the when ie oesfrlnug generation. language for models vised La- as such modeling, topic from identified Topics ocpspoievlal nutv isfrsuper- for and bias inductive words valuable of provide usages concepts Such coher- groups. semantically conceptual of ent mixtures with documents describe and co-occurrence corpus- words of capture patterns level (LDA), Allocation Dirichlet tent summarization. abstractive for models mod- with topic combine eling to how summaries describe we the paper, in this appear can texts richer that encoder). rep- the its (and through text resentation input the on conditioning by mary aettpc fteiptdcmn ( document input the of with topics supplied latent is decoder popular based the pointer-generator where summarization abstractive for Let Summarization Neural Attention-based 2.1 approach. our of scription 2017 etda euneof sequence a as sented atrpoie oegoa otx ogenerate to context global more a The topics. provides latent latter the on conditioning and copying text, text, the the on conditioning among switch to erlsmaiainmdl.W eiwone review We ( model attention-based such models. popular summarization the on neural builds work Our Approach 2 what about. of is la- text preserving input topic better the the a indicating in space, texts tent original the higher with have decoder coherence our by generated summaries performance improved strongly obtain and Wiki- How, and CNN/DailyMail namely datasets, mark htkn fifraincnw nrdc so introduce we can information of kind What epooeTpcAgetdGnrtr(TAG) Generator Augmented Topic propose We eapytepooe praho w bench- two on approach proposed the apply We x .T eeaeawr,tegnrtrlearns generator the word, a generate To ). 1 ( = n e Sha Fei and x 1 2 e tal. et See x , . . . , [email protected] L ) , 1 , 2017 2 eoeadcmn repre- document a denote L od.Smlry let Similarly, words. ,floe ytede- the by followed ), e tal. et See In ? , y = (y1,...,yT ) denote a T -word summary of x. 2.2 Topic Augmented Generator (TAG) We desire T ≪ L. Main idea As discussed in the previous section, To learn the mapping from x to y from a cor- our main idea is to introduce bias into the decoder pus of paired documents and their summaries, a such that generating words is geared toward re- natural formalism is to model the conditional dis- flecting the broad, albeit latent, semantic informa- tribution of y given x, i.e., p(y|x). Furthermore, tion underlying the input document. generating the summary is Markovian, implying a With such bias, we hope the summary could in- factorized form of the distribution clude words that are “exogenous”, but nonetheless

T semantically cohesive with the input document — using such words is “blessed” by subscribing to p(y|x)= p(y1|x) p(yt|x,y1:t−1) (1) tY=2 explicitly learned word co-occurrence patterns at the document corpus level. where y1:t−1 stands for the generated (t−1) words. LDA To this end, we use Latent Dirichlet Allo- Sequence-to-sequence (seq2seq) models typically cation (LDA) model (Blei et al., 2003) to discover use RNN encoder-decoder architectures to param- semantically coherent latent variables representing eterize the above distribution. the input documents. The encoder reads x word-by-word from the The document is modeled as a sequence of sam- left to the right (and/or reversely with another en- pling words from a mixture of K topics, coder) and produces a sequence of hidden states K L {h1,...,hi,...,hN }, with hi ∈ R and hi = p(x)= p(xi|zxi , β)p(zxi |θ)p(θ)dθ (5) fe(xi, hi−1), where fe is a differentiable nonlin- Z Y=1 Xx ear function. The decoder models the distribution i z i of every word in the summary conditioned on the Given a corpus of text documents, the prior distri- words preceding it and the encoder’s hidden states bution for θ and the topic-word vectors β can be learned with maximum likelihood estimation. Ad-

p(yt|x,y1:t−1)= γt = gyt (yt−1,st, ct) (2) ditionally, for new documents, their (maximum a posterior) topic vector θ∗ can also be inferred. The where gyt (·) is a differentiable function (such as model parameter β captures word co-occurrence softmax over all possible words) that yields prob- patterns in different topics. ability as output. s = f (y −1,s −1, c ) is the de- t d t t t TAG We revise the conditional probability of coder’s current hidden state, with f being a non- d eq. (4) with a new mixture component linear function. ct is the context vector, a weighted PG ⊤ ∗ average of the encoder hidden states: p(yt|x,y1:t−1)=λtpt + (1 − λt)q(µyt θ ) (6)

L where q(·) denotes the softmax probability of generating yt according to the input document’s ct = αtihi. (3) ∗ Xi=1 topic vector θ and the topic-word distribution µyt . Note that we can use without change the cor- The attention weights αti, forming a categorical responding β from the LDA model or use them distribution αt = (αt1,...,αtL), capture the con- as initialization and update them end-to-end. We text dependency between input words xi and gen- adopt the latter option in our experiments. erated word yt, cf. (Bahdanau et al., 2015). The mixture weights (1 − πt)λt,πtλt and (1 − λt) are (re)parameterized with a feedforward neu- Pointer-Generator (PG) See et al. (2017) mod- ral net (NN) of a softmax output and learnt end-to- ifies the generative probability eq. (2) so that out- end too. of-vocabulary words can be copied from the input: Inference and learning We use maximum like- PG p(yt|x,y1:t−1)=pt =(1 − πt)γt + πtαtyt (4) lihood estimation to learn the model parameters. Other alternatives are possible (Paulus et al., 2017) where the learnable switching term πt is adaptive and left for future work. during generation, and denotes the probability of At inference, we use LDA’s parameters to infer using the attention distribution αt to draw a word topic vectors. We do not include test samples in for the input document (See et al., 2017). fitting LDA to avoid information leak. 3 Related Work consists of 168,128 , 6000 and 6000 pairs for train- ing, testing, and validation respectively. Our work builds on the attention-based sequence- to-sequence learning models in the form of Methods in comparison Our main base- a pair of encoder and decoder (Rush et al., line is the seq2seq model with a pointer- 2015; Chopra et al., 2016; Nallapati et al., 2016). generator (See et al., 2017), referred as Pointer- Pointer-generator networks (See et al., 2017; Generator (PG) for short. We include another Vinyals et al., 2015; Nallapati et al., 2016) were variant PG+Cov where the coverage mechanism proposed to address out-of-vocabulary (OOV) is used to avoid repeating words receiving strong words by learning to copy new words from the attentions (See et al., 2017). Our approach has input documents. To avoid repetitions, See et al. corresponding two variants: Topic Augmented (2017) used a coverage mechanism (Tu et al., Generator (TAG) and TAG+Cov. We also report 2016), which discourages frequently attended the results of the Lead-3 baseline where the words to be generated. Some work (Liu et al., summary corresponds to the first three sentences 2018; Wang and Lee, 2018) explored adversarial of the input article. learning to improve the quality of the generated summaries. Wang and Lee (2018) further consider Evaluation metrics We use mainly the ROUGE a fully unsupervised approach, which does not scores to evaluate the quality of generated sum- require access to document-summary training maries (Lin, 2004). ROUGE-1, ROUGE-2 and pairs. ROUGE-L measure respectively the unigram- Wang et al. (2019) leverages topic information overlap, bigram-overlap, and the longest common by injecting topic information into the atten- sub-sequence between the predicted and reference tion mechanism. Specifically, they fix certain summaries. words’s embeddings, derived from the LDA’s topic-word parameters β. However, they discard Model specifics We follow the suggestion in the document-level topic vector θ, which plays a (See et al., 2017). We use K = 100 topics for both significant role in our model. Our early experi- datasets. Details are in Supplementary Material. ments with injecting topics into attention mecha- nism yield only very minor improvement. 4.2 Quantitative results We report our main results in Table 1. Clearly, 4 Experimental Results models augmented with topics perform better than those who are not. More detailed analysis shows 4.1 Setup that our model TAG+Cov performs better that Datasets We use two datasets CNN/Daily Mail PG+Cov on more than half of the test documents (CNN/DM) and WikiHow. (5888 out of 11490 on CNN/DM). We give details CNN/DM has been extensively used in in the Supplementary Material. recent studies on abstractive summariza- A good summary should also preserve well the tion (Hermann et al., 2015; See et al., 2017; main topics in the original document. To assess Nallapati et al., 2017; Wang and Lee, 2018). It this aspect, for each test document x, we compute consists of 312,085 news articles, where each is the Kullback-Leibler divergence between the topic associated with a multi-sentence summary (about distributions θ∗ inferred on the original document 4 sentences). As in (See et al., 2017) we use the and on the reference as well as the generated sum- original version of this corpus, which is split into maries. As shown by the boxplots (i.e., median, 287,227 instances for training, 11,490 and 13,368 minimum, maximum, first and the third quantiles) for testing and validation respectively. in Figure 1, TAG+Cov tends to generate sum- WikiHow has been introduced recently as a maries that are more coherent with those of the more challenging dataset for abstractive summa- original documents. In the case of CNN/DM, what rization (Koupaee and Wang, 2018). It contains is particularly interesting is that the summaries by about 200,000 pairs of articles and summaries. the TAG+Cov turn out to be more semantically co- The summaries are more abstractive and are typ- herent with the input texts than the ground-truth ically not the first few sentences of the documents. summaries. We suspect that the (ground-truth) As in (Koupaee and Wang, 2018), our data splits summaries for news articles are likely too concise Table 1: Comparison of various models on the test sets of CNN/DM and WikiHow. Higher scores are better. Models marked with (*) indicate results published in the original paper.

CNN/Daily Mail WikiHow ROUGE-1 ROUGE-2 Rouge-L ROUGE-1 ROUGE-2 Rouge-L PG* (See et al., 2017) 36.44 15.66 33.42 - - - PG+Cov* (See et al., 2017) 39.53 17.28 36.38 - - - PG (by us) 35.73 15.08 32.69 26.02 7.92 24.59 TAG (this paper) 36.48 15.89 33.68 26.18 8.18 25.25 PG+Cov (by us) 39.12 16.88 35.59 27.08 8.49 26.25 TAG+Cov (this paper) 40.06 17.89 36.52 28.36 9.05 27.48 Lead-3 40.34 17.70 36.57 26.00 7.24 24.25

Ground Truth: british physicist has sung 7 ’s song . song is being released digitally 6 and on vinyl for record store day 2015 . it is a cover of the song 5 from 1983 film monty python ’s meaning of life . professor 4 3.85 3.43 hawking , 73 , appeared on film alongside professor 3 2.64 . 2 KL divergence PG+Cov: british physicist stephen hawking has sung monty 1 python ’s galaxy song -lrb- clip from the video shown -rrb-

0 GT PG+Cov TAG+Cov . one of the world ’s greatest scientists has covered monty python ’s classic galaxy song , taking listeners on a journey 7 out of the . it is a cover of the song from 1983 film 6 monty python ’s meaning of life . 5 TAG+Cov: british physicist stephen hawking has sung monty

4 3.68 python ’s galaxy song -lrb- clip from the video shown -rrb- 3.49

3 2.70 . the song is being released digitally and on vinyl for record

2 store day 2015 . it is a cover of the song from 1983 film monty KL divergence

1 python ’s meaning of life .

0 GT PG+Cov TAG+Cov Figure 2: From CNN/DM. Our TAG+Cov model generates 3 main sentences instead of 2 for PG+Cov. Figure 1: Top (CNN/DM), Bottom (WikiHow). Boxplots θ∗ of pairwise KL divergences in the topic distributions be- Ground Truth: acquire a pot. gather the ingredients needed tween the original documents and: the Ground Truth (GT), generated summaries. Lower implies better semantic coher- to make the curry. walk to the pot on your kitchen counter. ence. choose the ingredients that you need for the recipe. PG+Cov: go to the kitchen counter. go to your kitchen counter. look for the ingredients you want to cook. finished. take care for a topic model to detect reliably co-occurrence of your health. patterns needed for inferring topics. TAG+Cov: acquire a pot. gather the ingredients needed to make the dish. walk to the pot on your kitchen counter. choose 4.3 Qualitative results the ingredients that you need to add to your recipe. confirm your decision. In Figures 2 and 3, we present two examples. In the first one, our model generates 3 main sen- Figure 3: From WikiHow (“How to Make Vegetable Curry in Harvest Moon Animal Parade”). tences, instead of 2 by the other model. Both have missed the last important sentence in the ground- truth summary. In the second example, PG+Cov 5 Conclusion misses a lot important words. More examples are given in the Supplementary Material. Overall, it We have shown that by conditioning on the top- seems like our model is more likely to capture the ics underlying the input documents, the decoder main topics of the initial documents and tend to be generates noticeably improved summaries. This more concise. suggests that the supervised learning of represen- tation from sequence-to-sequence models can ben- Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, efit from unsupervised learning of latent variable Bing Xiang, et al. 2016. Abstractive text summariza- models. We believe this is likely a fruitful di- tion using sequence-to-sequence rnns and beyond. Computational Natural Lan-guage Learning. rection for the task of abstractive summarization where rephrasing and introducing new concepts Romain Paulus, Caiming Xiong, and Richard Socher. that are not observed in the input texts are essen- 2017. A deep reinforced model for abstractive sum- marization. arXiv preprint arXiv:1705.04304. tial. Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sen- References tence summarization. EMNLP. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Abigail See, Peter J Liu, and Christopher D Man- gio. 2015. Neural machine translation by jointly ning. 2017. Get to the point: Summarization learning to align and translate. In 3rd International with pointer-generator networks. arXiv preprint Conference on Learning Representations, ICLR. arXiv:1704.04368.

David M Blei, Andrew Y Ng, and Michael I Jordan. Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua 2003. Latent dirichlet allocation. Journal of ma- Liu, and Hang Li. 2016. Modeling coverage chine Learning research, 3(Jan):993–1022. for neural machine translation. arXiv preprint arXiv:1601.04811. Sumit Chopra, Michael Auli, and Alexander M Rush. Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2016. Abstractive sentence summarization with at- 2015. Pointer networks. In Advances in Neural In- tentive recurrent neural networks. In Proceedings of formation Processing Systems, pages 2692–2700. the 2016 Conference of the North American Chap- ter of the Association for Computational Linguistics: Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Human Language Technologies, pages 93–98. Guoyin Wang, Dinghan Shen, Changyou Chen, and Lawrence Carin. 2019. Topic-guided variational Bonnie Dorr, David Zajic, and Richard Schwartz. autoencoders for text generation. arXiv preprint 2003. Hedge trimmer: A parse-and-trim approach arXiv:1903.07137. to headline generation. In Proceedings of the HLT- NAACL 03 on Text summarization workshop-Volume Yau-Shian Wang and Hung-Yi Lee. 2018. Learning 5, pages 1–8. Association for Computational Lin- to encode text as human-readable summaries us- guistics. ing generative adversarial networks. arXiv preprint arXiv:1810.02851. Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Wenyuan Zeng, Wenjie Luo, Sanja Fidler, and Raquel and Phil Blunsom. 2015. Teaching machines to read Urtasun. 2016. Efficient summarization with and comprehend. In Advances in neural information read-again and copy mechanism. arXiv preprint processing systems, pages 1693–1701. arXiv:1611.03382.

Mahnaz Koupaee and William Yang Wang. 2018. Wik- ihow: A large scale text summarization dataset. arXiv preprint arXiv:1810.09305.

Julian Kupiec, Jan Pedersen, and Francine Chen. 1999. A trainable document summarizer. Advances in Au- tomatic Summarization, pages 55–60.

Chin-Yew Lin. 2004. Rouge: A package for auto- matic evaluation of summaries. Text Summarization Branches Out.

Linqing Liu, Yao Lu, Min Yang, Qiang Qu, Jia Zhu, and Hongyan Li. 2018. Generative adversarial net- work for abstractive text summarization. In Thirty- Second AAAI Conference on Artificial Intelligence.

Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based se- quence model for extractive summarization of docu- ments. In Thirty-First AAAI Conference on Artificial Intelligence.