Arxiv:1908.07026V1 [Cs.LG] 19 Aug 2019 to Opnn,Ie,Tedcdr Upt Sum- Outputs Decoder, the I.E., Component, Ation E Tal
Total Page:16
File Type:pdf, Size:1020Kb
Topic Augmented Generator for Abstractive Summarization Melissa Ailem1, Bowen Zhang1 and Fei Sha1,2 1 University of Southern California, Los Angeles, CA 1 {ailem, zhan734, feisha}@usc.edu 2 [email protected] Abstract mary by conditioning on the input text (and its rep- resentation through the encoder). Steady progress has been made in abstractive What kind of information can we introduce so summarization with attention-based sequence- that richer texts can appear in the summaries? In to-sequence learning models. In this paper, we propose a new decoder where the output sum- this paper, we describe how to combine topic mod- mary is generated by conditioning on both the eling with models for abstractive summarization. input text and the latent topics of the docu- Topics identified from topic modeling, such as La- ment. The latent topics, identified by a topic tent Dirichlet Allocation (LDA), capture corpus- model such as LDA, reveals more global se- level patterns of words co-occurrence and describe mantic information that can be used to bias documents with mixtures of semantically coher- the decoder to generate words. In particular, ent conceptual groups. Such usages of words and they enable the decoder to have access to ad- ditional word co-occurrence statistics captured concepts provide valuable inductive bias for super- at document corpus level. We empirically vali- vised models for language generation. date the advantage of the proposed approach We propose Topic Augmented Generator (TAG) on both the CNN/Daily Mail and the Wiki- for abstractive summarization where the popular How datasets. Concretely, we attain strongly pointer-generator based decoder is supplied with improved ROUGE scores when compared to latent topics of the input document (See et al., state-of-the-art models. 2017). To generate a word, the generator learns to switch among conditioning on the text, copying 1 Introduction the text, and conditioning on the latent topics. The Extractive summarization focuses on select- latter provides a more global context to generate ing parts (e.g., words, phrases, and sentences) words. from the input document (Kupiec et al., 1999; We apply the proposed approach on two bench- Dorr et al., 2003; Nallapati et al., 2017). While mark datasets, namely CNN/DailyMail and Wiki- the primary goal is to preserve the important How, and obtain strongly improved performance messages in the original text, the more challeng- when the topics are introduced. Moreover, the ing abstractive summarization aims to generate summaries generated by our decoder have higher arXiv:1908.07026v1 [cs.LG] 19 Aug 2019 a summary via rephrasing and introducing new coherence with the original texts in the topic la- concepts/words (Zeng et al., 2016; Nallapati et al., tent space, indicating a better preserving of what 2016; See et al., 2017; Paulus et al., 2017; the input text is about. Liu et al., 2018). All those entail a broader and deeper understanding of the document, its back- 2 Approach ground story and other knowledge — Zeitgeist — Our work builds on the popular attention-based that are not explicitly specified in the input text neural summarization models. We review one but are nonetheless in the minds of the human such model (See et al., 2017), followed by the de- readers. scription of our approach. Neural-based abstractive summarization has since made a lot of progress (Nallapati et al., 2016; 2.1 Attention-based Neural Summarization See et al., 2017). By large, the language gener- Let x = (x1,...,xL) denote a document repre- ation component, i.e., the decoder, outputs sum- sented as a sequence of L words. Similarly, let y = (y1,...,yT ) denote a T -word summary of x. 2.2 Topic Augmented Generator (TAG) We desire T ≪ L. Main idea As discussed in the previous section, To learn the mapping from x to y from a cor- our main idea is to introduce bias into the decoder pus of paired documents and their summaries, a such that generating words is geared toward re- natural formalism is to model the conditional dis- flecting the broad, albeit latent, semantic informa- tribution of y given x, i.e., p(y|x). Furthermore, tion underlying the input document. generating the summary is Markovian, implying a With such bias, we hope the summary could in- factorized form of the distribution clude words that are “exogenous”, but nonetheless T semantically cohesive with the input document — using such words is “blessed” by subscribing to p(y|x)= p(y1|x) p(yt|x,y1:t−1) (1) tY=2 explicitly learned word co-occurrence patterns at the document corpus level. where y1:t−1 stands for the generated (t−1) words. LDA To this end, we use Latent Dirichlet Allo- Sequence-to-sequence (seq2seq) models typically cation (LDA) model (Blei et al., 2003) to discover use RNN encoder-decoder architectures to param- semantically coherent latent variables representing eterize the above distribution. the input documents. The encoder reads x word-by-word from the The document is modeled as a sequence of sam- left to the right (and/or reversely with another en- pling words from a mixture of K topics, coder) and produces a sequence of hidden states K L {h1,...,hi,...,hN }, with hi ∈ R and hi = p(x)= p(xi|zxi , β)p(zxi |θ)p(θ)dθ (5) fe(xi, hi−1), where fe is a differentiable nonlin- Z Y=1 Xx ear function. The decoder models the distribution i z i of every word in the summary conditioned on the Given a corpus of text documents, the prior distri- words preceding it and the encoder’s hidden states bution for θ and the topic-word vectors β can be learned with maximum likelihood estimation. Ad- p(yt|x,y1:t−1)= γt = gyt (yt−1,st, ct) (2) ditionally, for new documents, their (maximum a posterior) topic vector θ∗ can also be inferred. The where gyt (·) is a differentiable function (such as model parameter β captures word co-occurrence softmax over all possible words) that yields prob- patterns in different topics. ability as output. s = f (y −1,s −1, c ) is the de- t d t t t TAG We revise the conditional probability of coder’s current hidden state, with f being a non- d eq. (4) with a new mixture component linear function. ct is the context vector, a weighted PG ⊤ ∗ average of the encoder hidden states: p(yt|x,y1:t−1)=λtpt + (1 − λt)q(µyt θ ) (6) L where q(·) denotes the softmax probability of generating yt according to the input document’s ct = αtihi. (3) ∗ Xi=1 topic vector θ and the topic-word distribution µyt . Note that we can use without change the cor- The attention weights αti, forming a categorical responding β from the LDA model or use them distribution αt = (αt1,...,αtL), capture the con- as initialization and update them end-to-end. We text dependency between input words xi and gen- adopt the latter option in our experiments. erated word yt, cf. (Bahdanau et al., 2015). The mixture weights (1 − πt)λt,πtλt and (1 − λt) are (re)parameterized with a feedforward neu- Pointer-Generator (PG) See et al. (2017) mod- ral net (NN) of a softmax output and learnt end-to- ifies the generative probability eq. (2) so that out- end too. of-vocabulary words can be copied from the input: Inference and learning We use maximum like- PG p(yt|x,y1:t−1)=pt =(1 − πt)γt + πtαtyt (4) lihood estimation to learn the model parameters. Other alternatives are possible (Paulus et al., 2017) where the learnable switching term πt is adaptive and left for future work. during generation, and denotes the probability of At inference, we use LDA’s parameters to infer using the attention distribution αt to draw a word topic vectors. We do not include test samples in for the input document (See et al., 2017). fitting LDA to avoid information leak. 3 Related Work consists of 168,128 , 6000 and 6000 pairs for train- ing, testing, and validation respectively. Our work builds on the attention-based sequence- to-sequence learning models in the form of Methods in comparison Our main base- a pair of encoder and decoder (Rush et al., line is the seq2seq model with a pointer- 2015; Chopra et al., 2016; Nallapati et al., 2016). generator (See et al., 2017), referred as Pointer- Pointer-generator networks (See et al., 2017; Generator (PG) for short. We include another Vinyals et al., 2015; Nallapati et al., 2016) were variant PG+Cov where the coverage mechanism proposed to address out-of-vocabulary (OOV) is used to avoid repeating words receiving strong words by learning to copy new words from the attentions (See et al., 2017). Our approach has input documents. To avoid repetitions, See et al. corresponding two variants: Topic Augmented (2017) used a coverage mechanism (Tu et al., Generator (TAG) and TAG+Cov. We also report 2016), which discourages frequently attended the results of the Lead-3 baseline where the words to be generated. Some work (Liu et al., summary corresponds to the first three sentences 2018; Wang and Lee, 2018) explored adversarial of the input article. learning to improve the quality of the generated summaries. Wang and Lee (2018) further consider Evaluation metrics We use mainly the ROUGE a fully unsupervised approach, which does not scores to evaluate the quality of generated sum- require access to document-summary training maries (Lin, 2004). ROUGE-1, ROUGE-2 and pairs. ROUGE-L measure respectively the unigram- Wang et al. (2019) leverages topic information overlap, bigram-overlap, and the longest common by injecting topic information into the atten- sub-sequence between the predicted and reference tion mechanism.