Abstractive Sentence Summarization with Attentive Recurrent Neural Networks

Sumit Chopra Michael Auli Alexander M. Rush Facebook AI Research Facebook AI Research Harvard SEAS [email protected] [email protected] [email protected]

Abstract al., 2014), our model consists of a conditional recur- rent neural network, which acts as a decoder to gen- Abstractive Sentence Summarization gener- erate the summary of an input sentence, much like ates a shorter version of a given sentence while a standard recurrent language model. In addition, at attempting to preserve its meaning. We intro- every time step the decoder also takes a condition- duce a conditional recurrent neural network ing input which is the output of an encoder module. (RNN) which generates a summary of an in- Depending on the current state of the RNN, the en- put sentence. The conditioning is provided by a novel convolutional attention-based encoder coder computes scores over the words in the input which ensures that the decoder focuses on the sentence. These scores can be interpreted as a soft appropriate input words at each step of genera- alignment over the input text, informing the decoder tion. Our model relies only on learned features which part of the input sentence it should focus on and is easy to train in an end-to-end fashion on to generate the next word. Both the decoder and en- large data sets. Our experiments show that the coder are jointly trained on a data set consisting of model significantly outperforms the recently sentence-summary pairs. Our model can be seen as proposed state-of-the-art method on the Giga- word corpus while performing competitively an extension of the recently proposed model for the on the DUC-2004 shared task. same problem by Rush et al. (2015). While they use a feed-forward neural language model for genera- tion, we use a recurrent neural network. Further- 1 Introduction more, our encoder is more sophisticated, in that it explicitly encodes the position information of the in- Generating a condensed version of a passage while put words. Lastly, our encoder uses a convolutional preserving its meaning is known as text summariza- network to encode input words. These extensions tion. Tackling this task is an important step to- result in improved performance. wards natural language understanding. Summariza- tion systems can be broadly classified into two cat- The main contribution of this paper is a novel egories. Extractive models generate summaries by convolutional attention-based conditional recurrent cropping important segments from the original text neural network model for the problem of abstractive and putting them together to form a coherent sum- sentence summarization. Empirically we show that mary. Abstractive models generate summaries from our model beats the state-of-the-art systems of Rush scratch without being constrained to reuse phrases et al. (2015) on multiple data sets. Particularly no- from the original text. table is the fact that even with a simple generation In this paper we propose a novel recurrent neu- module, which does not use any extractive feature ral network for the problem of abstractive sentence tuning, our model manages to significantly outper- summarization. Inspired by the recently proposed form their ABS+ system on the Gigaword data set architectures for machine translation (Bahdanau et and is comparable on the DUC-2004 task.

93

Proceedings of NAACL-HLT 2016, pages 93–98, San Diego, California, June 12-17, 2016. c 2016 Association for Computational Linguistics 2 Previous Work vidual conditional probabilities:

While there is a large body of work for generat- N P (y x; θ) = p(yt y1, . . . , yt 1 , x; θ). (1) ing extractive summaries of sentences (Jing, 2000; | |{ − } t=1 Knight and Marcu, 2002; McDonald, 2006; Clarke Y and Lapata, 2008; Filippova and Altun, 2013; Fil- In this work we model this conditional probabil- ippova et al., 2015), there has been much less re- ity using an RNN Encoder-Decoder architecture, in- search on abstractive summarization. A count-based spired by Cho et al. (2014) and subsequently ex- noisy-channel machine translation model was pro- tended in Bahdanau et al. (2014). We call our model posed for the problem in Banko et al. (2000). The RAS (Recurrent Attentive Summarizer). task of abstractive sentence summarization was later formalized around the DUC-2003 and DUC-2004 3.1 Recurrent Decoder competitions (Over et al., 2007), where the TOP- The above conditional is modeled using an RNN: IARY system (Zajic et al., 2004) was the state-of- P (yt y1, . . . , yt 1 , x; θ) = Pt = gθ (ht, ct), the-art. More recently Cohn and Lapata (2008) |{ − } 1 and later Woodsend et al. (2010) proposed systems where ht is the hidden state of the RNN: which made heavy use of the syntactic features of the sentence-summary pairs. Later, along the lines ht = gθ (yt 1, ht 1, ct). 1 − − of Banko et al. (2000), MOSES was used directly as a method for text simplification by Wubben et al. Here ct is the output of the encoder module (detailed in 3.2). It can be seen as a context vector which is (2012). Other works which have recently been pro- § computed as a function of the current state ht 1 and posed for the problem of sentence summarization in- − clude (Galanis and Androutsopoulos, 2010; Napoles the input sequence x. et al., 2011; Cohn and Lapata, 2013). Very recently Our Elman RNN takes the following form (El- Rush et al. (2015) proposed a neural attention model man, 1990): for this problem using a new data set for training and ht = σ(W1yt 1 + W2ht 1 + W3ct) showing state-of-the-art performance on the DUC − − tasks. Our model can be seen as an extension of Pt = ρ(W4ht + W5ct), their model. where σ is the sigmoid function and ρ is the soft- ot oj max, defined as: ρ(ot) = e / j e and Wi 3 Attentive Recurrent Architecture (i = 1,..., 5) are matrices of learnable parameters d d P d V of sizes W 1,2,3 × and W 4,5 × . { } ∈ R { } ∈ R Let x denote the input sentence consisting of a The LSTM decoder is defined as (Hochreiter and sequence of M words x = [x1, . . . , xM ], where Schmidhuber, 1997): each word x is part of vocabulary , of size i V = V . Our task is to generate a target sequence it = σ(W1yt 1 + W2ht 1 + W3ct) |V| − − y = [y1, . . . , yN ], of N words, where N < M, it0 = tanh(W4yt 1 + W5ht 1 + W6ct) such that the meaning of x is preserved: y = − − ft = σ(W7yt 1 + W8ht 1 + W9ct) argmaxy P (y x), where y is a random variable de- − − | ot = σ(W10yt 1 + W11ht 1 + W12ct) noting a sequence of N words. − − mt = mt 1 ft + it it0 Typically the conditional probability is mod- − eled by a parametric function with parameters θ: h = m o t t t P (y x) = P (y x; θ). Training involves finding the | | Pt = ρ(W13ht + W14ct). θ which maximizes the conditional probability of sentence-summary pairs in the training corpus. If Operator refers to component-wise multiplica- the model is trained to generate the next word of the tion, and Wi (i = 1,..., 14) are matrices of learn- d d summary, given the previous words, then the above able parameters of sizes W 1,...,12 × , and d V { } ∈ R conditional can be factorized into a product of indi- W 13,14 × . { } ∈ R 94 3.2 Attentive Encoder by minimizing the negative conditional log likeli- θ We now give the details of the encoder which com- hood of the training data with respect to : S N putes the context vector ct for every time step t of i i i i the decoder above. With a slight overload of nota- = log P (yt y1, . . . , yt 1 , x ; θ), L − |{ − } tion, for an input sentence x we denote by x the d i=1 t=1 i X X (5) dimensional learnable embedding of the i-th word where the parameters θ constitute the parameters of (x d). In addition the position i of the word i ∈ R the decoder and the encoder. x is also associated with a learnable embedding l i i Once the parametric model is trained we generate of size d (l d). Then the full embedding for i ∈ R a summary for a new sentence x through a word- i-th word in x is given by ai = xi + li. Let us k q d based beam search such that P (y x) is maximized, denote by B × a learnable weight matrix | ∈ R argmax P (yt y1, . . . , yt 1 , x). The search is pa- which is used to convolve over the full embeddings |{ − } rameterized by the number of paths k that are pur- of consecutive words. Let there be d such matrices sued at each time step. (k 1, . . . , d ). The output of convolution is given ∈ { } by: 4 Experimental Setup q/2 z = a bk , (2) 4.1 Datasets and Evaluation ik i+h · q/2+h h= q/2 Our models are trained on the annotated version of X− the Gigaword corpus (Graff et al., 2003; Napoles k k where bj is the j-th column of the matrix B . Thus et al., 2012) and we use only the annotations for the d dimensional aggregate embedding vector zi is tokenization and sentence separation while discard- defined as zi = [zi1, . . . , zid]. Note that each word ing other annotations such as tags and parses. We xi in the input sequence is associated with one ag- pair the first sentence of each article with its head- gregate embedding vector zi. The vectors zi can be line to form sentence-summary pairs. The data seen as a representation of the word which captures is pre-processed in the same way as Rush et al. the position in which it occurs in the sentence and (2015) and we use the same splits for training, val- also the context in which it appears in the sentence. idation, and testing. For Gigaword we report re- In our experiments the width q of the convolution sults on the same randomly held-out test set of 2000 k matrix B was set to 5. To account for words at the sentence-summary pairs as (Rush et al., 2015).1 boundaries of x we first pad the sequence on both We also evaluate our models on the DUC-2004 sides with dummy words before computing the ag- evaluation data set comprising 500 pairs (Over et gregate vectors zi’s. al., 2007). Our evaluation is based on three vari- Given these aggregate vectors of words, we com- ants of ROUGE (Lin, 2004), namely, ROUGE-1 pute the context vector ct (the encoder output) as: (unigrams), ROUGE-2 (bigrams), and ROUGE-L (longest-common substring). M ct = αj,t 1xj, (3) − 4.2 Architectural Choices j=1 X We implemented our models in the Torch library (http://torch.ch/)2. To optimize our loss (Equa- where the weights αj,t 1 are computed as − tion 5) we used stochastic gradient descent with 32 exp(zj ht 1) mini-batches of size . During training we mea- αj,t 1 = · − . (4) − M sure the perplexity of the summaries in the valida- i=1 exp(zi ht 1) · − tion set and adjust our hyper-parameters, such as the learning rate, based on this number. 3.3 Training andP Generation 1We remove pairs with empty titles resulting in slightly dif- Given a training corpus = (xi, yi) S of S S { }i=1 ferent accuracy compared to Rush et al. (2015) for their sys- sentence-summary pairs, the above model can be tems. trained end-to-end using stochastic gradient descent 2Our code can found at www://github.com/facebook/namas

95 Model Perplexity RG-1 RG-2 RG-L Bag-of-Words 43.6 ABS 29.55 11.32 26.42 Convolutional (TDNN) 35.9 ABS+ 29.76 11.88 26.96 Attention-based (ABS) 27.1 RAS-Elman (k = 1) 33.10 14.45 30.25 RAS-Elman 18.9 RAS-Elman (k = 10) 33.78 15.97 31.15 RAS-LSTM 20.3 RAS-LSTM (k = 1) 31.71 13.63 29.31 RAS-LSTM (k = 10) 32.55 14.70 30.03 Table 1: Perplexity on the Gigaword validation set. Bag-of- words, Convolutional (TDNN) and ABS are the different en- Luong-NMT 33.10 14.45 30.71 coders of Rush et. al., 2015. Table 2: F1 ROUGE scores on the Gigaword test set. ABS and ABS+ are the systems of Rush et al. 2015. k refers to the size of the beam for generation; k = 1 implies greedy generation. RG refers to ROUGE. Rush et al. (2015) previously reported ROUGE recall, while as we use the more balanced F-measure. For the decoder we experimented with both the Elman RNN and the Long-Short Term Memory RG-1 RG-2 RG-L (LSTM) architecture (as discussed in 3.1). We § ABS 26.55 7.06 22.05 chose hyper-parameters based on a grid search and ABS+ 28.18 8.49 23.81 picked the one which gave the best perplexity on the RAS-Elman (k = 1) 29.13 7.62 23.92 validation set. In particular we searched over the RAS-Elman (k = 10) 28.97 8.26 24.06 number of hidden units H of the recurrent layer, the RAS-LSTM (k = 1) 26.90 6.57 22.12 learning rate η, the learning rate annealing schedule RAS-LSTM (k = 10) 27.41 7.69 23.06 γ (the factor by which to decrease η if the valida- Luong-NMT 28.55 8.79 24.43 tion perplexity increases), and the gradient clipping Table 3: ROUGE results (recall-only) on the DUC-2004 test threshold κ. Our final Elman architecture (RAS- sets. ABS and ABS+ are the systems of Rush et al. 2015. k Elman) uses a single layer with H = 512, η = 0.5, refers to the size of the beam for generation; k = 1 implies γ = 2, and κ = 10. The LSTM model (RAS- greedy generation. RG refers to ROUGE. LSTM) also has a single layer with H = 512, η = 0.1, γ = 2, and κ = 10. ABS as well as other models reported in Rush et al. 5 Results (2015). The RAS-LSTM performs slightly worse than RAS-Elman, most likely due to over-fitting. On the Gigaword corpus we evaluate our models in We attribute this to the relatively simple nature of terms of perplexity on a held-out set. We then pick this task which can be framed as English-to-English the model with best perplexity on the held-out set translation with few long-term dependencies. The and use it to compute the F1-score of ROUGE-1, ROUGE results (Table 2) show that our models com- ROUGE-2, and ROUGE-L on the test sets, all of fortably outperform both ABS and ABS+ by a wide which we report. For the DUC corpus however, margin on all metrics. This is even the case when we inline with the standard, we report the recall-only rely only on very fast greedy search (k = 1), while ROUGE. As baseline we use the state-of-the-art as ABS uses a much wider beam of size k = 50; the attention-based system (ABS) of Rush et al. (2015) stronger ABS+ system also uses additional extrac- which relies on a feed-forward network decoder. tive features which our model does not. These fea- Additionally, we compare to an enhanced version tures cause ABS+ to copy 92% of words from the of their system (ABS+), which relies on a range of input into the summary, whereas our model copies separate extractive summarization features that are only 74% of the words leading to more abstractive added as log-linear features in a secondary learning summaries. On DUC-2004 we report recall ROUGE step with minimum error rate training (Och, 2003). as is customary on this dataset. The results (Ta- Table 1 shows that both our RAS-Elman and ble 3) show that our models are better than ABS+. RAS-LSTM models achieve lower perplexity than However the improvements are smaller than for Gi-

96 gaword which is likely due to two reasons: First, I(1): brazilian defender pepe is out for the rest of the season with tokenization of DUC-2004 differs slightly from our a knee injury , his porto coach jesualdo said saturday . training corpus. Second, headlines in Gigaword are G: football : pepe out for season A+: ferreira out for rest of season with knee injury much shorter than in DUC-2004. R: brazilian defender pepe out for rest of season with knee injury For the sake of completeness we also compare I(2): economic growth in toronto will suffer this year because our models to the recently proposed standard Neu- of sars , a think tank said friday as health authorities insisted the ral Machine Translation (NMT) systems. In par- illness was under control in canada ’s largest city . G: sars toll on toronto economy estimated at c$ # billion ticular, we compare to a smaller re-implementation A+: think tank under control in canada ’s largest city of the attentive stacked LSTM encoder-decoder of R: think tank says economic growth in toronto will suffer this year Luong et al. (2015). Our implementation uses I(3): colin l. powell said nothing – a silence that spoke volumes two-layer LSTMs for the encoder-decoder with 500 to many in the white house on thursday morning . hidden units in each layer. Tables 2 and 3 report G: in meeting with former officials bush defends iraq policy A+: colin powell speaks volumes about silence in white house ROUGE scores on the two data sets. From the tables R: powell speaks volumes on the white house we observe that the proposed RAS-Elman model is I(4): an international terror suspect who had been under a con- able to match the performance of the NMT model troversial loose form of house arrest is on the run , british home of Luong at al. (2015). This is noteworthy be- secretary john reid said tuesday . cause RAS-Elman is significantly simpler than the G: international terror suspect slips net in britain A+: reid under house arrest terror suspect on the run NMT model at multiple levels. First, the encoder R: international terror suspect under house arrest used by RAS-Elman is extremely light-weight (at- tention over the convolutional representation of the Figure 1: Example sentence summaries produced on Gi- input words), compared to Luong’s (a 2 hidden layer gaword. I is the input, G is the true headline, A is ABS+, LSTM). Second, the decoder used by RAS-Elman is and R is RAS-ELMAN. a single layer standard (Elman) RNN as opposed to a multi-layer LSTM. In an independent work, Nalla- pati et. al (2016) also trained a collection of standard amples highlight typical mistakes of the models. In NMT models and report numbers in the same ball- Sentence 3 both models take literally the figurative park as RAS-Elman on both datasets. use of the idiom “a silence that spoke volumes,” and In order to better understand which component produce fluent but nonsensical summaries. In Sen- of the proposed architecture is responsible for the tence 4 the RAS model mistakes the content of a improvements, we trained the recurrent model with relative clause for the main verb, leading to a sum- Rush et. al., (2015)’s ABS encoder on a subset of the mary with the opposite meaning of the input. These Gigaword dataset. The ABS encoder, which does difficult cases are somewhat rare in the Gigaword, not have the position features, achieves a final vali- but they highlight future challenges for obtaining dation perplexity of 38 compared to 29 for the pro- human-level sentence summary. posed encoder, which uses position features as well as context information. This clearly shows the bene- 6 Conclusion fits of using the position feature in the proposed en- We extend the state-of-the-art model for abstrac- coder. tive sentence summarization (Rush et al., 2015) Finally in Figure 1 we highlight anecdotal exam- to a recurrent neural network architecture. Our ples of summaries produced by the RAS-Elman sys- model is a simplified version of the encoder-decoder tem on the Gigaword dataset. The first two examples framework for machine translation (Bahdanau et al., highlight typical improvements in the RAS model 2014). The model is trained on the Gigaword corpus over ABS+. Generally the model produces more flu- to generate headlines based on the first line of each ent summaries and is better able to capture the main news article. We comfortably outperform the previ- actors of the input. For instance in Sentence 1, RAS- ous state-of-the-art on both Gigaword data and the Elman correctly distinguishes the actions of “pepe” DUC-2004 challenge even though our model does from “ferreira”, and in Sentence 2 it identifies the not rely on additional extractive features. correct role of the “think tank”. The final two ex-

97 References Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Minh-Thang Luong, Hieu Pham, and Christopher D. gio. 2014. Neural machine translation by jointly Manning. 2015. Effective approaches to attention- learning to align and translate. CoRR, abs/1409.0473. based neural machine translation. In Proceedings of Michele Banko, Vibhu O Mittal, and Michael J Witbrock. the 2015 Conference on Empirical Methods in Natural 2000. Headline generation based on statistical trans- Language Processing, pages 1412–1421, , Por- lation. In Proceedings of the 38th Annual Meeting tugal, September. Association for Computational Lin- on Association for Computational Linguistics, pages guistics. 318–325. Association for Computational Linguistics. R McDonald. 2006. Discriminative sentence compres- Kyunghyun Cho, Bart van Merrienboer, C¸aglar Gulc¸ehre,¨ sion with soft syntactic evidence. In EACL-06, pages Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, 297–304. and Yoshua Bengio. 2014. Learning phrase repre- Ramesh Nallapati, Bing Xiang, and Zhou Bowen. 2016. sentations using RNN encoder-decoder for statistical Sequence-to-sequence rnns for text summarization. In machine translation. In Proceedings of EMNLP 2014, http://arxiv.org/abs/1602.06023. pages 1724–1734. Courtney Napoles, Chris Callison-Burch, Juri Ganitke- James Clarke and Mirella Lapata. 2008. Global infer- vitch, and Benjamin Van Durme. 2011. Paraphratic ence for sentence compression: An integer linear pro- sentence compression with a character-based metric: gramming approach. Journal of Artificial Intelligence Tightening without deletion. In Proceedings of the Research, pages 399–429. Workshop on Monolingual Text-To-Text Generation Trevor Cohn and Mirella Lapata. 2008. Sentence com- (MTTG’11). pression beyond word deletion. In Proceedings of Courtney Napoles, Matthew Gormley, and Benjamin the 22nd International Conference on Computational Van Durme. 2012. Annotated gigaword. In Proceed- Linguistics-Volume 1, pages 137–144. Association for ings of the Joint Workshop on Automatic Knowledge Computational Linguistics. Base Construction and Web-scale Knowledge Extrac- Trevor Cohn and Mirella Lapata. 2013. An abstrac- tion, pages 95–100. Association for Computational tive approach to sentence compression. ACM Transac- Linguistics. tions on Intelligent Systems and Technology (TIST’13), Franz Josef Och. 2003. Minimum error rate training 4,3(41). in statistical machine translation. In Proceedings of Jeffrey L. Elman. 1990. Finding structure in time. Cog- the 41st Annual Meeting on Association for Computa- nitive Science, 14(2):179–211. tional Linguistics-Volume 1, pages 160–167. Associa- Katja Filippova and Yasemin Altun. 2013. Overcoming tion for Computational Linguistics. the lack of parallel data in sentence compression. In Paul Over, Hoa Dang, and Donna Harman. 2007. Duc EMNLP, pages 1481–1491. in context. Information Processing & Management, Katja Filippova, Enrique Alfonseca, Carlos A Col- 43(6):1506–1520. menares, Lukasz Kaiser, and Oriol Vinyals. 2015. Alexander M Rush, Sumit Chopra, and Jason Weston. Sentence compression by deletion with lstms. In 2015. A neural attention model for abstractive sen- EMNLP. tence summarization. In EMNLP. Dimitrios Galanis and Ion Androutsopoulos. 2010. An Kristian Woodsend, Yansong Feng, and Mirella Lapata. extractive supervised two-stage method for sentence 2010. Generation with quasi-synchronous grammar. compression. In Proceedings of NAACL-HLT 2010. In Proceedings of the 2010 conference on empirical David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. methods in natural language processing, pages 513– 2003. English gigaword. Linguistic Data Consortium, 523. Association for Computational Linguistics. Philadelphia. Sander Wubben, Antal Van Den Bosch, and Emiel Krah- S. Hochreiter and J. Schmidhuber. 1997. Long short- mer. 2012. Sentence simplification by monolingual term memory. Neural Computation, 9(8):1735–1780. machine translation. In Proceedings of the 50th An- H Jing. 2000. Sentence reduction for automatic text sum- nual Meeting of the Association for Computational marization. In ANLP-00, pages 703–711. Linguistics: Long Papers-Volume 1, pages 1015–1024. Kevin Knight and Daniel Marcu. 2002. Summariza- Association for Computational Linguistics. tion beyond sentence extraction: A probabilistic ap- David Zajic, Bonnie Dorr, and Richard Schwartz. 2004. proach to sentence compression. Artificial Intelli- Bbn/umd at duc-2004: Topiary. In Proceedings of the gence, 139(1):91–107. HLT-NAACL 2004 Document Understanding Work- Chin-Yew Lin. 2004. Rouge: A package for auto- shop, Boston, pages 112–119. matic evaluation of summaries. In Text Summarization

98