Ranking Sentences for Extractive Summarization with Reinforcement Learning

Ranking Sentences for Extractive Summarization with Reinforcement Learning Shashi Narayan Shay B. Cohen Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh, EH8 9AB [email protected], scohen,mlap @inf.ed.ac.uk { } Abstract et al., 2017; Tan and Wan, 2017; Paulus et al., 2017) is an encoder-decoder architecture mod- Single document summarization is the task eled by recurrent neural networks. The encoder of producing a shorter version of a document while preserving its principal informa- reads the source sequence into a list of continuous- tion content. In this paper we conceptualize space representations from which the decoder gen- extractive summarization as a sentence rank- erates the target sequence. An attention mecha- ing task and propose a novel training algorithm nism (Bahdanau et al., 2015) is often used to lo- which globally optimizes the ROUGE evalua- cate the region of focus during decoding. tion metric through a reinforcement learning Extractive objective. We use our algorithm to train a neu- systems create a summary by identi- ral summarization model on the CNN and Dai- fying (and subsequently concatenating) the most lyMail datasets and demonstrate experimen- important sentences in a document. A few re- tally that it outperforms state-of-the-art extrac- cent approaches (Cheng and Lapata, 2016; Nalla- tive and abstractive systems when evaluated pati et al., 2017; Narayan et al., 2017; Yasunaga automatically and by humans.1 et al., 2017) conceptualize extractive summarization as a sequence labeling task in which each 1 Introduction label specifies whether each document sentence Automatic summarization has enjoyed wide popu- should be included in the summary. Existing mod- larity in natural language processing due to its po- els rely on recurrent neural networks to derive a tential for various information access applications. meaning representation of the document which is Examples include tools which aid users navigate then used to label each sentence, taking the pre- and digest web content (e.g., news, social media, viously labeled sentences into account. These product reviews), question answering, and person- models are typically trained using cross-entropy alized recommendation engines. Single document loss in order to maximize the likelihood of the summarization — the task of producing a shorter ground-truth labels and do not necessarily learn version of a document while preserving its infor- to rank sentences based on their importance due mation content — is perhaps the most basic of to the absence of a ranking-based objective. An- summarization tasks that have been identified over other discrepancy comes from the mismatch be- the years (see Nenkova and McKeown, 2011 for a tween the learning objective and the evaluation comprehensive overview). criterion, namely ROUGE (Lin and Hovy, 2003), Modern approaches to single document summa- which takes the entire summary into account. rization are data-driven, taking advantage of the In this paper we argue that cross-entropy train- success of neural network architectures and their ing is not optimal for extractive summarization. ability to learn continuous features without re- Models trained this way are prone to generating course to preprocessing tools or linguistic annota- verbose summaries with unnecessarily long sen- tions. Abstractive summarization involves various tences and redundant information. We propose to text rewriting operations (e.g., substitution, dele- overcome these difficulties by globally optimiz- tion, reordering) and has been recently framed as ing the ROUGE evaluation metric and learning to a sequence-to-sequence problem (Sutskever et al., rank sentences for summary generation through a 2014). Central in most approaches (Rush et al., reinforcement learning objective. Similar to pre- 2015; Chen et al., 2016; Nallapati et al., 2016; See vious work (Cheng and Lapata, 2016; Narayan 1Our code and data are available here: https://github. et al., 2017; Nallapati et al., 2017), our neural sum- com/shashiongithub/Refresh. marization model consists of a hierarchical docu- 1747 Proceedings of NAACL-HLT 2018, pages 1747–1759 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics ment encoder and a hierarchical sentence extrac- proposed in the literature (Cheng and Lapata, tor. During training, it combines the maximum- 2016; Nallapati et al., 2017; Narayan et al., 2017). likelihood cross-entropy loss with rewards from The main components include a sentence encoder, policy gradient reinforcement learning to directly a document encoder, and a sentence extractor (see optimize the evaluation metric relevant for the the left block of Figure1) which we describe in summarization task. We show that this global op- more detail below. timization framework renders extractive models better at discriminating among sentences for the Sentence Encoder A core component of our final summary; a sentence is ranked high for selec- model is a convolutional sentence encoder which tion if it often occurs in high scoring summaries. encodes sentences into continuous representa- We report results on the CNN and DailyMail tions. In recent years, CNNs have proven use- news highlights datasets (Hermann et al., 2015) ful for various NLP tasks (Collobert et al., 2011; which have been recently used as testbeds for the Kim, 2014; Kalchbrenner et al., 2014; Zhang et al., evaluation of neural summarization systems. Ex- 2015; Lei et al., 2015; Kim et al., 2016; Cheng perimental results show that when evaluated auto- and Lapata, 2016) because of their effectiveness in matically (in terms of ROUGE), our model out- identifying salient patterns in the input (Xu et al., performs state-of-the-art extractive and abstrac- 2015). In the case of summarization, CNNs can tive systems. We also conduct two human eval- identify named-entities and events that correlate uations in order to assess (a) which type of sum- with the gold summary. mary participants prefer (we compare extractive We use temporal narrow convolution by apply- and abstractive systems) and (b) how much key ing a kernel filter K of width h to a window of h information from the document is preserved in the words in sentence s to produce a new feature. This summary (we ask participants to answer questions filter is applied to each possible window of words in s to produce a feature map f Rk h+1 where k pertaining to the content in the document by read- ∈ − ing system summaries). Both evaluations over- is the sentence length. We then apply max-pooling whelmingly show that human subjects find our over time over the feature map f and take the max- summaries more informative and complete. imum value as the feature corresponding to this Our contributions in this work are three-fold: a particular filter K. We use multiple kernels of var- novel application of reinforcement learning to sen- ious sizes and each kernel multiple times to con- tence ranking for extractive summarization; cor- struct the representation of a sentence. In Figure1, roborated by analysis and empirical results show- kernels of size 2 (red) and 4 (blue) are applied ing that cross-entropy training is not well-suited three times each. Max-pooling over time yields two feature lists f K2 and f K4 R3. The final sen- to the summarization task; and large scale user ∈ studies following two evaluation paradigms which tence embeddings have six dimensions. demonstrate that state-of-the-art abstractive sys- Document Encoder The document encoder tems lag behind extractive ones when the latter are composes a sequence of sentences to obtain a doc- globally trained. ument representation. We use a recurrent neural 2 Summarization as Sentence Ranking network with Long Short-Term Memory (LSTM) cells to avoid the vanishing gradient problem when Given a document D consisting of a sequence of training long sequences (Hochreiter and Schmid- sentences (s1,s2,...,sn) , an extractive summarizer huber, 1997). Given a document D consisting of aims to produce a summary S by selecting m sen- a sequence of sentences (s1,s2,...,sn), we follow tences from D (where m < n). For each sentence common practice and feed sentences in reverse or- s D, we predict a label y 0,1 (where 1 i ∈ i ∈ { } der (Sutskever et al., 2014; Li et al., 2015; Filip- means that si should be included in the summary) pova et al., 2015; Narayan et al., 2017). This way and assign a score p(y s ,D,θ) quantifying s ’s i| i i we make sure that the network also considers the relevance to the summary. The model learns to as- top sentences of the document which are partic- sign p(1 s ,D,θ) > p(1 s ,D,θ) when sentence s | i | j i ularly important for summarization (Rush et al., is more relevant than s j. Model parameters are de- 2015; Nallapati et al., 2016). noted by θ. We estimate p(y s ,D,θ) using a neu- i| i ral network model and assemble a summary S by Sentence Extractor Our sentence extractor se- selecting m sentences with top p(1 s ,D,θ) scores. quentially labels each sentence in a document | i Our architecture resembles those previously with 1 (relevant for the summary) or 0 (otherwise). 1748 Candidate Gold Sentence extractor summary summary y5 y4 y3 y2 y1 Sentence encoder REWARD Police s4 are still s3 s s s s s hunting 5 4 3 2 1 for s2 the driver s1 Document encoder REINFORCE [convolution][max pooling] Update agent D s5 s4 s3 s2 s1 Figure 1: Extractive summarization model with reinforcement learning: a hierarchical encoder-decoder model ranks sentences for their extract-worthiness and a candidate summary is assembled from the top ranked sentences; the REWARD generator compares the candidate against the gold summary to give a reward which is used in the REINFORCE algorithm (Williams, 1992) to update the model.

Load more