Ranking Sentences for Extractive Summarization with Reinforcement Learning

Shashi Narayan Shay B. Cohen Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh, EH8 9AB [email protected], scohen,mlap @inf.ed.ac.uk { }

Abstract et al., 2017; Tan and Wan, 2017; Paulus et al., 2017) is an encoder-decoder architecture mod- Single document summarization is the task eled by recurrent neural networks. The encoder of producing a shorter version of a docu- ment while preserving its principal informa- reads the source sequence into a list of continuous- tion content. In this paper we conceptualize space representations from which the decoder gen- extractive summarization as a sentence rank- erates the target sequence. An attention mecha- ing task and propose a novel training algorithm nism (Bahdanau et al., 2015) is often used to lo- which globally optimizes the evalua- cate the region of focus during decoding. tion metric through a reinforcement learning Extractive objective. We use our algorithm to train a neu- systems create a summary by identi- ral summarization model on the CNN and Dai- fying (and subsequently concatenating) the most lyMail datasets and demonstrate experimen- important sentences in a document. A few re- tally that it outperforms state-of-the-art extrac- cent approaches (Cheng and Lapata, 2016; Nalla- tive and abstractive systems when evaluated pati et al., 2017; Narayan et al., 2017; Yasunaga automatically and by humans.1 et al., 2017) conceptualize extractive summariza- tion as a sequence labeling task in which each 1 Introduction label specifies whether each document sentence Automatic summarization has enjoyed wide popu- should be included in the summary. Existing mod- larity in natural language processing due to its po- els rely on recurrent neural networks to derive a tential for various information access applications. meaning representation of the document which is Examples include tools which aid users navigate then used to label each sentence, taking the pre- and digest web content (e.g., news, social media, viously labeled sentences into account. These product reviews), question answering, and person- models are typically trained using cross-entropy alized recommendation engines. Single document loss in order to maximize the likelihood of the summarization — the task of producing a shorter ground-truth labels and do not necessarily learn version of a document while preserving its infor- to rank sentences based on their importance due mation content — is perhaps the most basic of to the absence of a ranking-based objective. An- summarization tasks that have been identified over other discrepancy comes from the mismatch be- the years (see Nenkova and McKeown, 2011 for a tween the learning objective and the evaluation comprehensive overview). criterion, namely ROUGE (Lin and Hovy, 2003), Modern approaches to single document summa- which takes the entire summary into account. rization are data-driven, taking advantage of the In this paper we argue that cross-entropy train- success of neural network architectures and their ing is not optimal for extractive summarization. ability to learn continuous features without re- Models trained this way are prone to generating course to preprocessing tools or linguistic annota- verbose summaries with unnecessarily long sen- tions. Abstractive summarization involves various tences and redundant information. We propose to text rewriting operations (e.g., substitution, dele- overcome these difficulties by globally optimiz- tion, reordering) and has been recently framed as ing the ROUGE evaluation metric and learning to a sequence-to-sequence problem (Sutskever et al., rank sentences for summary generation through a 2014). Central in most approaches (Rush et al., reinforcement learning objective. Similar to pre- 2015; Chen et al., 2016; Nallapati et al., 2016; See vious work (Cheng and Lapata, 2016; Narayan 1Our code and data are available here: https://github. et al., 2017; Nallapati et al., 2017), our neural sum- com/shashiongithub/Refresh. marization model consists of a hierarchical docu-

1747 Proceedings of NAACL-HLT 2018, pages 1747–1759 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics ment encoder and a hierarchical sentence extrac- proposed in the literature (Cheng and Lapata, tor. During training, it combines the maximum- 2016; Nallapati et al., 2017; Narayan et al., 2017). likelihood cross-entropy loss with rewards from The main components include a sentence encoder, policy gradient reinforcement learning to directly a document encoder, and a sentence extractor (see optimize the evaluation metric relevant for the the left block of Figure1) which we describe in summarization task. We show that this global op- more detail below. timization framework renders extractive models better at discriminating among sentences for the Sentence Encoder A core component of our final summary; a sentence is ranked high for selec- model is a convolutional sentence encoder which tion if it often occurs in high scoring summaries. encodes sentences into continuous representa- We report results on the CNN and DailyMail tions. In recent years, CNNs have proven use- news highlights datasets (Hermann et al., 2015) ful for various NLP tasks (Collobert et al., 2011; which have been recently used as testbeds for the Kim, 2014; Kalchbrenner et al., 2014; Zhang et al., evaluation of neural summarization systems. Ex- 2015; Lei et al., 2015; Kim et al., 2016; Cheng perimental results show that when evaluated auto- and Lapata, 2016) because of their effectiveness in matically (in terms of ROUGE), our model out- identifying salient patterns in the input (Xu et al., performs state-of-the-art extractive and abstrac- 2015). In the case of summarization, CNNs can tive systems. We also conduct two human eval- identify named-entities and events that correlate uations in order to assess (a) which type of sum- with the gold summary. mary participants prefer (we compare extractive We use temporal narrow convolution by apply- and abstractive systems) and (b) how much key ing a kernel filter K of width h to a window of h information from the document is preserved in the words in sentence s to produce a new feature. This summary (we ask participants to answer questions filter is applied to each possible window of words in s to produce a feature map f Rk h+1 where k pertaining to the content in the document by read- ∈ − ing system summaries). Both evaluations over- is the sentence length. We then apply max-pooling whelmingly show that human subjects find our over time over the feature map f and take the max- summaries more informative and complete. imum value as the feature corresponding to this Our contributions in this work are three-fold: a particular filter K. We use multiple kernels of var- novel application of reinforcement learning to sen- ious sizes and each kernel multiple times to con- tence ranking for extractive summarization; cor- struct the representation of a sentence. In Figure1, roborated by analysis and empirical results show- kernels of size 2 (red) and 4 (blue) are applied ing that cross-entropy training is not well-suited three times each. Max-pooling over time yields two feature lists f K2 and f K4 R3. The final sen- to the summarization task; and large scale user ∈ studies following two evaluation paradigms which tence embeddings have six dimensions. demonstrate that state-of-the-art abstractive sys- Document Encoder The document encoder tems lag behind extractive ones when the latter are composes a sequence of sentences to obtain a doc- globally trained. ument representation. We use a recurrent neural 2 Summarization as Sentence Ranking network with Long Short-Term Memory (LSTM) cells to avoid the vanishing gradient problem when Given a document D consisting of a sequence of training long sequences (Hochreiter and Schmid- sentences (s1,s2,...,sn) , an extractive summarizer huber, 1997). Given a document D consisting of aims to produce a summary S by selecting m sen- a sequence of sentences (s1,s2,...,sn), we follow tences from D (where m < n). For each sentence common practice and feed sentences in reverse or- s D, we predict a label y 0,1 (where 1 i ∈ i ∈ { } der (Sutskever et al., 2014; Li et al., 2015; Filip- means that si should be included in the summary) pova et al., 2015; Narayan et al., 2017). This way and assign a score p(y s ,D,θ) quantifying s ’s i| i i we make sure that the network also considers the relevance to the summary. The model learns to as- top sentences of the document which are partic- sign p(1 s ,D,θ) > p(1 s ,D,θ) when sentence s | i | j i ularly important for summarization (Rush et al., is more relevant than s j. Model parameters are de- 2015; Nallapati et al., 2016). noted by θ. We estimate p(y s ,D,θ) using a neu- i| i ral network model and assemble a summary S by Sentence Extractor Our sentence extractor se- selecting m sentences with top p(1 s ,D,θ) scores. quentially labels each sentence in a document | i Our architecture resembles those previously with 1 (relevant for the summary) or 0 (otherwise).

1748 Candidate Gold Sentence extractor summary summary y5 y4 y3 y2 y1

Sentence encoder REWARD Police s4 are still s3 s s s s s hunting 5 4 3 2 1 for s2 the driver s1 Document encoder REINFORCE

[convolution][max pooling] Update agent D s5 s4 s3 s2 s1

Figure 1: Extractive summarization model with reinforcement learning: a hierarchical encoder-decoder model ranks sentences for their extract-worthiness and a candidate summary is assembled from the top ranked sentences; the REWARD generator compares the candidate against the gold summary to give a reward which is used in the REINFORCE algorithm (Williams, 1992) to update the model.

It is implemented with another RNN with LSTM cross-entropy loss at each decoding step: t cells and a softmax layer. At time i, it reads n sentence s and makes a binary prediction, con- i L(θ) = ∑ log p(yi si,D,θ). (1) ditioned on the document representation (obtained − i=1 | from the document encoder) and the previously la- beled sentences. This way, the sentence extractor Cross-entropy training leads to two kinds of dis- is able to identify locally and globally important crepancies in the model. The first discrepancy sentences within the document. We rank the sen- comes from the disconnect between the task def- inition and the training objective. While MLE tences in a document D by p(y = 1 s ,D,θ), the i | i confidence scores assigned by the softmax layer in Equation (1) aims to maximize the likelihood of the sentence extractor. of the ground-truth labels, the model is (a) ex- pected to rank sentences to generate a summary We learn to rank sentences by training our and (b) evaluated using ROUGE at test time. network in a reinforcement learning framework, The second discrepancy comes from the reliance directly optimizing the final evaluation metric, on ground-truth labels. Document collections namely ROUGE (Lin and Hovy, 2003). Before we for training summarization systems do not nat- describe our training algorithm, we elaborate on urally contain labels indicating which sentences why the maximum-likelihood cross-entropy ob- should be extracted. Instead, they are typically ac- jective could be deficient for ranking sentences companied by abstractive summaries from which for summarization (Section3). Then, we define sentence-level labels are extrapolated. Cheng our reinforcement learning objective in Section4 and Lapata(2016) follow Woodsend and Lapata and show that our new way of training allows the (2010) in adopting a rule-based method which as- model to better discriminate amongst sentences, signs labels to each sentence in the document in- i.e., a sentence is ranked higher for selection if it dividually based on their semantic correspondence often occurs in high scoring summaries. with the gold summary (see the fourth column 3 The Pitfalls of Cross-Entropy Loss in Table1). An alternative method (Svore et al., 2007; Cao et al., 2016; Nallapati et al., 2017) iden- Previous work optimizes summarization models tifies the set of sentences which collectively gives n by maximizing p(y D,θ) = ∏ p(yi si,D,θ), the highest ROUGE with respect to the gold sum- | i=1 | the likelihood of the ground-truth labels mary. Sentences in this set are labeled with 1 and y = (y1,y2,...,yn) for sentences (s1,s2,...,sn), 0 otherwise (see the column 5 in Table1). given document D and model parameters θ. This Labeling sentences individually often generates objective can be achieved by minimizing the too many positive labels causing the model to

1749 etne1 cusi l umre xetone, except summaries all in summaries: occurs the 13 in to sentence frequently important appear equal of and indicative or content are sentences more few A scores sum- 50%. 16 ROUGE top all have and (0,1,13) maries 57.1%, and fourth 57.2%, and ranked 57.5%, (11,13), is summaries third the (0,13), example, for second For ROUGE average scores. the ROUGE high reasonably have summaries ranked top multiple Interestingly, umre akdacrigt h enof mean the F ROUGE-L to and ROUGE-2, according ROUGE-1, ranked summaries training. during could considered which be scores ROUGE high with candidate many summaries are there that found We sentences. tences ttedt si ilol aiieprobabilities maximize only will it as data under- the will fit labels collective on loss with trained cross-entropy to model a suitable However, per- most summary. the deemed only form sentences they few the since to alternative tain better labels Collective a shown). present are 31 10 first of (only out total sentences in labeled positively Ta- 12 in has document 1 the ble example, The For highlights”. data. are the Highlights “story overfit its stories. and literature. on summarization shown) information the gather are in quickly sentences summaries to abstractive 31 readers standard of allow gold out to as 15 used journalists often by first written (only typically article are CNN latter abridged An 1: Table p ( fia atAi n h aiba islands Caribbean the and Asia East Africa, • Highlights Story 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 (atclm)soscandidate shows column) (last 1 Table sent. pos. | ot aoiarprsfis aeo oqiobrevrscle Chikungunya called virus mosquito-borne of case first reports Carolina North s i , hs ihwa muesses uha h lel,aemr ieyt suffer to healthier. likely are who more weeks. those are for than elderly, lasting effects the symptoms side as with virus’ such painful, the systems, be from immune can weak it the but with of Caribbean. deadly, Those cases the not multiple to is traveled had virus have has The who state being people the in is said virus Health virus Chikungunya. of the of Department cases know Tennessee reported The we also have where states Caribbean Other them the of said. in seven Nasci and transmitted,” countries cases, from is travel-associated summer eight come recorded and have we’ve tourists, year American this with far States. one ”So United popular the a within approaching. is spread fast to Caribbean starts the just it all, it’s before and After cases time home, 100,000 of back Chikungunya than matter bringing a more are with tourists – American concerned. say year said. officials Experts this Nasci health far,” Caribbean has thus of the – U.S. said in Division reported the year, outbreak in the major each cases in a transmitted States But Branch locally United any Disease had the Arboviral haven’t ”We to CDC’s it the Diseases. bring of Vector-Borne on travelers chief been infected Nasci, has 28 Roger symptoms, time. to some arthritis-like 25 for and radar About pain health more joint public did U.S. cause Nile West the can like much which – States virus, United The the virus,+ ago. in the decade hold watching islands, take a been could Caribbean than has it Prevention the that and and fear Control for Asia Disease East for Centers Africa, the in but found County primarily Forsyth is the Chikungunya to Health. according Public Caribbean, of the in Department virus. infected to the likely of way was case patient its The reported first made state’s has the say. It’s Chikungunya officials health called Carolina, virus North mosquito-borne debilitating, A D { 0 , , θ 11 ) , o etne nti e eg,sen- (e.g., set this in sentences for 13 } n goealother all ignore and 1) Table in N article CNN Sentences • iu sntdal,bti a epifl ihsmtm atn o weeks for lasting symptoms with painful, be can it but deadly, not is Virus 1 scores. 1750 a enpooe sawyo riigsequence- training of way a as proposed ) 1998 been Barto, has and (Sutton learning Reinforcement Reinforcement with Ranking Sentence 4 is- learning. sum- these reinforcement 5 address using we top sues how the explain of next any We score in maries. ROUGE occur not individual does high (35.6%), 3, a sentence having example, despite For summaries. and redundant verbose may form they and e.g., content summary, overlapping scoring convey necessar- high a not to do lead ily scores ROUGE individual top with the select then each and m for document the score in ROUGE sentence individual model the the predict train could to we of labels, likelihood summariza- ground-truth the the maximizing the of for Instead task. sentences tion ranking at ficient more yet informative, alternatives. as longer concise, be to may compared and scores summaries, (1,13) ROUGE and (11,13) summaries better summaries yield several that note in Also appears too. 0 sentence while hs iceace edrtemdlls ef- less model the render discrepancies These etne ihhgetsoe.Btsentences But scores. highest with sentences Learning • hknuy spiaiyfudin found primarily is Chikungunya 54.5 15.6 13.4 18.4 13.9 10.6 16.4 16.7 35.6 11.2 18.1 21.2 5.5 7.4 9.7 Sent-level ROUGE

0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 Individual Oracle

0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 Collective Oracle ... 45.3 : (4,11,13) 45.5 : (10,13) 45.9 : (3,9,13) 46.0 : (3,12,13) 46.0 : (3,7,13) 46.9 : (1,7,13) 47.5 : (9,13) 47.6 : (7,13) (11,12,13):47.7 47.8 : (0,10,13) 47.8 : (7,11,13) 49.3 : (12,13) 50.1 : (1,9,13) 50.4 : (9,11,13) 51.0 : (0,12,13) 51.3 : (0,7,13) 51.3 : (0,9,13) 52.0 : (1,11,13) 52.9 : (1,3,13) 53.4 : (3,13) 54.2 : (0,3,13) 54.5 : (13) 55.0 : (3,11,13) 56.6 : (1,13) 57.1 : (0,1,13) 57.2 : (11,13) 57.5 : (0,13) 59.3 : (0,11,13) Oracle Collective Multiple to-sequence generation models in order to directly in Equation (2) learns to discriminate among sen- optimize the metric used at test time, e.g., BLEU tences with respect to how often they occur in high or ROUGE (Ranzato et al., 2015). We adapt re- scoring summaries. inforcement learning to our formulation of extrac- tive summarization to rank sentences for summary 4.2 Training with High Probability Samples generation. We propose an objective function that Computing the expectation term in Equation (3) is combines the maximum-likelihood cross-entropy prohibitive, since there is a large number of pos- loss with rewards from policy gradient reinforce- sible extracts. In practice, we approximate the ex- ment learning to globally optimize ROUGE. Our pected gradient using a single sampley ˆ from pθ training algorithm allows to explore the space of for each training example in a batch: possible summaries, making our model more ro- bust to unseen data. As a result, reinforcement ∇L(θ) r(yˆ)∇log p(yˆ D,θ) ≈ − | learning helps extractive summarization in two n r(yˆ) ∇log p(yˆ s ,D,θ) (4) ways: (a) it directly optimizes the evaluation met- ≈ − ∑ i| i ric instead of maximizing the likelihood of the i=1 ground-truth labels and (b) it makes our model bet- Presented in its original form, the REINFORCE ter at discriminating among sentences; a sentence algorithm starts learning with a random policy is ranked high for selection if it often occurs in which can make model training challenging for high scoring summaries. complex tasks like ours where a single document can give rise to a very large number of candidate 4.1 Policy Learning summaries. We therefore limit the search space We cast the neural summarization model intro- ofy ˆ in Equation (4) to the set of largest prob- duced in Figure1 in the Reinforcement Learning ability samples Yˆ . We approximate Yˆ by the k paradigm (Sutton and Barto, 1998). Accordingly, extracts which receive highest ROUGE scores. the model can be viewed as an “agent” which in- More concretely, we assemble candidate sum- teracts with an “environment” consisting of docu- maries efficiently by first selecting p sentences ments. At first, the agent is initialized randomly, from the document which on their own have high it reads document D and predicts a relevance ROUGE scores. We then generate all possible score for each sentence s D using “policy” i ∈ combinations of p sentences subject to maximum p(y s ,D,θ), where θ are model parameters. Once i| i length m and evaluate them against the gold sum- the agent is done reading the document, a sum- mary. Summaries are ranked according to F1 by mary with labelsy ˆ is sampled out of the ranked taking the mean of ROUGE-1, ROUGE-2, and sentences. The agent is then given a “reward” r ROUGE-L. Yˆ contains these top k candidate sum- commensurate with how well the extract resem- maries. During training, we sampley ˆ from Yˆ in- bles the gold-standard summary. Specifically, as stead of p(yˆ D,θ). | reward function we use mean F1 of ROUGE-1, Ranzato et al.(2015) proposed an alternative to ROUGE-2, and ROUGE-L. Unigram and bigram REINFORCE called MIXER (Mixed Incremental overlap (ROUGE-1 and ROUGE-2) are meant to Cross-Entropy Reinforce) which first pretrains the assess informativeness, whereas the longest com- model with the cross-entropy loss using ground mon subsequence (ROUGE-L) is meant to assess truth labels and then follows a curriculum learning fluency. We update the agent using the REIN- strategy (Bengio et al., 2015) to gradually teach FORCE algorithm (Williams, 1992) which aims to the model to produce stable predictions on its own. minimize the negative expected reward: In our experiments MIXER performed worse than the model of Nallapati et al.(2017) just trained on L(θ) = Eyˆ p [r(yˆ)] (2) − ∼ θ collective labels. We conjecture that this is due where, pθ stands for p(y D,θ). REINFORCE is to the unbounded nature of our ranking problem. | based on the observation that the expected gradient Recall that our model assigns relevance scores to of a non-differentiable reward function (ROUGE, sentences rather than words. The space of senten- in our case) can be computed as follows: tial representations is vast and fairly unconstrained compared to other prediction tasks operating with ∇L(θ) = Eyˆ p [r(yˆ)∇log p(yˆ D,θ)] (3) − ∼ θ | fixed vocabularies (Li et al., 2016; Paulus et al., While MLE in Equation (1) aims to maximize 2017; Zhang and Lapata, 2017). Moreover, our the likelihood of the training data, the objective approximation of the gradient allows the model to

1751 A SkyWest Airlines flight made an emergency landing converge much faster to an optimal policy. Advan- in• Buffalo, New York, on Wednesday after a passenger tageously, we do not require an online reward esti- lost consciousness, officials said. The passenger received medical attention before being

ˆ EAD mator, we pre-compute Y, which leads to a signifi- released,• according to Marissa Snow, spokeswoman for cant speedup during training compared to MIXER L SkyWest. She said the airliner expects to accommodate the 75 (Ranzato et al., 2015) and related training schemes passengers• on another aircraft to their original destination (Shen et al., 2016). – Hartford, Connecticut – later Wednesday afternoon. Skywest Airlines flight made an emergency landing in Buffalo,• New York, on Wednesday after a passenger lost consciousness. 5 Experimental Setup She said the airliner expects to accommodate the 75 See et al. passengers• on another aircraft to their original destination In this section we present our experimental – Hartford, Connecticut. A SkyWest Airlines flight made an emergency landing setup for assessing the performance of our in• Buffalo, New York, on Wednesday after a passenger model which we call REFRESH as a shorthand lost consciousness, officials said. The passenger received medical attention before being for in o cement Learning-based xtractive EFRESH released,• according to Marissa Snow, spokeswoman for

RE F R E R SkyWest. Summarization. We describe our datasets, discuss The Federal Aviation Administration initially reported implementation details, our evaluation protocol, a• pressurization problem and said it would investigate. FAA backtracks on saying crew reported a pressuriza- and the systems used for comparison. tion• problem OLD

G One passenger lost consciousness • The plane descended 28,000 feet in three minutes Summarization Datasets We evaluated our • Q1 Who backtracked on saying crew reported a pressuriza- models on the CNN and DailyMail news high- tion problem? (FAA) Q2 How many passengers lost consciousness in the incident? lights datasets (Hermann et al., 2015). We used the (One) standard splits of Hermann et al.(2015) for train- Q3 How far did the plane descend in three minutes? (28,000 feet) ing, validation, and testing (90,266/1,220/1,093 documents for CNN and 196,961/12,148/10,397 Figure 2: Summaries produced by the LEAD base- for DailyMail). We did not anonymize entities or line, the abstractive system of See et al.(2017) and lower case tokens. We followed previous studies REFRESH for a CNN (test) article. GOLD presents (Cheng and Lapata, 2016; Nallapati et al., 2016, the human-authored summary; the bottom block shows 2017; See et al., 2017; Tan and Wan, 2017) in as- manually written questions using the gold summary and their answers in parentheses. suming that the “story highlights” associated with each article are gold-standard abstractive sum- maries. During training we use these to generate Sentences were padded with zeros to a length high scoring extracts and to estimate rewards for of 100. For the sentence encoder, we used a list them, but during testing, they are used as reference of kernels of widths 1 to 7, each with output chan- summaries to evaluate our models. nel size of 50 (Kim et al., 2016). The sentence embedding size in our model was 350. Implementation Details We generated extracts For the recurrent neural network component by selecting three sentences (m = 3) for CNN arti- in the document encoder and sentence extractor, cles and four sentences (m = 4) for DailyMail ar- we used a single-layered LSTM network with ticles. These decisions were informed by the fact size 600. All input documents were padded with that gold highlights in the CNN/DailyMail vali- zeros to a maximum document length of 120. We dation sets are 2.6/4.2 sentences long. For both performed minibatch cross-entropy training with a datasets, we estimated high-scoring extracts us- batch size of 20 documents for 20 training epochs. ing 10 document sentences (p = 10) with highest It took around 12 hrs on a single GPU to train. ROUGE scores. We tuned the initialization pa- After each epoch, we evaluated our model on rameter k for Yˆ on the validation set: we found the validation set and chose the best performing that our model performs best with k = 5 for the model for the test set. During training we used CNN dataset and k = 15 for the DailyMail dataset. the Adam optimizer (Kingma and Ba, 2015) with We used the One Billion Word Benchmark cor- initial learning rate 0.001. Our system is imple- pus (Chelba et al., 2013) to train word embeddings mented in TensorFlow (Abadi et al., 2015). with the skip-gram model (Mikolov et al., 2013) using context window size 6, negative sampling Evaluation We evaluated summarization qual- size 10, and hierarchical softmax 1. Known words ity using F1 ROUGE (Lin and Hovy, 2003). We were initialized with pre-trained embeddings of report unigram and bigram overlap (ROUGE-1 size 200. Embeddings for unknown words were and ROUGE-2) as a means of assessing infor- initialized to zero, but estimated during training. mativeness and the longest common subsequence

1752 (ROUGE-L) as a means of assessing fluency.2 We ris et al., 1992; Mani et al., 2002; Clarke and La- compared REFRESH against a baseline which sim- pata, 2010). We created a set of questions based ply selects the first m leading sentences from each on the gold summary under the assumption that it document (LEAD) and two neural models similar highlights the most important document content. to ours (see left block in Figure1), both trained We then examined whether participants were able with cross-entropy loss. Cheng and Lapata(2016) to answer these questions by reading system sum- train on individual labels, while Nallapati et al. maries alone without access to the article. The (2017) use collective labels. We also compared more questions a system can answer, the better it our model against the abstractive systems of Chen is at summarizing the document as a whole. et al.(2016), Nallapati et al.(2016), See et al. We worked on the same 20 documents used in (2017), and Tan and Wan(2017). 3 our first elicitation study. We wrote multiple fact- In addition to ROUGE which can be mislead- based question-answer pairs for each gold sum- ing when used as the only means to assess the in- mary without looking at the document. Questions formativeness of summaries (Schluter, 2017), we were formulated so as to not reveal answers to sub- also evaluated system output by eliciting human sequent questions. We created 71 questions in to- judgments in two ways. In our first experiment, tal varying from two to six questions per gold sum- participants were presented with a news article mary. Example questions are given in Figure2. and summaries generated by three systems: the Participants read the summary and answered all LEAD baseline, abstracts from See et al.(2017), associated questions as best they could without ac- and extracts from REFRESH. We also included cess to the original document or the gold summary. the human-authored highlights.4 Participants read Subjects were shown summaries from three sys- the articles and were asked to rank the summaries tems: the LEAD baseline, the abstractive system from best (1) to worst (4) in order of informative- of See et al.(2017), and R EFRESH. Five partici- ness (does the summary capture important infor- pants answered questions for each summary. We mation in the article?) and fluency (is the sum- used the same scoring mechanism from Clarke and mary written in well-formed English?). We did Lapata(2010), i.e., a correct answer was marked not allow any ties. We randomly selected 10 arti- with a score of one, partially correct answers with cles from the CNN test set and 10 from the Dai- a score of 0.5, and zero otherwise. The final lyMail test set. The study was completed by five score for a system is the average of all its question participants, all native or proficient English speak- scores. Answers were elicited using Amazon’s ers. Each participant was presented with the 20 Mechanical Turk crowdsourcing platform. We up- articles. The order of summaries to rank was ran- loaded data in batches (one system at a time) on domized per article and the order of articles per Mechanical Turk to ensure that same participant participant. Examples of summaries our subjects does not evaluate summaries from different sys- ranked are shown in Figure2. tems on the same set of questions. Our second experiment assessed the degree to which our model retains key information from the 6 Results document following a question-answering (QA) We report results using automatic metrics in Ta- paradigm which has been previously used to eval- ble2. The top part of the table compares R E- uate summary quality and text compression (Mor- FRESH against related extractive systems. The bot- tom part reports the performance of abstractive 2We used pyrouge, a Python package, to compute all ROUGE scores with parameters “-a -c 95 -m -n 4 -w 1.2.” systems. We present three variants of LEAD, one 3Cheng and Lapata(2016) report ROUGE recall scores on is computed by ourselves and the other two are the DailyMail dataset only. We used their code (https:// reported in Nallapati et al.(2017) and See et al. github.com/cheng6076/NeuralSum) to produce ROUGE F1 scores on both CNN and DailyMail datasets. For other (2017). Note that they vary slightly due to dif- systems, all results are taken from their papers. ferences in the preprocessing of the data. We re- 4We are grateful to Abigail See for providing us with the port results on the CNN and DailyMail datasets output of her system. We did not include output from Nallap- ati et al.(2017), Chen et al.(2016), Nallapati et al.(2016), or and their combination (CNN+DailyMail). Tan and Wan(2017) in our human evaluation study, as these models are trained on a named-entity anonymized version of Cross-Entropy vs Reinforcement Learning the CNN and DailyMail datasets, and as result produce sum- The results in Table2 show that R EFRESH is su- maries which are not comparable to ours. We did not include extracts from Cheng and Lapata(2016) either as they were perior to our LEAD baseline and extractive sys- significantly inferior to LEAD (see Table2). tems across datasets and metrics. It outperforms

1753 Models CNN DailyMail CNN+DailyMail R1 R2 RL R1 R2 RL R1 R2 RL LEAD (ours) 29.1 11.1 25.9 40.7 18.3 37.2 39.6 17.7 36.2 LEAD∗ (Nallapati et al., 2017) ——— ——— 39.2 15.7 35.5 LEAD (See et al., 2017) ——— ——— 40.3 17.7 36.6 Cheng and Lapata(2016) 28.4 10.0 25.0 36.2 15.2 32.9 35.5 14.7 32.2 Nallapati et al.(2017) ∗ ——— ——— 39.6 16.2 35.3 REFRESH 30.4 11.7 26.9 41.0 18.8 37.7 40.0 18.2 36.6 Chen et al.(2016) ∗ 27.1 8.2 18.7 ——— ——— Nallapati et al.(2016) ∗ ——— ——— 35.4 13.3 32.6 See et al.(2017) ——— ——— 39.5 17.3 36.4 Tan and Wan(2017) ∗ 30.3 9.8 20.0 ——— 38.1 13.9 34.0

Table 2: Results on the CNN and DailyMail test sets. We report ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) F1 scores. Extractive systems are in the first block and abstractive systems in the second. Table cells are filled with — whenever results are not available. Models marked with ∗ are not directly comparable to ours as they are based on an anonymized version of the dataset. the extractive system of Cheng and Lapata(2016) Models 1st 2nd 3rd 4th QA LEAD 0.11 0.21 0.34 0.33 36.33 which is trained on individual labels. REFRESH See et al.(2017) 0.14 0.18 0.31 0.36 28.73 is not directly comparable with Nallapati et al. REFRESH 0.35 0.42 0.16 0.07 66.34 (2017) as they generate anonymized summaries. GOLD 0.39 0.19 0.18 0.24 — Their system lags behind their LEAD baseline on Table 3: System ranking and QA-based evaluations. ROUGE-L on the CNN+DailyMail dataset (35.5% Rankings (1st, 2nd, 3rd and 4th) are shown as propor- vs 35.3%). Also note that their model is trained tions. Rank 1 is the best and Rank 4, the worst. The on collective labels and has a significant lead over column QA shows the percentage of questions that par- Cheng and Lapata(2016). As discussed in Sec- ticipants answered correctly by reading system sum- tion3 cross-entropy training on individual labels maries. tends to overgenerate positive labels leading to less informative and verbose summaries. time). REFRESH is ranked 2nd best followed by LEAD and See et al.(2017) which are mostly Extractive vs Abstractive Systems Our auto- ranked in 3rd and 4th places. We carried out pair- matic evaluation results further demonstrate that wise comparisons between all models in Table3 to REFRESH is superior to abstractive systems (Chen assess whether system differences are statistically et al., 2016; Nallapati et al., 2016; See et al., significant. There is no significant difference be- 2017; Tan and Wan, 2017) which are all vari- tween LEAD and See et al.(2017), and R EFRESH ants of an encoder-decoder architecture (Sutskever and GOLD (using a one-way ANOVA with post- et al., 2014). Despite being more faithful to the ac- hoc Tukey HSD tests; p < 0.01). All other differ- tual summarization task (hand-written summaries ences are statistically significant. combine several pieces of information from the original document), abstractive systems lag behind Human Evaluation: Question Answering The the LEAD baseline. Tan and Wan(2017) present results of our QA evaluation are shown in the last a graph-based neural model, which manages to column of Table3. Based on summaries generated outperform LEAD on ROUGE-1 but falters when by REFRESH, participants can answer 66.34% higher order ROUGE scores are used. Amongst of questions correctly. Summaries produced by abstractive systems See et al.(2017) perform best. LEAD and the abstractive system of See et al. Interestingly, their system is mostly extractive, ex- (2017) provide answers for 36.33% and 28.73% of hibiting a small degree of rewriting; it copies more the questions, respectively. Differences between than 35% of the sentences in the source docu- systems are all statistically significant (p < 0.01) ment, 85% of 4-grams, 90% of 3-grams, 95% of with the exception of LEAD and See et al.(2017). bigrams, and 99% of unigrams. Although the QA results in Table3 follow the same pattern as ROUGE in Table2, differences Human Evaluation: System Ranking Table3 among systems are now greatly amplified. QA- shows, proportionally, how often participants based evaluation is more focused and a closer re- ranked each system, 1st, 2nd, and so on. Per- flection of users’ information need (i.e., to find out haps unsurprisingly human-authored summaries what the article is about), whereas ROUGE simply are considered best (and ranked 1st 39% of the captures surface similarity (i.e., n-gram overlap)

1754 between output summaries and their references. tence), and then receives a delayed reward based Interestingly, LEAD is considered better than See on tf idf. Follow-on work (Rioux et al., 2014) ∗ et al.(2017) in the QA evaluation, whereas we extends this approach by employing ROUGE as find the opposite when participants are asked to part of the reward function, while Henß et al. rank systems. We hypothesize that LEAD is in- (2015) further experiment with Q-learning. Molla-´ deed more informative than See et al.(2017) but Aliod(2017) has adapt this approach to query- humans prefer shorter summaries. The average focused summarization. Our model differs from length of LEAD summaries is 105.7 words com- these approaches both in application and formu- pared to 61.6 for See et al.(2017). lation. We focus solely on extractive summariza- tion, in our case states are documents (not sum- 7 Related Work maries) and actions are relevance scores which Traditional summarization methods manually de- lead to sentence ranking (not sentence-to-sentence fine features to rank sentences for their salience transitions). Rather than employing reinforcement in order to identify the most important sentences learning for sentence selection, our algorithm per- in a document or set of documents (Kupiec et al., forms sentence ranking using ROUGE as the re- 1995; Mani, 2001; Radev et al., 2004; Filatova ward function. and Hatzivassiloglou, 2004; Nenkova et al., 2006; The REINFORCE algorithm (Williams, 1992) Sparck¨ Jones, 2007). A vast majority of these has been shown to improve encoder-decoder text- methods learn to score each sentence indepen- rewriting systems by allowing to directly opti- dently (Barzilay and Elhadad, 1997; Teufel and mize a non-differentiable objective (Ranzato et al., Moens, 1997; Erkan and Radev, 2004; Mihalcea 2015; Li et al., 2016; Paulus et al., 2017) or to in- and Tarau, 2004; Shen et al., 2007; Schilder and ject task-specific constraints (Zhang and Lapata, Kondadadi, 2008; Wan, 2010) and a summary is 2017; Nogueira and Cho, 2017). However, we generated by selecting top-scored sentences in a are not aware of any attempts to use reinforcement way that is not incorporated into the learning pro- learning for training a sentence ranker in the con- cess. Summary quality can be improved heuris- text of extractive summarization. tically, (Yih et al., 2007), via max-margin meth- ods (Carbonell and Goldstein, 1998; Li et al., 2009), or integer-linear programming (Woodsend 8 Conclusions and Lapata, 2010; Berg-Kirkpatrick et al., 2011; Woodsend and Lapata, 2012; Almeida and Mar- In this work we developed an extractive summa- tins, 2013; Parveen et al., 2015). rization model which is globally trained by opti- Recent deep learning methods (Kageb˚ ack¨ et al., mizing the ROUGE evaluation metric. Our train- 2014; Yin and Pei, 2015; Cheng and Lapata, 2016; ing algorithm explores the space of candidate sum- Nallapati et al., 2017) learn continuous features maries while learning to optimize a reward func- without any linguistic preprocessing (e.g., named tion which is relevant for the task at hand. Ex- entities). Like traditional methods, these ap- perimental results show that reinforcement learn- proaches also suffer from the mismatch between ing offers a great means to steer our model to- the learning objective and the evaluation crite- wards generating informative, fluent, and concise rion (e.g., ROUGE) used at the test time. In summaries outperforming state-of-the-art extrac- comparison, our neural model globally optimizes tive and abstractive systems on the CNN and Dai- the ROUGE evaluation metric through a rein- lyMail datasets. In the future we would like to fo- forcement learning objective: sentences are highly cus on smaller discourse units (Mann and Thomp- ranked if they occur in highly scoring summaries. son, 1988) rather than individual sentences, mod- Reinforcement learning has been previously eling compression and extraction jointly. used in the context of traditional multi-document summarization as a means of selecting a sentence Acknowledgments We gratefully acknowledge or a subset of sentences from a document clus- the support of the European Research Council ter. Ryang and Abekawa(2012) cast the sentence (Lapata; award number 681760), the European selection task as a search problem. Their agent Union under the Horizon 2020 SUMMA project observes a state (e.g., a candidate summary), ex- (Narayan, Cohen; grant agreement 688139), and ecutes an action (a transition operation that pro- Huawei Technologies (Cohen). duces a new state selecting a not-yet-selected sen-

1755 References Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robin- Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene son. 2013. One billion word benchmark for measur- Brevdo, Zhifeng Chen, Craig Citro, Greg S. Cor- ing progress in statistical language modeling. Tech- rado, Andy Davis, Jeffrey Dean, Matthieu Devin, nical report, Google. Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Hui Jiang. 2016. Distraction-based neural networks Levenberg, Dan Mane,´ Rajat Monga, Sherry Moore, for modeling documents. In Proceedings of the 25th Derek Murray, Chris Olah, Mike Schuster, Jonathon International Joint Conference on Artificial Intelli- Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal- gence. New York, USA, pages 2754–2760. war, Paul Tucker, Vincent Vanhoucke, Vijay Vasude- van, Fernanda Viegas,´ Oriol Vinyals, Pete Warden, Jianpeng Cheng and Mirella Lapata. 2016. Neural Martin Wattenberg, Martin Wicke, Yuan Yu, and Xi- summarization by extracting sentences and words. aoqiang Zheng. 2015. TensorFlow: Large-scale ma- In Proceedings of the 54th Annual Meeting of the chine learning on heterogeneous systems. Software Association for Computational Linguistics. Berlin, available from tensorflow.org. Germany, pages 484–494.

Miguel B. Almeida and Andre´ F. T. Martins. 2013. Fast James Clarke and Mirella Lapata. 2010. Discourse and robust compressive summarization with dual de- constraints for document compression. Computa- composition and multi-task learning. In Proceed- tional Linguistics 36(3):411–441. ings of the 51st Annual Meeting of the Associa- tion for Computational Linguistics. Sofia, Bulgaria, Ronan Collobert, Jason Weston, Leon´ Bottou, Michael pages 196–206. Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- scratch. Journal of Machine Learning Research gio. 2015. Neural machine translation by jointly 12:2493–2537. learning to align and translate. In Proceedings of the 3rd International Conference on Learning Rep- Gunes¨ Erkan and Dragomir R. Radev. 2004. LexRank: resentations. San Diego, California, USA. Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Regina Barzilay and Michael Elhadad. 1997. Using Research 22(1):457–479. lexical chains for text summarization. In Proceed- ings of the ACL Workshop on Intelligent Scalable Elena Filatova and Vasileios Hatzivassiloglou. 2004. Text Summarization. Madrid, Spain, pages 10–17. Event-based extractive summarization. In Pro- ceedings of ACL Workshop on Text Summarization Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Branches Out. Barcelona, Spain, pages 104–111. Noam Shazeer. 2015. Scheduled sampling for se- quence prediction with recurrent neural networks. Katja Filippova, Enrique Alfonseca, Carlos A. Col- In Advances in Neural Information Processing Sys- menares, Lukasz Kaiser, and Oriol Vinyals. 2015. tems 28. pages 1171–1179. Sentence compression by deletion with LSTMs. In Proceedings of the 2015 Conference on Empirical Taylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein. Methods in Natural Language Processing. Lisbon, 2011. Jointly learning to extract and compress. In , pages 360–368. Proceedings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics: Human Lan- Sebastian Henß, Margot Mieskes, and Iryna Gurevych. guage Technologies. Portland, Oregon, USA, pages 2015. A reinforcement learning approach for adap- 481–490. tive single- and multi-document summarization. In Proceedings of International Conference of the Ger- Ziqiang Cao, Chengyao Chen, Wenjie Li, Sujian man Society for Computational Linguistics and Lan- Li, Furu Wei, and Ming Zhou. 2016. TGSum: guage Technology. Duisburg-Essen, Germany, pages Build tweet guided multi-document summarization 3–12. dataset. In Proceedings of the 30th AAAI Con- ference on Artificial Intelligence. Phoenix, Arizona Karl Moritz Hermann, Toma´sˇ Kociskˇ y,´ Edward USA, pages 2906–2912. Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su- leyman, and Phil Blunsom. 2015. Teaching ma- Jaime Carbonell and Jade Goldstein. 1998. The use chines to read and comprehend. In Advances in Neu- of MMR, diversity-based reranking for reordering ral Information Processing Systems 28. pages 1693– documents and producing summaries. In Proceed- 1701. ings of the 21st Annual International ACM SIGIR Conference on Research and Development in Infor- Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. mation Retrieval. Melbourne, Australia, pages 335– Long Short-Term Memory. Neural Computation 336. 9(8):1735–1780.

1756 Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- Chin-Yew Lin and Eduard Hovy. 2003. Auto- som. 2014. A convolutional neural network for matic evaluation of summaries using n-gram co- modelling sentences. In Proceedings of the 52nd occurrence statistics. In Proceedings of the Annual Meeting of the Association for Computa- 2003 Human Language Technology Conference of tional Linguistics. Baltimore, Maryland, pages 655– the North American Chapter of the Association 665. for Computational Linguistics. Edmonton, Canada, pages 71–78. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Inderjeet Mani. 2001. Automatic Summarization. Nat- Conference on Empirical Methods in Natural Lan- ural language processing. John Benjamins Publish- guage Processing. Doha, Qatar, pages 1746–1751. ing Company.

Yoon Kim, Yacine Jernite, David Sontag, and Alexan- Inderjeet Mani, Gary Klein, David House, Lynette der M. Rush. 2016. Character-aware neural lan- Hirschman, Therese Firmin, and Beth Sundheim. guage models. In Proceedings of the 30th AAAI 2002. SUMMAC: A text summarization evaluation. Conference on Artificial Intelligence. Phoenix, Ari- Natural Language Engineering 8(1):4368. zona USA, pages 2741–2749. William C. Mann and Sandra A. Thompson. 1988. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Rhetorical Structure Theory: Toward a functional method for stochastic optimization. In Proceed- theory of text organization. Text 8(3):243–281. ings of the 3rd International Conference on Learn- ing Representations. San Diego, California, USA. Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into texts. In Proceedings of the Mikael Kageb˚ ack,¨ Olof Mogren, Nina Tahmasebi, and 2004 Conference on Empirical Methods in Natural Devdatt Dubhashi. 2014. Extractive summariza- Language Processing. Barcelona, Spain, pages 404– tion using continuous vector space models. In Pro- 411. ceedings of the Workshop on Continuous Vector Space Models and their Compositionality. Gothen- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- burg, Sweden, pages 31–39. rado, and Jeffrey Dean. 2013. Distributed represen- tations of words and phrases and their composition- Julian Kupiec, Jan Pedersen, and Francine Chen. 1995. ality. In Advances in Neural Information Processing A trainable document summarizer. In Proceed- Systems 26. pages 3111–3119. ings of the 18th Annual International ACM SIGIR Conference on Research and Development in Infor- Diego Molla-Aliod.´ 2017. Towards the use of deep re- mation Retrieval. Seattle, Washington, USA, pages inforcement learning with global policy for query- 406–407. based extractive summarisation. In Proceedings of the Australasian Language Technology Association Tao Lei, Regina Barzilay, and Tommi Jaakkola. Workshop 2017. Brisbane, Australia, pages 103– 2015. Molding CNNs for text: Non-linear, non- 107. consecutive convolutions. In Proceedings of the 2015 Conference on Empirical Methods in Natu- Andrew H. Morris, George M. Kasper, and Dennis A. ral Language Processing. Lisbon, Portugal, pages Adams. 1992. The effects and limitations of au- 1565–1575. tomated text condensing on reading comprehen- sion performance. Information Systems Research Jiwei Li, Thang Luong, and Dan Jurafsky. 2015. A 3(1):17–35. hierarchical neural autoencoder for paragraphs and documents. In Proceedings of the 53rd Annual Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Meeting of the Association for Computational Lin- SummaRuNNer: A recurrent neural network based guistics and the 7th International Joint Conference sequence model for extractive summarization of on Natural Language Processing. Beijing, China, documents. In Proceedings of the 31st AAAI Confer- pages 1106–1115. ence on Artificial Intelligence. San Francisco, Cali- fornia USA, pages 3075–3081. Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016. Deep rein- Ramesh Nallapati, Bowen Zhou, C´ıcero Nogueira dos forcement learning for dialogue generation. In Pro- Santos, C¸aglar Gulc¸ehre,¨ and Bing Xiang. 2016. ceedings of the 2016 Conference on Empirical Meth- Abstractive text summarization using sequence-to- ods in Natural Language Processing. Austin, Texas, sequence RNNs and beyond. In Proceedings of the pages 1192–1202. 20th SIGNLL Conference on Computational Natural Language Learning. Berlin, Germany, pages 280– Liangda Li, Ke Zhou, Gui-Rong Xue, Hongyuan Zha, 290. and Yong Yu. 2009. Enhancing diversity, cover- age and balance for summarization through struc- Shashi Narayan, Nikos Papasarantopoulos, Shay B. ture learning. In Proceedings of the 18th inter- Cohen, and Mirella Lapata. 2017. Neural extrac- national Conference on World Wide Web. Madrid, tive summarization with side information. CoRR Spain, pages 71–80. abs/1704.04530.

1757 Ani Nenkova and Kathleen McKeown. 2011. Auto- Frank Schilder and Ravikumar Kondadadi. 2008. matic summarization. Foundations and Trends in FastSum: Fast and accurate query-based multi- Information Retrieval 5(2–3):103–233. document summarization. In Proceedings of the 45th Annual Meeting of the Association of Compu- Ani Nenkova, Lucy Vanderwende, and Kathleen McK- tational Linguistics and HLT: Short Papers. Colum- eown. 2006. A compositional context sensitive bus, Ohio, USA, pages 205–208. multi-document summarizer: Exploring the factors that influence summarization. In Proceedings of the Natalie Schluter. 2017. The limits of automatic sum- 29th Annual International ACM SIGIR Conference marisation according to rouge. In Proceedings of the on Research and Development in Information Re- 15th Conference of the European Chapter of the As- trieval. pages 573–580. sociation for Computational Linguistics: Short Pa- pers. Valencia, Spain, pages 41–45. Rodrigo Nogueira and Kyunghyun Cho. 2017. Task- oriented query reformulation with reinforcement Abigail See, Peter J. Liu, and Christopher D. Manning. learning. In Proceedings of the 2017 Conference on 2017. Get to the point: Summarization with pointer- Empirical Methods in Natural Language Process- generator networks. In Proceedings of the 55th An- ing. Copenhagen, Denmark, pages 585–594. nual Meeting of the Association for Computational Linguistics Daraksha Parveen, Hans-Martin Ramsl, and Michael . Vancouver, Canada, pages 1073–1083. Strube. 2015. Topical coherence for graph-based ex- tractive summarization. In Proceedings of the 2015 Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, and Conference on Empirical Methods in Natural Lan- Zheng Chen. 2007. Document summarization us- guage Processing. Lisbon, Portugal, pages 1949– ing conditional random fields. In Proceedings of the 1954. 20th International Joint Conference on Artifical in- telligence. Hyderabad, India, pages 2862–2867. Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive sum- Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua marization. CoRR abs/1705.04304. Wu, Maosong Sun, and Yang Liu. 2016. Minimum risk training for neural machine translation. In Pro- Dragomir Radev, Timothy Allison, Sasha Blair- ceedings of the 54th Annual Meeting of the Asso- Goldensohn, John Blitzer, Arda C¸elebi, Stanko ciation for Computational Linguistics. Berlin, Ger- Dimitrov, Elliott Drabek, Ali Hakim, Wai Lam, many, pages 1683–1692. Danyu Liu, Jahna Otterbacher, Hong Qi, Horacio Saggion, Simone Teufel, Michael Topper, Adam Karen Sparck¨ Jones. 2007. Automatic summarising: Winkel, and Zhu Zhang. 2004. MEAD — A plat- The state of the art. Information Processing and form for multidocument multilingual text summa- Management 43(6):1449–1481. rization. In Proceedings of the 4th International Conference on Language Resources and Evaluation. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Lisbon, Portugal, pages 699–702. Sequence to sequence learning with neural net- works. In Advances in Neural Information Process- Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, ing Systems 27. pages 3104–3112. and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. CoRR Richard S. Sutton and Andrew G. Barto. 1998. Rein- abs/1511.06732. forcement Learning : An Introduction. MIT Press. Cody Rioux, Sadid A. Hasan, and Yllias Chali. 2014. Fear the REAPER: A system for automatic multi- Krysta Marie Svore, Lucy Vanderwende, and Christo- document summarization with reinforcement learn- pher J. C. Burges. 2007. Enhancing single- ing. In Proceedings of the 2014 Conference on Em- document summarization by combining ranknet and pirical Methods in Natural Language Processing. third-party sources. In Proceedings of the 2007 Doha, Qatar, pages 681–690. Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Alexander M. Rush, Sumit Chopra, and Jason Weston. Language Learning. Prague, Czech Republic, pages 2015. A neural attention model for abstractive sen- 448–457. tence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Lan- Jiwei Tan and Xiaojun Wan. 2017. Abstractive docu- guage Processing. Lisbon, Portugal, pages 379–389. ment summarization with a graph-based attentional neural model. In Proceedings of the 55th Annual Seonggi Ryang and Takeshi Abekawa. 2012. Frame- Meeting of the Association for Computational Lin- work of automatic text summarization using rein- guistics. Vancouver, Canada, pages 1171–1181. forcement learning. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natu- Simone Teufel and Marc Moens. 1997. Sentence ex- ral Language Processing and Computational Nat- traction as a classification task. In Proceedings of ural Language Learning. Jeju Island, Korea, pages ACL Workshop on Intelligent and Scalable Text Sum- 256–265. marization. Madrid, Spain, pages 58–65.

1758 Xiaojun Wan. 2010. Towards a unified approach to simultaneous single-document and multi-document summarizations. In Proceedings of the 23rd Inter- national Conference on Computational Linguistics. Beijing, China, pages 1137–1145. Ronald J. Williams. 1992. Simple statistical gradient- following algorithms for connectionist reinforce- ment learning. Machine Learning 8(3-4):229–256. Kristian Woodsend and Mirella Lapata. 2010. Auto- matic generation of story highlights. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden, pages 565–574. Kristian Woodsend and Mirella Lapata. 2012. Multi- ple aspect summarization using integer linear pro- gramming. In Proceedings of the 2012 Joint Con- ference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Jeju Island, Korea, pages 233–243. Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd In- ternational Conference on Machine Learning. Lille, France, pages 2048–2057.

Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srinivasan, and Dragomir Radev. 2017. Graph-based neural multi-document summarization. In Proceedings of the 21st Confer- ence on Computational Natural Language Learning. Vancouver, Canada, pages 452–462. Wen-tau Yih, Joshua Goodman, Lucy Vanderwende, and Hisami Suzuki. 2007. Multi-document summa- rization by maximizing informative content-words. In Proceedings of the 20th International Joint Con- ference on Artifical intelligence. Hyderabad, India, pages 1776–1782. Wenpeng Yin and Yulong Pei. 2015. Optimizing sen- tence modeling and selection for document summa- rization. In Proceedings of the 24th International Joint Conference on Artificial Intelligence. Buenos Aires, Argentina, pages 1383–1389. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text clas- sification. In Advances in Neural Information Pro- cessing Systems 28. pages 649–657. Xingxing Zhang and Mirella Lapata. 2017. Sentence simplification with deep reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copen- hagen, Denmark, pages 595–605.

1759