Abstractive News Summarization Via Copying and Transforming
Total Page:16
File Type:pdf, Size:1020Kb
Abstractive News Summarization via Copying and Transforming Philippe Laban John Canny Marti Hearst UC Berkeley UC Berkeley UC Berkeley [email protected] [email protected] [email protected] Abstract Original article: A Chilean miner stole the show at the World Pasty Championship by beating his Cornish com- Automatic Abstractive summarization in the petition. Jorge Pereira won the open savoury amateur news domain often relies on the use of a prize with his empanada Chilena, a traditional Chilean Pointer-Generator architecture, which favor pasty made with beef, onion, hard-boiled egg, olives the use of long extracts from the article. We and sultanas. Mr. Pereira decided to take part in the propose a new architecture, Copy and Trans- contest while on a two-month visit to the UK to see his wife’s family. Wife Gail, who spoke on the non-english- form (CaT), that produces more abstractive speaking cook’s behalf, said: “Jorge feels very excited summaries. We build a new coverage mech- and happy to be so far from my country to win such a anism, keyword coverage, that encourages the prize”. “It’s all about getting recognition for his country decoder at test-time to use pre-defined key- rather than winning.” There were also pasty makers... words. We train our CaT model with key- Long sentence style summary word coverage on two stylistically different Keywords 0: [Chile, Jorge, amateur, past, Prize] datasets, and obtain state-of-the-art results on Chilean miner Jorge Pereira won the open savoury am- both datasets (+0.3 ROUGE-1 on CNNDM, ateur prize with his empanada Chilena, a traditional Chilean pasty made with beef, onion, hard-boiled egg, +4.1 ROUGE-1 on Newsroom). We show hu- olives, and sultanas. man judges prefer our summaries over pre- Bullet point style summary vious state-of-the-art, but still prefer human- Keywords 1: [Chile, Jorge, amateur, past, Prize] written summaries. Our system can summa- Keywords 2: [Jorge, Wife, Gail, family, visit] rize articles in two styles, bullet-point and • Jorge Pereira won the open savoury amateur prize long-sentence, and a list of desired keywords with his empanada Chilena. • can be provided to customize the summary. The Chilean pasty made with beef, onion, hard- boiled egg, olives and sultanas. • He decided to take part in the contest while on a two-month visit to the UK. 1 Introduction • Wife Gail, who spoke on the non-english- speaking cooks behalf, said: “It’s all about getting Summarization is the task of reducing a document recognition for his country rather than winning.” to a shorter summary that retains the most impor- Figure 1: Our system can generate a news article sum- tant points of the original document. This defini- mary in two styles: long-sentence and bullet point. tion of summarization, although simple, leaves out When the keywords input are changed from Keywords the fact that a summary is intended for a reader and 1 to Keywords 2, the first two bullet points output re- their perspective, whose prior knowledge might main unchanged, but the third bullet point shown is affect what the important points are. Summariza- changed into the fourth. tion in the news domain is frequent and occurs at several levels as news articles are often summa- can use words in the input article by pointing to rized into a headline as well as a short summary, them, and freely choose words from a smaller vo- by a journalist. cabulary (generator). We argue this is constrain- When a journalist summarizes a news article, ing, as the decoder must decide whether to give they are free to reuse content from the article, more control to the pointer, or the generator. At as well as introduce new words and phrases, to test-time, Pointer-Generators to rely heavily on achieve brevity and a desired style. Mimicking pointing, making the summaries too extractive, in this behavior, automatic news summarization is comparison to human written summaries. often achieved using a Pointer-Generator mecha- We propose a novel mechanism for text gen- nism (See et al., 2017), a hybrid architecture that eration in summarization: Copy and Transform. We use a Transformer (Vaswani et al., 2017) ar- work. We propose a new mechanism for text- chitecture and force the encoder and decoder to generation in the summarization domain: Copy share vocabulary and embedding matrices. Be- and Transform. cause each layer of the Transformer is made of a Optimizing ROUGE and other evaluation met- transformation and a residual layer, the network is rics using reinforcement learning (RL) was intro- able to learn to copy words from the input or trans- duced to the field of summarization by Paulus et al. form them, in a single generation mechanism. We (2018), and has been adapted since then, for ex- demonstrate that the Copy and Transform mech- ample to optimize variants of ROUGE that in- anism outperforms the Pointer-Generator mecha- corporate entailment information between gener- nism, and achieves state-of-the-art ROUGE scores ated summaries and articles (Pasunuru and Bansal, on the CNNDM (Nallapati et al., 2016) and News- 2018). Our method does not require RL, and does room (Grusky et al., 2018) datasets. not directly optimize a metric we evaluate on. Our method also lends itself to customizing the Multi-pass summarization is also commonly summaries in two ways: summaries for an arti- explored, for example Nallapati et al.(2017) cle can be produced in two styles (bullet-point vs. shrink the original document to most relevant long sentence), and a list of keywords desired in sentences before summarizing, Chen and Bansal the summary can optionally be provided. Auto- (2018) extract the sentences which should appear matic summarization can help personalize content in the summary (based on expected ROUGE), then at scales not possible otherwise. Our contributions use an abstractive system to modify each sentence. are the following: Gehrmann et al.(2018) first train a content selec- 1. Copy and Transform, a novel mechanism for tor network used to constrain a second system’s text-generation for summarization. Pointer-Generator to only point to words that are likely to be in a summary. Our method is com- 2. A conceptually simple coverage mechanism prised of a single system that reads the news article that encourages expected keywords to appear unmodified. in the summary. Increasing Coverage of a summary is the idea that a good summary should cover the set of im- 3. Two types of optional summary customiz- portant topics the article details. See et al.(2017) ability: (i) format customization, with two first proposed a coverage loss discouraging the different summary styles, and (ii) content, Pointer-Generator system from pointing to words with flexibility in specification of keywords in the article it has already pointed to, enforcing to include when generating summaries. that it does not repeat itself, indirectly forcing it to cover more of the article. Gehrmann et al.(2018) 2 Related Work constrain the attention mechanism to point only News abstractive summarization started with to words that are judged likely to be in the sum- smaller scale datasets, such as DUC-2004 (Har- mary. Pasunuru and Bansal(2018) develop an RL- man and Over, 2004; Dang, 2006) or shorter sum- based loss that encourages salient keywords to be maries, such as the NYT corpus (Sandhaus, 2008), included in the summary. We propose keyword but more recently the field has focused on the CN- coverage, a simple, test-time method to encourage NDM dataset, first introduced by Nallapati et al. the use of important keywords that doesn’t require (2016), who published the first extractive and ab- an additional model or fine-tuning. stractive results on it. Grusky et al.(2018) intro- Transformers were initially introduced for Ma- duced the much larger Newsroom dataset, com- chine Translation (Vaswani et al., 2017). Liu posed of summaries made of 1 to 2 long sentences, et al.(2018) propose to use Transformers for sum- from a diverse set of sources. marization of Wikipedia pages, but they do not The Pointer-Generator mechanism was intro- share vocabularies between the encoder and de- duced by See et al.(2017), a hybrid between a coder, which limits the ability of the Transformer seq2seq model (Sutskever et al., 2014) with atten- to copy from the input, like we propose in this tion, and a pointer network (Vinyals et al., 2015). work. Gehrmann et al.(2018) also propose an ex- Pointer-Generator has become standard for sum- periment where they modify a Transformer by ran- marization, and is used by most of the following domly assigning one of its cross-attention heads to be a pointer head, shaping the Transformer into a anism in Figure2. Unlike the Pointer-Generator Pointer-Generator, they obtain mixed results and mechanism, we do not rely on a separate pointing their best performing model remains an LSTM- mechanism based on the input-output attention of based Pointer-Generator. We propose to use the the model. Instead, we set all word embeddings Transformer for what it is best at: transform, and to equal each other, as well as the final projection simply make it simple for it to copy input passages layer of the decoder layer. By tying the embed- to the output. dings, if an input word embedding is propagated Customizability of summaries is an exciting throughout the network to the last layer of the de- new topic, with objective to cater information syn- coder, it will be the word selected by the decoder thesized in a summary for each individual, based layer, essentially allowing it to be copied intact on interests.