Soft Layer-Specific Multi-Task Summarization with Entailment And
Total Page:16
File Type:pdf, Size:1020Kb
Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation Han Guo∗ Ramakanth Pasunuru∗ Mohit Bansal UNC Chapel Hill fhanguo, ram, [email protected] Abstract Marcu, 2002; Clarke and Lapata, 2008; Filippova et al., 2015; Henß et al., 2015), abstractive sum- An accurate abstractive summary of a doc- maries are based on rewriting as opposed to se- ument should contain all its salient infor- lecting. Recent end-to-end, neural sequence-to- mation and should be logically entailed by sequence models and larger datasets have allowed the input document. We improve these substantial progress on the abstractive task, with important aspects of abstractive summa- ideas ranging from copy-pointer mechanism and rization via multi-task learning with the redundancy coverage, to metric reward based re- auxiliary tasks of question generation and inforcement learning (Rush et al., 2015; Chopra entailment generation, where the former et al., 2016; Nallapati et al., 2016; See et al., 2017). teaches the summarization model how to Despite these strong recent advancements, there look for salient questioning-worthy de- is still a lot of scope for improving the summary tails, and the latter teaches the model quality generated by these models. A good rewrit- how to rewrite a summary which is a ten summary is one that contains all the salient directed-logical subset of the input doc- information from the document, is logically fol- ument. We also propose novel multi- lowed (entailed) by it, and avoids redundant infor- task architectures with high-level (seman- mation. The redundancy aspect was addressed by tic) layer-specific sharing across multi- coverage models (Suzuki and Nagata, 2016; Chen ple encoder and decoder layers of the et al., 2016; Nallapati et al., 2016; See et al., 2017), three tasks, as well as soft-sharing mech- but we still need to teach these models about how anisms (and show performance ablations to better detect salient information from the in- and analysis examples of each contribu- put document, as well as about better logically- tion). Overall, we achieve statistically sig- directed natural language inference skills. nificant improvements over the state-of- In this work, we improve abstractive text sum- the-art on both the CNN/DailyMail and marization via soft, high-level (semantic) layer- Gigaword datasets, as well as on the DUC- specific multi-task learning with two relevant aux- 2002 transfer setup. We also present sev- iliary tasks. The first is that of document-to- eral quantitative and qualitative analysis question generation, which teaches the summa- arXiv:1805.11004v1 [cs.CL] 28 May 2018 studies of our model’s learned saliency rization model about what are the right questions and entailment skills. to ask, which in turn is directly related to what the 1 Introduction salient information in the input document is. The second auxiliary task is a premise-to-entailment Abstractive summarization is the challenging generation task to teach it how to rewrite a sum- NLG task of compressing and rewriting a docu- mary which is a directed-logical subset of (i.e., ment into a short, relevant, salient, and coherent logically follows from) the input document, and summary. It has numerous applications such as contains no contradictory or unrelated informa- summarizing storylines, event understanding, etc. tion. For the question generation task, we use the As compared to extractive or compressive sum- SQuAD dataset (Rajpurkar et al., 2016), where marization (Jing and McKeown, 2000; Knight and we learn to generate a question given a sentence ∗ Equal contribution (published at ACL 2018). containing the answer, similar to the recent work by Du et al.(2017). Our entailment generation in hierarchical, distractive, saliency, and graph- task is based on the recent SNLI classification attention modeling (Rush et al., 2015; Chopra dataset and task (Bowman et al., 2015), converted et al., 2016; Nallapati et al., 2016; Chen et al., to a generation task (Pasunuru and Bansal, 2017). 2016; Tan et al., 2017). Paulus et al.(2018) Further, we also present novel multi-task learn- and Henß et al.(2015) incorporated recent ad- ing architectures based on multi-layered encoder vances from reinforcement learning. Also, See and decoder models, where we empirically show et al.(2017) further improved results via pointer- that it is substantially better to share the higher- copy mechanism and addressed the redundancy level semantic layers between the three afore- with coverage mechanism. mentioned tasks, while keeping the lower-level Multi-task learning (MTL) is a useful paradigm (lexico-syntactic) layers unshared. We also ex- to improve the generalization performance of a plore different ways to optimize the shared pa- task with related tasks while sharing some com- rameters and show that ‘soft’ parameter sharing mon parameters/representations (Caruana, 1998; achieves higher performance than hard sharing. Argyriou et al., 2007; Kumar and Daume´ III, Empirically, our soft, layer-specific sharing 2012). Several recent works have adopted MTL model with the question and entailment genera- in neural models (Luong et al., 2016; Misra tion auxiliary tasks achieves statistically signif- et al., 2016; Hashimoto et al., 2017; Pasunuru and icant improvements over the state-of-the-art on Bansal, 2017; Ruder et al., 2017; Kaiser et al., both the CNN/DailyMail and Gigaword datasets. 2017). Moreover, some of the above works have It also performs significantly better on the DUC- investigated the use of shared vs unshared sets of 2002 transfer setup, demonstrating its strong gen- parameters. On the other hand, we investigate the eralizability as well as the importance of auxiliary importance of soft parameter sharing and high- knowledge in low-resource scenarios. We also re- level versus low-level layer-specific sharing. port improvements on our auxiliary question and Our previous workshop paper (Pasunuru et al., entailment generation tasks over their respective 2017) presented some preliminary results for previous state-of-the-art. Moreover, we signif- multi-task learning of textual summarization with icantly decrease the training time of the multi- entailment generation. This current paper has task models by initializing the individual tasks several major differences: (1) We present ques- from their pretrained baseline models. Finally, we tion generation as an additional effective auxil- present human evaluation studies as well as de- iary task to enhance the important complemen- tailed quantitative and qualitative analysis studies tary aspect of saliency detection; (2) Our new of the improved saliency detection and logical in- high-level layer-specific sharing approach is sig- ference skills learned by our multi-task model. nificantly better than alternative layer-sharing ap- proaches (including the decoder-only sharing by 2 Related Work Pasunuru et al.(2017)); (3) Our new soft shar- ing parameter approach gives stat. significant Automatic text summarization has been progres- improvements over hard sharing; (4) We pro- sively improving over the time, initially more fo- pose a useful idea of starting multi-task mod- cused on extractive and compressive models (Jing els from their pretrained baselines, which sig- and McKeown, 2000; Knight and Marcu, 2002; nificantly speeds up our experiment cycle1; (5) Clarke and Lapata, 2008; Filippova et al., 2015; For evaluation, we show diverse improvements Kedzie et al., 2015), and moving more towards of our soft, layer-specific MTL model (over compressive and abstractive summarization based state-of-the-art pointer+coverage baselines) on the on graphs and concept maps (Giannakopoulos, CNN/DailyMail, Gigaword, as well as DUC 2009; Ganesan et al., 2010; Falke and Gurevych, datasets; we also report human evaluation plus 2017) and discourse trees (Gerani et al., 2014), analysis examples of learned saliency and entail- syntactic parse trees (Cheung and Penn, 2014; ment skills; we also report improvements on the Wang et al., 2013), and Abstract Meaning Repre- auxiliary question and entailment generation tasks sentations (AMR) (Liu et al., 2015; Dohare and over their respective previous state-of-the-art. Karnick, 2017). Recent work has also adopted 1About 4-5 days for Pasunuru et al.(2017) approach vs. machine translation inspired neural seq2seq mod- only 10 hours for us. This will allow the community to try els for abstractive summarization with advances many more multi-task training and tuning ideas faster. In our work, we use a question generation task task like summarization. Our pointer mechanism to improve the saliency of abstractive summariza- approach is similar to See et al.(2017), who use tion in a multi-task setting. Using the SQuAD a soft switch based on the generation probability dataset (Rajpurkar et al., 2016), we learn to gen- pg = σ(Wgct+Ugst+Vgewt−1 +bg), where σ(·) is erate a question given the sentence containing the a sigmoid function, Wg, Ug, Vg and bg are param- answer span in the comprehension (similar to Du eters learned during training. ewt−1 is the previous et al.(2017)). For the second auxiliary task of en- time step output word embedding. The final word tailment generation, we use the generation version distribution is Pf (y) = pg ·Pv(y)+(1−pg)·Pc(y), of the RTE classification task (Dagan et al., 2006; where Pv vocabulary distribution is as shown in Lai and Hockenmaier, 2014; Jimenez et al., 2014; Eq.1, and copy distribution Pc is based on the at- Bowman et al., 2015). Some previous work has tention distribution over source document words. explored the use of RTE for redundancy detec- tion in summarization by modeling graph-based Coverage Mechanism Following previous relationships between sentences to select the most work (See et al., 2017), coverage helps alleviate non-redundant sentences (Mehdad et al., 2013; the issue of word repetition while generating Gupta et al., 2014), whereas our approach is based long summaries. We maintain a coverage vector Pt−1 on multi-task learning.