Arxiv:2010.12836V2 [Cs.CL] 11 Apr 2021

Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation Alexander R. Fabbriy Simeng Hanyy Haoyuan Lizz Haoran Li z Marjan Ghazvininejadz Shafiq Jotyyy Dragomir Radevy Yashar Mehdadz y Yale University z Facebook AI yy Nanyang Technological University zz Renmin University of China {alexander.fabbri,dragomir.radev}@yale.edu {aimeeli,ghazvini,mehdad}@fb.com Abstract corpus (Sandhaus, 2008) as well as by the introduction of large pretrained models such as BART Models pretrained with self-supervised objec- tives on large text corpora achieve state-of-the- (Lewis et al., 2020) and Pegasus (Zhang et al., art performance on English text summariza- 2019), in some cases resulting in summaries which tion tasks. However, these models are typi- are even favored over the human-written reference cally fine-tuned on hundreds of thousands of summaries. Creating data for every new domain, data points, an infeasible requirement when however, is infeasible and highly costly. Thus, the applying summarization to new, niche do- ability to transfer large pretrained models to new mains. In this work, we introduce a novel domains with little or no in-domain data is neces- and generalizable method, called WikiTrans- fer, for fine-tuning pretrained models for sum- sary, especially as such models make their way into marization in an unsupervised, dataset-specific real-world applications. manner. WikiTransfer fine-tunes pretrained Unsupervised summarization approaches in- models on pseudo-summaries, produced from clude autoencoders to mirror the information com- generic Wikipedia data, which contain charac- pression inherent in summarization (Baziotis et al., teristics of the target dataset, such as the length 2019; Chu and Liu, 2019; Bražinskas et al., 2020b) and level of abstraction of the desired summaries. WikiTransfer models achieve state-of- as well as large-scale pretraining for domain- the-art, zero-shot abstractive summarization specific adaptation (Yang et al., 2020). However, performance on the CNN-DailyMail dataset little work has focused on domain adaptation in and demonstrate the effectiveness of our ap- summarization. Wang et al.(2019) examine do- proach on three additional diverse datasets. main adaptation for extractive summarization. Hua These models are more robust to noisy data and Wang(2017) showed that summarization mod- and also achieve better or comparable few-shot els have difficulty generating text in the style of performance using 10 and 100 training exam- the target domain, while more recently, Zhang ples when compared to few-shot transfer from other summarization datasets. To further boost et al.(2019) report strong performance of pre- performance, we employ data augmentation trained models when trained in few-shot settings via round-trip translation as well as introduce and (Bražinskas et al., 2020a) fine-tune dataset- a regularization term for improved few-shot specific components of a model for few-shot learn- transfer. To understand the role of dataset as- ing. We aim to build on recent work in pretrained pects in transfer performance and the quality models and improve zero-shot and few-shot sum- arXiv:2010.12836v2 [cs.CL] 11 Apr 2021 of the resulting output summaries, we further marization by encoding characteristics of the target study the effect of the components of our unsupervised fine-tuning data and analyze few-shot summarization dataset in unsupervised, intermedi- performance using both automatic and human ate fine-tuning data. evaluation. Summarization can be seen as a function of sub- functions of the input, called subaspects, which 1 Introduction determine the output form. Jung et al.(2019) de- Automatic text summarization aims to distill the fine three subaspects for summarization: position, most salient content of a given text in a compact importance, and diversity, and study how these form. Recent advances in summarization have been subaspects manifest themselves in summarization driven by the availability of large-scale datasets corpora and model outputs. For example, a com- such as the CNN-DailyMail (CNNDM) corpus mon subaspect for the CNNDM dataset is position; (Nallapati et al., 2016) and the New York Times earlier sentences tend to constitute a good summary. Inspired by this view of summarization as in style in generating the summary. subaspects, we aim to encode subaspects of a target Bražinskas et al.(2020a) introduce plug-in net- dataset into unlabeled data to allow a model fine- works, small finetune-able layers that aim to repro- tuned on this data to learn characteristics of the duce characteristics of the target dataset as seen in target dataset to improve zero-shot and few-shot a small set of labeled examples. In contrast, we aim transfer of the model. In our work, we focus on the to encode the characteristics of our target dataset, subaspects of extractive diversity, as determined such as level of extraction and compression, a priori by how well an extractive model performs on the in the intermediate training phase. In other work, data, compression ratio between the source docu- Lebanoff et al.(2018) adapt a single-document ment and summary, and, in the case of CNNDM, summarization model to multi-document settings, the lead bias. We assume knowledge of the target while Zhu et al.(2019) use Wikipedia reference dataset such as the size of input documents, the data for downstream query-based summarization size of the desired summaries, and the extent to Several approaches for unsupervised summariza- which the summary is abstractive, all of that can tion have made use of variational autoencoders be treated as prior knowledge if the task is to be (Baziotis et al., 2019; Chu and Liu, 2019; Bražin- well-defined (Kryscinski et al., 2020). We encode skas et al., 2020b). Zhou and Rush(2019) makes this knowledge into Wikipedia article data by ex- use of pretrained language models for unsupervised tracting summaries of the desired output length and text summarization by aligning the coverage of the filtering examples based on the desired level of generated summary to the source document. Laban abstraction. et al.(2020) train an unsupervised summarization Our contributions are the following: 1) We intro- model with reinforcement learning rewards. In duce a novel method, called WikiTransfer, which another line of work, extractive models such as creates pseudo-summaries with subaspects of the TextRank, (Mihalcea and Tarau, 2004), LexRank target dataset which can be used as unlabeled data (Erkan and Radev, 2004), and more recently Pac- for intermediate fine-tuning. We show that this Sum (Zheng and Lapata, 2019), make use of graph method improves zero-shot domain transfer over centrality for modeling salience. transfer from other domains, achieving state-of-the- art unsupervised abstractive summarization perfor- The power of pretrained models for few-shot mance on the CNNDM dataset while generalizing transfer was shown for abstractive summarization to other domains, and we perform extensive hyper- in Zhang et al.(2019) and extractive summariza- parameter studies on the factors influencing zero- tion in Desai et al.(2020). Our work focuses on shot performance 2) We demonstrate the benefits of the zero-shot abstractive summarization setting and WikiTransfer in few-shot settings, and show addi- the transferability of models fine-tuned on task- tional improvements when applying WikiTransfer specific data from a generic corpus, rather than just with data augmentation and a regularization term the transferability of a single pretrained model. The for training with potentially noisy augmented data. closest work to ours for zero-shot transfer is Yang We show robustness in these settings and analyze et al.(2020), which uses the lead-bias in news to differences in performance in both automatic and pretrain an unsupervised model on a large dataset human assessments. of news articles. Our approach, however, focuses on fine-tuning an already-pretrained model specifi- 2 Related Work cally for summarization on a downstream dataset by leveraging a generic text corpus (Wikipedia) While advances have been made in neural tech- to create auxiliary fine-tuning data that transfers niques for summarization due in part to large across domains, allowing for more fine-grained datasets, less work has focused on domain adap- control over the transfer process. We show the tation of such methods in the zero and few-shot generalizability of such fine-tuning across domains. settings. Wang et al.(2019) examine domain adap- BART (Lewis et al., 2020) is a pretrained denoising tation, but in extractive summarization. Hua and autoencoder and achieved state-of-the-art perfor- Wang(2017) examine domain adaptation between mance when fine-tuned on summarization tasks at opinion and news summarization, observing that the time. In this work, we use BART as our base models trained on one domain and applied to an- pretrained model but in future work will experi- other domain can capture relevant content but differ ment with other pretrained models. 3 Methods Data Augmentation via Round-Trip Transla- tion: In addition to fine-tuning on WikiTransfer WikiTransfer Intermediate Fine-tuning: We data for zero-shot domain transfer, we test the abil- propose a method for fine-tuning pretrained mod- ity of our model to transfer when we have few els using unsupervised Wikipedia data. We cre- examples and whether data augmentation further ate dataset-specific unsupervised data for this in- improves these results. In few-shot fine-tuning, we termediate fine-tuning, by making use of charac- conduct data augmentation to reduce brute-force teristics of the target dataset such as the average memorization and introduce a regularization ef- length of input documents, the average summary fect. Specifically, we perform round-trip translation length, and the general bin of whether the sum- (Yu et al., 2018) to generate paraphrases of both maries desired are very abstractive or very extrac- the source documents and summaries, as previous tive, as discussed above.

Load more