Query Focused Abstractive Summarization via Incorporating Query Relevance and Transfer Learning with Transformer Models

Md Tahmid Rahman Laskar1,3, Enamul Hoque2, Jimmy Huang2,3

1 Department of Electrical Engineering and Computer Science, 2 School of Information Technology, 3 Information Retrieval & Knowledge Management Research Lab, York University, Toronto, Canada Introduction

2 Query Focused Abstractive Text Summarization

• Problem Statement: A set of documents along with a query is given and the goal is to generate abstractive summaries from the document(s) based on the given query. • Abstractive summaries can contain novel words which were not appeared in the source document.

Document: Even if reality shows were not enlightening, they generate massive revenues that can be used for funding more sophisticated programs. Take BBC for example, it offers entertaining reality shows such as total as well as brilliant documentaries. Query: What is the benefit of reality shows? Summary: Reality show generates revenues.

3 Motivation

• Challenges: • Lack of datasets. • Available datasets: Debatepedia, DUC. • Size of the available datasets is very small. • e.g. Debatepedia (Only around 10,000 training instances). • Few-shot Learning Problem. • Training a neural model end-to-end with small training data is challenging.

• Solution: We introduce a transfer learning technique via utilizing the Transformer architecture [Vaswani et al., 2017]: • First, we pre-train a transformer-based model in a large generic abstractive summarization dataset. • Then, we fine-tune the pre-trained model in the target query focused abstractive summarization dataset. 4 Contributions

• Our proposed approach:

• is the first work for the Query Focused Abstractive Summarization task where transfer learning is utilized with transformer.

• sets a new state-of-the-art result in the Debatepedia dataset.

• does not require any in-domain data augmentation for Few-shot Learning.

• The source code of our proposed model is also made publicly available: https://github.com/tahmedge/QR-BERTSUM-TL-for-QFAS

5 Literature Review

6 Related Work • Generic Abstractive Summarization • Pointer Generator Network (PGN) [See et al., 2017].

• Sequence to sequence model based on Recurrent Neural Network (RNN).

• Overcomes the repetition of the same word in the generated summaries via the copy and coverage mechanism.

• BERT for SUMmarization (BERTSUM) [Liu and Lapata, 2019].

• Utilized BERT [Devlin et al., 2018] as the encoder and Decoder of Transformer as the Decoder.

• Outperforms PGN for abstractive text summarization in several datasets.

• Limitations:

• Cannot incorporate query relevance.

7 Related Work (cont’d) • Query Focused Abstractive Summarization (QFAS)

• Diversity Driven Attention (DDA) Model [Nema et al., 2017].

• A neural encoder-decoder model based on RNN.

• Introduced a new dataset for the QFAS task from Debatepedia.

• Limitations:

• Only performs well when the Debatepedia dataset is augmented.

8 Related Work (cont’d) • Query Focused Abstractive Summarization (QFAS)

• Relevance Sensitive Attention for Query Focused Summarization (RSA- QFS) [Baumel et al., 2018].

• First, pre-trained the PGN model on a generic abstractive summarization dataset.

• Then, incorporated query relevance into the pre-trained model to predict query focused summaries in the target datasets.

• Limitations:

• Did not fine-tune their model on the QFAS datasets.

• Obtained a very low Precision score in the Debatepedia dataset.

9 Methodology

10 Proposed Approach

• Our proposed model works in two steps via utilizing transfer learning:

Step 1 Pre-train the BERTSUM model on a generic abstractive summarization corpus (e.g. XSUM)

Transfer Learning

Incorporate query relevance in the pre- Step 2 trained model and fine-tune it for the QFAS task in the target domain (i.e. Debatepedia).

• We choose the XSUM dataset for pre-training since the generated summaries in this dataset are more abstractive compared to the other datasets [Liu et al., 2019]. • To incorporate the query relevance in BERTSUM, we concatenate the query with the document as the input to the encoder [Lewis et al., 2019]. 11 Proposed Approach (cont’d)

Transformer Transformer Pre-train the BERTSUM Transfer Incorporate query relevance Decoder Decoder model on a large generic Learning into the pre-trained BERTSUM abstractive summarization and fine-tune on the target dataset. BERT Encoder domain. BERT Encoder

[CLS] Sent1 [SEP] [CLS] Sent2 [SEP] … [CLS] SentN [SEP] [CLS] SentQ [SEP] [CLS] Sent1 [SEP] … [CLS] SentN [SEP]

Input: Document{Sent1, Sent2, ... SentN} Input: Query{SentQ}, Document{Sent1 ... SentN} Document: The argument that too evil can be Query: What is the benefit of reality shows? prevented by assassination is highly questionable. The Document: Even if reality shows were not figurehead of an evil government is not necessarily the enlightening, they generate massive revenues that lynchpin that hold it together. Therefore, if Hitler had can be used for funding more sophisticated been assassinated, it is pure supposition that the Nazi programs. Take BBC for example, it offers would have acted any differently to how they did act. entertaining reality shows such as total wipeout as Summary: The idea that assassinations can prevent well as brilliant documentaries. injustice is questionable. Summary: Reality show generates revenues.

(a) Pre-train the BERTSUM model on a (b) Fine-tune the pre-trained model for the generic abstractive summarization corpus. QFAS task on the target domain.

12 Datasets

13 Debatepedia Dataset

Original Version: • Debatepedia dataset is created from the Debatepedia1 website. • Previous work on this dataset for the QFAS task used 10-fold cross validation.

Debatepedia (Original Dataset)

Average Number of Instances in each fold Train Dev Test 10859 1357 1357

1http://www.debatepedia.org/en/index.php/Welcome_to_Debatepedia%21 14 Debatepedia Dataset (cont’d)

Augmented Version: • We find in the official source code of the DDA model that dataset was augmented by creating more instances in the training set. • In the augmented dataset: • The average training instances in each fold were 95,843. • However, the test and the validation data were same as the original.

Debatepedia (Augmented Dataset)

Average Number of Instances in each fold Train Dev Test 95843 1357 1357

15 Data Augmentation Approach: Debatepedia Dataset

• We describe the data augmentation approach based on the source code2 of DDA.

• We find that for each training instance, 8 new training instances were created.

• First a pre-defined vocabulary was created having 24,822 words with their synonyms.

i. Then, each new training instance was created by randomly replacing:

• M (1 ≤ M ≤ 3) words in each query.

• N (10 ≤ N ≤ 17) words in each document.

ii. Each word was replaced with their synonyms found in the pre-defined vocabulary.

iii. When a word was not found in the pre-defined vocabulary, GloVe vocabulary was used.

iv. Steps i, ii, and iii are repeated 8 times to create 8 new training instances.

2Source Code of DDA: https://git.io/JeBZX 16 Experimental Details

17 Experimental Setup

• Dataset: • We used the original version of the Debatepedia dataset to evaluate our proposed model. • Evaluation Metrics: • ROUGE scores with Precision, Recall, and F1 in terms of ROUGE - 1, 2, L. • Baselines: Baseline Model Description QR-BERTSUM BERTSUM model via only incorporating query relevance.

BERTSUMXSUM Pre-trained BERTSUM model on XSUM dataset without any fine-tuning. RSA-QFS The result of the RSA-QFS model mentioned in [Baumel et al., 2018]. DDA The result of the DDA model mentioned in [Nema et al., 2017]. DDA (Original dataset) We run the DDA model in the Original version of Debatepedia. DDA (Augmented dataset) We run the DDA model in the Augmented version of Debatepedia.

18 Results and Analyses

19 Results Here, ‘Recall’, ‘Precision’, and ‘F1’ are denoted by ‘R’, ‘P’, and ‘F’ respectively. ‘*’ denotes our implementation of the DDA model.

Models ROUGE 1 ROUGE 2 ROUGE L R P F R P F R P F QR-BERTSUM 22.31 35.68 26.42 9.94 16.73 11.90 21.22 33.85 25.09

BERTSUMXSUM 17.36 11.48 13.32 3.03 2.47 2.75 14.96 9.88 11.46 RSA-QFS [Baumel et al.] 53.09 - - 16.10 - - 46.18 - - DDA [Nema et al.] 41.26 - - 18.75 - - 40.43 - - DDA* (Original Dataset) 7.52 7.67 7.35 2.83 2.88 2.84 7.13 7.54 7.24 DDA* (Augmented Dataset) 37.80 47.38 40.49 27.55 33.74 29.37 37.27 46.68 39.90 Our Model: QR-BERTSUM-TL 57.96 60.44 58.50 45.20 46.11 45.47 57.05 59.33 57.73

• An improvement of 9.17%, and 23.54% in terms of ROUGE-1, and ROUGE-L respectively over RSA-QFS + PGN. • A huge gain in terms of ROUGE-2 compared to the previous models, with an improvement of 141.67% from DDA and an improvement of 180.75% over RSA-QFS + PGN.

20 Discussions

• In the original version of the Debatepedia dataset. • The Transformer based QR-BERTSUM outperforms the RNN based DDA model. • Suggests the effectiveness of using transformer instead of RNN. • We find that data augmentation significantly improves the performance of DDA. • Our proposed model significantly outperformed the baselines: • The QR-BERTSUM model (Which did not leverage transfer learning)

• The BERTSUMXSUM model (Which did not utilize fine-tuning) • Our proposed model sets a new state-of-the-art result without any in-domain data augmentation.

21 Conclusions and Future Work

• There are lack of datasets for QFAS and the available datasets are small in size. • To address this problem, we presented a transfer learning technique with the BERTSUM model for QFAS. • Our approach shows state-of-the-art result in the Debatepedia dataset. • In future, we will investigate the performance of our proposed approach on more datasets (e.g. DUC).

22 Acknowledgements

This research is supported by the Natural Sciences & Engineering Research Council (NSERC) of Canada and an ORF-RE (Ontario Research Fund- Research Excellence) award in BRAIN Alliance. We also thank Compute Canada for providing us with computing resources.

23 Questions?

24 References

1. Baumel, T., et al.: Query Focused Abstractive Summarization: Incorporating Query Relevance, Multi-Document Coverage, and Summary Length Constraints into seq2seq Models. arXiv preprint arXiv:1801.07704 (2018) 2. Devlin, J., et al.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proc. of NAACL-HLT. pp. 4171-4186 (2019) 3. Lewis, M., et al.: BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv preprint arXiv:1910.13461 (2019) 4. Liu, Y., Lapata, M.: Text Summarization with Pretrained Encoders. In: Proc. of EMNLP-IJCNLP. pp. 3721-3731 (2019) 5. Nema, P., et al.: Diversity driven attention Model for query-based abstractive summarization. In: Proc. of ACL. pp. 1063-1072 (2017) 6. See, A., et al.: Get To The Point: Summarization with Pointer-Generator Networks. In: Proc. of ACL. pp. 1073-1083 (2017) 7. Vaswani, A., et al.: Attention Is All You Need. In: Proc. of NIPS. pp. 5998-6008 (2017)

25