On the Summarization and Evaluation of Long Documents
Total Page:16
File Type:pdf, Size:1020Kb
IMPERIAL COLLEGE LONDON DEPARTMENT OF COMPUTING On the Summarization and Evaluation of Long Documents Author: Internal Supervisors: Alexander Gaskell Dr. Pedro Baiz Prof. Lucia Specia External Supervisors: Hugo Barbaroux Dr. Eric Topham Submitted in partial fulfillment of the requirements for the MSc degree in Artificial Intelligence of Imperial College London Date: September 3, 2020 Abstract Text summarization has been a key language generation task within Natural Language Pro- cessing (NLP) for over 60 years. The field has advanced tremendously during the past two years, benefiting from the proliferation of neural networks across NLP, and in partic- ular, the Transformer. The recent advances have focused primarily on short documents; long documents are the focus of this thesis. We investigate two promising areas: 1) iden- tifying a superior automatic evaluation metric, and 2) creating and experimenting with novel Transformer-based long document summarization models. There are several key find- ings in this thesis. Regarding the evaluation metrics, we test eight metrics on five exper- iments spanning three datasets. We develop a novel evaluation metric, BARTScore, and find this metric correlates twice as well with human judgement as the prevailing ROUGE metrics, and often was the strongest performer of every metric we considered. We estab- lish a set of four evaluation metrics, BARTScore, BERTScore, Mover-1 and Mover-2, each of which consistently outperform the ROUGE metrics. Regarding the novel architectures, we experiment with the Longformer Encoder Decoder (LED) a Transformer model designed for long documents. We demonstrate state of the art performance on one dataset, beat- ing the incumbent by over one ROUGE. We also show that approximate self-attention lay- ers perform comparably to dense self-attention layers for summarization while being much more memory-efficient. These efficiencies allow us to fine-tune summarization models us- ing 2.5x longer sequences on standard hardware. The accompanying code can be found at https://github.com/alexgaskell10/nlp_summarization. Acknowledgements I am grateful for the valuable help and insight I have received from a number of people affiliated with this project. I would first like to thank my supervisor, Dr. Pedro Baiz, for his guidance and insightful suggestions throughout. These have been invaluable for steering the direction of the project and helping me to always see the bigger picture. I would also like to thank The Data Analysis Bureau, T-DAB, and in particular Dr. Eric Topham and Hugo Barbaroux for your contributions to this project. You have each provided several important ideas which have helped shaped the project to what it currently is. Equally importantly, our frequent catch-ups helped maintain my sanity throughout the long Covid-19 dominated months without any access to the Imperial campus and regular human contact in general. Table of Contents Page 1 Introduction1 1.1 Ethical, Legal and Environmental Considerations................2 2 Background3 2.1 Overview of Text Summarization.........................3 2.2 A Brief History of Text Summarization......................4 2.3 Building Blocks...................................5 2.3.1 Sequence to Sequence...........................5 2.3.2 Recurrent Neural Networks........................6 2.3.3 Attention..................................9 2.3.4 Transformers................................ 11 2.4 Evaluation...................................... 12 2.4.1 ROUGE................................... 12 2.4.2 Human Evaluation............................. 14 2.4.3 Model-Based Metrics............................ 14 2.4.4 BLEURT................................... 15 2.4.5 MoverScore................................. 16 2.5 Leading Approaches to Text Summarization................... 17 2.5.1 TED Architectures............................. 18 2.5.2 Memory-Efficient Transformers...................... 20 2.5.3 Using Recurrent Neural Networks..................... 23 2.5.4 Beyond Maximum Likelihood....................... 24 2.5.5 Unsupervised Learning........................... 25 3 Datasets 26 3.1 Summarization Datasets.............................. 26 3.2 Evaluation Datasets................................. 28 3.3 Discussion...................................... 29 4 Design 29 4.1 Evaluation Analysis................................. 29 4.1.1 Description of Metrics........................... 29 4.2 Architecture Analysis................................ 31 4.2.1 BART.................................... 31 4.2.2 LED..................................... 32 4.2.3 RED..................................... 33 5 Experimental Methodology 34 5.1 Evaluation Analysis................................. 34 5.1.1 Preprocessing................................ 34 5.1.2 Human-Metric Evaluation Correlations.................. 36 5.1.3 Quora Question Pairs............................ 36 5.1.4 Adversarial Analysis............................ 36 5.1.5 Benchmarking of Existing Studies..................... 37 5.2 Architecture Analysis................................ 38 5.2.1 Preprocessing................................ 38 TABLE OF CONTENTS TABLE OF CONTENTS 5.2.2 Hyper-parameter Search.......................... 38 5.2.3 LED vs BART................................ 38 5.2.4 LED performance.............................. 39 5.2.5 Random Starts Analysis.......................... 39 5.2.6 RED vs LED................................. 42 6 Results 42 6.1 Evaluation Experiments.............................. 42 6.1.1 Metric Configurations........................... 42 6.1.2 CNN/DailyMail Correlation Experiments................. 44 6.1.3 Quora Question Pairs Analysis....................... 45 6.1.4 Adversarial Analysis............................ 45 6.1.5 Benchmarking of Leading Architectures................. 47 6.2 Architecture Experiments............................. 47 6.2.1 LED vs BART................................ 48 6.2.2 LED performance.............................. 49 6.2.3 LED Profiling................................ 52 6.2.4 Random Starts Analysis.......................... 53 6.2.5 RED vs LED................................. 54 7 Analysis 54 7.1 Evaluation Analysis................................. 54 7.1.1 Choosing Between the Metrics....................... 54 7.1.2 Metrics profiling.............................. 56 7.1.3 Limitations................................. 56 7.2 Architecture Analysis................................ 57 7.2.1 Qualitative Analysis............................ 57 7.2.2 Random Starts Discussion......................... 58 7.2.3 Deep Dive Into Model Performance by Document Length........ 58 8 Concluding Remarks 63 8.1 Contributions.................................... 63 8.2 Future Work..................................... 64 8.3 Conclusion..................................... 65 Appendices 66 A CNN/DailyMail Summaries 66 B Dataset Samples with Summaries 70 C Correlation Metric Definitions 75 D Model Configurations 76 E Supporting Results 78 F Ethics Checklist 84 5 1 INTRODUCTION 1 Introduction This thesis seeks to contribute to the existing literature on automatic text summarization. This field has recently undergone rapid and substantial progress, driven by the success of se- quence to sequence (seq2seq) modelling in NLP. However, the research emphasis has focused on developing architectures suitable for short, single documents and has neglected long and multi-document summarization. The ultimate goal for this field is to develop a framework which can produce high quality summaries independent of the source document lengths, whether it is a single or multi-document task and agnostic over domain and whether the lexicon is technical or colloquial. Given the impressive recent progress seen in short document summarization, the next frontier is to replicate the results using long documents. The focus of this thesis is therefore on advancing the literature in long document summarization. We identify the evaluation of summaries and architectures for long document summarization as two particularly impor- tant areas and these are the central themes of this thesis. Regarding the evaluation metrics, we rigorously test eight evaluation metrics for their correlation with human judgement when scoring summaries, their ability to detect semantic equivalence and their sensitivity to artificial corruption. It will be seen that a set of four novel, model-based evaluation metrics considerably outperform the prevailing evaluation metrics. It will be argued that these are well-placed to become the new primary evaluation metrics for text summarization research. Considering the architectures angle, we experiment using the LED, a Transformer Encoder Decoder (TED) well-suited to long document summa- rization. By facilitating the summarization of longer documents, we will demonstrate that the LED performs near or at state of the art levels on long document summarization tasks. The key findings of this thesis are as follows: • We achieve state of the art text summarization performance on the arXiv dataset, beat- ing the incumbent, PEGASUS, by over one ROUGE1. This is in spite of our modest com- putational resources • We develop a novel evaluation metric, BARTScore. This correlates approximately 2x better with human judgement than ROUGE2. Often performed best of all metrics we tested • We establish a superior set of model-based evaluation metrics, consisting of BARTScore, BERTScore, Mover-1 and Mover-2. All shown to outperform ROUGE on five tasks span- ning three datasets • Demonstrate