Download File

Salience Estimation and Faithful Generation: Modeling Methods for Text Summarization and Generation Chris Kedzie Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy under the Executive Committee of the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2021 © 2021 Chris Kedzie All Rights Reserved Abstract Salience Estimation and Faithful Generation: Modeling Methods for Text Summarization and Generation Chris Kedzie This thesis is focused on a particular text-to-text generation problem, automatic summarization, where the goal is to map a large input text to a much shorter summary text. The research presented aims to both understand and tame existing machine learning models, hopefully paving the way for more reliable text-to-text generation algorithms. Somewhat against the prevailing trends, we eschew end-to-end training of an abstractive summarization model, and instead break down the text summarization problem into its constituent tasks. At a high level, we divide these tasks into two categories: content selection, or “what to say” and content realization, or “how to say it” (McKeown, 1985). Within these categories we propose models and learning algorithms for the problems of salience estimation and faithful generation. Salience estimation, that is, determining the importance of a piece of text relative to some context, falls into a problem of the former category, determining what should be selected for a summary. In particular, we experiment with a variety of popular or novel deep learning models for salience estimation in a single document summarization setting, and design several ablation experiments to gain some insight into which input signals are most important for making predictions. Understanding these signals is critical for designing reliable summarization models. We then consider a more difficult problem of estimating salience in a large document stream, and propose two alternative approaches using classical machine learning techniques from both un- supervised clustering and structured prediction. These models incorporate salience estimates into larger text extraction algorithms that also consider redundancy and previous extraction decisions. Overall, we find that when simple, position based heuristics are available, as in single document news or research summarization, deep learning models of salience often exploit them to make predictions, while ignoring the arguably more important content features of the input. In more demanding environments, like stream summarization, where heuristics are unreliable, more semantically relevant features become key to identifying salience content. In part two, content realization, we assume content selection has already been performed and focus on methods for faithful generation (i.e., ensuring that output text utterances respect the se- mantics of the input content). Since they can generate very fluent and natural text, deep learning- based natural language generation models are a popular approach to this problem. However, they often omit, misconstrue, or otherwise generate text that is not semantically correct given the input content. In this section, we develop a data augmentation and self-training technique to mitigate this problem. Additionally, we propose a training method for making deep learning-based natural language generation models capable of following a content plan, allowing for more control over the output utterances generated by the model. Under a stress test evaluation protocol, we demonstrate some empirical limits on several neural natural language generation models’ ability to encode and properly realize a content plan. Finally, we conclude with some remarks on future directions for abstractive summarization outside of the end-to-end deep learning paradigm. Our aim here is to suggest avenues for con- structing abstractive summarization systems with transparent, controllable, and reliable behavior when it comes to text understanding, compression, and generation. Our hope is that this thesis inspires more research in this direction, and, ultimately, real tools that are broadly useful outside of the natural language processing community. Table of Contents List of Tables .......................................... vi List of Figures .......................................... ix Acknowledgments ........................................ xii Dedication ............................................ Chapter 1: Introduction .................................... 1 1.1 Problems in Text-to-Text Generation . 4 1.2 Contributions . 10 1.3 Organization . 12 Chapter 2: Related Work ................................... 14 2.1 Natural Language Generation in Antiquity . 14 2.2 Natural Language Generation from 1980–2000 . 19 2.3 The Emergence of Data Driven Extractive Summarization (2000–2014) . 21 2.4 Neural Natural Language Generation Models (2014–Present) . 23 Chapter 3: Salience Estimation with Deep Learning Content Selection Models ....... 28 3.1 Problem Definition . 31 i 3.2 Models . 32 3.2.1 Word Embedding Layer . 33 3.2.2 Sentence Encoder Layer . 33 3.2.3 Sentence Extraction Layer . 39 3.2.4 Inference and Summary Generation . 57 3.3 Datasets . 58 3.3.1 Ground Truth Extract Summaries . 60 3.4 Experiments . 61 3.4.1 Training . 62 3.4.2 Baselines . 63 3.5 Results . 63 3.5.1 Ablation Experiments . 65 3.6 Discussion . 71 3.7 Conclusion . 72 Chapter 4: Salience Estimation with Structured Content Selection Models ......... 74 4.1 Task Definition . 76 4.2 Dataset . 81 4.3 The Salience-biased Affinity Propagation (SAP) Summarizer . 82 4.3.1 Summarization Model . 83 4.3.2 Salience Estimation . 84 4.3.3 Affinity Propagation Clustering . 89 4.3.4 Redundancy Filtering and Update Selection . 91 ii 4.3.5 TREC 2014 Experiments and Results . 91 4.3.6 Automatic Experiments . 93 4.3.7 Results . 97 4.3.8 Feature Ablation . 99 4.4 Learning-to-Search (L2S) Summarizer . 100 4.4.1 Stream Summarization as Sequential Decision Making . 102 4.4.2 Policy-based Stream Summarization . 103 4.4.3 The Locally-Optimal Learning-to-Search Algorithm . 105 4.4.4 Features . 109 4.4.5 Expanding Relevance Judgments . 111 4.4.6 TREC 2015 Experiments and Results . 112 4.4.7 Automatic Experiments . 116 4.4.8 Results . 117 4.4.9 Discussion . 118 4.5 Conclusion . 119 Chapter 5: Faithful and Controllable Generation ....................... 121 5.1 Meaning Representations for Task-Oriented Dialogue Generation . 126 5.1.1 Meaning Representation Structure . 126 5.1.2 Relating Between Meaning Representations and Utterances . 130 5.2 Modeling Meaning Representation-to-Text Generation with Sequence-to-Sequence Ar- chitectures . 132 5.2.1 Sequence-to-Sequence Modeling . 132 5.2.2 Learning . 135 iii 5.2.3 Inference . 135 5.2.4 Sampling . 138 5.3 Faithful Generation Through Data-Augmentation: Noise-Injection Sampling and Self-Training . 140 5.3.1 An Idealized Data-Augmentation Protocol . 146 5.3.2 Conditional Utterance Sampling for Data-Augmentation . 147 5.3.3 A Practical Data-Augmentation Protocol . 150 5.3.4 Datasets . 154 5.3.5 Text Generation Models . 156 5.3.6 Meaning Representation Parsing Models . 158 5.3.7 Experiments . 161 5.3.8 Results . 164 5.3.9 Experiment Human Evaluation . 165 5.3.10 Analysis . 167 5.4 Alignment Training for Controllable Generation . 171 5.4.1 Alignment Training Linearization . 171 5.4.2 Phrase-based Data Augmentation . 176 5.4.3 Datasets . 178 5.4.4 Generation Models . 181 5.4.5 Utterance Planner Model . 183 5.4.6 Experiments . 185 5.4.7 Results . 187 5.4.8 Discussion . 192 iv 5.4.9 Limitations . 192 5.5 Conclusion . 193 Chapter 6: Conclusion ..................................... 194 6.1 Limitations and Open Problems for Abstractive Summarization Beyond End-to- End Neural Models . 197 6.2 Why not an end-to-end neural abstractive summarization model? . 199 6.3 Future Work . 200 6.4 Final Remarks . 204 References ............................................ 229 Appendix A: GRU-based Sequence-to-Sequence Architecture ................ 230 Appendix B: Transformer-based Sequence-to-Sequence Architecture ............ 236 B.1 Transformer Components . 236 B.2 Transformer Processing Blocks . 240 B.3 The Transformer Encoder and Decoder Layers . 242 v List of Tables 3.1 Sizes of the training, validation, test splits for each dataset and the average number of test set human reference summaries per document. 59 3.2 News domain METEOR (M) and ROUGE-2 recall (R-2) results across all extrac- tor/encoder pairs. Results that are statistically indistinguishable from the best system are shown in bold face. 64 3.3 Non-news domain METEOR (M) and ROUGE-2 recall (R-2) results across all ex- tractor/encoder pairs. Results that are statistically indistinguishable from the best system are shown in bold face. 64 3.4 ROUGE-2 recall across sentence extractors when using fixed pretrained embeddings or when embeddings are fine-tuned (F.-T.) during training. In both cases embeddings are initialized with pretrained GloVe embeddings. All extractors use the averaging sentence encoder. When both fine-tuned and fixed settings are bolded, there is no signifcant performance difference. Difference in scores shown in paren- thesis..

Download File

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support