Document Modeling with External Attention for Sentence Extraction

Document Modeling with External Attention for Sentence Extraction Shashi Narayan∗ Ronald Cardenas∗ Nikos Papasarantopoulos∗ University of Edinburgh Charles University in Prague University of Edinburgh [email protected] [email protected] [email protected] Shay B. Cohen Mirella Lapata Jiangsheng Yu Yi Chang University of Edinburgh Huawei Technologies fscohen,[email protected] fjiangsheng.yu,[email protected] Abstract understanding tasks, is still an open challenge. Re- cently, some neural network architectures were Document modeling is essential to a va- proposed to capture large context for modeling riety of natural language understanding text (Mikolov and Zweig, 2012; Ghosh et al., tasks. We propose to use external in- 2016; Ji et al., 2015; Wang and Cho, 2016). Lin formation to improve document modeling et al.(2015) and Yang et al.(2016) proposed a hi- for problems that can be framed as sen- erarchical RNN network for document-level mod- tence extraction. We develop a frame- eling as well as sentence-level modeling, at the work composed of a hierarchical docu- cost of increased computational complexity. Tran ment encoder and an attention-based ex- et al.(2016) further proposed a contextual lan- tractor with attention over external infor- guage model that considers information at inter- mation. We evaluate our model on extrac- document level. tive document summarization (where the It is challenging to rely only on the document external information is image captions and for its understanding, and as such it is not sur- the title of the document) and answer se- prising that these models struggle on problems lection (where the external information is such as document summarization (Cheng and La- a question). We show that our model con- pata, 2016; Chen et al., 2016; Nallapati et al., sistently outperforms strong baselines, in 2017; See et al., 2017; Tan and Wan, 2017) and terms of both informativeness and fluency machine reading comprehension (Trischler et al., (for CNN document summarization) and 2016; Miller et al., 2016; Weissenborn et al., 2017; achieves state-of-the-art results for answer Hu et al., 2017; Wang et al., 2017). In this pa- 1 selection on WikiQA and NewsQA. per, we formalize the use of external information 1 Introduction to further guide document modeling for end goals. We present a simple yet effective document Recurrent neural networks have become one of modeling framework for sentence extraction that the most widely used models in natural lan- allows machine reading with “external attention.” guage processing (NLP). A number of variants Our model includes a neural hierarchical docu- of RNNs such as Long Short-Term Memory ment encoder (or a machine reader) and a hier- networks (LSTM; Hochreiter and Schmidhuber, archical attention-based sentence extractor. Our 1997) and Gated Recurrent Unit networks (GRU; hierarchical document encoder resembles the ar- Cho et al., 2014) have been designed to model chitectures proposed by Cheng and Lapata(2016) text capturing long-term dependencies in prob- and Narayan et al.(2018) in that it derives the doc- lems such as language modeling. However, document meaning representation from its sentences ument modeling, a key to many natural language and their constituent words. Our novel sentence ∗ The first three authors made equal contributions to this extractor combines this document meaning repre- paper. The work was done when the second author was visit- sentation with an attention mechanism (Bahdanau ing Edinburgh. et al., 2015) over the external information to label 1Our TensorFlow code and datasets are publicly avail- able at https://github.com/shashiongithub/ sentences from the input document. Our model ex- Document-Models-with-Ext-Information. plicitly biases the extractor with external cues and 2020 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 2020–2030 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics implicitly biases the encoder through training. measure a score between each sentence in the doc- We demonstrate the effectiveness of our model ument and the question and return the sentence on two problems that can be naturally framed with highest score in an isolated manner (Yin as sentence extraction with external information. et al., 2016; dos Santos et al., 2016; Wang et al., These two problems, extractive document summa- 2016). Our model with ISF and IDF scores as ex- rization and answer selection for machine reading ternal features achieves competitive results for an- comprehension, both require local and global con- swer selection. Our ensemble model combining textual reasoning about a given document. Extrac- scores from our model and word overlap scores tive document summarization systems aim at cre- using a logistic regression layer achieves state-of- ating a summary by identifying (and subsequently the-art results on the popular question answering concatenating) the most important sentences in a datasets WikiQA (Yang et al., 2015) and NewsQA document, whereas answer selection systems se- (Trischler et al., 2016), and it obtains comparable lect the candidate sentence in a document most results to the state of the art for SQuAD (Rajpurkar likely to contain the answer to a query. For docu- et al., 2016). We also evaluate our approach on the ment summarization, we exploit the title and im- MSMarco dataset (Nguyen et al., 2016) and elab- age captions which often appear with documents orate on the behavior of our machine reader in a (specifically newswire articles) as external infor- scenario where each candidate answer sentence is mation. For answer selection, we use word overlap contextually independent of each other. features, such as the inverse sentence frequency 2 Document Modeling For Sentence (ISF, Trischler et al., 2016) and the inverse doc- Extraction ument frequency (IDF) together with the query, all formulated as external cues. Given a document D consisting of a sequence of n Our main contributions are three-fold: First, our sentences (s1; s2; :::; sn) , we aim at labeling each model ensures that sentence extraction is done in sentence si in D with a label yi 2 f0; 1g where a larger (rich) context, i.e., the full document is yi = 1 indicates that si is extraction-worthy and 0 read first before we start labeling its sentences for otherwise. Our architecture resembles those pre- extraction, and each sentence labeling is done by viously proposed in the literature (Cheng and La- implicitly estimating its local and global relevance pata, 2016; Nallapati et al., 2017). The main com- to the document and by directly attending to some ponents include a sentence encoder, a document external information for importance cues. encoder, and a novel sentence extractor (see Fig- Second, while external information has been ure1) that we describe in more detail below. The shown to be useful for summarization systems novel characteristics of our model are that each using traditional hand-crafted features (Edmund- sentence is labeled by implicitly estimating its (lo- son, 1969; Kupiec et al., 1995; Mani, 2001), our cal and global) relevance to the document and by model is the first to exploit such information in directly attending to some external information for deep learning-based summarization. We evalu- importance cues. ate our models automatically (in terms of ROUGE Sentence Encoder A core component of our scores) on the CNN news highlights dataset (Her- model is a convolutional sentence encoder (Kim, mann et al., 2015). Experimental results show 2014; Kim et al., 2016) which encodes sentences that our summarizer, informed with title and im- into continuous representations. We use temporal age captions, consistently outperforms summariz- narrow convolution by applying a kernel filter K ers that do not use this information. We also con- of width h to a window of h words in sentence duct a human evaluation to judge which type of s to produce a new feature. This filter is applied summary participants prefer. Our results over- to each possible window of words in s to pro- whelmingly show that human subjects find our duce a feature map f 2 Rk−h+1 where k is the summaries more informative and complete. sentence length. We then apply max-pooling over Lastly, with the machine reading capabilities of time over the feature map f and take the maximum our model, we confirm that a full document needs value as the feature corresponding to this particu- to be “read” to produce high quality extracts al- lar filter K. We use multiple kernels of various lowing a rich contextual reasoning, in contrast to sizes and each kernel multiple times to construct previous answer selection approaches that often the representation of a sentence. In Figure1, ker- 2021 nels of size 2 (red) and 4 (blue) are applied three Document encoder Sentence Extractor times each. The max-pooling over time operation y1 y2 y3 y4 y5 yields two feature lists f K2 and f K4 2 R3. The final sentence embeddings have six dimensions. Document Encoder The document encoder composes a sequence of sentences to obtain a doc- s5 s4 s3 s2 s1 s1 s2 s3 s4 s5 ument representation. We use a recurrent neural network with LSTM cells to avoid the vanishing L gradient problem when training long sequences (Hochreiter and Schmidhuber, 1997). Given a document D consisting of a sequence of sentences (s1; s2; : : : ; sn), we follow common practice and feed the sentences in reverse order (Sutskever et al., 2014; Li et al., 2015; Filippova et al., 2015). s5 s4 s3 s2 s1 e1 e2 e3 Sentence Extractor Our sentence extractor se- Document External quentially labels each sentence in a document with 1 or 0 by implicitly estimating its relevance in the document and by directly attending to the North Korea external information for importance cues.

Load more