Abstractive Sentence Summarization with Attentive Recurrent Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
Abstractive Sentence Summarization with Attentive Recurrent Neural Networks Sumit Chopra Michael Auli Alexander M. Rush Facebook AI Research Facebook AI Research Harvard SEAS [email protected] [email protected] [email protected] Abstract al., 2014), our model consists of a conditional recur- rent neural network, which acts as a decoder to gen- Abstractive Sentence Summarization gener- erate the summary of an input sentence, much like ates a shorter version of a given sentence while a standard recurrent language model. In addition, at attempting to preserve its meaning. We intro- every time step the decoder also takes a condition- duce a conditional recurrent neural network ing input which is the output of an encoder module. (RNN) which generates a summary of an in- Depending on the current state of the RNN, the en- put sentence. The conditioning is provided by a novel convolutional attention-based encoder coder computes scores over the words in the input which ensures that the decoder focuses on the sentence. These scores can be interpreted as a soft appropriate input words at each step of genera- alignment over the input text, informing the decoder tion. Our model relies only on learned features which part of the input sentence it should focus on and is easy to train in an end-to-end fashion on to generate the next word. Both the decoder and en- large data sets. Our experiments show that the coder are jointly trained on a data set consisting of model significantly outperforms the recently sentence-summary pairs. Our model can be seen as proposed state-of-the-art method on the Giga- word corpus while performing competitively an extension of the recently proposed model for the on the DUC-2004 shared task. same problem by Rush et al. (2015). While they use a feed-forward neural language model for genera- tion, we use a recurrent neural network. Further- 1 Introduction more, our encoder is more sophisticated, in that it explicitly encodes the position information of the in- Generating a condensed version of a passage while put words. Lastly, our encoder uses a convolutional preserving its meaning is known as text summariza- network to encode input words. These extensions tion. Tackling this task is an important step to- result in improved performance. wards natural language understanding. Summariza- tion systems can be broadly classified into two cat- The main contribution of this paper is a novel egories. Extractive models generate summaries by convolutional attention-based conditional recurrent cropping important segments from the original text neural network model for the problem of abstractive and putting them together to form a coherent sum- sentence summarization. Empirically we show that mary. Abstractive models generate summaries from our model beats the state-of-the-art systems of Rush scratch without being constrained to reuse phrases et al. (2015) on multiple data sets. Particularly no- from the original text. table is the fact that even with a simple generation In this paper we propose a novel recurrent neu- module, which does not use any extractive feature ral network for the problem of abstractive sentence tuning, our model manages to significantly outper- summarization. Inspired by the recently proposed form their ABS+ system on the Gigaword data set architectures for machine translation (Bahdanau et and is comparable on the DUC-2004 task. 93 Proceedings of NAACL-HLT 2016, pages 93–98, San Diego, California, June 12-17, 2016. c 2016 Association for Computational Linguistics 2 Previous Work vidual conditional probabilities: While there is a large body of work for generat- N P (y x; θ) = p(yt y1, . , yt 1 , x; θ). (1) ing extractive summaries of sentences (Jing, 2000; | |{ − } t=1 Knight and Marcu, 2002; McDonald, 2006; Clarke Y and Lapata, 2008; Filippova and Altun, 2013; Fil- In this work we model this conditional probabil- ippova et al., 2015), there has been much less re- ity using an RNN Encoder-Decoder architecture, in- search on abstractive summarization. A count-based spired by Cho et al. (2014) and subsequently ex- noisy-channel machine translation model was pro- tended in Bahdanau et al. (2014). We call our model posed for the problem in Banko et al. (2000). The RAS (Recurrent Attentive Summarizer). task of abstractive sentence summarization was later formalized around the DUC-2003 and DUC-2004 3.1 Recurrent Decoder competitions (Over et al., 2007), where the TOP- The above conditional is modeled using an RNN: IARY system (Zajic et al., 2004) was the state-of- P (yt y1, . , yt 1 , x; θ) = Pt = gθ (ht, ct), the-art. More recently Cohn and Lapata (2008) |{ − } 1 and later Woodsend et al. (2010) proposed systems where ht is the hidden state of the RNN: which made heavy use of the syntactic features of the sentence-summary pairs. Later, along the lines ht = gθ (yt 1, ht 1, ct). 1 − − of Banko et al. (2000), MOSES was used directly as a method for text simplification by Wubben et al. Here ct is the output of the encoder module (detailed in 3.2). It can be seen as a context vector which is (2012). Other works which have recently been pro- § computed as a function of the current state ht 1 and posed for the problem of sentence summarization in- − clude (Galanis and Androutsopoulos, 2010; Napoles the input sequence x. et al., 2011; Cohn and Lapata, 2013). Very recently Our Elman RNN takes the following form (El- Rush et al. (2015) proposed a neural attention model man, 1990): for this problem using a new data set for training and ht = σ(W1yt 1 + W2ht 1 + W3ct) showing state-of-the-art performance on the DUC − − tasks. Our model can be seen as an extension of Pt = ρ(W4ht + W5ct), their model. where σ is the sigmoid function and ρ is the soft- ot oj max, defined as: ρ(ot) = e / j e and Wi 3 Attentive Recurrent Architecture (i = 1,..., 5) are matrices of learnable parameters d d P d V of sizes W 1,2,3 × and W 4,5 × . { } ∈ R { } ∈ R Let x denote the input sentence consisting of a The LSTM decoder is defined as (Hochreiter and sequence of M words x = [x1, . , xM ], where Schmidhuber, 1997): each word x is part of vocabulary , of size i V = V . Our task is to generate a target sequence it = σ(W1yt 1 + W2ht 1 + W3ct) |V| − − y = [y1, . , yN ], of N words, where N < M, it0 = tanh(W4yt 1 + W5ht 1 + W6ct) such that the meaning of x is preserved: y = − − ft = σ(W7yt 1 + W8ht 1 + W9ct) argmaxy P (y x), where y is a random variable de- − − | ot = σ(W10yt 1 + W11ht 1 + W12ct) noting a sequence of N words. − − mt = mt 1 ft + it it0 Typically the conditional probability is mod- − eled by a parametric function with parameters θ: h = m o t t t P (y x) = P (y x; θ). Training involves finding the | | Pt = ρ(W13ht + W14ct). θ which maximizes the conditional probability of sentence-summary pairs in the training corpus. If Operator refers to component-wise multiplica- the model is trained to generate the next word of the tion, and Wi (i = 1,..., 14) are matrices of learn- d d summary, given the previous words, then the above able parameters of sizes W 1,...,12 × , and d V { } ∈ R conditional can be factorized into a product of indi- W 13,14 × . { } ∈ R 94 3.2 Attentive Encoder by minimizing the negative conditional log likeli- θ We now give the details of the encoder which com- hood of the training data with respect to : S N putes the context vector ct for every time step t of i i i i the decoder above. With a slight overload of nota- = log P (yt y1, . , yt 1 , x ; θ), L − |{ − } tion, for an input sentence x we denote by x the d i=1 t=1 i X X (5) dimensional learnable embedding of the i-th word where the parameters θ constitute the parameters of (x d). In addition the position i of the word i ∈ R the decoder and the encoder. x is also associated with a learnable embedding l i i Once the parametric model is trained we generate of size d (l d). Then the full embedding for i ∈ R a summary for a new sentence x through a word- i-th word in x is given by ai = xi + li. Let us k q d based beam search such that P (y x) is maximized, denote by B × a learnable weight matrix | ∈ R argmax P (yt y1, . , yt 1 , x). The search is pa- which is used to convolve over the full embeddings |{ − } rameterized by the number of paths k that are pur- of consecutive words. Let there be d such matrices sued at each time step. (k 1, . , d ). The output of convolution is given ∈ { } by: 4 Experimental Setup q/2 z = a bk , (2) 4.1 Datasets and Evaluation ik i+h · q/2+h h= q/2 Our models are trained on the annotated version of X− the Gigaword corpus (Graff et al., 2003; Napoles k k where bj is the j-th column of the matrix B . Thus et al., 2012) and we use only the annotations for the d dimensional aggregate embedding vector zi is tokenization and sentence separation while discard- defined as zi = [zi1, . , zid]. Note that each word ing other annotations such as tags and parses. We xi in the input sequence is associated with one ag- pair the first sentence of each article with its head- gregate embedding vector zi.