COMP 786 (Fall 2020)
Natural Language Processing
Week 9: Summarization; Guest Talk; Machine Translation 1
Mohit Bansal
(various slides adapted/borrowed from courses by Dan Klein, JurafskyMartin-SLP3, Manning/Socher, others) Automatic Document Summarization Statistical NLP Spring 2011
Lecture 25: Summarization
Dan Klein – UC Berkeley
Single-Document Summarization
Full documentDocument to a salient, non-redundantSummarization summary of ~100 words
1 Multi-Document Summarization
Several news sources with articles on the same topic (can use overlappingMulti-document info across articles as Summarization a good feature for summarization)
… 27,000+ more
Extractive Summarization
2 Multi-document Summarization
… 27,000+ more
ExtractiveExtractive Summarization Summarization
Directly selecting existing sentences from input document instead of rewriting them
2 Selection
mid-‘90s • Maximum Marginal Relevance • Graph algorithmsGraph-based Extractive Summ
Stationary distribution represents node centrality
ss 22 present ss 11
ss 44 Nodes are sentences ss 33
Edges are similarities
[Mihalcea et al., 2004, 2005; inter alia]
Selection
mid-‘90s • Maximum Marginal Relevance • Graph algorithms • Word distribution models
ww PP DD (w)(w) ww PP AA (w)(w) present Obama 0.017 Obama ? speech 0.024 ~ speech ? health 0.009 health ? Montana 0.002 Montana ? Input document distribution Summary distribution
5 Selection Maximize Concept Coverage [Gillick and Favre, 2008]
The health care bill is a major test for the ss 11 Obama administration. concept value obama 3 ss 22 Universal health care is a divisive issue. health 2 President Obama remained calm. ss 33 house 1 ss 44 Obama addressed the House on Tuesday.
summary length value Length limit: greedy 18 words {s1, s3}175 {s2, s3, s4}176 optimal
[Gillick and Favre, 2009]
Maximize Concept Coverage
Optimization problem: Set Coverage
Value of concept c
Set of concepts Set of extractive summaries present in summary s of document set D
Results Bigram Recall Pyramid
Baseline 4.00 Baseline 23.5
2009 6.85 2009 35.0
[Gillick and Favre 09]
15 Selection
[Gillick and Favre, 2008]
The health care bill is a major test for the ss 11 Obama administration. concept value obama 3 ss 22 Universal health care is a divisive issue. health 2 President Obama remained calm. ss 33 house 1
ss 44 Obama addressed the House on Tuesday.
summary length value Length limit: greedy 18 words {s1, s3}175 {s2, s3, s4}176 optimal
MaximizeMaximize Concept Concept Coverage Coverage A set coverage optimization problem Optimization problem: Set Coverage
Value of concept c
Set of concepts Set of extractive summaries present in summary s of document set D
Results Bigram Recall Pyramid
Baseline 4.00 Baseline 23.5
2009 6.85 2009 35.0
[Gillick and Favre 09]
[Gillick and Favre, 2009]
15 Selection
Integer LinearMaximize Program for theConcept maximum Coverage coverage mode l [Gillick, Can be Riedhammer, solved using Favre, an integer Hakkani-Tur, linear program 2008] with constraints:
total concept value
summary length limit
maintain consistency between selected sentences and concepts
ci an indicator for the presence of concept i in the summary, and sj an indicator for the presence of sentence j in the summary. We add Occij to indicate the occurrence of concept i in sentence j. Equations (1) and (2) ensure the logical consistency of the solution: selecting a sentence necessitates selecting all the concepts it contains and selecting a concept is only possible if it is present in at least one selected sentence.
[Gillick et al., 2008] [Gillick and Favre, 2009]
Selection
[Gillick and Favre, 2009]
This ILP is tractable for reasonable problems
16 Problems with Extraction
What would a human do? It is therefore unsurprising that Lindsay pleaded not guilty yesterday afternoon to the charges filed against her, according to her publicist.
Beyond Extraction: Compression
If you had to write a concise summary, making effective use of the 100-word limit, you wouldProblems remove some with information Extraction from the lengthy sentences in the original article
What would a human do? It is therefore unsurprising that Lindsay pleaded not guilty yesterday afternoon to the charges filed against her, according to her publicist.
[Berg-Kirkpatrick et al., 2011]
19 Sentence Rewriting
[Berg-Kirkpatrick, Gillick, and Klein 11]
Beyond Extraction: Compression Sentence Rewriting Model should learn the subtree deletions/cuts that allow compression
[Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick et al., 2011]
20 Beyond Extraction: Compression Sentence Rewriting Model should learn the subtree deletions/cuts that allow compression
[Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick et al., 2011]
Sentence Rewriting
New Optimization problem: Safe Deletions
Value of deletion d
Set branch cut deletions made in creating summary s
How do we know how much a given deletion costs?
[Berg-Kirkpatrick, Gillick, and Klein 11]
21 Sentence Rewriting Sentence Rewriting
[Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick, Gillick, and Klein 11]
Beyond Extraction: Compression Sentence Rewriting The new optimization problem Sentencelooks to maximize Rewriting the concept values asNew well asOptimization safe deletion valuesproblem: in the Safe candidate Deletions summary: New Optimization problem: Safe Deletions
Value of deletionValue of d deletion d
Set branch cut deletions madeSet branch in creating cut deletionssummary s made in creating summary s ToHow decide do the we value/cost know how of a muchdeletion, a wegiven decide deletion relevant costs? deletion featuresHow and do thewe model know learns how muchtheir weights: a given deletion costs?
[Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick et al., 2011]
21 21 Beyond Extraction: Compression
Some example features forFeatures concept bigrams and cuts/deletions:
Bigram Features f(b) Cut Features f(c)
COUNT: Bucketed document counts COORD: Coordinated phrase, four versions: NP, VP, S, SBAR STOP: Stop word indicators S-ADJUNCT: Adjunct to matrix verb, POSITION: First document position four versions: CC, PP, indicators ADVP, SBAR CONJ: All two- and three-way REL-C: Relative clause indicator conjunctions of above ATTR-C: Attribution clause indicator BIAS: Always one ATTR-PP: PP attribution indicator TEMP-PP: Temporal PP indicator TEMP-NP Temporal NP indicator BIAS: Always one
[Berg-Kirkpatrick et al., 2011] Neural Abstractive Summarization
Mostly based on sequence-to-sequence RNN models
Later added attention, coverage, pointer/copy, hierarchical encoder/ attention, metric rewards RL, etc.
Examples: Rush et al., 2015; Nallapati et al., 2016; See et al., 2017; Paulus et al., 2017 tion 3 contextualizes our models with respect to document, around which the story revolves. In closely related work on the topic of abstractive text order to accomplish this goal, we may need to summarization. We present the results of our ex- go beyond the word-embeddings-based represen- periments on three different data sets in Section 4. tation of the input document and capture addi- We also present some qualitative analysis of the tional linguistic features such as parts-of-speech output from our models in Section 5 before con- tags, named-entity tags, and TF and IDF statis- cluding the paper with remarks on our future di- tics of the words. We therefore create additional rection in Section 6. look-up based embedding matrices for the vocab- ulary of each tag-type, similar to the embeddings 2 Models for words. For continuous features such as TF and IDF, we convert them into categorical values In this section, we first describe the basic encoder- by discretizing them into a fixed number of bins, decoder RNN that serves as our baseline and then and use one-hot representations to indicate the bin propose several novel models for summarization, number they fall into. This allows us to map them each addressing a specific weakness in the base- into an embeddings matrix like any other tag-type. line. Finally, for each word in the source document, we 2.1 Encoder-Decoder RNN with Attention simply look-up its embeddings from all of its as- and Large Vocabulary Trick sociated tags and concatenate them into a single long vector, as shown in Fig. 1. On the target side, Our baseline model corresponds to the neural ma- we continue to use only word-based embeddings chine translation model used in Bahdanau et al. as theFeature-Augmented representation. Encoder-Decoder (2014). The encoder consists of a bidirectional GRU-RNN (Chung et al., 2014), while the decoder consists of a uni-directional GRU-RNN with the same hidden-state size as that of the encoder, and Layer Attention mechanism
an attention mechanism over the source-hidden Output states and a soft-max layer over target vocabu- lary to generate words. In the interest of space, State Hidden Hidden we refer the reader to the original paper for a de- W W W W DECODER POS POS POS POS Layer tailed treatment of this model. In addition to the NER NER NER NER Input Input TF TF TF TF basic model, we also adapted to the summariza- IDF IDF IDF IDF tion problem, the large vocabulary ‘trick’ (LVT) ENCODER described in Jean et al. (2014). In our approach, Figure 1: Feature-rich-encoder: We use one embedding the decoder-vocabulary of each mini-batch is re- vector each for POS, NER tags and discretized TF and IDF stricted to words in the source documents of that values, which are concatenated together with word-based em-[Nallapati et al., 2016] batch. In addition, the most frequent words in the beddings as input to the encoder. target dictionary are added until the vocabulary reaches a fixed size. The aim of this technique is to reduce the size of the soft-max layer of the 2.3 Modeling Rare/Unseen Words using decoder which is the main computational bottle- Switching Generator-Pointer neck. In addition, this technique also speeds up Often-times in summarization, the keywords or convergence by focusing the modeling effort only named-entities in a test document that are central on the words that are essential to a given example. to the summary may actually be unseen or rare This technique is particularly well suited to sum- with respect to training data. Since the vocabulary marization since a large proportion of the words in of the decoder is fixed at training time, it cannot the summary come from the source document in emit these unseen words. Instead, a most common any case. way of handling these out-of-vocabulary (OOV) words is to emit an ‘UNK’ token as a placeholder. 2.2 Capturing Keywords using Feature-rich However this does not result in legible summaries. Encoder In summarization, an intuitive way to handle such In summarization, one of the key challenges is to OOV words is to simply point to their location in identify the key concepts and key entities in the the source document instead. We model this no- tion using our novel switching decoder/pointer ar- is set to 0 whenever the word at position i in the chitecture which is graphically represented in Fig- summary is OOV with respect to the decoder vo- ure 2. In this model, the decoder is equipped with cabulary. At test time, the model decides automat- a ‘switch’ that decides between using the genera- ically at each time-step whether to generate or to tor or a pointer at every time-step. If the switch point, based on the estimated switch probability is turned on, the decoder produces a word from its P (si). We simply use the arg max of the poste- target vocabulary in the normal fashion. However, rior probability of generation or pointing to gener- if the switch is turned off, the decoder instead gen- ate the best output at each time step. erates a pointer to one of the word-positions in the The pointer mechanism may be more robust in source. The word at the pointer-location is then handling rare words because it uses the encoder’s copied into the summary. The switch is modeled hidden-state representation of rare words to decide as a sigmoid activation function over a linear layer which word from the document to point to. Since based on the entire available context at each time- the hidden state depends on the entire context of step as shown below. the word, the model is able to accurately point to s s s unseen words although they do not appear in the P (si = 1) = (v (Whhi + WeE[oi 1] 1 s · s target vocabulary.Generation+Copying + Wcci + b )), where P (si = 1) is the probability of the switch th turning on at the i time-step of the decoder, hi is the hidden state, E[oi 1] is the embedding vec- Layer tor of the emission from the previous time step, Output G P G G G ci is the attention-weighted context vector, and State s s s s s Wh, We, Wc, b and v are the switch parame- Hidden DECODER ters. We use attention distribution over word posi- Layer
tions in the document as the distribution to sample Input the pointer from. ENCODER a a a a Pi (j) exp(v (Whhi 1 + We E[oi 1] Figure 2: Switching generator/pointer model: When the / a d · a + Wc hj + b )), switch shows ’G’, the traditional generator consisting of the a [Nallapati et al., 2016] pi = arg max(Pi (j)) for j 1,...,Nd . softmax layer is used to produce a word, and when it shows j 2 { } ’P’, the pointer network is activated to copy the word from In the above equation, pi is the pointer value at one of the source document positions. When the pointer is th i word-position in the summary, sampled from activated, the embedding from the source is used as input for a the attention distribution Pi over the document the next time-step as shown by the arrow from the encoder to a word-positions j 1,...,Nd , where P (j) is the decoder at the bottom. 2 { } i the probability of the ith time-step in the decoder pointing to the jth position in the document, and d 2.4 Capturing Hierarchical Document hj is the encoder’s hidden state at position j. At training time, we provide the model with ex- Structure with Hierarchical Attention plicit pointer information whenever the summary In datasets where the source document is very word does not exist in the target vocabulary. When long, in addition to identifying the keywords in the OOV word in summary occurs in multiple doc- the document, it is also important to identify the ument positions, we break the tie in favor of its key sentences from which the summary can be first occurrence. At training time, we optimize the drawn. This model aims to capture this notion of conditional log-likelihood shown below, with ad- two levels of importance using two bi-directional ditional regularization penalties. 1Even when the word does not exist in the source vocabu- log P (y x)= (gi log P (yi y i, x)P (si) lary, the pointer model may still be able to identify the correct | i { | } position of the word in the source since it takes into account X the contextual representation of the corresponding ’UNK’ to- +(1 gi) log P (p(i) y i, x)(1 P (si)) ) { | } ken encoded by the RNN. Once the position is known, the corresponding token from the source document can be dis- where y and x are the summary and document played in the summary even when it is not part of the training words respectively, gi is an indicator function that vocabulary either on the source side or the target side. RNNs on the source side, one at the word level al., 2002; Erkan and Radev, 2004; Wong et al., and the other at the sentence level. The attention 2008a; Filippova and Altun, 2013; Colmenares et mechanism operates at both levels simultaneously. al., 2015; Litvak and Last, 2008; K. Riedhammer The word-level attention is further re-weighted by and Hakkani-Tur, 2010; Ricardo Ribeiro, 2013). the corresponding sentence-level attention and re- Humans on the other hand, tend to paraphrase normalized as shown below: the original story in their own words. As such, hu- man summaries are abstractive in nature and sel- P a(j)P a(s(j)) P a(j)= w s , dom consist of reproduction of original sentences Nd a a k=1 Pw(k)Ps (s(k)) from the document. The task of abstractive sum- marization has been standardized using the DUC- where P a(j) is theP word-level attention weight at w 2003 and DUC-2004 competitions.2 The data for jth position of the source document, and s(j) is these tasks consists of news stories from various the ID of the sentence at jth word position, P a(l) s topics with multiple reference summaries per story is the sentence-level attention weight for the lth generated by humans. The best performing system sentence in the source, N is the number of words d on the DUC-2004 task, called TOPIARY (Zajic in the source document, and P a(j) is the re-scaled et al., 2004), used a combination of linguistically attention at the jth word position. The re-scaled motivated compression techniques, and an unsu- attention is then used to compute the attention- pervised topic detection algorithm that appends weighted context vector that goes as input to the keywords extracted from the article onto the com- hidden state of the decoder. Further, we also con- pressed output. Some of the other notable work in catenate additional positional embeddings to the the task of abstractive summarization includes us- hidden state of the sentence-level RNN to model ing traditional phrase-table based machine transla- positional importance of sentences in the docu- tion approaches (Banko et al., 2000), compression ment. This architecture therefore models key sen- using weighted tree-transformation rules (Cohn tences as well as keywords within those sentences and Lapata, 2008) and quasi-synchronous gram- jointly. A graphical representation of this model is mar approaches (Woodsend et al., 2010). displayed in Figure 3. Hierarchical Attention With the emergence of deep learning as a viable alternative for many NLP tasks (Collobert et al.,
Layer 2011), researchers have started considering this
Output Output framework as an attractive, fully data-driven alter- native to abstractive summarization. In Rush et layer State al. (2015), the authors use convolutional models Hidden Hidden Sentence to encode the source, and a context-sensitive at- State layer tentional feed-forward neural network to generate Hidden Hidden Word DECODER the summary, producing state-of-the-art results on
Layer