COMP 786 (Fall 2020) Natural Language Processing
Total Page:16
File Type:pdf, Size:1020Kb
COMP 786 (Fall 2020) Natural Language Processing Week 9: Summarization; Guest Talk; Machine Translation 1 Mohit Bansal (various slides adapted/borrowed from courses by Dan Klein, JurafskyMartin-SLP3, Manning/Socher, others) Automatic Document Summarization Statistical NLP Spring 2011 Lecture 25: Summarization Dan Klein – UC Berkeley Single-Document Summarization Full documentDocument to a salient, non-redundantSummarization summary of ~100 words 1 Multi-Document Summarization Several news sources with articles on the same topic (can use overlappingMulti-document info across articles as Summarization a good feature for summarization) … 27,000+ more Extractive Summarization 2 Multi-document Summarization … 27,000+ more ExtractiveExtractive Summarization Summarization Directly selecting existing sentences from input document instead of rewriting them 2 Selection mid-‘90s • Maximum Marginal Relevance • Graph algorithmsGraph-based Extractive Summ Stationary distribution represents node centrality ss 22 present ss 11 ss 44 Nodes are sentences ss 33 Edges are similarities [Mihalcea et al., 2004, 2005; inter alia] Selection mid-‘90s • Maximum Marginal Relevance • Graph algorithms • Word distribution models ww PP DD (w)(w) ww PP AA (w)(w) present Obama 0.017 Obama ? speech 0.024 ~ speech ? health 0.009 health ? Montana 0.002 Montana ? Input document distribution Summary distribution 5 Selection Maximize Concept Coverage [Gillick and Favre, 2008] The health care bill is a major test for the ss 11 Obama administration. concept value obama 3 ss 22 Universal health care is a divisive issue. health 2 President Obama remained calm. ss 33 house 1 ss 44 Obama addressed the House on Tuesday. summary length value Length limit: greedy 18 words {s1, s3}175 {s2, s3, s4}176 optimal [Gillick and Favre, 2009] Maximize Concept Coverage Optimization problem: Set Coverage Value of concept c Set of concepts Set of extractive summaries present in summary s of document set D Results Bigram Recall Pyramid Baseline 4.00 Baseline 23.5 2009 6.85 2009 35.0 [Gillick and Favre 09] 15 Selection [Gillick and Favre, 2008] The health care bill is a major test for the ss 11 Obama administration. concept value obama 3 ss 22 Universal health care is a divisive issue. health 2 President Obama remained calm. ss 33 house 1 ss 44 Obama addressed the House on Tuesday. summary length value Length limit: greedy 18 words {s1, s3}175 {s2, s3, s4}176 optimal MaximizeMaximize Concept Concept Coverage Coverage A set coverage optimization problem Optimization problem: Set Coverage Value of concept c Set of concepts Set of extractive summaries present in summary s of document set D Results Bigram Recall Pyramid Baseline 4.00 Baseline 23.5 2009 6.85 2009 35.0 [Gillick and Favre 09] [Gillick and Favre, 2009] 15 Selection Integer LinearMaximize Program for theConcept maximum Coverage coverage mode l [Gillick, Can be Riedhammer, solved using Favre, an integer Hakkani-Tur, linear program 2008] with constraints: total concept value summary length limit maintain consistency between selected sentences and concepts ci an indicator for the presence of concept i in the summary, and sj an indicator for the presence of sentence j in the summary. We add Occij to indicate the occurrence of concept i in sentence j. Equations (1) and (2) ensure the logical consistency of the solution: selecting a sentence necessitates selecting all the concepts it contains and selecting a concept is only possible if it is present in at least one selected sentence. [Gillick et al., 2008] [Gillick and Favre, 2009] Selection [Gillick and Favre, 2009] This ILP is tractable for reasonable problems 16 Problems with Extraction What would a human do? It is therefore unsurprising that Lindsay pleaded not guilty yesterday afternoon to the charges filed against her, according to her publicist. Beyond Extraction: Compression If you had to write a concise summary, making effective use of the 100-word limit, you wouldProblems remove some with information Extraction from the lengthy sentences in the original article What would a human do? It is therefore unsurprising that Lindsay pleaded not guilty yesterday afternoon to the charges filed against her, according to her publicist. [Berg-Kirkpatrick et al., 2011] 19 Sentence Rewriting [Berg-Kirkpatrick, Gillick, and Klein 11] Beyond Extraction: Compression Sentence Rewriting Model should learn the subtree deletions/cuts that allow compression [Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick et al., 2011] 20 Beyond Extraction: Compression Sentence Rewriting Model should learn the subtree deletions/cuts that allow compression [Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick et al., 2011] Sentence Rewriting New Optimization problem: Safe Deletions Value of deletion d Set branch cut deletions made in creating summary s How do we know how much a given deletion costs? [Berg-Kirkpatrick, Gillick, and Klein 11] 21 Sentence Rewriting Sentence Rewriting [Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick, Gillick, and Klein 11] Beyond Extraction: Compression Sentence Rewriting The new optimization problem Sentencelooks to maximize Rewriting the concept values asNew well asOptimization safe deletion valuesproblem: in the Safe candidate Deletions summary: New Optimization problem: Safe Deletions Value of deletionValue of d deletion d Set branch cut deletions madeSet branch in creating cut deletionssummary s made in creating summary s ToHow decide do the we value/cost know how of a muchdeletion, a wegiven decide deletion relevant costs? deletion featuresHow and do thewe model know learns how muchtheir weights: a given deletion costs? [Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick et al., 2011] 21 21 Beyond Extraction: Compression Some example features forFeatures concept bigrams and cuts/deletions: Bigram Features f(b) Cut Features f(c) COUNT: Bucketed document counts COORD: Coordinated phrase, four versions: NP, VP, S, SBAR STOP: Stop word indicators S-ADJUNCT: Adjunct to matrix verb, POSITION: First document position four versions: CC, PP, indicators ADVP, SBAR CONJ: All two- and three-way REL-C: Relative clause indicator conjunctions of above ATTR-C: Attribution clause indicator BIAS: Always one ATTR-PP: PP attribution indicator TEMP-PP: Temporal PP indicator TEMP-NP Temporal NP indicator BIAS: Always one [Berg-Kirkpatrick et al., 2011] Neural Abstractive Summarization Mostly based on sequence-to-sequence RNN models Later added attention, coverage, pointer/copy, hierarchical encoder/ attention, metric rewards RL, etc. Examples: Rush et al., 2015; Nallapati et al., 2016; See et al., 2017; Paulus et al., 2017 tion 3 contextualizes our models with respect to document, around which the story revolves. In closely related work on the topic of abstractive text order to accomplish this goal, we may need to summarization. We present the results of our ex- go beyond the word-embeddings-based represen- periments on three different data sets in Section 4. tation of the input document and capture addi- We also present some qualitative analysis of the tional linguistic features such as parts-of-speech output from our models in Section 5 before con- tags, named-entity tags, and TF and IDF statis- cluding the paper with remarks on our future di- tics of the words. We therefore create additional rection in Section 6. look-up based embedding matrices for the vocab- ulary of each tag-type, similar to the embeddings 2 Models for words. For continuous features such as TF and IDF, we convert them into categorical values In this section, we first describe the basic encoder- by discretizing them into a fixed number of bins, decoder RNN that serves as our baseline and then and use one-hot representations to indicate the bin propose several novel models for summarization, number they fall into. This allows us to map them each addressing a specific weakness in the base- into an embeddings matrix like any other tag-type. line. Finally, for each word in the source document, we 2.1 Encoder-Decoder RNN with Attention simply look-up its embeddings from all of its as- and Large Vocabulary Trick sociated tags and concatenate them into a single long vector, as shown in Fig. 1. On the target side, Our baseline model corresponds to the neural ma- we continue to use only word-based embeddings chine translation model used in Bahdanau et al. as theFeature-Augmented representation. Encoder-Decoder (2014). The encoder consists of a bidirectional GRU-RNN (Chung et al., 2014), while the decoder consists of a uni-directional GRU-RNN with the same hidden-state size as that of the encoder, and Layer Attention mechanism an attention mechanism over the source-hidden Output states and a soft-max layer over target vocabu- lary to generate words. In the interest of space, State Hidden Hidden we refer the reader to the original paper for a de- W W W W DECODER POS POS POS POS Layer tailed treatment of this model. In addition to the NER NER NER NER Input Input TF TF TF TF basic model, we also adapted to the summariza- IDF IDF IDF IDF tion problem, the large vocabulary ‘trick’ (LVT) ENCODER described in Jean et al. (2014). In our approach, Figure 1: Feature-rich-encoder: We use one embedding the decoder-vocabulary of each mini-batch is re- vector each for POS, NER tags and discretized TF and IDF stricted to words in the source documents of that values, which are concatenated together with word-based em-[Nallapati et al., 2016] batch. In addition, the most frequent words in the beddings as input to the encoder. target dictionary are added until the vocabulary reaches a fixed size. The aim of this technique is to reduce the size of the soft-max layer of the 2.3 Modeling Rare/Unseen Words using decoder which is the main computational bottle- Switching Generator-Pointer neck. In addition, this technique also speeds up Often-times in summarization, the keywords or convergence by focusing the modeling effort only named-entities in a test document that are central on the words that are essential to a given example.