Summarization with a Joint Model for Sentence Extraction and Compression Andre´ F

Summarization with a Joint Model for Sentence Extraction and Compression Andre´ F. T. Martins∗† and Noah A. Smith∗ ∗School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA †Instituto de Telecomunicaçoes,˜ Instituto Superior Tecnico,´ Lisboa, Portugal {afm,nasmith}@cs.cmu.edu Abstract a promising framework with applications, for exam- ple, in headline generation (Dorr et al., 2003; Jin, Text summarization is one of the oldest prob- 2003), little work has been done to include it as a lems in natural language processing. Popu- module in document summarization systems. Most lar approaches rely on extracting relevant sen- existing approaches (with some exceptions, like the tences from the original documents. As a side vine-growth model of Daume,´ 2006) use a two-stage effect, sentences that are too long but partly architecture, either by first extracting a certain num- relevant are doomed to either not appear in the final summary, or prevent inclusion of other ber of salient sentences and then feeding them into relevant sentences. Sentence compression is a a sentence compressor, or by first compressing all recent framework that aims to select the short- sentences and extracting later. However, regardless est subsequence of words that yields an infor- of which operation is performed first—compression mative and grammatical sentence. This work or extraction—two-step “pipeline” approaches may proposes a one-step approach for document fail to find overall-optimal solutions; often the sum- summarization that jointly performs sentence maries are not better that the ones produced by ex- extraction and compression by solving an integer linear program. We report favorable extractive summarization. On the other hand, a pilot perimental results on newswire data. study carried out by Lin (2003) suggests that summarization systems that perform sentence compression have the potential to beat pure extractive sys- 1 Introduction tems if they model cross-sentence effects. In this work, we address this issue by merging the Automatic text summarization dates back to the tasks of sentence extraction and sentence compres- 1950s and 1960s (Luhn, 1958; Baxendale, 1958; Ed- sion into a global optimization problem. A careful mundson, 1969). Today, the proliferation of digital design of the objective function encourages “sparse information makes research on summarization tech- solutions,” i.e., solutions that involve only a small nologies more important than ever before. In the last number of sentences whose compressions are to be two decades, machine learning techniques have been included in the summary. Our contributions are: employed in extractive summarization of single documents (Kupiec et al., 1995; Aone et al., 1999; • We cast joint sentence extraction and compression Osborne, 2002) and multiple documents (Radev and as an integer linear program (ILP); McKeown, 1998; Carbonell and Goldstein, 1998; Radev et al., 2000). Most of this work aims only • We provide a new formulation of sentence com- to extract relevant sentences from the original doc- pression using dependency parsing information uments and present them as the summary; this sim- that only requires a linear number of variables, plification of the problem yields scalable solutions. and combine it with a bigram model; Some attention has been devoted by the NLP • We show how the full model can be trained in a community to the related problem of sentence com- max-margin framework. Since a dataset of sum- pression (Knight and Marcu, 2000): given a long maries comprised of extracted, compressed sen- sentence, how to maximally compress it into a gram- tences is unavailable, we present a procedure that matical sentence that still preserves all the rele- trains the compression and extraction models sep- vant information? While sentence compression is arately and tunes a parameter to interpolate the two models. Maximal Marginal Relevance (MMR). For long documents or large collections, it becomes impor- The compression model and the full system are tant to penalize the redundancy among the extracted compared with state-of-the-art baselines in standard sentences. Carbonell and Goldstein (1998) proposed newswire datasets. This paper is organized as fol- greedily adding sentences to the summary S to max- lows: §2–3 provide an overview of our two building imize, at each step, a score of the form blocks, sentence extraction and sentence compression. §4 describes our method to perform one-step λ · scorerel(ti) − (1 − λ) · scorered(ti,S), (2) sentence compression and extraction. §5 shows experiments in newswire data. Finally, §6 concludes where scorerel(ti) is as in Eq. 1 and scorered(ti,S) the paper and suggests future work. accounts for the redundancy between ti and the current summary S. In our experiments, redundancy is 2 Extractive summarization the 1-gram cosine similarity between the sentence Extractive summarization builds a summary by ex- ti and the current summary S. The trade-off be- tracting a few informative sentences from the docu- tween relevance and redundancy is controlled by λ ∈ [0, 1], which is tuned on development data. ments. Let D , {t1, . , tM } be a set of sentences, contained in a single or in multiple related docu- McDonald (2007) proposed a non-greedy variant ments.1 The goal is to extract the best sequence of of MMR that takes into account the redundancy between each pair of candidate sentences. This is cast sentences hti1 , . , tiK i that summarizes D whose total length does not exceed a fixed budget of J as a global optimization problem: words. We describe some well-known approaches ˆ P S = arg max λ · t ∈S scorerel(ti) − that wil serve as our experimental baselines. S i P (1 − λ) · scorered(ti, tj), (3) Extract the leading sentences (Lead). For ti,tj ∈S single-document summarization, the simplest where score (t ) θ> f (t ), score (t , t ) method consists of greedily extracting the leading rel i , rel rel i red i j , θ> f (t , t ), and f (t ) and f (t , t ) are feature sentences while they fit into the summary. A sen- red red i j rel i red i j vectors with corresponding learned weight vectors tence is skipped if its inclusion exceeds the budget, θ and θ . He has shown how the relevance-based and the next is examined. This performs extremely rel red method and the MMR framework (in the non-greedy well in newswire articles, due to the journalistic form of Eq. 3) can be cast as an ILP. By introducing convention of summarizing the article first. indicator variables hµiii=1,...,M and hµijii,j=1,...,M Rank by relevance (Rel). This method ranks sen- with the meanings tences by a relevance score, and then extracts the top 1 if t is to be extracted ones that can fit into the summary. The score is typ- µ = i i 0 otherwise ically a linear function of feature values: 1 if ti and tj are both to be extracted > PD µij = scorerel(ti) , θ f(ti) = d=1 θdfd(ti), (1) 0 otherwise (4) 2 Here, each fd(ti) is a feature extracted from sen- one can reformulate Eq. 3 as an ILP with O(M ) tence ti, and θd is the corresponding weight. In our variables and constraints: experiments, relevance features include (i) the recip- PM rocal position in the document, (ii) a binary feature max λ · i=1 µiscorerel(ti) − (5) hµ i,hµ i indicating whether the sentence is the first one, and i ij PM PM (iii) the 1-gram and 2-gram cosine similarity with (1 − λ) · i=1 j=1 µijscorered(ti, tj), the headline and with the full document. subject to binary constraints µi, µij ∈ {0, 1}, the 1 For simplicity, we describe a unified framework for single length constraint PM µ N ≤ J (where N is the and multi-document summarization, although they may require i=1 i i i specialized strategies. Here we experiment only with single- number of words of the ith sentence), and the fol- document summarization and assume t1, ..., tM are ordered. lowing “agreement constraints” for i, j = 1,...,M (that impose the logical relation µij = µi ∧ µj): 2000; Daume´ and Marcu, 2002), heuristic methods that parse the sentence and then trim constituents ac- µij ≤ µi, µij ≤ µj, µij ≥ µi + µj − 1 (6) cording to linguistic criteria (Dorr et al., 2003; Zajic et al., 2006), a pure discriminative model (McDon- Let us provide a compact representation of the pro- ald, 2006), and an ILP formulation (Clarke and La- gram in Eq. 5 that will be used later. Define our vec- pata, 2008). We next give an overview of the two tor of parameters as θ [λθ , −(1−λ)θ ]. Pack- , rel red latter approaches. ing all the feature vectors (one for each sentence, and McDonald (2006) uses the outputs of two parsers one for each pair of sentences) into a matrix F, (a phrase-based and a dependency parser) as fea- Frel 0 tures in a discriminative model that decomposes F , , (7) 0 Fred over pairs of consecutive words. Formally, given a sentence t = hw1, . , wN i, the score of a compres- with Frel , [frel(ti)]1≤i≤M and Fred , sion c = hwj1 , . , wjL i decomposes as: [fred(ti, tj)]1≤i<j≤M , and packing all the variables PL > µi and µij into a vector µ, the program in Eq. 5 can score(c; t) = l=2 θ f(t, jl−1, jl) (9) be compactly written as where f(t, jl−1, jl) are feature vectors that depend > max θ Fµ, (8) on the original sentence t and consecutive positions µ jl−1 and jl, and θ is a learned weight vector. The subject to binary and linear constraints on µ.

Load more