Summarization with a Joint Model for Sentence Extraction and Compression Andre´ F. T. Martins∗† and Noah A. Smith∗ ∗School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA †Instituto de Telecomunicac¸oes,˜ Instituto Superior Tecnico,´ Lisboa, Portugal {afm,nasmith}@cs.cmu.edu

Abstract a promising framework with applications, for exam- ple, in headline generation (Dorr et al., 2003; Jin, Text summarization is one of the oldest prob- 2003), little work has been done to include it as a lems in natural language processing. Popu- module in document summarization systems. Most lar approaches rely on extracting relevant sen- existing approaches (with some exceptions, like the tences from the original documents. As a side vine-growth model of Daume,´ 2006) use a two-stage effect, sentences that are too long but partly architecture, either by first extracting a certain num- relevant are doomed to either not appear in the final summary, or prevent inclusion of other ber of salient sentences and then feeding them into relevant sentences. Sentence compression is a a sentence compressor, or by first compressing all recent framework that aims to select the short- sentences and extracting later. However, regardless est subsequence of words that yields an infor- of which operation is performed first—compression mative and grammatical sentence. This work or extraction—two-step “pipeline” approaches may proposes a one-step approach for document fail to find overall-optimal solutions; often the sum- summarization that jointly performs sentence maries are not better that the ones produced by ex- extraction and compression by solving an in- teger linear program. We report favorable ex- tractive summarization. On the other hand, a pilot perimental results on newswire data. study carried out by Lin (2003) suggests that sum- marization systems that perform sentence compres- sion have the potential to beat pure extractive sys- 1 Introduction tems if they model cross-sentence effects. In this work, we address this issue by merging the Automatic text summarization dates back to the tasks of sentence extraction and sentence compres- 1950s and 1960s (Luhn, 1958; Baxendale, 1958; Ed- sion into a global optimization problem. A careful mundson, 1969). Today, the proliferation of digital design of the objective function encourages “sparse information makes research on summarization tech- solutions,” i.e., solutions that involve only a small nologies more important than ever before. In the last number of sentences whose compressions are to be two decades, machine learning techniques have been included in the summary. Our contributions are: employed in extractive summarization of single documents (Kupiec et al., 1995; Aone et al., 1999; • We cast joint sentence extraction and compression Osborne, 2002) and multiple documents (Radev and as an integer linear program (ILP); McKeown, 1998; Carbonell and Goldstein, 1998; Radev et al., 2000). Most of this work aims only • We provide a new formulation of sentence com- to extract relevant sentences from the original doc- pression using dependency information uments and present them as the summary; this sim- that only requires a linear number of variables, plification of the problem yields scalable solutions. and combine it with a model; Some attention has been devoted by the NLP • We show how the full model can be trained in a community to the related problem of sentence com- max-margin framework. Since a dataset of sum- pression (Knight and Marcu, 2000): given a long maries comprised of extracted, compressed sen- sentence, how to maximally compress it into a gram- tences is unavailable, we present a procedure that matical sentence that still preserves all the rele- trains the compression and extraction models sep- vant information? While sentence compression is arately and tunes a parameter to interpolate the two models. Maximal Marginal Relevance (MMR). For long documents or large collections, it becomes impor- The compression model and the full system are tant to penalize the redundancy among the extracted compared with state-of-the-art baselines in standard sentences. Carbonell and Goldstein (1998) proposed newswire datasets. This paper is organized as fol- greedily adding sentences to the summary S to max- lows: §2–3 provide an overview of our two building imize, at each step, a score of the form blocks, sentence extraction and sentence compres- sion. §4 describes our method to perform one-step λ · scorerel(ti) − (1 − λ) · scorered(ti,S), (2) sentence compression and extraction. §5 shows ex- periments in newswire data. Finally, §6 concludes where scorerel(ti) is as in Eq. 1 and scorered(ti,S) the paper and suggests future work. accounts for the redundancy between ti and the cur- rent summary S. In our experiments, redundancy is 2 Extractive summarization the 1-gram cosine similarity between the sentence Extractive summarization builds a summary by ex- ti and the current summary S. The trade-off be- tracting a few informative sentences from the docu- tween relevance and redundancy is controlled by λ ∈ [0, 1], which is tuned on development data. ments. Let D , {t1, . . . , tM } be a set of sentences, contained in a single or in multiple related docu- McDonald (2007) proposed a non-greedy variant ments.1 The goal is to extract the best sequence of of MMR that takes into account the redundancy be- tween each pair of candidate sentences. This is cast sentences hti1 , . . . , tiK i that summarizes D whose total length does not exceed a fixed budget of J as a global optimization problem: words. We describe some well-known approaches ˆ P S = arg max λ · t ∈S scorerel(ti) − that wil serve as our experimental baselines. S i P (1 − λ) · scorered(ti, tj), (3) Extract the leading sentences (Lead). For ti,tj ∈S single-document summarization, the simplest where score (t ) θ> f (t ), score (t , t ) method consists of greedily extracting the leading rel i , rel rel i red i j , θ> f (t , t ), and f (t ) and f (t , t ) are feature sentences while they fit into the summary. A sen- red red i j rel i red i j vectors with corresponding learned weight vectors tence is skipped if its inclusion exceeds the budget, θ and θ . He has shown how the relevance-based and the next is examined. This performs extremely rel red method and the MMR framework (in the non-greedy well in newswire articles, due to the journalistic form of Eq. 3) can be cast as an ILP. By introducing convention of summarizing the article first. indicator variables hµiii=1,...,M and hµijii,j=1,...,M Rank by relevance (Rel). This method ranks sen- with the meanings tences by a relevance score, and then extracts the top  1 if t is to be extracted ones that can fit into the summary. The score is typ- µ = i i 0 otherwise ically a linear function of feature values:  1 if ti and tj are both to be extracted > PD µij = scorerel(ti) , θ f(ti) = d=1 θdfd(ti), (1) 0 otherwise (4) 2 Here, each fd(ti) is a feature extracted from sen- one can reformulate Eq. 3 as an ILP with O(M ) tence ti, and θd is the corresponding weight. In our variables and constraints: experiments, relevance features include (i) the recip- PM rocal position in the document, (ii) a binary feature max λ · i=1 µiscorerel(ti) − (5) hµ i,hµ i indicating whether the sentence is the first one, and i ij PM PM (iii) the 1-gram and 2-gram cosine similarity with (1 − λ) · i=1 j=1 µijscorered(ti, tj), the headline and with the full document. subject to binary constraints µi, µij ∈ {0, 1}, the 1 For simplicity, we describe a unified framework for single length constraint PM µ N ≤ J (where N is the and multi-document summarization, although they may require i=1 i i i specialized strategies. Here we experiment only with single- number of words of the ith sentence), and the fol- document summarization and assume t1, ..., tM are ordered. lowing “agreement constraints” for i, j = 1,...,M (that impose the logical relation µij = µi ∧ µj): 2000; Daume´ and Marcu, 2002), heuristic methods that parse the sentence and then trim constituents ac- µij ≤ µi, µij ≤ µj, µij ≥ µi + µj − 1 (6) cording to linguistic criteria (Dorr et al., 2003; Zajic et al., 2006), a pure discriminative model (McDon- Let us provide a compact representation of the pro- ald, 2006), and an ILP formulation (Clarke and La- gram in Eq. 5 that will be used later. Define our vec- pata, 2008). We next give an overview of the two tor of parameters as θ [λθ , −(1−λ)θ ]. Pack- , rel red latter approaches. ing all the feature vectors (one for each sentence, and McDonald (2006) uses the outputs of two parsers one for each pair of sentences) into a matrix F, (a phrase-based and a dependency parser) as fea-   Frel 0 tures in a discriminative model that decomposes F , , (7) 0 Fred over pairs of consecutive words. Formally, given a sentence t = hw1, . . . , wN i, the score of a compres- with Frel , [frel(ti)]1≤i≤M and Fred , sion c = hwj1 , . . . , wjL i decomposes as: [fred(ti, tj)]1≤i µi and µij into a vector µ, the program in Eq. 5 can score(c; t) = l=2 θ f(t, jl−1, jl) (9) be compactly written as where f(t, jl−1, jl) are feature vectors that depend > max θ Fµ, (8) on the original sentence t and consecutive positions µ jl−1 and jl, and θ is a learned weight vector. The subject to binary and linear constraints on µ. This factorization in Eq. 9 allows exact decoding with dy- formulation requires O(M 2) variables and con- namic programming. straints. If we do not penalize sentence redundancy, Clarke and Lapata (2008) cast the problem as an the redundancy term may be dropped; in this simpler ILP. In their formulation, Eq. 9 may be expressed as: case, F = Frel, the vector µ only contains the vari- ables hµ i, and the program in Eq. 8 only requires N i X > O(M) variables and constraints. Our method (to be score(c; t) = αiθ f(t, 0, i) + presented in §4) will build on this latter formulation. i=1 N X > 3 Sentence Compression βiθ f(t, i, n + 1) + i=1 Despite its simplicity, extractive summarization has N−1 N X X > a few shortcomings: for example, if the original sen- γijθ f(t, i, j), (10) tences are too long or embed several clauses, there i=1 j=i+1 is no way of preventing lengthy sentences from ap- pearing in the final summary. The sentence com- where αi, βi, and γij are additional binary variables pression framework (Knight and Marcu, 2000) aims with the following meanings: to select the best subsequence of words that still • α = 1 iff word w starts the compression; yields a short, informative and grammatical sen- i i tence. Such a sentence compressor is given a sen- • βi = 1 iff word wi ends the compression; tence t hw1, . . . , wN i as input and outputs a sub- , • γij = 1 iff words wi and wj appear consecutively sequence of length L, c hw , . . . , w i, with , j1 jL in the compression; 1 ≤ j1 < . . . < jL ≤ N. We may represent this output as a binary vector s of length N, where and subject to the following agreement constraints: sj = 1 iff word wj is included in the compression. N PN Note that there are O(2 ) possible subsequences. i=1 αi = 1 PN 3.1 Related Work i=1 βi = 1 Pj−1 Past approaches to sentence compression include sj = αj + i=1 γij PN a noisy channel formulation (Knight and Marcu, si = βi + j=i+1 γij. (11) This framework also allows the inclusion of con- Consider feature vectors f11(t, j), f10(t, j), f01(t, j), straints to enforce grammaticality. and f00(t, j), that look at the surface sentence and at To compress a sentence, one needs to maximize the status of the word j and its head π(j); these fea- the score in Eq. 10 subject to the constraints in tures have corresponding weight vectors θ11, θ10, Eq. 11. Representing the variables through θ01, and θ00. The score of c is written as:

score(c; t) = PN P ν θ> f (t, j) ν , hα1, . . . , αN , β1, . . . , βN , γ11, . . . , γNN i j=1 a,b∈{0,1} jab ab ab (12) P > = a,b∈{0,1} θabFabνab and packing the feature vectors into a matrix F, we = θ>Fν, (15) obtain the ILP max θ>Fν (13) s,ν where Fab , [fab(t, 1),..., fab(t, N)], νab , (νjab)j=1,...,N , θ , (θ11, θ10, θ01, θ00), and F , subject to linear and integer constraints on the vari- Diag(F11, F10, F01, F00) (a block-diagonal matrix). ables s and ν. This particular formulation requires We have reached in Eq. 15 an ILP isomorphic to 2 O(N ) variables and constraints. the one in Eq. 13, but only with O(N) variables. There are some agreement constraints between the 3.2 Proposed Method variables ν and s that reflect the logical relations in We propose an alternative model for sentence com- Eq. 14; these may be written as linear inequalities pression that may be formulated as an ILP, as in (cf. Eq. 6), yielding O(N) constraints. Eq. 13, but with only O(N) variables and con- Given this proposal and §3.1, it is also straight- straints. This formulation is based on the output of a forward to extend this model to include bigram fea- dependency parser. tures as in Eq. 10; the combination of dependency Directed arcs in a dependency tree link pairs of relation features and bigram features yields a model words, namely a head to its modifier. A dependency that is more powerful than both models in Eq. 15 and parse tree is characterized by a set of labeled arcs Eq. 10. Such a model is expressible as an ILP with of the form (head, modifier, label); see Fig.1 for an O(N 2) variables and constraints, making use of the example. Given a sentence t = hw1, . . . , wN i, we variables s, ν, α, β and γ. In §5, we compare the write i = π(j) to denote that the ith word is the performance of this model (called “Bigram”) and the head (the “parent”) of the jth word; if j is the root, model in Eq. 15 (called “NoBigram”).2 we write π(j) = 0. Let s be the binary vector de- 4 Joint Compression and Extraction learningWe nextvia describethe EM algorithm our joint– modelnone of forwhich sentencehave com- prepressionviously been and extraction.known to hav Lete exactD ,non-projecti{t1, . . . ,v teM } be implementations.a set of sentences as in §2, each expressed as a se- quenceWe then ofswitch words,focusti to hmodelswi1, . .that . , waccountiN i. Followingfor $ , i non-local§3, we representinformation, a compressionin particular ofarityti asand aneigh- binary vec- Figure 1: A projective dependency graph. bouring parse decisions. For systems that model ar- tor si = hsi1, . . . , siNi i, where sij = 1 iff word wij Figure 1: A dependency parse for an English sentence; ity constraints we give a reduction from the Hamilto- 2 example from McDonald and Satta (2007). nian graphIt shouldproblem be notedsuggesting that morethat efficientthe parsing decodersprob- are possible lemthatis dointractable not require solvingin this ancas ILP.e. InF particular,or neighbouring inference in the scribing a possible compression c for the sentence NoBigram variant can performed in polynomial time with dy- parsenamicdecisions, programmingwe algorithmsextend the thatwork propagateof McDonald messages along t. For each word j, we consider four possible cases, andthePereira dependency(2006) parseand tree;sho forw thethat Bigrammodeling variant,vertical dynamic pro- accounting for the inclusion or not of j and π(j) in neighbourhoodsgramming can stillmak bees employedparsing withintractable some additionalin addi- storage. Figure 2: Non-projective dependency graph. the compression. We introduce (mutually exclusive) tionOurto ILPmodeling formulation,horizontal however,neighbourhoods. is more suited toA thecon- final goal of performing document summarization (of which our sentence binary variables νj11, νj10, νj01, and νj00 to indicate sequence of these results is that it is unlikely that a, b ∈ {0, 1} compression model will be a component); furthermore, it also eachthose of thesethat assume cases, i.e.,each fordependency decision, is in- exactallowsnon-projecti the straightforwardve dependenc inclusiony parsing of globalis tractable linguistic con- dependent modulo the global structural constraint forstraints,any mod which,el assumptions as shown byweak Clarkeer than and Lapatathose made (2008), can that dependencνjab y,graphssj =musta ∧ sbeπ(jtrees.) = b.Such mod-(14) bygreatlythe edge-f improveactored the grammaticalitymodels. of the compressions. els are commonly referred to as edge-factored since their parameters factor relative to individual edges 1.1 Related Work of the graph (Paskin, 2001; McDonald et al., There has been extensive work on data-driven de- 2005a). Edge-factored models have many computa- pendency parsing for both projective parsing (Eis- tional benefits, most notably that inference for non- ner, 1996; Paskin, 2001; Yamada and Matsumoto, projective dependency graphs can be achieved in 2003; Nivre and Scholz, 2004; McDonald et al., polynomial time (McDonald et al., 2005b). The pri- 2005a) and non-projective parsing systems (Nivre mary problem in treating each dependency as in- and Nilsson, 2005; Hall and Nov´ ak,´ 2005; McDon- dependent is that it is not a realistic assumption. ald et al., 2005b). These approaches can often be Non-local information, such as arity (or valency) classified into two broad categories. In the first cat- and neighbouring dependencies, can be crucial to egory are those methods that employ approximate obtaining high parsing accuracies (Klein and Man- inference, typically through the use of linear time ning, 2002; McDonald and Pereira, 2006). How- shift-reduce parsing algorithms (Yamada and Mat- ever, in the data-driven parsing setting this can be sumoto, 2003; Nivre and Scholz, 2004; Nivre and partially adverted by incorporating rich feature rep- Nilsson, 2005). In the second category are those resentations over the input (McDonald et al., 2005a). that employ exhaustive inference algorithms, usu- The goal of this work is to further our current ally by making strong independence assumptions, as understanding of the computational nature of non- is the case for edge-factored models (Paskin, 2001; projective parsing algorithms for both learning and McDonald et al., 2005a; McDonald et al., 2005b). inference within the data-driven setting. We start by Recently there have also been proposals for exhaus- investigating and extending the edge-factored model tive methods that weaken the edge-factored assump- of McDonald et al. (2005b). In particular, we ap- tion, including both approximate methods (McDon- peal to the Matrix Tree Theorem for multi-digraphs ald and Pereira, 2006) and exact methods through in- to design polynomial-time algorithms for calculat- teger linear programming (Riedel and Clarke, 2006) ing both the partition function and edge expecta- or branch-and-bound algorithms (Hirakawa, 2006). tions over all possible dependency graphs for a given For grammar based models there has been limited sentence. To motivate these algorithms, we show work on empirical systems for non-projective pars- that they can be used in many important learning ing systems, notable exceptions include the work and inference problems including min-risk decod- of Wang and Harper (2004). Theoretical studies of ing, training globally normalized log-linear mod- note include the work of Neuhaus and Bok¨ er (1997) els, syntactic language modeling, and unsupervised showing that the recognition problem for a mini- is included in the compression. Now, define a sum- which states, for each sentence ti, that ti should be mary of D as a set of sentences obtained by extract- ignored or have at least ρNi words extracted. We fix ing and compressing sentences from D. More pre- ρ = 0.8, enforcing compression rates below 80%.4 cisely, let µ1, . . . , µM be binary variables, one for To learn the model parameters θ = hθe, θci, we each sentence ti in D; define µi = 1 iff a compres- can use a max-margin discriminative learning al- sion of sentence ti is used in the summary. A sum- gorithm like MIRA (Crammer and Singer, 2003), mary of D is then represented by the binary vari- which is quite effective and scalable. However, there ables hµ1, . . . , µM , s1,..., sM i. Notice that these is not (to our knowledge) a single dataset of ex- variables are redundant: tracted and compressed sentences. Instead, as will be described in Sec. 5.1, there are separate datasets µ = 0 ⇔ ∀j ∈ {1,...,N } s = 0, (16) i i ij of extracted sentences, and datasets of compressed i.e., an empty compression means that the sentence sentences. Therefore, instead of globally learning is not to be extracted. In the sequel, it will become the model parameters, θ = hθe, θci, we propose the clear why this redundancy is convenient. following strategy to learn them separately: Most approaches up to now are concerned with ei- 0 ther extraction or compression, not both at the same • Learn θe using a corpus of extracted sentences, time. We will combine the extraction scores in Eq. 8 0 • Learn θc using a corpus of compressed sentences, and the compression scores in Eq. 15 to obtain a sin- 0 0 gle, global optimization problem;3 we rename the • Tune η so that θ = hθe, ηθci has good perfor- mance on development data. (This is necessary extraction features and parameters to Fe and θe and since each set of weights is learned up to scaling.) the compression features and parameters to Fc and θc: T PM T 5 Experiments max θ Feµ + θ Fciνi, (17) µ,ν,s e i=1 c 5.1 Datasets, Evaluation and Environment subject to agreement constraints on the variables νi and si (see Eqs. 11 and 14), and new agreement con- For our experiments, two datasets were used: straints on the variables µ and s1,..., sM to enforce the relation in Eq. 16: The DUC 2002 dataset. This is a collection of newswire articles, comprised of 59 document clus- sij ≤ µi, ∀i = 1,...,M, ∀j = 1,...,Ni ters. Each document within the collections (out of PNi µi ≤ j=1 sij, ∀i = 1,...,M a total of 567 documents) has one or two manually (18) created abstracts with approximately 100 words.5 The constraint that the length of the summary cannot exceed J words is encoded as: Clarke’s dataset for sentence compression. This is the dataset used by Clarke and Lapata (2008). It PM PNi i=1 j=1 sij ≤ J. (19) contains manually created compressions of 82 news- All variables are further restricted to be binary. We paper articles (1,433 sentences) from the British Na- 6 also want to avoid picking just a few words from tional Corpus and the American News . many sentences, which typically leads to ungram- matical summaries. Hence it is desirable to obtain To evaluate the sentence compressor alone, we “sparse” solutions with only a few sentences ex- measured the compression rate and the precision, tracted and compressed (and most components of µ recall, and F1-measure (both macro and micro- are zero) To do so, we add the constraint averaged) with respect to the “gold” compressed

PNi 4There are alternative ways to achieve “sparseness,” either j=1 sij ≥ µiρNi, i = 1,...,M, (20) P in a soft way, by adding a term −λ Pi µi to the objective, or 3 In what follows, we use the formulation in Eq. 8 with- using a different hard constraint, like i µi ≤ K, to limit the out the redundancy terms; however these can be included in number of sentences from which to pick words. a straightforward way, naturally increasing the number of vari- 5http://duc.nist.gov ables/constraints. 6http://homepages.inf.ed.ac.uk/s0460084/data Compression Micro-Av. Macro-Av. Ratio PRF1 PRF1 HedgeTrimmer 57.64% 0.7099 0.5925 0.6459 0.7195 0.6547 0.6367 McDonald (2006) 71.40% 0.7444 0.7697 0.7568 0.7711 0.7852 0.7696 NoBigram 71.20% 0.7399 0.7626 0.7510 0.7645 0.7730 0.7604 Bigram 71.35% 0.7472 0.7720 0.7594 0.7737 0.7848 0.7710

Table 1: Results for sentence compression in the Clarke’s test dataset (441 sentences) for our implementation of the baseline systems (HedgeTrimmer and the system described in McDonald, 2006), and the two variants of our model, NoBigram and Bigram. The compression ratio associated with the reference compressed sentences in this dataset is 69.06%. In the rightmost column, the statistically indistinguishable best results are emboldened, based on a paired t-test applied to the sequence of F1 measures (p < 0.01). sentences, calculated on unigrams.7 tive model described by McDonald (2006), which To evaluate the full system, we used Rouge-N captures “soft syntactic evidence” (we reproduced (Lin and Hovy, 2002), a popular n-gram recall- the same set of features). Both systems require based automatic evaluation measure. This score a phrase-structure parser; we used Collins’ parser compares the summary produced by a system with (Collins, 1999);9 the latter system also derives fea- one or more valid reference summaries. tures from a dependency parser; we used the MST- All our experiments were conducted on a PC with Parser (McDonald et al., 2005).10 a Intel dual-core processor with 2.66 GHz and 2 Gb We implemented the two variants of our compres- RAM memory. We used ILOG CPLEX, a commer- sor described in §3.2. cial integer programming solver. The interface with NoBigram. This variant factors the compression CPLEX was coded in Java. score as a sum over individual scores, each depend- 5.2 Sentence Compression ing on the inclusion or not of each word and its head in the compression (see Eq. 15). An upper bound of We split Clarke’s dataset into two partitions, one 70% was placed on the compression ratio. As stated used for training (1,188 sentences) and the other for in §3.2, inference amounts to solving an ILP with testing (441 sentences). This dataset includes one O(N) variables and constraints, N being the sen- manual compression for each sentence, that we use tence length. We also used MSTParser to obtain the as reference for evaluation purposes. Compression dependency parse trees. ratio, i.e., the fraction of words included in the com- pressed sentences, is 69.32% (micro-averaged over Bigram. This variant includes an extra term stand- the training partition). ing for a bigram score, which factors as a sum over For comparison, two baselines were imple- pairs of consecutive words. As in McDonald (2006), mented: a simple compressor based on Hedge Trim- we include features that depend on the “in-between” mer, the headline generation system of Dorr et al. words in the original sentence that are to be omitted (2003) and Zajic et al. (2006),8 and the discrimina- in the compression.11 As stated in §3.2, inference through this model can be done by solving an ILP 7 Notice that this evaluation score is not able to properly cap- with O(N 2) variables and constraints. ture the grammaticality of the compression; this is a known is- sue that typically is addressed by requiring human judgments. step of the algorithm) already provides significant compression, 8 Hedge Trimmer applies a deterministic compression proce- as illustrated in Table 1. dure whose first step is to identify the lowest leftmost S node in 9http://people.csail.mit.edu/mcollins/code. the parse tree that contains a NP and a VP; this node is taken as html the root of the compressed sentence (i.e., all words that are not 10http://sourceforge.net/projects/mstparser spanned by this node are discarded). Further steps described 11The major difference between this variant and model of by Dorr et al. (2003) include removal of low content units, and McDonald (2006) is that the latter employs “soft syntactic ev- an “iterative shortening” loop that keeps removing constituents idence” as input features, while we make the dependency rela- until a desired compression ratio is achieved. The best results tions part of the output features. All the non-syntactic features were obtained without iterative shortening, which is explained are the same. Apart from this, notice that our variant does not by the fact that the selection of the lowest leftmost S node (first employ a phrase-structure parser. For both variants, we used MSTParser to obtain Rouge-1 Rouge-2 the dependency parse trees. The model parameters Lead 0.384 ± 0.080 0.177 ± 0.083 Rel 0.389 ± 0.074 0.178 ± 0.080 are learned in a pure discriminative way through a MMR λ = 0.25 0.392 ± 0.071 0.178 ± 0.077 max-margin approach. We used the 1-best MIRA Pipeline 0.380 ± 0.073 0.173 ± 0.073 algorithm (Crammer and Singer, 2003; McDonald Rel + NoBigr η = 1.5 0.403 ± 0.080 0.180 ± 0.082 et al., 2005) for training; this is a fast online algo- Rel + Bigr η = 4.0 0.403 ± 0.076 0.180 ± 0.076 rithm that requires solving the inference problem at Table 2: Results for sentence extraction in the DUC2002 each step. Although inference amounts to solving dataset (140 documents). Bold indicates the best results an ILP, which in the worst case scales exponentially with statistical significance, according to a paired t-test with the size of the sentence, training the model is (p < 0.01); Rouge-2 scores of all systems except Pipeline in practice very fast for the NoBigram model (a few are indistinguishable according to the same test, with p > minutes in the environment described in §5.1) and 0.05. fast enough for the Bigram model (a couple of hours using the same equipment). This is explained by the gram compression model (§3.2) and (ii) the Bigram fact that sentences don’t usually exceed a few tens variant. Each variant was trained with the proce- of words, and because of the structure of the ILPs, dure described in §4. To keep tractability, the in- whose constraint matrices are very sparse. ference ILP problem was relaxed (the binary con- Table 1 depicts the micro- and macro-averaged straints were relaxed to unit interval constraints) and precision, recall and F1-measure. We can see that non-integer solution values were rounded to produce both variants outperform the Hedge Trimmer base- a valid summary, both for training and testing.14 line by a great margin, and are in line with the sys- Whenever this procedure yielded a summary longer tem of McDonald (2006); however, none of our vari- than 100 words, we truncated it to fit the word limit. ants employ a phrase-structure parser. We also ob- Table 2 depicts the results of each of the above serve that our simpler NoBigram variant, which uses systems in terms of Rouge-1 and Rouge-2 scores. a linear-sized ILP, achieves results similar to these We can see that both variants of our system are able two systems. to achieve the best results in terms of Rouge-1 and 5.3 Joint Compression and Extraction Rouge-2 scores. The suboptimality of extracting and compressing in separate stages is clear from the ta- For the summarization task, we split the DUC 2002 ble, as Pipeline performs worse than the pure ex- dataset into a training partition (427 documents) and tractive systems. We also note that the configuration a testing partition (140 documents). The training Rel + Bigram is not able to outperform Rel + No- partition was further split into a training and a de- Bigram, despite being computationally more expen- velopment set. We evaluated the performance of sive (about 25 minutes to process the whole test set, Lead, Rel, and MMR as baselines (all are described against the 7 minutes taken by the Rel + NoBigram in §2). Weights for Rel were learned via the SVM- variant). Fig. 2 exemplifies the summaries produced 12 Rank algorithm; to create a gold-standard ranking, by our system. We see that both variants were able 13 we sorted the sentences by Rouge-2 score (with re- to include new pieces of information in the summary spect to the human created summaries). We include without sacrificing grammaticality. a Pipeline baseline as well, which ranks all sentences These results suggest that our system, being capa- by relevance, then includes their compressions (us- ble of performing joint sentence extraction and com- ing the Bigram variant) while they fit into the sum- pression to summarize a document, offers a power- mary. ful alternative to pure extractive systems. Finally, we We tested two variants of our joint model, com- note that no labeled datasets currently exist on which bining the Rel extraction model with (i) the NoBi- our full model could have been trained with super- 12SVMRank is implemented in the SVMlight toolkit vision; therefore, although inference is performed (Joachims, 1999), http://svmlight.joachims.org. 13A similar system was implemented that optimizes the 14See Martins et al. (2009) for a study concerning the impact Rouge-1 score instead, but it led to inferior performance. of LP relaxations in the learning problem. MMR baseline: encourages “sparse” summaries that involve only a Australian novelist Peter Carey was awarded the coveted Booker few sentences. Experiments in newswire data sug- Prize for fiction Tuesday night for his love story, “Oscar and Lu- cinda”. gest that our system is a valid alternative to exist- A panel of five judges unanimously announced the award of the ing extraction-based systems. However, it is worth $26,250 prize after an 80-minute deliberation during a banquet at noting that further evaluation (e.g., human judg- London’s ancient Guildhall. Carey, who lives in Sydney with his wife and son, said in a brief ments) needs to be carried out to assert the quality speech that like the other five finalists he had been asked to attend of our summaries, e.g., their grammaticality, some- with a short speech in his pocket in case he won. thing that the Rouge scores cannot fully capture. Rel + NoBigram: Future work will address the possibility of in- Australian novelist Peter Carey was awarded the coveted Booker cluding linguistic features and constraints to further Prize for fiction Tuesday night for his love story, “Oscar and Lu- improve the grammaticality of the produced sum- cinda”. A panel of five judges unanimously announced the award of the maries. $26,250 prize after an 80-minute deliberation during a banquet at Another straightforward extension is the inclusion London’s ancient Guildhall. The judges made their selection from 102 books published in Britain of a redundancy term and a query relevance term in the past 12 months and which they read in their homes. in the objective function. For redundancy, a simi- Carey, who lives in Sydney with his wife and son, said in a brief lar idea of that of McDonald (2007) can be applied, speech that like the other five finalists he had been asked to attend 2 with a short speech in his pocket in case he won. yielding a ILP with O(M + N) variables and con- straints (M being the number of sentences and N the Rel + Bigram: total number of words). However, such model will Australian novelist Peter Carey was awarded the coveted Booker take into account the redundancy among the origi- Prize for fiction Tuesday night for his love story, “Oscar and Lu- cinda”. nal sentences and not their compressions; to model A panel of five judges unanimously announced the award of the the redundancy accross compressions, a possibil- $26,250 prize after an 80-minute deliberation during a banquet at ity is to consider a linear redundancy score (similar London’s ancient Guildhall. to cosine similarity, but without the normalization), He was unsuccessful in the prize competition in 1985 when his P 2 novel, “Illywhacker,” was among the final six. which would result in an ILP with O(N + i Pi ) Carey called the award a “great honor” and he thanked the prize variables and constraints, where Pi ≤ M is the num- sponsors for “provoking so much passionate discussion about liter- ber of sentences in which word wi occurs; this is no ature perhaps there will be more tomorrow”. 2 Carey was the only non-Briton in the final six. worse than O(M N). We also intend to model discourse, which, as Figure 2: Summaries produced by the strongest base- shown by Daume´ and Marcu (2002), plays an im- line (MMR) and the two variants of our system. Deleted portant role in document summarization. Another words are marked as such. future direction is to extend our ILP formulations to more sophisticated models that go beyond word jointly, our training procedure had to learn sepa- deletion, like the ones proposed by Cohn and Lapata rately the extraction and the compression models, (2008). and to tune a scalar parameter to trade off the two models. We conjecture that a better model could Acknowledgments have been learned if a labeled dataset with extracted compressed sentences existed. The authors thank the anonymous reviewers for helpful 6 Conclusion and Future Work comments, Yiming Yang for interesting discussions, and Dipanjan Das and Sourish Chaudhuri for providing their We have presented a summarization system that per- code. This research was supported by a grant from FCT forms sentence extraction and compression in a sin- through the CMU-Portugal Program and the Informa- gle step, by casting the problem as an ILP. The sum- tion and Communications Technologies Institute (ICTI) mary optimizes an objective function that includes at CMU, and also by Priberam Informatica.´ both extraction and compression scores. Our model References A. F. T. Martins, N. A. Smith, and E. P. Xing. 2009. Polyhedral outer approximations with application to C. Aone, M. E. Okurowski, J. Gorlinsky, and B. Larsen. natural language parsing. In Proc. of ICML. 1999. A trainable summarizer with knowledge ac- R. T. McDonald, F. Pereira, K. Ribarov, and J. Hajic.ˇ quired from robust nlp techniques. In Advances in Au- 2005. Non-projective dependency parsing using span- tomatic Text Summarization. MIT Press. ning tree algorithms. In Proc. of HLT-EMNLP. P. B. Baxendale. 1958. Machine-made index for tech- R. McDonald. 2006. Discriminative sentence compres- nical literature—an experiment. IBM Journal of Re- sion with soft syntactic constraints. In Proc. of EACL. search Development, 2(4):354–361. R. McDonald. 2007. A study of global inference algo- J. Carbonell and J. Goldstein. 1998. The use of MMR, rithms in multi-document summarization. In Proc. of diversity-based reranking for reordering documents ECIR. and producing summaries. In Proc. of SIGIR. M. Osborne. 2002. Using maximum entropy for sen- J. Clarke and M. Lapata. 2008. Global inference for tence extraction. In Proc. of the ACL Workshop on sentence compression an integer linear programming . approach. JAIR, 31:399–429. D. R. Radev and K. McKeown. 1998. Generating natural T. Cohn and M. Lapata. 2008. Sentence compression language summaries from multiple on-line sources. beyond word deletion. In Proc. COLING. Computational , 24(3):469–500. M. Collins. 1999. Head-Driven Statistical Models for D. R. Radev, H. Jing, and M. Budzikowska. 2000. Natural Language Parsing. Ph.D. thesis, University Centroid-based summarization of multiple documents: of Pennsylvania. sentence extraction, utility-based evaluation, and user K. Crammer and Yoram Singer. 2003. Ultraconservative studies. In Proc. of the NAACL-ANLP Workshop on online algorithms for multiclass problems. J. Mach. Automatic Summarization. Learn. Res., 3:951–991. D. Zajic, B. Dorr, J. Lin, and R. Schwartz. 2006. H. Daume´ and D. Marcu. 2002. A noisy-channel model Sentence compression as a component of a multi- for document compression. In Proc. of ACL. document summarization system. In Proc. of the ACL H. Daume.´ 2006. Practical Structured Learning Tech- DUC Workshop. niques for Natural Language Processing. Ph.D. thesis, University of Southern California. B. Dorr, D. Zajic, and R. Schwartz. 2003. Hedge trim- mer: A parse-and-trim approach to headline gener- ation. In Proc. of HLT-NAACL Text Summarization Workshop and DUC. H. P. Edmundson. 1969. New methods in automatic ex- tracting. Journal of the ACM, 16(2):264–285. R. Jin. 2003. Statistical Approaches Toward Title Gener- ation. Ph.D. thesis, Carnegie Mellon University. T. Joachims. 1999. Making large-scale SVM learning practical. In Advances in Kernel Methods - Support Vector Learning. MIT Press. K. Knight and D. Marcu. 2000. Statistics-based summarization—step one: Sentence compression. In Proc. of AAAI/IAAI. J. Kupiec, J. Pedersen, and F. Chen. 1995. A trainable document summarizer. In Proc. of SIGIR. C.-Y. Lin and E. Hovy. 2002. Manual and automatic evaluation of summaries. In Proc. of the ACL Work- shop on Automatic Summarization. C.-Y. Lin. 2003. Improving summarization performance by sentence compression-a pilot study. In Proc. of the Int. Workshop on Inf. Ret. with Asian Languages. H. P. Luhn. 1958. The automatic creation of litera- ture abstracts. IBM Journal of Research Development, 2(2):159–165.