and the Construction of Coherent Extracts

Tristan Miller German Research Center for Artificial Intelligence0 Erwin-Schrodinger-Straße¨ 57, D-67663 Kaiserslautern [email protected]

Keywords: automatic summarization, latent semantic analy- many of these techniques are tied to a particular sis, LSA, coherence, extracts language or require resources such as a list of dis- course keywords and a manually marked-up cor- pus; others are constrained in the type of summary We describe a language-neutral au- they can generate (e.g., general-purpose vs. query- tomatic summarization system which focussed). aims to produce coherent extracts. It In this paper, we present a new, recursive builds an initial extract composed solely method for automatic text summarization which of topic sentences, and then recursively aims to preserve both the topic coverage and fills in the topical lacunae by provid- the coherence of the source document, yet has ing linking material between semanti- minimal reliance on language-specific NLP tools. cally dissimilar sentences. While exper- Only - and sentence-boundary detection rou- iments with human judges did not prove tines are required. The system produces general- a statistically significant increase in tex- purpose extracts of single documents, though it tual coherence with the use of a latent should not be difficult to adapt the technique semantic analysis module, we found a to query-focussed summarization, and may also strong positive correlation between co- be of use in improving the coherence of multi- herence and overall summary quality. document summaries. 2 Latent semantic analysis 1 Introduction Our system fits within the general category of IR- A major problem with automatically-produced based systems, but rather than comparing text with summaries in general, and extracts in particular, the standard vector-space model, we employ la- is that the output text often lacks fluency and orga- tent semantic analysis (LSA) [Deerwester et al., nization. Sentences often leap incoherently from 1990], a technique originally developed to cir- topic to topic, confusing the reader and hampering cumvent the problems of synonymy and polysemy his ability to identify information of interest. In- in IR. LSA extends the traditional vector-space terest in producing textually coherent summaries document model with singular value decomposi- has consequently increased in recent years, lead- tion, a process by which the term–sentence co- ing to a wide variety of approaches, including IR- occurrence matrix representing the source docu- influenced techniques (Salton et al. [1997]; Car- ment is factored into three smaller matrices of a bonell and Goldstein [1998]), variations on lexi- particular form. One such matrix is a diagonal ma- cal chaining (Brunn et al. [2002]; Karamuftuoglu trix of singular values; when one or more of the [2002]), and discourse structure analysis (Marcu smallest singular values are deleted and the three [1997, 1999]; Chan et al. [2000]). Unfortunately, matrices multiplied together, the product is a least- 0The research described in this paper was carried out squares best fit to the original matrix. The appar- while the author was at the University of Toronto. ent result of this smearing of values is that the ap- proximated matrix has captured the latent transi- where sim (x, y) ∈ [−1, 1] is the LSA cosine sim- tivity relations among terms, allowing for identi- ilarity score for the sentences with indices x and y. fication of semantically similar sentences which In order to preserve important information which share few or no common terms withal. We be- may be found at the beginning of the document, lieve that the deep semantic relations discovered and also to account for the possibility that the doc- by LSA may assist in the identification and cor- ument contains only one topic segment, we always rection of abrupt topic shifts between sentences. consider the first sentence of the document to be a topic sentence—i.e., r0 = 1—and include it in our 3 Algorithm initial extract.1 Let us refer to this initial extract as E0 = he0,1, . . . , e0,n+1i where e0,i = ri−1. The input to our summarizer is a plain text docu- As we might imagine, this basic extract will ment, which is converted into a list of tokenized have very poor coherence, since every sentence sentences. A tokenizer and sentence-boundary addresses a completely different topic. However, disambiguation algorithm may be used for these we can improve its coherence by selecting from first steps. the set h1, . . . , mi \ E0 a number of indices for The list of m sentences (indexed from 1 to m) is “glue” sentences between adjacent pairs of sen- then segmented into linearly discrete topics. This tences represented in E0. We consider an ap- can be done manually if the original document is propriate glue sentence between two others to be structured (e.g., a book with chapters, or an arti- one which occurs between them in the source cle with sections), or a linear al- document, and which is semantically similar to gorithm, such as C99 [Choi, 2000] can be used. both. Thus we look for sentence indices G1 = The output of this step is a list of sentence indices hg1,1, . . . , g1,ni such that ht1, . . . , tn+1i, where, for the ith of the n topics, 0 0  t is the index of the first sentence of the topic seg- g1,i = arg max f sim (j, e0,i) , sim (j, e0,i+1) , i e0,i α;  if ;   m + 1 if i = n + 1; sim0 (x, y) = 0 if sim (x, y) < 0; t = i index of first sentence  sim (x, y) otherwise.  otherwise.  of the ith topic for α ∈ [0, 1]. The purpose of f() is to reward glue sentences which are similar to their bound- As mentioned previously, we use LSA to mea- ary sentences, but to penalize if the similarity is sure , so before we can be- too biased in favour of only one of the boundaries. gin constructing the extract, we need to con- The revised similarity measure sim0() ensures that struct a reduced-dimensionality term–sentence co- we do not select a glue sentence which is nearly occurrence matrix. Once this is done, a prelimi- equivalent to any one boundary—such a sentence nary extract is produced by choosing a representa- is redundant. (Of course, useful values of α will tive “topic sentence” from each segment—that is, be 1 or close thereto.) that sentence which has the highest semantic sim- Once we have G , we can construct a revised ilarity to all other sentences in its topic segment. 1 extract E = he , . . . , e i = hE ∪ G i.2 These topic sentences correspond to a list of sen- 1 1,1 1,2n+1 0 1 1 tence indices hr1, . . . , rni such that In practice, it may be the case that r1 = 1, in which case inclusion of r0 is not necessary. In this paper we assume, without loss of generality, that r1 6= 1. ti+1−1 2 X For notational convenience, we take it as understood that ri = arg max sim (j, k) , the sentence indices in the extracts Ei are sorted in ascending ti≤j n, the algorithm at which some ei,j = ei,j+1 − 1, thus preclud- needs to recurse at least ing the possibility of finding any further glue sen-  ` + 1 tences between them. We may also encounter log 2 n the case where for all k between ei,j and ei,j+1, 0 0 f (sim (k, ei,j) , sim (k, ei,j+1)) is so low that the times. The worst case occurs when n = 2 and the extract’s coherence would not be significantly im- algorithm always selects a glue sentence which is proved by the addition of an intermediary sen- adjacent to one of the boundary sentences (with tence. Or, we may find that the sentences with in- indices e1 and e2). In this case, the algorithm must dices ei,j and ei,j+1 are themselves so similar that recurse min (`, e2 − e1) times, which is limited by no glue is necessary. Finally, it is possible that the the source document length, m. user wishes to constrain the size of the extract to a On each recursion i of the algorithm, the main certain number of sentences, or to a fixed percent- loop considers at most m − 2in − 1 candidate age of the original document’s length. The first of glue sentences, comparing each one with two of these stopping conditions is straightforward to ac- the 2in − 1 sentences already in the extract. To count for; the next two can be easily handled by simplify matters, we note that 2in − 1 can never introducing two fixed thresholds β and γ: when exceed m, so the number of comparisons must be, the similarity between adjacent sentences from Ei at worst, proportional to m. The comparison func- exceeds β, or when the value of f() falls below γ, tion, sim(), runs in time proportional to the num- no glue sentence is suggested for the pair in ques- ber of word types, w, in the original document. tion. Thus an upper bound on the time complexity of a The case of maximum summary length is a bit na¨ıve implementation of Algorithm 1 is O(wm2). trickier. If we are not concerned about undershoot- Running time can be cut down considerably in ing the target length `, then we can simply halt the general case, however. Since sim(i, j) remains the algorithm once |Ei| ≥ `, and then take Ei−1 constant, we can save time by precomputing a tri- (or Ei, if |Ei| = `) as the final extract. Most angular similarity matrix of all pairs of sentences real-world applications, however, demand that we in the document, or better yet, by using memo- maximize the extract size. Given Ei−1 of length ization (i.e., caching intersentential similarity val- ` − p, the optimal extract E of length ` is the one ues as they are computed). The algorithm could which glues together the p largest gaps in Ei−1. be further improved by having the loop skip over A version of the gluing algorithm which takes adjacent extract sentences for which no glue was into account all four stopping conditions is shown found on a previous recursion. At any rate, the in Algorithm 1. running time of the summarizer as a whole will Once the final set of sentences for the ex- likely be dominated by the singular value decom- tract has been selected, we send the sentences, position step of the LSA stage (at least O(wm2)) in their original order of occurrence, to the topic and possibly too by the topic segmenter (for C99, segmenter. The discovered topic segments are also O(wm2)). then used by a simple text formatter to partition the summary into sections or paragraphs for easy 4 Evaluation reading. In general there are two approaches to evaluat- ing summaries: intrinsic evaluations, which rate 3.1 Complexity analysis the summary in and of itself, and extrinsic evalua- Given an initial extract of length n, the first re- tions, which test the summary in relation to some cursion of Algorithm 1 will add at most n − 1 other task [Sparck¨ Jones and Galliers, 1996]. Pop- sentences to the extract, yielding a new extract of ular intrinsic approaches include quality evalua- Algorithm 1: glue() input : initial extract E, maximum extract length ` output : largest coherent extract of length ≤ ` precondition: |E| < ` assumption : Lists are kept sorted in ascending order. Where list elements are coordinate pairs, the sorting key is the first coordinate. G ← hi; for i ← 1 to |E| − 1 do s ← sim(E[i],E[i + 1]); if E[i] = E[i + 1] − 1 or s > β then continue; g ← arg max f(sim0(j, E[i]), sim0(j, E[i + 1])); E[i]

* |G| + [ return E ∪ x (y, x) ∈ G[i] ;

i=|E|+|G|−`+1

else return glue(E ∪ hx | (y, x) ∈ Gi , `); end tion, where human graders grade the summary Another reason why coherence is not measured di- in isolation on the basis of relevance, grammat- rectly is the dearth of good, automatable evalua- icality, readability, etc.; and gold-standard com- tion metrics for the trait. One approach commonly parison, where the summary is compared (by hu- used in essay assessment [Miller, 2003a] is to av- mans or automatically) with an “ideal” summary. erage the semantic similarity (using the cosine co- Extrinsic methods are usually domain- or query- efficient, with or without LSA) of all adjacent sen- dependent, but two popular methods which are rel- tence pairs. This technique is not appropriate for atively generic are relevance assessment, where our algorithm because by definition its summaries the summarizer acts as the back-end to an infor- are guaranteed to have good intersentential cosine mation retrieval system, and reading comprehen- scores. This approach has the additional disadvan- sion, where the summaries are used as input to a tage of rewarding redundancy. question-answering task. In both cases the idea is A more recent approach to automated coher- to compare performance of the task given the sum- ence assessment is to check for the presence or maries versus the whole documents. absence of discourse relations [Marcu, 2000]. The Though it could be argued that reading com- problem with this approach is that the vast major- prehension is somewhat dependent on coherence, ity of discourse relations are not signalled by an almost all evaluation methods are designed pri- obvious discourse marker [Marcu and Echihabi, marily to assess topic coverage and informa- 2002]. tion relevance. This may be because to date, Since we also could not come up with a new researchers have concentrated on evaluation of task-based evaluation which would measure co- highly-compressed summaries, where coherence herence in isolation, we felt we were left with no necessarily takes a back seat to topic coverage. choice but to use the intrinsic method of quality evaluation. We therefore recruited human judges In order to measure the contribution of LSA to to provide ratings for our summaries’ coherence, our system’s performance, we also employed a and for the sake of convenience and simplicity, we version of our summarizer (nolsa) which does also used them to assess other aspects of summary not use the singular value decomposition module. quality. Test procedure We ran the eight summarizers 4.1 Experiment on the three source documents twice each—once Source data We had hoped to use the TIPSTER to produce a “short” summary (around 100 ) documents commonly used in summary evalua- and once to produce a “long” summary (around tions at the annual Document Understanding Con- 300 words). We then recruited human judges who ference (DUC). However, most of them were very self-identified as fluent in English, the language short and focussed on single, narrow topics, mak- of the source documents. The judges were pro- ing them unsuitable for an evaluation of summary vided with these documents and the 48 summaries coherence. We therefore randomly selected one grouped according to source document and sum- 1000-word and one 2000-word article from a cur- mary length. Within each document–summary rent encyclopedia, plus one of the five longest length group, the summaries were labelled only newspaper articles from the DUC 2001 trial data. with a random number and were presented in ran- dom order. We asked the judges to read each Comparison systems On the basis of our own source document and then assign to each of its informal observations, we determined that our sys- summaries an integer score ranging from 1 (very tem (hereinafter lsa) performed best with a reten- poor) to 5 (very good) on each of three dimen- tion of 20–30% of the singular values and thresh- sions: comprehensiveness (i.e., topic coverage), olds of α = 0.9, β = 1.0, and γ = 0.1. More par- coherence, and overall quality. The judges were simonious cutoffs tended to result in summaries given the compression ratio for each summary and greatly in deficit of the allowed length. told to take it under consideration when assigning We selected four third-party comparison sys- their ratings. tems based on their availability and similarity to our own technique and/or goals: Microsoft Word, 4.2 Results commonly available and therefore an oft-used 4.2.1 Interjudge agreement benchmark; Lal and Ruger¨ [2002], a Bayesian classifier summarizer intended to assist students To compare interjudge agreement, we com- with reading comprehension; Copernic, a com- puted correlation matrices for each of coher- mercial summarizer based partly on the work of ence, comprehensiveness, and overall quality rat- Turney [2000]; and Sinope (formerly Sumatra), ings. Interjudge agreement on coherence was which, like lsa, employs a technique for identi- generally low, with the mean Pearson correla- fying latent semantic relations [Lie, 1998]. In our tion coefficient r ranging from 0.0672 to 0.3719. results tables we refer to these systems as word, Agreement on comprehensiveness and quality was plal, copernic, and sinope, respectively. better, but still only moderate, with r in the ranges [0.2545, 0.4660] and [0.2250, 0.4726], re- Baselines There are two popular methods for spectively. Why the correlation is only moderate constructing baseline extracts of a given length, is difficult to explain, though given the similarly both of which are used in our study. The first low agreement in the DUC 2001 evaluations [Lin (random) is to randomly select n sentences from and Hovy, 2002], it was not entirely unexpected. the document and present them in their original Though we had made an effort to narrowly define order of appearance. The second way (init), coherence in the written instructions to the judges, based on the observation that important sentences it is possible that some of them nevertheless con- are usually located at the beginning of paragraphs, flated the term with its more conventional meaning is to select the initial sentence of the first n para- of intelligibility, or with cohesion. As discussed graphs. in Miller [2003b], this last possibility seems to be supported by the judges’ written comments. except copernic and plal. The only other significant result we obtained for coherence was 4.2.2 Comparative performance of that the sinope summarizer performed worse summarizers than copernic (p = 0.0050) and plal (p = We used SAS to perform a three-way repeated- 0.0005). Using these pairwise comparisons, we measures analysis of variance (ANOVA) for each can partition the summarizers into three overlap- of the three dimensions: coherence, comprehen- ping ranks as shown in Table 1. siveness, and overall quality. Quite unexpect- edly, the (document, summary length, summa- Rank(s) Summarizer Mean rating rizer) three-way interaction effect was significant A init 11.1111 at the 0.05 confidence level for all three dimen- AB plal 9.9722 sions (p = 0.0151, p < 0.0001, and p = 0.0002, AB copern 9.6667 respectively). This means it would have been very CB word 8.9444 difficult, if not impossible, to make any general- CB lsa 8.7222 izations about the performance of the individual CB nolsa 8.6667 summarizers. On the assumption that the type CB random 8.4722 of document was irrelevant to summarizer perfor- C sinope 7.7500 mance, we added the document scores for each Table 1: Summarizer coherence rankings (summarizer, summary length, rater) triplet to get new coherence, comprehensiveness, and overall quality measurements in the range [3, 15]. We then Comprehensiveness and overall quality The performed two-way repeated-measures ANOVAs mean comprehensiveness score for long sum- for each dimension. The two-way interaction ef- maries was higher than that for short summaries by fect was still significant for comprehensiveness a statistically significant 1.9792 (p < 0.0001, α = (p = 0.0025) and overall quality (p = 0.0347), 0.05). In fact, in no case did any summarizer pro- but not for coherence (p = 0.6886). duce a short summary whose mean score exceeded that of the long summary for the same document. Coherence In our coherence ANOVA, the only This could be because none of the short summaries significant effect was the summarizer (p < covered as many topics as our judges thought they 0.0001). That summary length was not found to could have, or because the judges did not or could be significant (p = 0.0806) is somewhat surpris- not completely account for the compression level. ing, since we expected a strong positive correla- In order to resolve this question, we would proba- tion between the coherence score and the com- bly need to repeat the experiment with abstracts pression ratio. Though we did ask our judges to produced by human experts, which presumably account for the summary length when assigning have optimal comprehensiveness at any compres- their scores, we did not think that very short ex- sion ratio. tracts could maintain the same level of coherence Likewise, the overall quality scores were depen- as their longer counterparts. It may be that sum- dent not only on the summarizer but also on the mary length’s effect on coherence is significant summary length, but it is not clear whether this is only for summaries with much higher compres- because our judges did not factor in the compres- sion ratios than those used in our study. sion ratio, or because they genuinely believed that With respect to the comparative performance of the shorter summaries were not as useful as they the summaries, only 7 of the 28 pairwise com- could have been for their size. parisons from our ANOVA were significant at As with coherence, we can partition the summa- the 0.05 confidence level. The initial-sentences rizers into overlapping ranks based on their statis- baseline was found to perform significantly bet- tically significant scores. Because the (summary 3 ter than every other summarizer (p ≤ 0.0008 ) length, summarizer) interaction was significant,

3All p values in this chapter from here on are Tukey- adjusted. we produce separate rankings for short and long the LSA summarizer’s unfortunate choice of topic summaries. (See Tables 2 and 3.) sentences than to its gluing process, which actu- ally seemed to perform well with the material it 4.2.3 Relationship among dimensions was given. Intuition tells us that overall quality of a sum- mary depends in part on both its topic flow 5 Conclusion and its topic coverage. To see if this assump- Our goal in this work has been to investigate how tion is borne out in our data, we calculated we can improve the coherence of automatically- the Pearson correlation coefficient for our 864 produced extracts. We developed and imple- pairs of coherence–overall quality ratings and mented an algorithm which builds an initial extract comprehensiveness–overall quality ratings. The composed solely of topic sentences, and then fills correlation between coherence and overall quality in the lacunae by providing linking material be- was strong at r = 0.6842, and statistically sig- tween semantically dissimilar sentences. In con- nificant (t = 27.55) below the 0.001 confidence trast with much of the previous work we reviewed, level. The comprehensiveness–overall quality cor- our system was designed to minimize reliance on relation was also quite strong (r = 0.7515, t = language-specific features. 33.44, α < 0.001). Our study revealed few clearly-defined dis- 4.3 Analysis tinctions among the summarization systems we reviewed, and no significant benefit to using Unfortunately, moderate to low interjudge agree- LSA with our algorithm. Though our evaluation ment for all three dimensions, coupled with an un- method for coherence was intended to circum- expected three-way interaction between the sum- vent the limitations of automated approaches, the marizers, the source documents, and the compres- use of human judges introduced its own set of sion ratio, stymied our attempts to make high- problems, foremost of which was the low inter- level, clear-cut comparisons of summarizer perfor- judge agreement on what constitutes a fluent sum- mance. The statistically significant results we did mary. Despite this lack of consensus, we found obtain have confirmed what researchers in auto- a strong positive correlation between the judges’ matic summarization have known for years: that it scores for coherence and overall summary qual- is very hard to beat the initial-sentences baseline. ity. We would like to take this as good evidence This baseline consistently ranked in the top cate- that the production of coherent summaries is an gory for every one of the three summary dimen- important research area within automatic summa- copern plal sions we studied. While the and rization. However, it may be that humans simply systems sometimes had higher mean ratings than find it too difficult to evaluate coherence in isola- init , the difference was never statistically sig- tion, and end up using other aspects of summary nificant. quality as a proxy measure. The performance of our own systems was un- remarkable; they consistently placed in the sec- Acknowledgments ond of the two or three ranks, and only once in the first as well. Though one of the main foci of This research was supported in part by NSERC of our work was to measure the contribution of the Canada. Thanks to Graeme Hirst and Gerald Penn LSA metric to our summarizer’s performance, we for their advice. were unable to prove any significant difference be- tween the mean scores for our summarizer and its References non-LSA counterpart. The two systems consis- Meru Brunn, Yllias Chali, and Barbara Dufour. UofL sum- tently placed in the same rank for every dimension marizer at DUC 2002. In Workshop on Automatic Summa- rization, ACL 2002, volume 2, pages 39–44, July 2002. we measured, with mean ratings differing by no J. G. Carbonell and J. Goldstein. The use of MMR, diversity- more than 6%. As a case study in Miller [2003b] based reranking for reordering documents and producing suggests, this nebulous result may be due more to summaries. In SIGIR 98, pages 335–336, August 1998. Short summaries Long summaries Rank(s) Summarizer Mean rating Rank(s) Summarizer Mean rating A copern 10.0556 A plal 11.9444 A plal 9.6667 AB copern 10.5556 AB init 8.5556 AB init 10.2222 AB nolsa 8.1111 B sinope 9.6667 B lsa 7.5556 B word 9.6111 CB sinope 7.0000 B random 9.2222 CB word 6.9444 B lsa 8.9444 C random 5.3889 B nolsa 8.9444

Table 2: Summarizer comprehensiveness rankings

Short summaries Long summaries Rank(s) Summarizer Mean rating Rank(s) Summarizer Mean rating A copern 9.7222 A plal 11.1667 AB init 9.4444 AB init 10.2778 AB plal 9.0556 AB copern 9.9444 AB nolsa 7.5000 AB word 9.2222 CB lsa 7.3333 AB lsa 9.0556 C word 6.9444 B random 8.5000 C sinope 6.7778 B nolsa 8.3333 C random 5.5556 B sinope 8.1667

Table 3: Summarizer overall quality rankings

W. K. Chan, T. B. Y. Lai, W. J. Gao, and B. K. T’sou. Mining editors, Advances in Automatic Text Summarization, pages discourse markers for Chinese textual summarization. In 123–136. MIT Press, Cambridge, 1999. Workshop on Automatic Summarization, ACL 2000, pages Daniel Marcu. The Theory and Practice of Discourse 11–20, 2000. and Summarization. MIT Press, November 2000. Freddy Choi. Advances in domain-independent linear text Daniel Marcu and Abdessamad Echihabi. An unsupervised segmentation. In NAACL 2000 and the 6th ACL Confer- approach to recognizing discourse relations. In ACL 2002, ence on Applied NLP, pages 26–33, April 2000. pages 368–375, July 2002. Scott Deerwester, Susan T. Dumais, George W. Furnas, Tristan Miller. Essay assessment with latent semantic anal- Thomas K. Landauer, and Richard Harshman. Indexing ysis. Journal of Educational Computing Research, 28(3), by latent semantic analysis. Journal of the American So- 2003a. ciety For Information Science, 41:391–407, 1990. Murat Karamuftuoglu. An approach to summarisation based Tristan Miller. Generating coherent extracts of single docu- on lexical bonds. In Workshop on Automatic Summariza- ments using latent semantic analysis. Master’s thesis, Uni- tion, ACL 2002, volume 2, pages 86–89, July 2002. versity of Toronto, March 2003b. Partha Lal and Stefan Ruger.¨ Extract-based summarization G. Salton, A. Singhal, M. Mitra, and C. Buckley. Automatic with simplification. In Workshop on Automatic Summa- text structuring and summarization. Information Process- rization, ACL 2002, volume 2, pages 90–96, July 2002. ing and Management, 33(2):193–207, 1997. D. H. Lie. Sumatra: a system for automatic summary gener- Karen Sparck¨ Jones and Julia Rose Galliers. Evaluating Nat- ation. In 14th Twente Workshop on Language Technology, ural Language Processing Systems. Lecture Notes in Ar- December 1998. tificial Intelligence 1083. Springer, Berlin, 1996. Chin-Yu Lin and Eduard Hovy. Manual and automatic eval- P. Turney. Learning algorithms for keyphrase extraction. In- uation of summaries. In Workshop on Automatic Summa- formation Retrieval, 2(4):303–336, 2000. rization, ACL 2002, volume 1, pages 45–51, July 2002. Daniel Marcu. From discourse structures to text summaries. In Workshop on Intelligent Scalable Text Summarization, ACL97 and the EACL97, pages 82–88, July 1997. Daniel Marcu. Discourse trees are good indicators of im- portance in text. In Inderjeet Mani and Mark T. Maybury,