Latent Semantic Analysis and the Construction of Coherent Extracts
Total Page:16
File Type:pdf, Size:1020Kb
Latent Semantic Analysis and the Construction of Coherent Extracts Tristan Miller German Research Center for Artificial Intelligence0 Erwin-Schrodinger-Straße¨ 57, D-67663 Kaiserslautern [email protected] Keywords: automatic summarization, latent semantic analy- many of these techniques are tied to a particular sis, LSA, coherence, extracts language or require resources such as a list of dis- Abstract course keywords and a manually marked-up cor- pus; others are constrained in the type of summary We describe a language-neutral au- they can generate (e.g., general-purpose vs. query- tomatic summarization system which focussed). aims to produce coherent extracts. It In this paper, we present a new, recursive builds an initial extract composed solely method for automatic text summarization which of topic sentences, and then recursively aims to preserve both the topic coverage and fills in the topical lacunae by provid- the coherence of the source document, yet has ing linking material between semanti- minimal reliance on language-specific NLP tools. cally dissimilar sentences. While exper- Only word- and sentence-boundary detection rou- iments with human judges did not prove tines are required. The system produces general- a statistically significant increase in tex- purpose extracts of single documents, though it tual coherence with the use of a latent should not be difficult to adapt the technique semantic analysis module, we found a to query-focussed summarization, and may also strong positive correlation between co- be of use in improving the coherence of multi- herence and overall summary quality. document summaries. 2 Latent semantic analysis 1 Introduction Our system fits within the general category of IR- A major problem with automatically-produced based systems, but rather than comparing text with summaries in general, and extracts in particular, the standard vector-space model, we employ la- is that the output text often lacks fluency and orga- tent semantic analysis (LSA) [Deerwester et al., nization. Sentences often leap incoherently from 1990], a technique originally developed to cir- topic to topic, confusing the reader and hampering cumvent the problems of synonymy and polysemy his ability to identify information of interest. In- in IR. LSA extends the traditional vector-space terest in producing textually coherent summaries document model with singular value decomposi- has consequently increased in recent years, lead- tion, a process by which the term–sentence co- ing to a wide variety of approaches, including IR- occurrence matrix representing the source docu- influenced techniques (Salton et al. [1997]; Car- ment is factored into three smaller matrices of a bonell and Goldstein [1998]), variations on lexi- particular form. One such matrix is a diagonal ma- cal chaining (Brunn et al. [2002]; Karamuftuoglu trix of singular values; when one or more of the [2002]), and discourse structure analysis (Marcu smallest singular values are deleted and the three [1997, 1999]; Chan et al. [2000]). Unfortunately, matrices multiplied together, the product is a least- 0The research described in this paper was carried out squares best fit to the original matrix. The appar- while the author was at the University of Toronto. ent result of this smearing of values is that the ap- proximated matrix has captured the latent transi- where sim (x, y) ∈ [−1, 1] is the LSA cosine sim- tivity relations among terms, allowing for identi- ilarity score for the sentences with indices x and y. fication of semantically similar sentences which In order to preserve important information which share few or no common terms withal. We be- may be found at the beginning of the document, lieve that the deep semantic relations discovered and also to account for the possibility that the doc- by LSA may assist in the identification and cor- ument contains only one topic segment, we always rection of abrupt topic shifts between sentences. consider the first sentence of the document to be a topic sentence—i.e., r0 = 1—and include it in our 3 Algorithm initial extract.1 Let us refer to this initial extract as E0 = he0,1, . , e0,n+1i where e0,i = ri−1. The input to our summarizer is a plain text docu- As we might imagine, this basic extract will ment, which is converted into a list of tokenized have very poor coherence, since every sentence sentences. A tokenizer and sentence-boundary addresses a completely different topic. However, disambiguation algorithm may be used for these we can improve its coherence by selecting from first steps. the set h1, . , mi \ E0 a number of indices for The list of m sentences (indexed from 1 to m) is “glue” sentences between adjacent pairs of sen- then segmented into linearly discrete topics. This tences represented in E0. We consider an ap- can be done manually if the original document is propriate glue sentence between two others to be structured (e.g., a book with chapters, or an arti- one which occurs between them in the source cle with sections), or a linear text segmentation al- document, and which is semantically similar to gorithm, such as C99 [Choi, 2000] can be used. both. Thus we look for sentence indices G1 = The output of this step is a list of sentence indices hg1,1, . , g1,ni such that ht1, . , tn+1i, where, for the ith of the n topics, 0 0 t is the index of the first sentence of the topic seg- g1,i = arg max f sim (j, e0,i) , sim (j, e0,i+1) , i e0,i<j<e0,i+1 ment and ti+1 − 1 is the index of the last sentence of the topic segment. We stipulate that there are no where sentences which do not belong to a topic segment, f (x, y) = xy · (1 − |x − y|) so for all ti, we have ti < ti+1, and and 1 i = 1 0 if sim (x, y) > α; if ; m + 1 if i = n + 1; sim0 (x, y) = 0 if sim (x, y) < 0; t = i index of first sentence sim (x, y) otherwise. otherwise. of the ith topic for α ∈ [0, 1]. The purpose of f() is to reward glue sentences which are similar to their bound- As mentioned previously, we use LSA to mea- ary sentences, but to penalize if the similarity is sure semantic similarity, so before we can be- too biased in favour of only one of the boundaries. gin constructing the extract, we need to con- The revised similarity measure sim0() ensures that struct a reduced-dimensionality term–sentence co- we do not select a glue sentence which is nearly occurrence matrix. Once this is done, a prelimi- equivalent to any one boundary—such a sentence nary extract is produced by choosing a representa- is redundant. (Of course, useful values of α will tive “topic sentence” from each segment—that is, be 1 or close thereto.) that sentence which has the highest semantic sim- Once we have G , we can construct a revised ilarity to all other sentences in its topic segment. 1 extract E = he , . , e i = hE ∪ G i.2 These topic sentences correspond to a list of sen- 1 1,1 1,2n+1 0 1 1 tence indices hr1, . , rni such that In practice, it may be the case that r1 = 1, in which case inclusion of r0 is not necessary. In this paper we assume, without loss of generality, that r1 6= 1. ti+1−1 2 X For notational convenience, we take it as understood that ri = arg max sim (j, k) , the sentence indices in the extracts Ei are sorted in ascending ti≤j<ti+1 order—that is, ei,j < ei,j+1 for 1 ≤ j < |Ei|. k=ti More generally, however, we can repeat the glu- length 2n − 1. In general, at most 2i−1n sentences ing process recursively, using Ei to generate Gi+1, will be added on the ith recursion, bringing the i and hence Ei+1. The question that arises, then, extract length to 2 n − 1 sentences. Therefore, to is when to stop. Clearly there will come a point achieve an extract of length ` > n, the algorithm at which some ei,j = ei,j+1 − 1, thus preclud- needs to recurse at least ing the possibility of finding any further glue sen- ` + 1 tences between them. We may also encounter log 2 n the case where for all k between ei,j and ei,j+1, 0 0 f (sim (k, ei,j) , sim (k, ei,j+1)) is so low that the times. The worst case occurs when n = 2 and the extract’s coherence would not be significantly im- algorithm always selects a glue sentence which is proved by the addition of an intermediary sen- adjacent to one of the boundary sentences (with tence. Or, we may find that the sentences with in- indices e1 and e2). In this case, the algorithm must dices ei,j and ei,j+1 are themselves so similar that recurse min (`, e2 − e1) times, which is limited by no glue is necessary. Finally, it is possible that the the source document length, m. user wishes to constrain the size of the extract to a On each recursion i of the algorithm, the main certain number of sentences, or to a fixed percent- loop considers at most m − 2in − 1 candidate age of the original document’s length. The first of glue sentences, comparing each one with two of these stopping conditions is straightforward to ac- the 2in − 1 sentences already in the extract. To count for; the next two can be easily handled by simplify matters, we note that 2in − 1 can never introducing two fixed thresholds β and γ: when exceed m, so the number of comparisons must be, the similarity between adjacent sentences from Ei at worst, proportional to m.