Using Sentence-Selection Heuristics to Rank Text Segments in TXTRACTOR

Using Sentence-Selection Heuristics to Rank Text Segments in TXTRACTOR Daniel McDonald and Hsinchun Chen Artificial Intelligence Lab Management Information Systems Department University of Arizona Tucson, AZ 85721, USA 520-621-2748 {dmm, hchen}@eller.arizona.edu ABSTRACT Indicative text summarization systems support the user in deciding which documents to view in their totality and which to TXTRACTOR is a tool that uses established sentence-selection ignore. Some summarization techniques use measures of query heuristics to rank text segments, producing summaries that relevance to tailor the summary to a specific query [22] [5]. contain a user-defined number of sentences. The purpose of Providing tools for users to sift through query results can identifying text segments is to maximize topic diversity, which is potentially ease the burden of information overload. an adaptation of the Maximal Marginal Relevance criterion used Using document summaries can also potentially improve the by Carbonell and Goldstein [5]. Sentence selection heuristics are results of queries on digital libraries. Relevance feedback methods then used to rank the segments. We hypothesize that ranking text usually select terms from entire documents in order to expand segments via traditional sentence-selection heuristics produces a queries. Lam-Adesina and Jones found query-expansion using balanced summary with more useful information than one document summaries to be considerably more effective than produced by using segmentation alone. The proposed summary is query-expansion using full-documents [13]. Other summarization created in a three-step process, which includes 1) sentence research explores the processing of summaries instead of full evaluation 2) segment identification and 3) segment ranking. As documents in information retrieval tasks [18, 21]. Using the required length of the summary changes, low-ranking summaries instead of full documents in a digital library has the segments can then be dropped from (or higher ranking segments potential to speed query processing and facilitate greater post added to) the summary. We compare the output of TXTRACTOR retrieval analysis, again potentially easing the burden of to the output of a segmentation tool based on the TextTiling information overload. algorithm to validate the approach. 1.2 Background Categories and Subject Descriptors Approaches to text summarization vary greatly. A distinction I.2.7 Natural Language Processing - Language parsing and frequently is made between summaries generated by text understanding, Text analysis extraction and those that generate text abstracts. Text extraction is widely used [10], utilizing sentences from a document to create a General Terms: summary. Early examples of summarization techniques utilized Algorithms text extraction [16]. Text abstraction programs, on the other hand, Keywords produce grammatical sentences that summarize a document’s concepts. The concepts in an abstract are often thought of as Text summarization, text segmentation, Information Retrieval, having been compressed. While the formation of an abstract may text extraction better fit the idea of a summary, its creation involves greater 1. INTRODUCTION complexity and difficulty [10]. Producing abstracts usually involves several stages such as topic fusion and text generation 1.1 Digital Libraries that are not required for text extracts. Recent summarization Automatic text summarization offers potential benefits to the research has largely focused on text extraction with renewed operation and design of digital libraries. As digital libraries grow interest in sentence-selection summarization methods in particular in size, so does the user’s need for information filtering tools. [17]. An extracted summary remains closer to the original document, by using sentences from the text, thus limiting the bias that might otherwise appear in a summary [16]. TXTRACTOR continues this trend by utilizing text extraction methods to Permission to make digital or hard copies of all or part of this work for produce summaries. personal or classroom use is granted without fee provided that copies The goals of text summarizers can be categorized by their intent, are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy focus, and coverage [7]. Intent refers to the potential use of the otherwise, or republish, to post on servers or to redistribute to lists, summary. Firmin and Chrzanowski divide a summary’s intent into requires prior specific permission and/or a fee. three main categories: indicative, informative, and evaluative. JCDL’02, July 13-17, 2002, Portland, Oregon, USA. Indicative summaries give an indication of the central topic of the Copyright 2002 ACM 1-58113-513-0/02/0007…$5.00. original text or enough information to judge the text’s relevancy. 28 Informative summaries can serve as substitutes for the full conclusion” are more likely to appear in scientific literature than documents and evaluative summaries express the point of view of in newspaper articles [10]. Position-based methods are also the author on a given topic. Focus refers to the summary’s scope, domain-dependent. The first sentence in a paragraph contains the whether generic or query-relevant. A generic summary is based on topic sentence in some domains, whereas it is the last sentence the original text, while a query-relevant summary is based on a elsewhere. Combined with other techniques, however, these topic selected by the user. Finally, coverage refers to the number extraction methods can still contribute to the quality of a of documents that contribute to the summary, whether the summary. summary is based on a single document or multiple documents. TXTRACTOR uses a text extraction approach to produce 2.2 Document Segmentation summaries that are categorized as indicative, generic, and based Document segmentation is an Information Retrieval (IR) approach only on single documents. to summarization. Narrowing the scope from a collection of documents, the IR approach views a single document as a 2. RELATED RESEARCH collection of words and phrases from which topic boundaries must TXTRACTOR is most strongly related to the research by be identified [10]. Recent research in this field, particularly the Carbonell and Goldstein [5] that strives to reduce the redundancy TextTiling algorithm [9], seems to show that a document’s topic of information in a query-focused summary. Carbonell and boundaries can be identified with a fair amount of success. Once a Goldstein introduce the concept of Maximal Marginal Relevance document’s segments have been identified, sentences from within (MMR), where each sentence is ranked based on a combination of the segments are typically extracted using word-based rules in a relevance and diversity measure. The consideration of diversity order to turn a document’s segments into a summary. Breaking a in TXTRACTOR is achieved by segmenting a document using the document into segments identifies the document’s topic TextTiling algorithm [9]. Sentences coming from different text boundaries. Segmentation is a nice way to make sure that a segments are considered adequately diverse. All text segments document’s topics are adequately represented in a summary. must be represented in a summary before additional sentences The IR approach to extraction does have some weaknesses. from an already represented segment can be added. Nomoto [18] Having a word-level focus “prevents researchers from employing and Radev [19] also present different ways to implement diversity reasoning at the non-word level” [10]. While the IR technique calculations for summary creation. Different from the successfully segments single documents into topic areas [9], the summarization work done by Carbonell and Goldstein, however, selection of sentences to extract from within those topic areas TXRACTOR is not query-focused, but rather uses sentence- could be improved by using many different heuristics, both word- selection heuristics, instead of query relevance, to rank a based and those that utilize language knowledge. In addition, once document’s sentences. a document is segmented, there is no way to know which of the segments is the most salient to the overall document. Some 2.1 Sentence Selection mechanism is required to rank segments so that the most pertinent Much research has been done on techniques to identify sentences topic information either gets extra coverage in the summary, or is that effectively summarize a document. Luhn in 1958 first utilized covered first in the summary. A practical problem is also word-frequency-based rules to identify sentences for summaries addressed by ranking segments. When the required number of [16]. Edmundson (1969) added three rules in addition to word sentences in a summary is less than the number of identified frequencies for selecting sentences to extract, including cue segments, there must be an intelligent way to decide which phrases (e.g., “significant,” “impossible,” “hardly”), title and segments will not be covered. A possible solution perhaps is to heading words, and sentence location (words starting a paragraph force a document to have a certain number of segments that match were more heavily weighted) [6]. The ideas behind these older the number of sentences allowed in the summary. Presetting the approaches are still referenced in modern text extraction research. number of acceptable topic areas, however, seems to

Using Sentence-Selection Heuristics to Rank Text Segments in TXTRACTOR

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support