Using Sentence-Selection Heuristics to Rank Text Segments in TXTRACTOR

Daniel McDonald and Hsinchun Chen Artificial Intelligence Lab Management Information Systems Department University of Arizona Tucson, AZ 85721, USA 520-621-2748 {dmm, hchen}@eller.arizona.edu

ABSTRACT Indicative text summarization systems support the user in deciding which documents to view in their totality and which to TXTRACTOR is a tool that uses established sentence-selection ignore. Some summarization techniques use measures of query heuristics to rank text segments, producing summaries that relevance to tailor the summary to a specific query [22] [5]. contain a user-defined number of sentences. The purpose of Providing tools for users to sift through query results can identifying text segments is to maximize topic diversity, which is potentially ease the burden of information overload. an adaptation of the Maximal Marginal Relevance criterion used Using document summaries can also potentially improve the by Carbonell and Goldstein [5]. Sentence selection heuristics are results of queries on digital libraries. Relevance feedback methods then used to rank the segments. We hypothesize that ranking text usually select terms from entire documents in order to expand segments via traditional sentence-selection heuristics produces a queries. Lam-Adesina and Jones found query-expansion using balanced summary with more useful information than one document summaries to be considerably more effective than produced by using segmentation alone. The proposed summary is query-expansion using full-documents [13]. Other summarization created in a three-step process, which includes 1) sentence research explores the processing of summaries instead of full evaluation 2) segment identification and 3) segment ranking. As documents in tasks [18, 21]. Using the required length of the summary changes, low-ranking summaries instead of full documents in a digital library has the segments can then be dropped from (or higher ranking segments potential to speed query processing and facilitate greater post added to) the summary. We compare the output of TXTRACTOR retrieval analysis, again potentially easing the burden of to the output of a segmentation tool based on the TextTiling information overload. algorithm to validate the approach. 1.2 Background Categories and Subject Descriptors Approaches to text summarization vary greatly. A distinction I.2.7 Natural Language Processing - Language and frequently is made between summaries generated by text understanding, Text analysis extraction and those that generate text abstracts. Text extraction is widely used [10], utilizing sentences from a document to create a General Terms: summary. Early examples of summarization techniques utilized Algorithms text extraction [16]. Text abstraction programs, on the other hand, Keywords produce grammatical sentences that summarize a document’s concepts. The concepts in an abstract are often thought of as Text summarization, , Information Retrieval, having been compressed. While the formation of an abstract may text extraction better fit the idea of a summary, its creation involves greater 1. INTRODUCTION complexity and difficulty [10]. Producing abstracts usually involves several stages such as topic fusion and text generation 1.1 Digital Libraries that are not required for text extracts. Recent summarization Automatic text summarization offers potential benefits to the research has largely focused on text extraction with renewed operation and design of digital libraries. As digital libraries grow interest in sentence-selection summarization methods in particular in size, so does the user’s need for information filtering tools. [17]. An extracted summary remains closer to the original document, by using sentences from the text, thus limiting the bias that might otherwise appear in a summary [16]. TXTRACTOR continues this trend by utilizing text extraction methods to Permission to make digital or hard copies of all or part of this work for produce summaries. personal or classroom use is granted without fee provided that copies The goals of text summarizers can be categorized by their intent, are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy focus, and coverage [7]. Intent refers to the potential use of the otherwise, or republish, to post on servers or to redistribute to lists, summary. Firmin and Chrzanowski divide a summary’s intent into requires prior specific permission and/or a fee. three main categories: indicative, informative, and evaluative. JCDL’02, July 13-17, 2002, Portland, Oregon, USA. Indicative summaries give an indication of the central topic of the Copyright 2002 ACM 1-58113-513-0/02/0007…$5.00. original text or enough information to judge the text’s relevancy.

28

Informative summaries can serve as substitutes for the full conclusion” are more likely to appear in scientific literature than documents and evaluative summaries express the point of view of in newspaper articles [10]. Position-based methods are also the author on a given topic. Focus refers to the summary’s scope, domain-dependent. The first sentence in a paragraph contains the whether generic or query-relevant. A generic summary is based on topic sentence in some domains, whereas it is the last sentence the original text, while a query-relevant summary is based on a elsewhere. Combined with other techniques, however, these topic selected by the user. Finally, coverage refers to the number extraction methods can still contribute to the quality of a of documents that contribute to the summary, whether the summary. summary is based on a single document or multiple documents. TXTRACTOR uses a text extraction approach to produce 2.2 Document Segmentation summaries that are categorized as indicative, generic, and based Document segmentation is an Information Retrieval (IR) approach only on single documents. to summarization. Narrowing the scope from a collection of documents, the IR approach views a single document as a 2. RELATED RESEARCH collection of words and phrases from which topic boundaries must TXTRACTOR is most strongly related to the research by be identified [10]. Recent research in this field, particularly the Carbonell and Goldstein [5] that strives to reduce the redundancy TextTiling algorithm [9], seems to show that a document’s topic of information in a query-focused summary. Carbonell and boundaries can be identified with a fair amount of success. Once a Goldstein introduce the concept of Maximal Marginal Relevance document’s segments have been identified, sentences from within (MMR), where each sentence is ranked based on a combination of the segments are typically extracted using word-based rules in a relevance and diversity measure. The consideration of diversity order to turn a document’s segments into a summary. Breaking a in TXTRACTOR is achieved by segmenting a document using the document into segments identifies the document’s topic TextTiling algorithm [9]. Sentences coming from different text boundaries. Segmentation is a nice way to make sure that a segments are considered adequately diverse. All text segments document’s topics are adequately represented in a summary. must be represented in a summary before additional sentences The IR approach to extraction does have some weaknesses. from an already represented segment can be added. Nomoto [18] Having a word-level focus “prevents researchers from employing and Radev [19] also present different ways to implement diversity reasoning at the non-word level” [10]. While the IR technique calculations for summary creation. Different from the successfully segments single documents into topic areas [9], the summarization work done by Carbonell and Goldstein, however, selection of sentences to extract from within those topic areas TXRACTOR is not query-focused, but rather uses sentence- could be improved by using many different heuristics, both word- selection heuristics, instead of query relevance, to rank a based and those that utilize language knowledge. In addition, once document’s sentences. a document is segmented, there is no way to know which of the segments is the most salient to the overall document. Some 2.1 Sentence Selection mechanism is required to rank segments so that the most pertinent Much research has been done on techniques to identify sentences topic information either gets extra coverage in the summary, or is that effectively summarize a document. Luhn in 1958 first utilized covered first in the summary. A practical problem is also word-frequency-based rules to identify sentences for summaries addressed by ranking segments. When the required number of [16]. Edmundson (1969) added three rules in addition to word sentences in a summary is less than the number of identified frequencies for selecting sentences to extract, including cue segments, there must be an intelligent way to decide which phrases (e.g., “significant,” “impossible,” “hardly”), title and segments will not be covered. A possible solution perhaps is to heading words, and sentence location (words starting a paragraph force a document to have a certain number of segments that match were more heavily weighted) [6]. The ideas behind these older the number of sentences allowed in the summary. Presetting the approaches are still referenced in modern text extraction research. number of acceptable topic areas, however, seems to defeat the Sentence-selection methods assist in finding the salient sentences process of true segment identification. Segment boundaries would in a document. By salient, we mean sentences a user would seem arbitrary if there were a limit on their number. The process include in a summary. There has been much review of sentence- of finding a document’s topic areas is separate from that of selection methods in research. Teufel and Moens found the use of selecting representative sentences to appear in the summary. A cue phrases to be the best individual method [24]. Kupiec et al., ranking of the segments, however, would allow a summary to on the other hand, found the position-based method to be the best grow and shrink, while extracting sentences from as many of the [12]. Regarding the combining of sentence-selection heuristics, highest ranked topic areas as possible. While this approach would research conducted by Kupiec, Pedersen, and Chen found that the not be suited to an informative summary, ranking segments and best mix of extraction methods included position, cue phrase, and controlling the number of sentences in a summary are acceptable sentence length. C. Aone et al. tested several different variations for an indicative summary. of tf*idf and the using or suppressing of proper names in their system DimSum [3]. Goldstein et al. found that summary 2.3 Combination Proposal sentences had 90 percent more proper nouns per sentence [8]. TXTRACTOR attempts to capture the benefits of sentence- When deciding which combination of extraction methods to use in selection summarization and document segmentation while TXTRACTOR, we assume each method is independent and its overcoming many of their deficiencies. The document impact can be aggregated into the total sentence score. As we segmentation algorithm identifies the document’s main topic conduct additional summarization experimentation, we will areas, while sentence-selection heuristics identify the salient further refine our use of sentence-selection methods and add sentences of a summary. The topic areas are used as the additional promising methods. foundation for the summary and the salient sentences are used as a Despite the usefulness of sentence extraction methods in finding compass guiding the inclusion of certain topic areas. Document salient sentences, they cannot alone produce the highest-quality segmentation provides a thorough domain-independent analysis of extracts. Sentence-selection techniques are often domain the entire document, created in a bottom-up manner. Sentence- dependent. For example, the words “Abstract” and “in selection heuristics provide saliency information in a structured

29

top-down manner. In addition, we have included many sentence- sentence. Thus, a high tf*idf score for a sentence is normalized for selection techniques in order to reduce the domain-dependency sentence length. The resulting score is then added to the effect of any one heuristic. We hypothesize that ranking a sentence’s score. document’s segments on the basis of their containing one or many 3.1.4 Sentence position in a paragraph of the document’s salient sentences will produce summaries that As the sentences are extracted from the original document, new are more information rich than those produced by the lines and carriage returns signal the beginning of new paragraphs. segmentation-only approach. The beginning sentence of a document and the beginning sentence 3. TXTRACTOR IMPLEMENTATION in a paragraph are given additional weight due to their greater TXTRACTOR is a summarizer based on text extraction written in summarizing potential. Java. Its major components include sentence-selection rules and a 3.1.5 The sentence length segmentation algorithm. The summarization process takes place in The length of a sentence can provide clues as to its usefulness in a three main steps 1) sentence evaluation 2) segmentation or topic summary [12] [3]. Before adding sentence length summarization boundary identification and 3) segment ranking and extraction. heuristic, we tried to achieve the same effect by simply not averaging tf*idf scores over the number of words in the sentence. 3.1 Sentence Evaluation: Longer sentences would naturally be scored higher because they The summarization process begins by parsing the sentences of the would contain more non- terms. This approach overly original text using a program that recognizes 60 abbreviations and weighted long sentences to the point where scores from the tf*idf various punctuation anomalies. The original orders of the equation would overpower the scores in other areas. Normalizing sentences are preserved so they can be added to the summary in that score would mute the value of a concentrated sentence with the intended order. Once the sentences are identified, many document-wide terms. To solve this problem, we made TXTRACTOR begins ranking each sentence. We use five sentence length its own rule. The length of a sentence is calculated sentence-selection heuristics to evaluate the document’s and its impact added to the sentence’s weight. sentences. The following different ranking methods have an Because each of the five sentence-selection rules is calculated impact on the corresponding scores of each of the sentences. differently, each score had to be normalized so the impact of each 3.1.1 Presence of cue phrases rule would be comparable. For example, the impact of an extra Currently, each sentence is checked for the existence of ten proper noun in a sentence is not the same as that of a sentence different cue phrases (e.g. “in summary,” “in conclusion,” “in occurring first in a paragraph or of a sentence being very long. short,” “therefore”). Cue phrases are words that signal to a reader The current normalization factor for each heuristic was that the author is going to summarize his or her idea or subject. determined through experimentation. The cue phrases are loaded out of a text file so that additional TXTRACTOR has a configuration option that allows the user to words can be easily added as more experimentation is done in this adjust the impact of each sentence-selection heuristic, without area. having to recompile the program. Because each extraction heuristic was normalized, a user can change the weighting of a 3.1.2 Proper nouns in the sentence particular heuristic and immediately judge its impact on the A TXTRACTOR-generated summary is meant to provide enough summary. Including the configuration capability facilitates information for a user to be able to decide whether she or he experimentation with different heuristic weights. In addition, wants to read the original document in its entirety. Important to while the summary-generation logic of TXTRACTOR was this decision is the existence of certain proper names and places. designed to be reasonably domain-independent, a user can still Currently, TXTRACTOR simply reads each sentence and counts change the weighting given to different sentence-selection rules the capitalized words, not including the opening word in the through the configuration option, thus customizing the sentence. This is meant as a temporary implementation until a full summarizer to different domains and uses. For example, a user entity-extraction algorithm can be implemented. The total number may want to see as many proper nouns as possible in the of capitalized words in each sentence is then averaged over the summary. Increasing the weight of the proper nouns rule will number of words in the sentence. Shorter sentences are thus not cause sentences with proper nouns to move to the top of the penalized for having fewer proper nouns than longer sentences. sentence ranking and thus appear more often in the summary. The average number of proper nouns is then normalized and Once each sentence is scored based on the above five heuristics, added to the sentence’s score. the sentences are then ranked according to their summarizing 3.1.3 TF*IDF value. Unlike segmentation-only approaches, sentences are ranked Tf*idf measures how frequently the words in a sentence occur against other sentences from the entire document, not only those relative to their occurrence in the entire document. Sentences that sentences within the same topic area. An example of sentence have document words in common are scored higher. To calculate scoring is shown in Figure 1. The three sentences listed are the top tf*idf, the occurrence of every word in a sentence and the word’s three sentences extracted from a document entitled “May the total occurrences in the document are totaled. Before the terms are Source be With You” from Wired Magazine [15]. A nearly totaled, however, each word is made lower-case and stemmed complete copy of the article is found in Figure 4. Each sentence in using the Porter stemmer. The Porter stemmer is one of the most Figure 1 comes from a different topic segment. The highest- widely used algorithms [11] and can be thought of as a scoring sentence greatly benefits from being the first sentence in lexicon-free stemmer because it uses cascaded rewrite rules that the article (+20). This position heuristic has been shown to be the can be run very quickly and do not require the use of a lexicon. most effective of all sentence-selection heuristics [12]. All three Stemming is performed so that words with the same stem but sentences begin a paragraph (+10). The second sentence contains different affixes may be treated as the same word when the cue phase “thus”, adding 10 points. The cue phrase allows it to calculating the frequency of a particular term. The tf*idf outrank the third sentence despite the third sentence having the calculation is computed and then averaged over the length of the highest values of the three for tf*idf and sentence length.

30

(topic: 0) (sentence 0) (score: 95) “The laws protecting software code are stifling creativity, destroying knowledge, and betraying the public trust.” (First document sentence: +30, first sentence of paragraph + 20, 0 for proper nouns, +34 for tf*idf, +11 for sentence length = score of 95)

(topic: 5) (sentence 52) (score: 85) “Thus, I would dramatically reduce the safeguards for software - from the ordinary term of 95 years to an initial term of 5 years, renewable once.” (First sentence of paragraph + 20, +10 for cue phrase “thus,” +0 for proper nouns, +41 for tf*idf, +14 for sentence length = score of 85)

(topic: 1) (sentence 23) (score: 85) “Finally, while control is needed, and perfectly warranted, our bias should be clear up front: Monopolies are not justified by theory; they should be permitted only when justified by facts.” (First sentence of paragraph + 20, +1 for proper nouns, +45 for tf*idf, +18 for sentence length = score of 85)

Figure 1- the weighting of individual sentences

3.2 Segmentation: example of the segment ranking is shown in Figure 2. High- ranking sentences are added first to the summary. Two sentences The segmentation algorithm used is based on the TextTiling from the same segment are not included in the summary algorithm developed by Marti Hearst [9]. The TextTiling (regardless of their ranking) until a sentence from each segment algorithm analyzes a document and determines where the topic has been included. Once all segments are represented in the boundaries are located. A topic boundary can be thought of as the summary, then the process starts over adding one sentence from point at which the author of the document changes subjects or all segments. Remaining ranked-sentences are added by segment themes. The first step in the TextTiling algorithm is to divide the text into token-sequences, removing any words that appear on the stop list. We have used a token-sequence length of 20 and the same stop word list used by Marti Hearst in TextTiling. Token- sequences are then combined to form blocks. Blocks are 2. Segment Ranking compared using a similarity algorithm. The comparison between blocks functions like a sliding window. The first block contains 1. Text Segmentation Sentence Ordering the first token-sequence plus k token-sequences before it. The with ranked sentences second block contains the second token-sequence and the k token- A two-sentence sequences after it. The value for k used in our summarizer is 10, 1. also the same one used by Marti Hearst in TextTiling. The blocks summary would are then compared using an algorithm that returns the similarity as 2. include sentences a percentage, which is derived from the number of times the same 5. ranked 1 & 3 terms appear in the two blocks being compared. The Jaccard coefficient is used for the similarity equation, which differs A five-sentence slightly from the normalized inner product equation used by 15. summary would Hearst. We did not consider the impact of using different 18. include sentences similarity equations to be significant. The Jaccard coefficient is as ranked 1, 2, 15, 3, & follows: L 4 3. ∑(wik wjk ) = k =1 4. Si, j L L L 2 + 2 − 8. ∑∑∑wik wjk wik w jk k ==11k k = 1 Figure 2- text segmentation and sentence-selection combined Si,j is the similarity between the two blocks of grouped token- sequences i and j, wik is the number of occurrences of term k in the block, and L is the total number of blocks. Once the topic boundaries have been identified, TXTRACTOR until the summary-length requirement is met. Once the length then assigns each sentence to a document segment. After all requirement is met, the sentences are then sorted by the order in sentences have been given weights and assigned to segments, then which they appeared in the original document and displayed on segment ranking and sentence extraction can operate. the screen. Figure 3 shows some pseudo code for the segment ranking routine. Document segmentation, therefore, provides the 3.3 Segment Ranking: topic structure for a document within which sentence selection Once a document is segmented into its main topic areas, can be utilized to identify the salient topic areas. It is of practical TXTRACTOR ranks the document segments based on the scores advantage to rank segments so that a user can easily change the given to sentences by the sentence-selection heuristics. An desired length of the summary while the ranking routine identifies

31

Rank segments (Array of ranked sentences) 4.3 An Example Document while(summary length not achieved) While not included in the summaries evaluated in the user studies, for( each ranked sentence in array ) the article in Figure 4 is a good example of the differences between the TXTRACTOR (reference by “TXT#” in the figure) if(sentence segment not already used) summaries and the summaries generated by the segmentation-only if(summary length achieved) approach (referenced by “SEG#” in the figure) [15]. Large break; asterisks and segment numbers highlight the breaks in the add sentence to summary document segments. The first sentence selected by TXTRACTOR end if is the first sentence in the document, despite the fact that it had a else 10-point lower tf*idf score than the first segmentation sentence. add to temp array for recursive call The first sentence, however, is a very good summarizing sentence. end else The segmentation approach then selects a second sentence from end for the first topic area. The two summaries then select the same sentence from segment two to add to their summaries. Later, Rank segments( temp array ) TXTRACTOR skips over the third topic area, while the end while segmentation algorithm adds its final two sentences from that end Rank segments topic area. A segmentation summary tries to include sentences for( all summary sentences) from every segment. TXTRACTOR had ranked the two sentences rank sentences by original document order added in the segmentation summary as 50th and 62nd respectively. end for The sentences had low scores for sentence length and somewhat low scores for tf*idf. Sentences three and four for the Figure 3 – Pseudo code for segment ranking TXTRACTOR summary are scored highly due to the included cue phrase, “thus”. The final sentence selected by TXTRACTOR is not rated in the top five best sentences (it is sixth), but because which segments, represented by their sentences, to add to or drop two sentences in the top-five come from the first topic area, room from the summary. in the summary is preserved for a segment not already represented. Thus, the ranking routine includes a sentence from 4. PRELIMINARY TESTING the seventh segment in the summary, instead of duplicating As a preliminary test of the performance of the TXTRACTOR sentences in a segment. summarizer, subjects compared summaries produced by segmentation alone with summaries produced by TXTRACTOR. 4.4 Document Selection A length limit of five sentences was imposed on all the Five documents were deliberately selected from various different summaries. subject domains. Document subjects ranged from psychology and sports to arts and science. The subjects of the documents were 4.1 Segmented Summaries varied in order to see whether the TXTRACTOR approach had The summaries produced by the segmentation-only approach used limitations in certain subject domains. Effort was also made to the same segmenting code as that used in TXTRACTOR. After vary the length of the documents. The numbers of words in the segments were identified, every word in each segment, except for documents ranged from 537 up to 13,293. Different lengths of those on the stop list, was scored based on the tf*idf equation. The documents were selected so that varied numbers of segments two highest-ranking terms from the segment were identified along would be created. By including long articles, we hoped to get with the first occurrence of each of the terms in the segment. The preliminary clues as to which summarizer prioritized a sentence(s) where the first occurrences were identified was then document’s segments best and whether prioritizing segments led added to the summary. Each segment produced one or two to improved summaries. In this experiment, we did not ask the sentences, depending on whether one sentence contained the top subjects to judge the cohesiveness of the summaries. We tried to two keywords or not. The same procedure for every segment was focus the user on the information content of the summary, not its carried out until there was at least one sentence from every cohesiveness. segment in the summary. In cases where there were more than We selected the following five documents to be summarized: five sentences, only the first five sentences were included in the 1. Turning Snooping Into Art, by Noah Shachtman, 773 summary for comparison purposes. words [23] 2. No. 4 Virginia suffers first loss of season, Game Day 4.2 TXTRACTOR Configuration Recap, 537 words [20] The configuration settings of the document-segmenting algorithm 3. Nanotech Fine Tuning by Mark K. Anderson, 654 were kept constant between the TXTRACTOR system and the words [2] segmentation-only system. Token-sequences of 20 characters 4. Ann Landers by Ann Landers, 650 words [14] were used and 10 token-sequences were added together to form a 5. A Primer on Narcissism by Sam Vaknin, Ph.D, 13,293 block for the similarity comparisons. Blocks were allowed to words [25] cross sentence boundaries and no stemming or noun phrasing was applied in identifying the document segments. Stemming, 4.5 Experiment Participants however, was used in calculating tf*idf in the sentence-selection Five subjects were chosen to compare the TXTRACTOR portion of TXTRACTOR. The segmenting code was allowed to summary with the summary generated by the segmentation–only determine how many topic areas the document had instead of approach. All subjects were above the age of 20 and were either being forced to generate the boundaries for a predetermined completing or had already obtained a bachelor’s degree. The number of topics. subjects were emailed the ten summaries, grouped by original document. Participants were directed to choose the summary that

32

1**[The laws protecting software code are stifling creativity, destroying knowledge, and betraying the public trust. TXT1 Legal heavy Lawrence Lessig argues it's time to bust the copyright monopoly. In the early 1970s, RCA was experimenting with a new technology for distributing film on magnetic tape - what we would come to call video. SEG1 Researchers were keen not only to find a means for reproducing celluloid with high fidelity but also to discover a way to control the use of the technology. Their aim was a method that could restrict the use of a film distributed on video, allowing the studio to maximize the film's return from distribution. The technology eventually chosen was relatively simple. A video would play once, and when finished, the cassette would lock into place. If a customer wanted to play the tape again, she would have to return it to the video store and have it unlocked.…. They were horrified. They would "never," Feely reported, permit their content to be distributed in that form, because the content - however clever the self-locking tape was - was still insufficiently controlled. How could they know, one of the Disney execs asked Feely, "how many people were going to be sitting there watching" a film? What's to stop someone else from coming in and watching for free? We live in a world with "free" content, and this freedom is not an imperfection. We listen to the radio without paying for the songs we hear; SEG2 we hear friends humming tunes that they have not licensed. We tell jokes that reference movie plots without the permission of the directors. We read our children books, borrowed from a library, without paying the original copyright holder for the performance rights. The fact that content at a particular time may be free tells us nothing about whether using that content is theft. Similarly, in arguing for increasing content owners' control over content users, it's not sufficient to say "They didn't pay for this use." Second, the reason perfect control has not been our tradition's aim is that creation always involves building upon something else. There is no art that doesn't reuse. And there will be less art if every reuse is taxed by the appropriator.]**2**[Monopoly controls have been the exception in free societies; they have been the rule in closed societies. Finally, while control is needed, and perfectly warranted, our bias should be clear up front: Monopolies are not justified by theory; TXT2 SEG3 they should be permitted only when justified by facts. If there is no solid basis for extending a certain monopoly protection, then we should not extend that protection. This does not mean that every copyright must prove its value initially. That would be a far too cumbersome system of control. But it does mean that every system or category of copyright or patent should prove its worth. Before the monopoly should be permitted, there must be reason to believe it will do some good - for society, and not just for monopoly holders. One example of this expansion of control is in the realm of software.]**3**[Like authors and publishers, coders (or more likely, the SEG4 companies they work for) enjoy decades of copyright protection. Yet the public gets very little in return. The current term of protection for software is the life of an author plus 70 years, or, if it's work-for-hire, a total of 95 years. This is a bastardization of the Constitution's requirement that copyright be for "limited times." By the time Apple's Macintosh operating system finally falls into the public domain, there will be no machine that could possibly run it. The term of copyright for software is effectively unlimited. Worse, the copyright system safeguards software without creating any new knowledge in return. When the system protects Hemingway, we at SEG5 least get to see how Hemingway writes.]**4**[We get to learn about his style and the tricks he uses to make his work succeed. We can see this because it is the nature of creative writing that the writing is public. There is no such thing as language that conveys meaning while not simultaneously transmitting its words. Software is different: Software gets compiled, and the compiled code is essentially unreadable; but in order to copyright software, the author need not reveal the source code. Thus, while the English department gets to analyze Virginia Woolf's novels to train TXT3 its students in better writing, the computer science department doesn't get to examine Apple's operating system to train its students in better coding. The harm that comes from this system of protecting creativity is greater than the loss experienced by computer science education.]**5**[While the creative works from the 16th century can still be accessed and used by others, the data in some software programs from the 1990s is already inaccessible. Once a company that produces a certain product goes out of business, it has no simple way to uncover how its product encoded data. The code is thus lost, and the software is inaccessible. Knowledge has been destroyed. Copyright law doesn't require the release of source code because it is believed that software would become unprotectable. The open source movement might throw that view into doubt, but even if one believes it, the remedy (no source code) is worse than the disease. There are plenty of ways for software to be secured without the safeguards of law. Copy-protection systems, for example, give the copyright holder plenty of control over how and when the software is copied. If society is to give software producers more protection than they would otherwise take, then we should get something in return.]**6**[And one thing we could get would be access to the source code after the copyright expires. Thus, I would dramatically reduce the safeguards for software - from the ordinary term of 95 years to an initial term of 5 years, renewable TXT4 once. And I would extend that government-backed protection only if the author submitted a duplicate of the source code to be held in escrow while the work was protected. Once the copyright expired, that escrowed version would be publicly available from the copyright office. Most programmers should like this change. No code lives for 10 years, and getting access to the source code of even orphaned software projects would benefit all. More important, it would unlock the knowledge built into this protected code for others to build upon as they see fit.]**7**[Software would thus be like every other creative work - open for others to see and to learn from. There are other ways that the government could help free up resources for innovation. … One context in particular where this could do some good is in orphaned software. Companies often decide that the costs of developing or maintaining software outweigh the benefits. They therefore "orphan" the software by neither selling it nor supporting it. They have little reason, however, to make the software's source code available to others. The code simply disappears, and the products become useless. Software gets 95 years of copyright protection. By the time the Mac OS finally falls into the public domain, no machine will be able to run it. But if Congress created an incentive for these companies to donate their code to a conservancy, then others could build on the earlier work TXT5 and produce updated or altered versions. This in turn could improve the software available by preserving the knowledge that was built into the original code. Orphans could be adopted by others who saw their special benefit. The problems with software are just examples of the problems found generally with creativity.]**8**[Our trend in copyright law has been to enclose as much as we can; the consequence of this enclosure is a stifling of creativity and innovation. If the Internet teaches us anything, it is that great value comes from leaving core resources in a commons, where they're free for people to build upon as they see fit. An Innovation Commons was the essence - the core - of the Internet. We are now corrupting this core, and this corruption will in turn destroy the opportunity for creativity that the Internet built.]**

Figure 4 - original text showing topics and sentences extracted via TXTRACTOR and a segmentation-only approach 33

provided the most pertinent information and seemed to be the 7. REFERENCES most useful in general. [1] in TIPSTER Text Phase III 18-Month Workshop, (Fairfax, 4.6 Results VA, 1998). Despite the small size of the experiment, we were able to observe [2] Anderson, M.K. Nanotech Fine Tuning some encouraging responses. Of the 25 comparisons that were http://www.wired.com/news/technology/0,1282,49447- made between summaries, the TXTRACTOR summary was 2,00.html. preferred 14 times, the segmentation-only summary was preferred [3] Aone, C., Okurowski, M.E., Gorlinsky, J. and Larsen, B. A 8 times, and 3 times the summaries were judged to be more or less Trainable Summarizer with Knowledge Acquired from the same. The subjects therefore preferred the TXTRACTOR Robust NLP Techniques. in Maybury, M.T. ed. Advances in summaries 7:4 over the summaries generated by segmentation Automatic Text Summarization, The MIT Press, Cambridge, only. After submitting their responses, the subjects were told 1999, 71-80. which summarizer produced which summary. The participants [4] Boguraev, B. and Kennedy, C., Salience-based Content then volunteered explanations for why they choose the summary Characterization of Text Documents. in Proceedings of the they did. A common sentiment was that the TXTRACTOR Workshop on Intelligent Scalable Text Summarization at the summary contained more information, but the sentences ACL/EACL Conference, (Madrid, Spain, 1997), 2-9. sometimes did not flow well. When the sentences flowed well, as judged by the participants, the TXTRACTOR-produced summary [5] Carbonell, J. and Goldstein, J., The Use of MMR, was usually preferred. An interesting note is that even though we Diversity-Based Reranking for Reordering Documents and had not instructed the subjects to assess the readability of the Producing Summaries. in SIGIR, (Melbourne, Austrailia, summary, users did not ignore the summary’s cohesiveness. It 1998), 335-336. seems, even with indicative summaries, [6] Edmundson, H.P. New Methods in Automatic Extracting. in that poor readability can distract a subject from information Maybury, M.T. ed. Advances in Automatic Text content. Summarization, The MIT Press, Cambridge, 1969, 23-42. [7] Firmin, T. and Chrzanowski, M.J. An Evaluation of 5. CONCLUSION & FUTURE DIRECTION Automatic Text Summarization Systems. in Maybury, M.T. Based on our tests, the TXTRACTOR summarizer outperformed ed. Advances in Automatic Text Summarization, The MIT the summarizer based solely on segmentation. The hypothesis that Press, Cambridge, 1999. ranking segments through the use of established sentence- [8] Goldstein, J., Kantrowitz, M., Mittal, V. and Carbonell, J., selection heuristics leads to better text-extracted summaries Summarizing Text Documents: Sentence Selection and appears to be promising. There is much that can be done, Evaluation Metrics. in 22nd International Conference on however, to improve the performance of the summarizer. Future Research and Development in Information Retrieval, improvements to TXTRACTOR include implementing the local (1999). salience method of cohesion analysis [4]. The local salience method is based on the assumption that relevant words and [9] Hearst, M.A. Segmenting Text into Multi-Paragraph phrases are revealed by a “combination of grammatical, syntactic, Subtopic Passages. Computational (23(1)). 33- and contextual parameters”. The original document is parsed to 64. identify a sentence’s subjects and predicates. Different weights [10] Hovy, E. and Lin, C.-Y. Automated Text Summarization in are then given to sentences based on the part-of-speech containing SUMMARIST. in Maybury, M.T. ed. Advances in the term being analyzed. Experimentation will be conducted on Automatic Text Summarization, The MIT Press, Cambridge, how many parts-of-speech to parse out of each sentence. 1999, 81-94. Additional research is needed to tune the weights of the sentence- [11] Jurafsky, D. and Martin, J.H. Speech and Language selection methods being used. Much research that has been done Processing: An Introduction to Natural Language in this area could be incorporated into our work. In addition, Processing, Computational Linguistics, and Speech analyzing the discourse context of the sentences should help Recognition. Prentice Hall, Upper Saddle River, 2000. improve the cohesiveness of the summaries. We are currently planning on conducting more complete [12] Kupiec, J., Pedersen, J. and Chen, F., A Trainable experiments and user studies on our combined segmentation and Document Summarizer. in Proceedings of the 18th ACM- sentence-selection approach to summarization. We are looking to SIGIR Conference, (1995), 68-73. test our summarization approach on a larger scale, similar to that [13] Lam-Adesina, A.M. and Jones, G.J.F., Applying done at the May 1998 SUMMAC conference [1]. Finally, we Summarization Techniques for Term Selection in would like to implement and test TXTRACTOR in different Relevance Feedback. in SIGIR, (new Orleans, Louisiana, digital library domains such as medical libraries and web pages. USA, 2001), 1-9. [14] Landers, A. Ann Landers 6. ACKNOWLEDGMENTS http://www.washingtonpost.com/wp-dyn/articles/A62823- We would like to express our gratitude to NSF Digital Library 2002Jan4.html. Initiative-2, “High-performance Digital Library Systems: From Information Retrieval to Knowledge Management,” IIS-9817473, [15] Lessig, L. May the Source Be With You. Wired Magazine, April 1999 – March 2002. We also would like to thank William 9.12 (December). Oliver for his implementation of the TextTiling algorithm and http://www.wired.com/wired/archive/9.12/lessig.html. Karina McDonald for her feedback on the summaries. [16] Luhn, H.P. The Automatic Creation of Literature Abstracts. in Maybury, M.T. ed. Advances in Automatic Text Summarization, The MIT Press, Cambridge, 1958, 15-22.

34

[17] Mani, I. and Maybury, M.T. (eds.). Advances in Automatic [22] Sanderson, M., Accurate user directed summarization from Text Summarization. The MIT Press, Cambridge, 1999. existing tools. in Conference on Information and [18] Nomoto, T. and Matsumoto, Y., A New Approach to Knowledge Management, (Bethesda, MD, USA, 1998), 45- Unsupervised Text Summarization. in SIGIR, (New 51. Orleans, LA, USA, 2001), 26-34. [23] Shachtman, N. Turning Snooping Into Art [19] Radev, D.R., Jing, H. and Budzikowska, M., Centroid- http://www.wired.com/news/culture/0,1284,49439,00.html. based summarization of mulitiple documents: sentence [24] Teufel, S. and Moens, M., Sentence Extraction as a extraction, utility-based evaluation, and user studies. in Classification Task. in Workshop on Intelligent Scalable ACL/NAAL Workshop on Summarization, (Seatle, WA., Summarization ACL/EACL Conference, (Madrid, Spain, 2000). 1999), 58-65. [20] Recap. No. 4 Virginia suffers first loss of season [25] Vaknin, S. A Primer on Narcissism http://sports.espn.go.com/ncaa/mbasketball/recap?gameId http://www.mentalhelp.net/poc/view_doc.php/type/doc/id/41 =220050189, 2002. 9. [21] Sakai, T. and Jones, K.S., Generic Summaries for Indexing in Information Retrieval. in SIGIR, (New Orleans, Louisiana, USA, 2001), 190-198.

35