Evaluation of Features for Sentence Extraction on Different Types of Corpora

Evaluation of Features for Sentence Extraction on Different Types of Corpora Chikashi Nobatay, Satoshi Sekinez and Hitoshi Isaharay y Communications Research Laboratory 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0289, Japan fnova, [email protected] z Computer Science Department, New York University 715 Broadway, 7th floor, New York, NY 10003, USA [email protected] Abstract the speech corpus we obtained results comparable with those for the written corpora. This means that We report evaluation results for our sum- the features we use are worth analyzing. marization system and analyze the result- Sentence extraction is one of the main methods ing summarization data for three differ- required for a summarization system to reduce the ent types of corpora. To develop a ro- size of a document. Edmundson (1969) proposed a bust summarization system, we have cre- method of integrating several features, such as the ated a system based on sentence extraction positions of sentences and the frequencies of words and applied it to summarize Japanese and in an article, in order to extract sentences. He man- English newspaper articles, obtained some ually assigned parameter values to integrate features of the top results at two evaluation work- for estimating the significance scores of sentences. shops. We have also created sentence ex- On the other hand, machine learning methods can traction data from Japanese lectures and also be applied to integrate features. For sentence evaluated our system with these data. In extraction from training data, Kupiec et al. (1995) addition to the evaluation results, we an- and Aone et al. (1998) used Bayes’ rule, Lin (1999) alyze the relationships between key sen- and Nomoto and Matsumoto (1997) generated a de- tences and the features used in sentence cision tree, and Hirao et al. (2002) generated an extraction. We find that discrete combi- SVM. nations of features match distributions of In this paper, we not only show evaluation results key sentences better than sequential com- for our sentence extraction system using combina- binations. tions of features but also analyze the features for different types of corpora. The analysis gives us some 1 Introduction indication about how to use these features and how to combine them. Our ultimate goal is to create a robust summarization system that can handle different types of documents in a uniform way. To achieve this goal, we 2 Summarization data have developed a summarization system based on sentence extraction. We have participated in eval- The summarization data we used for this research uation workshops on automatic summarization for were prepared from Japanese newspaper articles, both Japanese and English written corpora. We have Japanese lectures, and English newspaper articles. also evaluated the performance of the sentence ex- By using these three types of data, we could com- traction system for Japanese lectures. At both work- pare two languages and also two different types of shops we obtained some of the top results, and for corpora, a written corpus and a speech corpus. 2.1 Summarization data from Japanese not unified, but were used separately for evaluation. newspaper articles In the following sections, we call these summarization data the “CSJ data”. We use the CSJ data as Text Summarization Challenge (TSC) is an evalua- an example of a Japanese speech corpus to evaluate tion workshop for automatic summarization, which the performance of sentence extraction. is run by the National Institute of Informatics in Japan (TSC, 2001). Three tasks were presented at 2.3 Summarization data from English TSC-2001: extracting important sentences, creating newspaper articles summaries to be compared with summaries prepared by humans, and creating summaries for informa- Document Understanding Conference (DUC) is an tion retrieval. We focus on the first task here, i.e., evaluation workshop in the U.S. for automatic sum- the sentence extraction task. At TSC-2001, a dry marization, which is sponsored by TIDES of the run and a formal run were performed. The dry run DARPA program and run by NIST (DUC, 2001). data consisted of 30 newspaper articles and manu- At DUC-2001, there were two types of tasks: ally created summaries of each. The formal run data single-document summarization (SDS) and multi- consisted of another 30 pairs of articles and sum- document summarization (MDS). The organizers of maries. The average number of sentences per article DUC-2001 provided 30 sets of documents for a dry was 28.5 (1709 sentences / 60 articles). The news- run and another 30 sets for a formal run. These data paper articles included 15 editorials and 15 news re- were shared by both the SDS and MDS tasks, and ports in both data sets. The summaries were created the average number of sentences was 42.5 (25779 from extracted sentences with three compression ra- sentences / 607 articles). Each document set had a tios (10%, 30%, and 50%). In our analysis, we used topic, such as “Hurricane Andrew” or “Police Mis- the extraction data for the 10% compression ratio. conduct”, and contained around 10 documents rele- In the following sections, we call these summa- vant to the topic. We focus on the SDS task here, for rization data the “TSC data”. We use the TSC data which the size of each summary output was set to as an example of a Japanese written corpus to eval- 100 words. Model summaries for the articles were uate the performance of sentence extraction. also created by hand and provided. Since these summaries were abstracts, we created sentence extrac- 2.2 Summarization data from Japanese tion data from the abstracts by word-based compar- lectures ison. In the following sections, we call these summa- The speech corpus we used for this experiment rization data the “DUC data”. We use the DUC data is part of the Corpus of Spontaneous Japanese as an example of an English written corpus to eval- (CSJ) (Maekawa et al., 2000), which is being cre- uate the performance of sentence extraction. ated by NIJLA, TITech, and CRL as an ongoing joint project. The CSJ is a large collection of mono- 3 Overview of our sentence extraction logues, such as lectures, and it includes transcrip- system tions of each speech as well as the voice data. We selected 60 transcriptions from the CSJ for both sen- In this section, we give an overview of our sentence tence segmentation and sentence extraction. Since extraction system, which uses multiple components. these transcription data do not have sentence bound- For each sentence, each component outputs a score. aries, sentence segmentation is necessary before The system then combines these independent scores sentence extraction. Three annotators manually gen- by interpolation. Some components have more than erated sentence segmentation and summarization re- one scoring function, using various features. The sults. The target compression ratio was set to 10%. weights and function types used are decided by op- The results of sentence segmentation were unified timizing the performance of the system on training to form the key data, and the average number of data. sentences was 68.7 (4123 sentences / 60 speeches). Our system includes parts that are either common The results of sentence extraction, however, were to the TSC, CSJ, and DUC data or specific to one of these data sets. We stipulate which parts are specific. The length of a sentence means the number of let- ters, and based on the results of an experiment with 3.1 Features for sentence extraction the training data, we set C to 20 for the TSC and 3.1.1 Sentence position CSJ data. For the DUC data, the length of a sen- We implemented three functions for sentence po- tence means the number of words, and we set C to sition. The first function returns 1 if the position of the sentence is within a given threshold N from the 10 during the training stage. beginning, and returns 0 otherwise: 3.1.3 Tf*idf P1. Score (S )(1 · i · n) = 1(if i < N) pst i The third type of scoring function is based on term = 0(otherwise) frequency (tf) and document frequency (df). We applied three scoring functions for tf*idf, in which the The threshold N is determined by the number of term frequencies are calculated differently. The first function uses the raw term frequencies, while the words in the summary. other two are two different ways of normalizing the The second function is the reciprocal of the po- frequencies, as follows, where DN is the number of sition of the sentence, i.e., the score is highest for documents given: the first sentence, gradually decreases, and goes to a minimum at the final sentence: DN T1. tf*idf(w) = tf(w) log 1 df(w) P2. Scorepst(Si) = i tf(w)-1 DN T2. tf*idf(w) = log These first two functions are based on the hypoth- tf(w) df(w) tf(w) DN esis that the sentences at the beginning of an article T3. tf*idf(w) = log tf(w)+1 df(w) are more important than those in the remaining part. The third function is the maximum of the reciprocal of the position from either the beginning or the end of the document: For the TSC and CSJ data, we only used the third 1 1 method (T3), which was reported to be effective P3. Score (S ) = max( ; ) pst i i n ¡ i + 1 for the task of information retrieval (Robertson and Walker, 1994). The target words for these functions This method is based on the hypothesis that the sen- are nouns (excluding temporal or adverbial nouns). tences at both the beginning and the end of an article For each of the nouns in a sentence, the system cal- are more important than those in the middle.

Load more