Automatic Text Summarization with Lexical Clustering

Topic Keyword Identification for Text Summarization using Lexical Clustering Youngjoong Ko, Kono Kim, and Jungyun Seo Department of Computer Science Sogang University Sinsu-dong 1, Mapo-gu, Seoul Korea, 121-742 {kyj, kono}@nlpzodiac.sogang.ac.kr, [email protected] Correspondence author: Youngjoong Ko Correspondence address: NLP laboratory, Dept. of Computer Science, Sogang University, Sinsu-dong 1, Mapo-gu, Seoul, Korea, 121-742 Correspondence telephone number: 82-2-706-8954 Correspondence fax number: 82-2-704-8273 Correspondence Email number: [email protected] Abstract Automatic text summarization sets the goal at reducing the size of a document while preserving its content. Generally, producing a summary as extracts is achieved by including only sentences which are most topic-related. DOCUSUM is our summarization system based on new topic keyword identification method. The process of DOCUSUM is as follows. First, DOCUSUM converts the content words of a document into elements of a context vector space. It then constructs lexical clusters from the context vector space and identifies core clusters. Next, it selects topic keywords from the core clusters. Finally, it generates a summary of the document using the topic keywords. In the experiments on various compression ratio (the compression of 30%, the compression of 10%, and the extraction of the fixed number of sentences: 4 or 8 sentences), DOCUSUM showed better performance than other methods. Keywords: text summarization, lexical clustering, k-means algorithm, topic keyword identification Topic Keyword Identification for Text Summarization using Lexical Clustering Youngjoong Ko, Kono Kim, and Jungyun Seo Department of Computer Science Sogang University Sinsu-dong 1, Mapo-gu, Seoul Korea, 121-742 {kyj, kono}@nlpzodiac.sogang.ac.kr, [email protected] We developed DOCUSUM which makes an Abstract extracts using topic keyword identification. Automatic text summarization sets the goal DOCUSUM is a text summarization system at reducing the size of a document while based on IR techniques using semantic and preserving its content. Generally, producing statistical methods. For example, DOCUSUM a summary as extracts is achieved by uses not only word counting but also topic including only sentences which are most keyword identification through context vector topic-related. DOCUSUM is our space. The context vector space is automatically summarization system based on new topic constructed using co-occurrence statistics and keyword identification method. The process statistical methods are utilized with this of DOCUSUM is as follows. First, contextual knowledge. DOCUSUM identifies DOCUSUM converts the content words of a topic keywords without other linguistic resources document into elements of a context vector such as the WordNet. It also does not use cue space. It then constructs lexical clusters phrases or discourse parser for the topic keyword from the context vector space and identifies identification. Therefore, these characteristics core clusters. Next, it selects topic keywords make DOCUSUM robust and low-cost. from the core clusters. Finally, it generates a The rest of this paper is organized as follows. summary of the document using the topic Section 2 describes related works in topic keywords. In the experiments on various identification. Section 3 explains the structure of compression ratio (the compression of 30%, DOCUSUM in overall manner. In Section 4, we the compression of 10%, and the extraction present each stage from the construction of of the fixed number of sentences: 4 or 8 context vector space to the generation of sentences), DOCUSUM showed better summary in detail. Section 5 is devoted to performance than other methods. evaluating experimental results. In the last Section, we draw conclusions and present future 1 Introduction works. The goal of automatic summarization is to take 2 Related Work an information source, extract content from it, and present the most important content to a user Several techniques for topic identification have in a condensed form and in a manner sensitive to been reported in the literature. The pioneering the user’s or application’s needs [1]. To achieve work studied that most frequent words represent this goal, topic of an information source should the most important concepts of the text [2]. be identified by machine as well as human. Using Therefore, topic keywords are the most frequent identified topics (theme), we can extract words. This representation abstracts the source sentences from a source and generate abstracts. text into a frequency table. This method ignores Therefore, one of main goals for our method is to the semantic content of words and their potential identify main topics of texts. These topics can be membership in multi-word phrases. represented as topic keywords. In other early summarization system, problem, DOCUSUM proceed in four stages: Edmundson studied that first paragraph or first First, DOCUSUM converts the content words of sentences of each paragraph contain topic a document into vectors of a context vector space. information [3]. Also he studied that the presence Second, it constructs lexical clusters using the of words such as significant, hardly, impossible context vector space and identifies core clusters. signals topic sentences. Although all the Third, it selects topic keywords from the core techniques presented above are easily computed, clusters. Finally, it generates the summary of the these approaches depend very much on the document using the topic keywords. particular format and style of writing. The following Figure 1 shows the architecture To overcome the limitation of the frequency- of DOCUSUM. based method, Aone et al. aggregated synonym occurrences together as occurrences of the same concept using the WordNet [4]. Training Co-Occurrence Analysis News Articles Process Bazilay and Elhadad constructed lexical chain Context Vector Space by calculating semantic distance between words using the WordNet [5]. Strong lexical chains are selected and the sentences related to these strong chains are chosen as a summary. Generation Content Words Extraction Document Process These methods which use semantic relations Lexical Clustering between words depend heavily on manually Topic Keyword Identification constructed resources such as the WordNet [6]. The WordNet is not available in several Topic Keywords Title Query languages such as Korean and this kind of Document Vector Space linguistic resources are hard to maintain. Hovy and Lin used topic identification which aimed at extracting the salient concepts in a Summary document [7]. By training on a corpus of Figure 1. Illustration of DOCUSUM documents with their associated topics, their method yields a ranked list of sentence positions The first stage represents each word as a vector. that tend to contain the most topic-related The meaning of a word can be represented by a keywords. They also used topic signature method vector, which places a word in a for topic identification. To construct topic multidimensional semantic space [8]. The main signature, they used a set of 30,000 texts where requirement of such spaces is that words which each article is labeled with one out of 32 possible are similar in meaning should be represented by topic labels. For each topic, the top 300 terms, similar vectors. DOCUSUM uses a context which are scored by a term-weighting metric, vector space based on a co-occurrence analysis of were treated as a topic signature. large corpora. The assumption is that similar Topic keywords in DOCUSUM are close to words occur in similar contexts. For example, a topic signature conceptually and both methods textbook with a paragraph about ‘cats’ might also use very large corpus. However, DOCUSUM is mention ‘dogs’, ‘fur’, ‘pets’ and so on. This different in that no genre-related and no knowledge can be used to assume that ‘cats’ and supervised training are necessary and topic ‘dogs’ are related in meaning. For about 60,000 keywords are identified with context vector words, the co-occurrence statistics were space. calculated in a sentence window, which was slid from a corpus of about 20 million words 3 Overall Architecture (newspaper articles for 2 years). Each word is represented by the co-occurrence value with the In order to produce a summary with high quality, other words. If any two words in the context topic of a document should be recognized and it vector space have a similar co-occurrence pattern, can be represented as a few topic keywords. A the meanings of these words are likely to be the few topic keywords from a document are a good same. As a result, the similarity of meaning representative form for topic. The problem is between two words increases in proportion to the how to identify the topic keywords. To solve this inner product value of vectors for these words. The content words written in the documents Each word in the context vector space is are converted into the vectors of the context represented as the co-occurrence value with other vector space. In the second stage, the converted element words as shown in Figure 2. If any two vectors are clustered by k-means algorithm. Then words have a similar co-occurrence pattern, the DOCUSUM identifies core clusters. To identify meaning of these words is similar. In the context them, we developed scoring measure for cluster. vector space, the meaning similarity between The score of a cluster is determined by the words can be calculated by inner product or normalized sum of term frequency within the cosine metric. Thus the meaning similarity cluster. The core clusters can be identified by this between two words increases in proportion to the cluster score. inner product value of two vectors. In the third stage, DOCUSUM selects topic In order to build the context vector space, a keywords with strong relations to the topic of co-occurrence value of words is required. To document. These topic keywords are selected calculate these co-occurrence values, the articles from core clusters by term frequency.

Automatic Text Summarization with Lexical Clustering

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support