Automated Text Summarization Base on Lexicales Chain and Graph Using of Wordnet and Wikipedia Knowledge Base
Total Page:16
File Type:pdf, Size:1020Kb
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 3, January 2012 ISSN (Online): 1694-0814 www.IJCSI.org 343 Automated Text Summarization Base on Lexicales Chain and graph Using of WordNet and Wikipedia Knowledge Base Mohsen Pourvali and Mohammad Saniee Abadeh Department of Electrical & Computer Qazvin Branch Islamic Azad University Qazvin, Iran Department of Electrical and Computer Engineering at Tarbiat Modares University Tehran, Iran Abstract The technology of automatic document summarization is delivering the majority of information content from a set of maturing and may provide a solution to the information overload documents about an explicit or implicit main topic [14]. problem. Nowadays, document summarization plays an important Authors of the paper [10] provide the following definition role in information retrieval. With a large volume of documents, for a summary: “A summary can be loosely defined as a presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document text that is produced from one or more texts that conveys summarization is a process of automatically creating a important information in the original text(s), and that is no compressed version of a given document that provides useful longer than half of the original text(s) and usually information to users, and multi-document summarization is to significantly less than that. Text here is used rather loosely produce a summary delivering the majority of information content and can refer to speech, multimedia documents, hypertext, from a set of documents about an explicit or implicit main topic. etc. The main goal of a summary is to present the main The lexical cohesion structure of the text can be exploited to ideas in a document in less space. If all sentences in a text determine the importance of a sentence/phrase. Lexical chains are document were of equal importance, producing a summary useful tools to analyze the lexical cohesion structure in a text .In would not be very effective, as any reduction in the size of this paper we consider the effect of the use of lexical cohesion features in Summarization, And presenting a algorithm base on a document would carry a proportional decrease in its in the knowledge base. Ours algorithm at first find the correct sense formativeness. Luckily, information content in a document of any word, Then constructs the lexical chains, remove Lexical appears in bursts, and one can therefore distinguish chains that less score than other ,detects topics roughly from between more and less informative segments. Identifying lexical chains, segments the text with respect to the topics and the informative segments at the expense of the rest is the selects the most important sentences. The experimental results on main challenge in summarization”. assumes a tripartite an open benchmark datasets from DUC01 and DUC02 show that processing model distinguishing three stages: source text our proposed approach can improve the performance compared to interpretation to obtain a source representation, source sate-of-the-art summarization approaches. representation transformation to summary representation, Keywords: text Summarization, Data Mining, Text mining, Word and summary text generation from the summary Sense Disambiguation representation. A variety of document summarization methods have been developed recently. The paper [4] 1. Introduction reviews research on automatic summarizing over the last decade. This paper reviews salient notions and developments, and seeks to assess the state-of-the-art for The technology of automatic document summarization is this challenging natural language processing (NLP) task. maturing and may provide a solution to the information The review shows that some useful summarizing for overload problem. Nowadays, document summarization various purposes can already be done but also, not plays an important role in information retrieval (IR). With a surprisingly, that there is a huge amount more to do. large volume of documents, presenting the user with a Sentence based extractive summarization techniques are summary of each document greatly facilitates the task of commonly used in automatic summarization to produce finding the desired documents. Text summarization is the extractive summaries. Systems for extractive process of automatically creating a compressed version of a summarization are typically based on technique for given text that provides useful information to users, and sentence extraction, and attempt to identify the set of multi-document summarization is to produce a summary sentences that are most important for the overall understanding of a given document. In paper [11] proposed Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved. IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 3, January 2012 ISSN (Online): 1694-0814 www.IJCSI.org 344 paragraph extraction from a document based on intra- centroid score, position, and overlap with first sentence document links between paragraphs. It yields a text (which may happen to be the title of a document). For relationship map (TRM) from intra-links, which indicate single-documents or (given) clusters it computes centroid that the linked texts are semantically related. It proposes topic characterizations using tf–idf-type data. It ranks four strategies from the TRM: bushy path, depth-first path, candidate summary sentences by combining sentence segmented bushy path, augmented segmented bushy path. scores against centroid, text position value, and tf–idf In our study we focus on sentence based extractive title/lead overlap. Sentence selection is constrained by a summarization. In this way we to express that The lexical summary length threshold, and redundant new sentences cohesion structure of the text can be exploited to avoided by checking cosine similarity against prior ones. In determine the importance of a sentence. Eliminate the the past, extractive summarizers have been mostly based on ambiguity of the word has a significant impact on the scoring sentences in the source document. In paper [12] inference sentence. In this article we will show that the each document is considered as a sequence of sentences separation text into the inside issues by using the correct and the objective of extractive summarization is to label the concept Noticeable effect on the summary text is created. sentences in the sequence with 1 and 0, where a label of 1 The experimental results on an open benchmark datasets indicates that a sentence is a summary sentence while 0 from DUC01 and DUC02 show that our proposed approach denotes a non-summary sentence. To accomplish this task, can improve the performance compared to state-of-the-art is applied conditional random field, which is a state-of-the- summarization approaches. art sequence labeling method .In paper [15] proposed a The rest of this paper is organized as follows: Section 2 novel extractive approach based on manifold–ranking of introduces related works, Word sense disambiguation is sentences to query-based multi-document summarization. presented in Section 3, clustering of the lexical chains is The proposed approach first employs the manifold–ranking presented in Section 4, text segmentation base on the inner process to compute the manifold–ranking score for each topics is presented in Section 5, The experiments and sentence that denotes the biased information-richness of the results are given in Section 6. Finally conclusion presents sentence, and then uses greedy algorithm to penalize the in section 7. sentences with highest overall scores, which are deemed both informative and novel, and highly biased to the given query. The summarization techniques can be classified into 2. Related work two groups: supervised techniques that rely on pre-existing document-summary pairs, and unsupervised techniques, based on properties and heuristics derived from the text. Generally speaking, the methods can be either extractive Supervised extractive summarization techniques treat the summarization or abstractive summarization. Extractive summarization task as a two-class classification problem at summarization involves assigning salience scores to some the sentence level, where the summary sentences are units (e.g.sentences, paragraphs) of the document and positive samples while the non-summary sentences are extracting the sentences with highest scores, while negative samples. After representing each sentence by a abstraction summarization vector of features, the classification function can be trained (e.g.http://www1.cs.columbia.edu/nlp/newsblaster/) usually in two different manners [7]. One is in a discriminative way needs information fusion, sentence compression and with well-known algorithms such as support vector reformulation [14]. machine (SVM) [16]. Many unsupervised methods have Sentence extraction summarization systems take as input been developed for document summarization by exploiting a collection of sentences (one or more documents) and different features and relationships of the sentences – see, select some subset for output into a summary. This is best for example [3] and the references therein. On the other treated as a sentence ranking problem, which allows for hand, summarization task can also be categorized as either varying thresholds to meet varying summary length generic or query-based. A query-based summary presents requirements. Most commonly, such ranking