An Automatic Text Summarization for Malayalam Using Sentence Extraction
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-3, Issue-8, Aug.-2015 AN AUTOMATIC TEXT SUMMARIZATION FOR MALAYALAM USING SENTENCE EXTRACTION 1RENJITH S R, 2SONY P 1M.Tech Computer and Information Science, Dept.of Computer Science, College of Engineering Cherthala Kerala, India-688541 2Assistant Professor, Dept. of Computer Science, College of Engineering Cherthala, Kerala, India-688541 Abstract—Text Summarization is the process of generating a short summary for the document that contains the significant portion of information. In an automatic text summarization process, a text is given to the computer and the computer returns a shorter less redundant extract of the original text. The proposed method is a sentence extraction based single document text summarization which produces a generic summary for a Malayalam document. Sentences are ranked based on feature scores and Googles PageRank formula. Top k ranked sentences will be included in summary where k depends on the compression ratio between original text and summary. Performance evaluation will be done by comparing the summarization outputs with manual summaries generated by human evaluators. Keywords—Text summarization, Sentence Extraction, Stemming, TF-ISF score, Sentence similarity, PageRank formula, Summary generation. I. INTRODUCTION a summary, which represents the subject matter of an article by understanding the whole meaning, which With enormous growth of information on cyberspace, are generated by reformulating the salient unit conventional Information Retrieval techniques have selected from an input sentences. It may contain some become inefficient for finding relevant information text units which are not present in the input text. An effectively. When we give a keyword to be searched extract is a summary consisting of a number of on the internet, it returns thousands of documents sentences selected from the input text.Sentence overwhelming the user. It becomes a time consuming extraction methods have been studied extensively and difficult task to recall the precise documents. over the past decade. Sole concentration on the Text summarization approaches are used as a solution structural information in the text like position, length, to this problem which reduces time required to find term frequency, relevance features, etc. does not the web document having relevant and useful data. capture the true importance of sentences while Text summarization is the process of automatically dealing with different kinds of writing styles. This creating a compressed version of the text containing accounts for a renewed approach to text significant information. summarization which combines the best of both The summaries can help the reader to get a quick worlds - a structure based approach, which gives overview of an entire document. Another important some degree of importance to sentences based on issue related to the information retrieval from the their structural features alone, and a graph based internet is the existence of many documents with the approach, which gives sufficient importance to the same or similar topics, known as duplication. This semantic relationship between sentences. Based on kind of data duplication problem increases the information content of the summary, it can be necessity for effective document summarization. The categorized as informative and indicative summary. advantages of automatic text summarization are The indicative summary represents an indication saving in reading time, facilitating document about an articles purpose and it prompt the user for selection and literature searches, improvement of selecting the article for in-depth reading for detailed document indexing efficiency, free from bias, and understanding; on the other hand, informative they are useful in question-answering systems where summary covers all significant information in the they provide personalized information. Input to a document at an abstract level, that is, it will contain summarization process can be one or more text information about all the different aspects such as documents. When only one document is the input, it articles purpose, scope, approach, content, domain, is called single document text summarization and results and conclusions. For example, an abstract of a when the input is group of related text documents, it research article is more informative than its headline. is called multi document summarization. We can also categorize the text summarization based on the type II. RELATED WORK of users the summary is intended for: User focused summaries are intended to satisfy the requirements of Text summarization has been an area of interest since a particular user or group of users and generic many years. The need for an automatic text summaries are aimed at a broad community. summarizer has increased much due to the abundance Depending on the nature of summary, it can be of documents in the internet. I. Mani et al. [6] defines categorized as an abstract or an extract. An abstract is text summarization as the process of distilling the An Automatic Text Summarization For Malayalam Using Sentence Extraction 46 International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-3, Issue-8, Aug.-2015 most important information from single or multiple extraction based text summarization technique which documents to produce an abridged version for uses a structural charcteristics based sentence scoring particular user(s) and task(s). D. Shen et al. [4] along with a PageRank based sentence ranking. The differentiates the two approaches to text effectiveness of the proposed approach had been summarization as abstraction based and extraction confirmed for English and Tamil documents by based. Abstraction based approach understands the applying the ROUGE evaluation. The method was overall meaning of the document and generate a new carried out in four different phases namely i)Pre text whereas the extraction based approach simply processing, where stop word removal and stemming selects a subset of existing sentences in the original are performed in order to prepare the source data for text to form the summary. P. Baxendale [7] presented summary generation,ii)Scoring,where the sentences experimental data on how the leading sentences of a were given scores based on their position,length, document are more important than the ones at the end topic similarity and TF-IDF feature such that longer in terms of its informative content or significance. sentences similar to the title of the document and Hence the postion of a sentence in a document forms appearing at the beginning of the document are an important selection criterion. H. P. Luhn [5] getting high scores, iii)Ranking, where the sentences presented the idea that frequently occuring terms are ranked according to Google’s PageRank formula signify the overall content of the document. S. Brin et and finally, iv)summary generation, where the final al. [8] used the Pagerank based score to rank the summary comprises of the top ranked sentences sentences which gives more importance to sentences displayed in the same order as they appear in the that refer to others as well as are referred by others. source document text. The number of top ranked Dhanya P. M et al. [1] performed a comparative study sentences selected for the summary may be user- of text summarization in Indian languages. Two defined in terms of the number sentences or summarization techniques each from compression ratio with respect to the length of the Tamil[9][2],Kannada [10][11]and one each from source document text. Odia [12], Bengali[13], Punjabi[14] and The proposed algorithm, on evaluation using ROUGE Gujarathi[15] were taken for the purpose of metrics for English and Tamil, yields better results. comparison. Text consisting of three sentences was Since this technique only requires a stop word list and taken as an example and they tried to find out the stemmer for summary generation in any language, it summary sentences using all the eight methods.It was is expected to work well irrespective of language. concluded that most of the methods have selected a Stemmers are usually considered as the initial phase set of features based on which they rank the of a summarization procedure.Stemming is the sentences. Punjabi method uses the maximum process of removing the affixes from inflections and number of features which is ten and odia uses the to return the root form. Malayalam is highly least number of features which is one. The accuracy agglutinative in nature and hundreds of inflections are of the method depends on the number of features and possible for each word. An effective stemmer in the contribution of that feature towards summary. The Malayalam is not yet implemented. Prajitha U et al. methods show a recall scores of 0.45, 0.48, 0.43, [16] proposed an algorithm namely LALITHA:A 0.66, 0.42, 0.412, 0.42, 0.82 respectively.In almost all light weight Malayalam stemmer using suffix methods testing is done by comparing the results with stripping method.In Malayalam inflections are mainly results of human summarizers. formed by adding suffixes to the root form. So the Anita R Kulkarni et al. [3] illustrates three different proposed stemmer considers only the suffix part and techniques namely statistical,knowledge based and strip it to get the stem.Stemming will reduce a word linguistic techniques that can be applied in text to a stem which need not be a meaningful one. summarization. Summarization tools like SweSum,(a The suffix stripping can be done mainly on two basis: summarization tool from Royal Institute of Iteration and Longest match .Iteration is a recursive Technology, Sweden) that works on news text using procedure and in each iteration we can remove a HTML tags, MEAD- a public domain multilingual single suffix from the right end of the word. In multi-document summarization system developed by Malayalam since it is possible to attach many the research group of Dragomir Radev,which uses suffixes, this iteration process will be computationally three features namely centroid score, position and expensive.