Abstractive Text Summarization Using Rich Semantic Graph for Marathi Sentence

JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131 Abstractive Text Summarization using Rich Semantic Graph for Marathi Sentence Sheetal Shimpikar, Sharvari Govilkar Computer Technology Department, Mumbai University [email protected] [email protected] Abstract — Text summarization helps to find, take out important sentences from the original document and links together to construct a short and clear summary. Large text documents are difficult to summarize manually. Using the computer program, a text can be reduced to get important points; the summary obtained from it is termed as Text summarization. The objective of the work is the representation and summarization of Indian language for “Marathi” text documents using abstractive text summarization techniques. The proposed approach takes Marathi documents as input text. The first step is pre-processing of the input text. Rich semantic graph method. The challenge in doing abstractive text summarization in Marathi documents is due to the complexity because it cannot be formulated mathematically or logically. And no work has been done on Marathi language. The Rich Semantic Graph based method gives the correct, bug free result. Keywords — Text Summarization, Rich Semantic Graph, Part of Speech Tagging, Name Entity Recognition, Ontology I. INTRODUCTION In current era of digital cultures and technologies, to understand the huge amount of information, text summarization is very important method for all. The purpose of text summarization is to minimize the text from the documents into meaningful form and save important contents. Abstractive text summarization methods use language understanding tools to generate a summary. They extract phrases and lexical chains from the documents. Important steps in abstractive text summarization are taking out the basic features, pick out the applicable information, clarifying and minimizing information. Abstractive text summarization is divided into two parts. In structured based, important information is converted into coded form in the document through patterns like frames, scripts and templates. In semantic based, natural language generation system (NLG) is used. The input will be the semantic representation of document. There are some limitations in abstractive text summarization techniques, for example, combination of sentences is not much developed, so the result obtained from the machine generated automatic summaries will be unclear. Abstractive text summarization systems are tough to reproduce as they depend on the internal tools. These are required to extract the information and language generation. Abstraction needs meaningful text understanding. II. LITERATURE SURVEY Here different techniques for abstractive text summarization in Indian languages are cited. Most of the researchers concentrate on abstract data to be extracted from source document. Extractive summarizations are weaker, whereas Abstractive summarization is more impressive. Abstractive text summarization techniques have been applied on various languages to detect whether the valid sentences are chosen. The following data is arranged as per the various regional languages on which abstractive text summarization techniques are applied. Jagadish S Kallimani [1] suggests a solution for abstractive summarization by making use of extractive methodology in Kannada language. In this method, the abstract data is extracted from source document. This information dense data is then post processed to gather key or most important concepts from original text. The main idea is to generate abstractive summary by gathering key concepts from source document using extractive summary technique. Rajina Kabeer [2] used semantic Graph based method which concentrates on summarizing documents in Malayalam. In this method the input document undergoes series of linguistics processing to get triples from each sentence of source document. With help of semantic triples, the semantic graph is generated and graph needs to be reduced in order to remove redundancy and to attain concise abstract summary. It gives opportunities and confidence to move forward with abstractive summarization techniques using various methods like domain based ontology, semantic graphic representation, word net etc. Manjula Subramanian and Vipul Dalal [3] discussed about Hindi language by using semantic graph reducing technique to generate abstractive summary. This approach is divided into three phases. Developing Rich Semantic Graph from source document, reducing Rich Semantic Graph to form the abstractive Rich Semantic Graph and final step includes the generation of abstractive summary from abstracted Rich Semantic Graph. Volume V, Issue XII, December/2018 Page No:2381 JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131 Sabina, Priyanka, Adiba, Palash [4] had done a survey on various Indian regional languages that are based on abstractive type. After doing the study they analysed and discussed the articles and came to a conclusion that semantic graph based method is perfect for Bengali language. They also compared rule based and ontology based method. Before working on Bengali summarizer, domain ontology and word net should be build. Comparison between single and multi-document summarization techniques has done. Ibrahim and Mostafa [5] have worked on abstractive summary using rich semantic graph reducing technique for single document. In this type, the original document is taken in the form of rich semantic graph, then the graph is reduced and then abstractive summary is obtained from the reduced graph. A case study is shown that how the original text was reduced to fifty percent. Nikita Munot and Dr.Sharvari Govilkar [6] have worked on English language. They have proposed a system that takes a single document as input and by using Rich semantic graph technique, output summary is obtained. An example is shown which obtains an abstractive text summary using Rich semantic graph method. Khan and N. Salim [7] had a survey on different methods proposed by different researchers on abstractive summarization. The common thing between them is to find key sentences from the document. J. Mohan, Sunitha, A. Ganesha, Jaya [8] have worked on ontology based abstractive summarization. Ontology is a approved and clear definition of a shared abstract idea. Various frequently used ontology based methods are been discussed which are related to abstractive type. N. Moratanch [9] worked on the various abstractive type summarization methods. Here the paper consists of the important types improved, bugs obtained, fact findings and upcoming guidance in text summarization. I. Fathy, D. Fadl and M.Aref [10] have worked on rich semantic graph based type. In this paper RSG technique is used to obtain the English text. Here the Word Net ontology comes in picture to generate multiple texts. Used in five parts: Text planning, Sentence planning, Surface Realization, Writing Styles Selected Essay Generation and Text Evaluation. D.Bartakke, S. D.Sawarkar, and A. Gulati [11] have worked on multi document text summarization by using abstractive text summarization. So there is use of more than one document to create abstractive summary. Semantic graphs are generated for each sentence, generated graphs are reduced, some rules are created and used to obtain abstractive summary. It can be inferred from the literature review that very negligible amount of work has been done for Marathi Abstractive Text Summarization and we propose the following system to do Abstractive Text Summarization in Marathi language. III. METHODOLOGY A set of Marathi text document is given as input to the system. The proposed approach consists of following phases: Marathi text document as input, Pre-processing, Rich Semantic Graph phase (graph creation, reduction, generate summary). Fig. 1 The proposed system architecture Volume V, Issue XII, December/2018 Page No:2382 JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131 A. Pre-processing First step of pre-processing is used to present the text documents into clear word format. Example- !”@£}[]$ मुरलीधर देिवदास आमटे उफ बाबा आमटे हे एक थोर मराठी समाजसेवक होते.बाबा आमट$9ा जU िडस$बर २३ १९१४ रोजी महारा ातील वधा िज,ात \हाला .बाबा आमट$ना भारत सरकार कडून १९७१ म;े प3Iी पुरार Bा] \हाला आहे. तसेच बाबा आमट$ना भारत सरकार कडून १९८६ म;े प3िवभूषण पुरार Bा] \हाला आहे.. Baba Amte was a very good human being. 3.1 Input Validation To find out either the given document is valid in Devanagari script or not. The invalid words are removed before using them. Algorithm: Input: Marathi single document as input. Output: Marathi (Devanagari) script Steps- 1. Use the character set as UTF-8. 2. Scan the input document. 3. Compare each character from scanned input document with UTF-8. 4. If character is present in UTF-8, then it is valid to Devanagari script otherwise not. 5. Ignore all invalid Devanagari script characters. 6. Repeat step 3 till all characters from input script document get verified. 7. Store all valid Devanagari character, words in file to process further. 3.2 Tokenization Separating tokens from the input document is referred as Tokenization. The separate word/token from the sentences is called Lexicons. With the help of lexical analyzer one can tokenize the input document as one token per line. Marathi is a grouped language where word limits are firm. By searching spaces between the words, tokenization can be done. The importance of this phase is that, we can deal with each word separately. Algorithm:- Input- Marathi (Devanagari) document. Output- List of tokens. Steps- 1. Start 2. Initialize all pointers (input (for character), output (for token), initially assign tokens = NULL) 3. Scan the input document. 4. Check for not end of file. i Read a character from input file. ii If character is not special character then do following while space character is not found. • Treat all character as a token. • Add token into token list file. • Increment respective pointers. 5. Repeat step 4 until end of file. 6. Stop. 3.3 Stemming Stemming is used to remove suffixes from words. Stem is not necessarily the linguistic root of the word.

Load more