JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131

Abstractive Text Summarization using Rich Semantic Graph for Marathi Sentence

Sheetal Shimpikar, Sharvari Govilkar Computer Technology Department, Mumbai University [email protected] [email protected]

Abstract — Text summarization helps to find, take out important sentences from the original document and links together to construct a short and clear summary. Large text documents are difficult to summarize manually. Using the computer program, a text can be reduced to get important points; the summary obtained from it is termed as Text summarization. The objective of the work is the representation and summarization of Indian language for “Marathi” text documents using abstractive text summarization techniques. The proposed approach takes Marathi documents as input text. The first step is pre-processing of the input text. Rich semantic graph method. The challenge in doing abstractive text summarization in Marathi documents is due to the complexity because it cannot be formulated mathematically or logically. And no work has been done on . The Rich Semantic Graph based method gives the correct, bug free result.

Keywords — Text Summarization, Rich Semantic Graph, Part of Speech Tagging, Name Entity Recognition, Ontology

I. INTRODUCTION In current era of digital cultures and technologies, to understand the huge amount of information, text summarization is very important method for all. The purpose of text summarization is to minimize the text from the documents into meaningful form and save important contents. Abstractive text summarization methods use language understanding tools to generate a summary. They extract phrases and lexical chains from the documents. Important steps in abstractive text summarization are taking out the basic features, pick out the applicable information, clarifying and minimizing information. Abstractive text summarization is divided into two parts. In structured based, important information is converted into coded form in the document through patterns like frames, scripts and templates. In semantic based, natural language generation system (NLG) is used. The input will be the semantic representation of document. There are some limitations in abstractive text summarization techniques, for example, combination of sentences is not much developed, so the result obtained from the machine generated automatic summaries will be unclear. Abstractive text summarization systems are tough to reproduce as they depend on the internal tools. These are required to extract the information and language generation. Abstraction needs meaningful text understanding.

II. LITERATURE SURVEY Here different techniques for abstractive text summarization in Indian languages are cited. Most of the researchers concentrate on abstract data to be extracted from source document. Extractive summarizations are weaker, whereas Abstractive summarization is more impressive. Abstractive text summarization techniques have been applied on various languages to detect whether the valid sentences are chosen. The following data is arranged as per the various regional languages on which abstractive text summarization techniques are applied. Jagadish S Kallimani [1] suggests a solution for abstractive summarization by making use of extractive methodology in language. In this method, the abstract data is extracted from source document. This information dense data is then post processed to gather key or most important concepts from original text. The main idea is to generate abstractive summary by gathering key concepts from source document using extractive summary technique. Rajina Kabeer [2] used semantic Graph based method which concentrates on summarizing documents in . In this method the input document undergoes series of linguistics processing to get triples from each sentence of source document. With help of semantic triples, the semantic graph is generated and graph needs to be reduced in order to remove redundancy and to attain concise abstract summary. It gives opportunities and confidence to move forward with abstractive summarization techniques using various methods like domain based ontology, semantic graphic representation, word net etc. Manjula Subramanian and Vipul Dalal [3] discussed about language by using semantic graph reducing technique to generate abstractive summary. This approach is divided into three phases. Developing Rich Semantic Graph from source document, reducing Rich Semantic Graph to form the abstractive Rich Semantic Graph and final step includes the generation of abstractive summary from abstracted Rich Semantic Graph.

Volume V, Issue XII, December/2018 Page No:2381 JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131

Sabina, Priyanka, Adiba, Palash [4] had done a survey on various Indian regional languages that are based on abstractive type. After doing the study they analysed and discussed the articles and came to a conclusion that semantic graph based method is perfect for . They also compared rule based and ontology based method. Before working on Bengali summarizer, domain ontology and word net should be build. Comparison between single and multi-document summarization techniques has done. Ibrahim and Mostafa [5] have worked on abstractive summary using rich semantic graph reducing technique for single document. In this type, the original document is taken in the form of rich semantic graph, then the graph is reduced and then abstractive summary is obtained from the reduced graph. A case study is shown that how the original text was reduced to fifty percent. Nikita Munot and Dr.Sharvari Govilkar [6] have worked on English language. They have proposed a system that takes a single document as input and by using Rich semantic graph technique, output summary is obtained. An example is shown which obtains an abstractive text summary using Rich semantic graph method. Khan and N. Salim [7] had a survey on different methods proposed by different researchers on abstractive summarization. The common thing between them is to find key sentences from the document. J. Mohan, Sunitha, A. Ganesha, Jaya [8] have worked on ontology based abstractive summarization. Ontology is a approved and clear definition of a shared abstract idea. Various frequently used ontology based methods are been discussed which are related to abstractive type. N. Moratanch [9] worked on the various abstractive type summarization methods. Here the paper consists of the important types improved, bugs obtained, fact findings and upcoming guidance in text summarization. I. Fathy, D. Fadl and M.Aref [10] have worked on rich semantic graph based type. In this paper RSG technique is used to obtain the English text. Here the Word Net ontology comes in picture to generate multiple texts. Used in five parts: Text planning, Sentence planning, Surface Realization, Writing Styles Selected Essay Generation and Text Evaluation. D.Bartakke, S. D.Sawarkar, and A. Gulati [11] have worked on multi document text summarization by using abstractive text summarization. So there is use of more than one document to create abstractive summary. Semantic graphs are generated for each sentence, generated graphs are reduced, some rules are created and used to obtain abstractive summary. It can be inferred from the literature review that very negligible amount of work has been done for Marathi Abstractive Text Summarization and we propose the following system to do Abstractive Text Summarization in Marathi language.

III. METHODOLOGY A set of Marathi text document is given as input to the system. The proposed approach consists of following phases: Marathi text document as input, Pre-processing, Rich Semantic Graph phase (graph creation, reduction, generate summary).

Fig. 1 The proposed system architecture

Volume V, Issue XII, December/2018 Page No:2382 JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131

A. Pre-processing First step of pre-processing is used to present the text documents into clear word format. Example- !”@£}[]$ मुरलीधर देिवदास आमटे उफ बाबा आमटे हे एक थोर मराठी समाजसेवक होते.बाबा आमटा ज िडसबर २३ १९१४ रोजी महारा ातील वधा िजात हाला .बाबा आमटना भारत सरकार कडून १९७१ मे पी पुरार ा हाला आहे. तसेच बाबा आमटना भारत सरकार कडून १९८६ मे पिवभूषण पुरार ा हाला आहे.. Baba Amte was a very good human being. 3.1 Input Validation To find out either the given document is valid in Devanagari script or not. The invalid words are removed before using them. Algorithm: Input: Marathi single document as input. Output: Marathi (Devanagari) script Steps- 1. Use the character set as UTF-8. 2. Scan the input document. 3. Compare each character from scanned input document with UTF-8. 4. If character is present in UTF-8, then it is valid to Devanagari script otherwise not. 5. Ignore all invalid Devanagari script characters. 6. Repeat step 3 till all characters from input script document get verified. 7. Store all valid Devanagari character, words in file to process further. 3.2 Tokenization Separating tokens from the input document is referred as Tokenization. The separate word/token from the sentences is called Lexicons. With the help of lexical analyzer one can tokenize the input document as one token per line. Marathi is a grouped language where word limits are firm. By searching spaces between the words, tokenization can be done. The importance of this phase is that, we can deal with each word separately. Algorithm:- Input- Marathi (Devanagari) document. Output- List of tokens. Steps- 1. Start 2. Initialize all pointers (input (for character), output (for token), initially assign tokens = NULL) 3. Scan the input document. 4. Check for not end of file. i Read a character from input file. ii If character is not special character then do following while space character is not found. • Treat all character as a token. • Add token into token list file. • Increment respective pointers. 5. Repeat step 4 until end of file. 6. Stop. 3.3 Stemming Stemming is used to remove suffixes from words. Stem is not necessarily the linguistic root of the word. Example is given with next step. Algorithm- Input- List of words. Output- Stem of words. Steps- 1. Use Corpus of possible suffixes used in Devanagari script for Marathi language. 2. Read all the valid Devanagari tokens one by one except stop word. 3. While token_file_pointer!=EOF and While Suffix_file_pointer!=EOF do, • Find the ending of token such that maximum length of token endings (maximum ending characters) gets compared with Corpus of Suffix list.

Volume V, Issue XII, December/2018 Page No:2383 JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131

• Replace the maximum ending characters from the token by null character. • Store the result of token as stem_of_token.

4. Repeat step3 for all token as stem_of_token. 5. End. 3.4 Morphological Analysis Morphological analysis is used to find the inside formation of the word. A root word is obtained by using morphological analyzer. When stemming is done, the words are tested so that they are not inflected. This is obtained by creation and comparison of rules with the words. If stem word is inflected then the root word is created by adding of substitution characters with stem word. Algorithm- Input- Stemmed token (with infection), list of inflection stripping rule. Output- Marathi root words. Steps- 1. Start. 2. Initialize stem_ptr and root_ptr. 3. Read the stemmed token. 4. Read the last character of tokens with modifier (inflection). 5. Compare the infected character with inflection stripping rules. 6. If match found then replace the inflected character with new character given in the rule. 7. Write token in root word file. 8. Increment respective pointers. 9. Repeat step 5 to 7 for all stemmed tokens. 10. Stop. 3.5 Part of Speech Tagging POS tagging parses the whole sentence to describe each word syntactic function and generates the POS tags for each word. It is a piece of software that reads text in some language and assigns parts of speech to each word such as noun, verb, adjective, etc. Assignment of descriptors to the given tokens is called Tagging. The descriptor is called tag. To perform this task, a parser tool is used for implementation. 3.6 Name Entity Recognition Name Entity Recognition finds atomic elements into decided groups such as place, individual names, firm, time or amount. For Marathi language, we can use one of the approaches is machine learning where we are going to use HMM to predict the correct NER tag for the given word or the second approach is Rule Based system where handcrafted rules are written which helps is identifying the correct NER tag of the word. Any entity such as person, organization, geographical location, government agency, event that has a name is treated as a named entity. Expressions such as money, percentage, phone numbers, date, time, URLs, addresses are also treated as named entities. To perform this task we are going to use NER tool which is available freely. Few rules are given below. 1. Rules for creating Person’s Name a. Check the actual Nouns. b. Formal or specialized words like {men, books, author of, co-author, read, worked, state, city, country, university, college, school, island of, hero, hospital, born, establish, started, saints, founded, chairman of, director}show proper noun. c. The complete set is said as name when it includes two things. Firstly, the set of capitalized letters should have (.). Secondly, the complete set is called as name when it has one or two capitalized words. d. If instantly leading word to possible name owns to {Mr., Shri, Prof, Dr, Mrs., Sister, Brother, Swami, Rev, Sir, Lord, Master, Army, Air force, Navy rank, Justice, H.H, Ratna, Padmashri, His Excellence}, it is definitely a name. e. If either of the capitalized words becomes visible afterwards, the chances for it to be name rises. f. If the set of words or one of capitalized words is visible in the starting of a sentence, chances are of name. g. If preposition belongs to {by, of, friend, colleagues, to, co-author, with, men, persons, emperor, men like, sage, as}, the chances are for name. h. If the word instantly after the capitalized word(s) (i.e. the post-position) belongs to set {said, told} the chances are of name. 2. Generating rules for Place/Institute/Organization Name Index a. Look for correct Nouns.

Volume V, Issue XII, December/2018 Page No:2384 JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131

b. If a Preposition is owned by {of} comes instantly following a Name, surely it is a Place or Organization or Institute. c. Possible set of preposition for potential Place or Organization {from, in, at, to, for, of} 3. Rules for generating Date Index. a. Look for correct nouns. b. Words like {Century, decade, AD., BC. During, before, after, until} then probably it’s a year. To find NER tags, similar tools are used. 3.7 Rich Semantic Graph Generation Module Rich semantic graph generation module is used to produce the complete rich semantic graphs of the input document. First step is to accept the input as preprocessed sentences and produces graph for every sentence and at the end, all the sub graphs are merged together. They show the complete document. With the help of created POS tags and tokens, the semantic graphs are obtained. Obtaining abridgement by semantic graph generation is a procedure that uses subject-predicate-object (SPO) triples from single sentences to generate a semantic graph of the input document. B. Rich Semantic Graph Generation Phases Rich semantic graph generation phase helps to reduce the generated rich semantic graph of the input document to more reduced graph. Here a set of heuristic rules are used on the created rich semantic graphs to reduce it by combining, removing, or integrating the graph nodes. Consider two simple sentences, and then apply some of the rules on the graph nodes: Sentence1= [SN1, MV1, ON1] Sentence2= [SN2, MV2, ON2] Every sentence is formed by three nodes: Subject Noun (SN) node, Main verb (MV) node and Object Noun (ON) node. Input- Rich Semantic Graph (RSG) of the complete document. Output-Reduced rich semantic graph (RSG). C. Summarized Text Generation Phase In this step, abstractive summary is obtained from the reduced Rich Semantic Graph (RSG) [18]. The working includes accesses to the domain ontology that contains the details required in the same domain of RSG to generate the actual texts. Also, the Word Net ontology is used to obtain multiple texts as per the word synonyms. The produced multiple texts are valued and ranked, where the most ranked text is considered. Example- Expected Output: मुरलीधर देिवदास आमटे उफ बाबा आमटे हे एक थोर मराठी समाजसेवक होते. ांचा ज िडसबर २३ १९१४ रोजी महारा ातील वधा िजात झाला . ांना भारत सरकार कडून १९७१ मे पी पुरार आिण १९८६ मे पिवभूषण पुरार ा झाला आहे.

IV. CONCLUSIONS Computers will become more and more capable of receiving and giving instructions. The aim is to produce the summary that consists of the essential data and connects the sentences semantically. One possible solution is to summarize a document using either extractive text summarization or abstractive text summarization methods. But compared to extractive, abstractive method is stronger because they produce summary which is semantically related but difficult to produce. We need to work on abstractive text summarization of Marathi language, as not much work has been done on abstractive text summarization. We are trying to use Rich Semantic Graph approach and generate the abstractive text summarised Marathi document. We have seen many abstractive text summarization techniques that have been used in other languages. But RSG technique is more suitable for Marathi language according to the study done. ACKNOWLEDGMENT The heading of the Acknowledgment section and the References section must not be numbered. Causal Productions wishes to acknowledge Michael Shell and other contributors for developing and maintaining the IEEE LaTeX style files which have been used in the preparation of this template. To see the list of contributors, please refer to the top of file IEEETran.cls in the IEEE LaTeX distribution.

REFERENCES [1] Sunitha C, Dr. A Jaya and Amal Ganesh “Abstractive Summarization Techniques in Indian Languages”, Peer-review under responsibility of the Organizing Committee of ICRTCSE 2016 doi: 10.1016/j.procs.2016.05.121, 2016. [2] Manjula Subramanian, “Semantic graph reducing technique to generate abstractive summary with input text in Hindi, 2015. [3] RajinaKabeer, “Semantic Graph based method on summarising documents in Malayalam”, IEEE International conference on data science and engineering, 2014. [4] Atif Khan and NaomicSalim, “Abstractive Summarization Methods”, Journal of Theoretical and applied information technology, vol.55, ISSN- 1992-8645, E- ISSN-1817, 3195, September-2013.

Volume V, Issue XII, December/2018 Page No:2385 JASC: Journal of Applied Science and Computations ISSN NO: 1076-5131

[5] Barzilay and Mckeown, “Sentence fusion for multi document news summarization”, Computer Linguistics, vol-31, pp. 297-338, 2005 [6] Harabagin and Lacatusu, “Generating single and multi-document summaries with gistexter”, in document understanding conference, 2002. [7] Green backer, “Multimodal semantic model”, ACL HLT 2011, pp.75, 2011. [8] Genest and Lapalme, “Framework for abstractive summarization using text to text generation”, in Proceeding of workshops on Monolingual Text to Text generation., 2011. [9] HadiaAbhas Mohammed Elsied, NaomicSalim, “Automatic Abstraction Summarization a systematic Literature review”, Journal of Theoretical and Applied Information Technology 31st August 2013. [10] Rosna P Harun, “Text Summarization methods in Dravidian Language”, International Journal of Innovations in Engineering and Technology (IJIET), Volume 7 Issue 1 June 2016. [11] Dhanya P M and Jathavedam M, “Comparative study of text summarization in Indian Languages”, International Journal of Computer Applications (0975 – 8887) Volume 75– No.6, August 2013. [12] R.C. Balabantaray, B Sahoo and Mswan,”Odia text summarization using stemmer”, International Journal of Applied Information Systems (IJAIS) – ISSN: 2249- 0868 Foundation of Computer Science FCS, New York, USA Volume 1– No.3, February 2012. [13] N. R. Kasture1, NehaYargal 2, NehaNityanand Singh3, Neha Kulkarni4 and Vijay Mathur5, “A Survey on Methods of Abstractive Text Summarization”, International journal for research in emerging science and technology, volume-1, issue-6, November-2014. [14] A. Khan, and N. Salim, A Review On Abstractive Summarization Methods,Journal of Theoretical and Applied Information Technology, 59(1), 64-71, Malaysia, 2014. [15] J. Mohan, M. Sunitha, A. Ganesha, Jaya , A Study on Ontology based Abstractive Summarization, Fourth International Conference Fourth International Conference on Recent Trends in Computer Science & Engineering,32-37,,2016. [16] I. F. Moawad, M. Aref, Semantic Graph Reduction Approach for Abstractive Text Summarization, IEEE International Conference On Natural, 132-138, Egypt, 2012. [11]. N. Moratanch, and. S. Chitrakala , A Survey Of Abstractive Text Summarization, IEEE International Conference on Circuit, Power and Computing Technologies, India, 2016. [17] P. Genest, and G. Lapalme, Fully Abstractive Approach to Guided Summarization,Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 354–358, Canada, 2012. [18] I. Fathy, D. Fadl, and M. Aref , Rich Semantic Representation based Approach for Text Generation,The 8th International Conference on INFOrmatics and Systems (INFOS), 19-27, Egypt, 2012. [19] D. Bartakke, S. D. Sawarkar, and A. Gulati, A Semantic Based Approach for Abstractive multi-Document Text Summarization, International Journal of Innovative Research in Computer and Communication Engineering, 4(7), India, 2016. [20] M. Subramaniam, and V. Dalal, Test Model for Rich Semantic Graph Representation for Hindi Text using Abstractive Method, International Research Journal of Engineering and Technology (IRJET) , 2(2), India, 2015.

Volume V, Issue XII, December/2018 Page No:2386