Exploring Resources in Word Sense Disambiguation for Marathi Language Amit Patil1, Chhaya Patil2, Dr
Total Page:16
File Type:pdf, Size:1020Kb
www.rspsciencehub.com Volume 02 Issue 10S October 2020 Special Issue of First International Conference on Advancements in Management, Engineering and Technology (ICAMET 2020) Exploring Resources in Word Sense Disambiguation for Marathi Language Amit Patil1, Chhaya Patil2, Dr. Rakesh Ramteke3, Dr. R. P. Bhavsar4, Dr. Hemant Darbari5 1,2 Assistant Professor, Department of Computer Application, RCPET’s IMRD, Maharashtra, India 3,4Professor, School of Computer Sciences, KBC North Maharashtra University, Maharashtra, India 5Director General, Centre for Development of Advanced Computing (C-DAC), Maharashtra, India Abstract Word Sense Disambiguation (WSD) is one of the most challenging problems in the research area of natural language processing. To find the correct sense of the word in a particular context is called Word Sense Disambiguation. As a human, we can get a correct sense of the word given in the sentence because of word knowledge of that particular natural language, but it is not an easy task for the machine to disambiguate the word. Developing any WSD system, it required sense repository and sense dictionary. It is very costly and time-consuming to build these resources. Many foreign languages have available these resources, that is why most of the foreign languages like English, German, Spanish etc lot of work is done in these Natural languages. When we look for Indian languages like Hindi, Marathi, Bengali etc. very less work is done. The reason behind this is resource-scarcity. In this paper, we majorly focus on Marathi Language Word Sense Disambiguation because of very less work is done in the Marathi Language as compared to Hindi and other Indian Languages. Our main objective is to provide information about various resources available for the Marathi language which will be helpful for researchers who wants to do work for Marathi WSD. This paper also gives a review on work done for Marathi Language WSD and its challenges and problems. Keywords: WSD, WordNet, Indo WorldNet, Part of Speech(POS), NLTK, iNLTK has been a long-standing challenge to improve the 1. Introduction ability of computers to do natural language processing. In the supervised machine learning Word sense disambiguation is a process which is approach, the classifier is trained for every different automatically recognized which multiple meaning word on manually senses annotated. These methods of ambiguous words is being used in a specific assume that the context can provide sufficient proof sentence. In other Word, WSD is identifying which on its own to disambiguate the sense.It uses meaning of a word (i.e. sense) is used in a sentence annotated corpus and ambiguity is resolved by when the word has multiple senses text. In natural finding the nearest or closest word having language processing, word sense disambiguation similarities. (WSD) is an open problem of computation 2. WSD Application and Techniques linguistic. It is a word sense disambiguation is a WSD is required in various areas like Information massive challenge in natural language processing. Retrieval (IR), Sentiment Analysis, Knowledge The human mind is quite talented at word-sense Graph Construction, Text Mining and Information disambiguation. A human language developed in a Extraction (IE), Lexicography and Machine way that human easily understands the meaning Translation. Solutions to WSD are mostly which reflects in the sentence. In the computer, it categorized into knowledge-based, supervised and International Research Journal on Advanced Science Hub (IRJASH) 108 www.rspsciencehub.com Volume 02 Issue 10S October 2020 unsupervised approaches. Language WSD is a lack of resources. Still, some peoples started research work for Marathi Every machine learning approach has many Language and developed some Marathi Language algorithms. In the supervised approach follows WSD. In this section, we try to give information many algorithms are Decision List, Decision Tree, about resources available for Marathi Language Naïve Bayes, Support Vector Machine (SVM), which will useful for researchers, who want to Logistic Regression, Logistic Regression, Random work on this problem. Forest, KNN (k- Nearest Neighbours) Ensemble Methods, Neural Networks, etc. In the Unsupervised approach follows many algorithms 3.1 Marathi WorldNet are Word Clustering, Context Clustering, Co- occurrence Graph, K-means, Apriori algorithm, etc. This is a machine readable dictionary based English The Knowledge Base approach follows many WordNet. It is not just a traditional dictionary, but algorithms are Genetic algorithm, Decision more than this. This dictionary gives different Support, Lesk algorithm, Semantic Similarity, relations between synsets or synonym sets Selection Preferences. represented as unique concepts. It is developed by Dr. Pushpak Bhattacharya with his team at IIT, The authors [1] explored the work status of WSD in Bombay. Marathi WorldNet is organized as a the Marathi Language. Many researchers used semantic network of large electronic databases. different algorithms for disambiguating the Marathi sentence like the graph-based algorithm to resolve Paradigmatic relations such as synonymy, ambiguity based on word sense and context hyponymy, antonymy and entailment etc. are used domain. The researcher used Genetic Algorithm to construct it. It is widely used lexical database technique through which they resolve the today for research in NLP for Marathi language, the ambiguity of the words based on their context different senses called synonym sets or synsets for domain and their senses. Other authors used each open-class word like nouns, verbs, adjectives, approach consists of a modified Lesk algorithm and adverbs are listed by Marathi WordNet. It has with Support Vector Machine. etc. The accuracy of the index_txt file to Provides information about all every algorithm is depends on text corpus and words present in Wordnet, the data_txt file for different techniques applied to the data set. Providing the details of every word in the index file and the onto_txt file which Provides ontology details of the words in data file [2]. 2.1 Marathi Language and its Word Categories Example: Marathi is the Indo-Aryan language. This language of Sanskrit origin. The Marathi language is the data_txt File- official language of Maharashtra state, a state in Structure of Data_txt: Example 00054554 03 02 India. The language is most speakers in word wide. फसवणे:चकवणे 0001 0400 00000183 | फसेल असे Approximately 90 million people in India speak करणे:"नकली मालाची ववक्री क셂न दुकानदार लोका車ना this language. Maharashtra is a Southern state in फसवतात." India, the dialects of Marathi include Varhadii, Gawdi of Goa, Nagpuri Marathi, Dangii, Malwani, Synset id=00054554 Kudali, Kasargod, Kosti, Ahirani of Khandeshi, etc. POS=03 number of words present in synset=02 The Marathi language follows the Subject, Object, Synsets= फसवणे:चकवणे Verb, Nouns inflect for gender, number, etc. The Number of relations lexical as well as Marathi language is eight main POS (Part of semantic=0001 Speech). These are Noun, Verb, Adjective, Adverb, Four-digit code relation id=0400 synset_id for Pronoun, Postposition, Conjunction and Interjection [3]. which that relation exists=00000183 gloss= फसेल 3. Marathi Language Resources असे करणे One of the challenges in researching Marathi Example sentence=:"नकली मालाची ववक्री क셂न International Research Journal on Advanced Science Hub (IRJASH) 109 www.rspsciencehub.com Volume 02 Issue 10S October 2020 दुकानदार लोका車ना फसवतात." Most of the NLP applications required pre- Here part of speech (pos): 1(noun), 2(adjective), processing of the text. The libraries that are more 3(verb), 4(adverb) first four-digit relation type useful while using python for text processing are represented and the second four-digit represents the the Indic NLP Library and Natural Language order of words from two synsets for which relation Toolkit for Indic Languages (iNLTK). These two holds. libraries support many Indian languages including index_txt File- Marathi. Consider the example from index_txt file: गीत 01 01 0400 01 00004897 3.3.1 Indic NLP Library In the example: word= गीत pos=01 {*pos: 1(noun), This Library is intended to build Python-based 2(adjective), 3(verb), 4(adverb)} number of libraries for common text processing and Natural relations exists for word in all its senses=01 Language Processing in Indian languages. Most of number of senses=01 and sense id is=00004897 in the Indian languages have similarities in terms of the data_txt.txt file. script, phonology, language syntax, etc. and this onto_txt File- library gives a general solution commonly required Structure of Onto_txt : 00000051 0001 00000044 toolsets for Indian language text. |वेळ (Time) {TIME उदाहरणे :- सकाळ, वदवस इत्यादी} Using this library you can do Text Normalization, onto id= 00000051 0001 indicates that parent Script Information, Word Tokenization and De- exists parent onto id= 00000044 onto tokenization, Sentence Splitting, Word description=(Time). Segmentation, Syllabification, Script Conversion, 3.2 Indo WordNet Romanization, Translation, etc. for the Marathi Language. Based on EuroWordNet dictionary, IndoWordNet is developed. Eighteen scheduled languages of 3.3.2 Natural Language Toolkit for Indic India, namely Marathi, Hindi, Malayalam, Telugu, Languages (iNLTK) Kashmiri, Bodo, Bangla, Gujarati, Kannada, Odia, Konkani, Manipuri, Assamese, Punjabi, Nepali, The iNLTK library is equivalent to the NLTK Tamil, Sanskrit and Urdu represent the lexical Python package.