Wordnet in Malaylam Language for Cricket Domain

Total Page:16

File Type:pdf, Size:1020Kb

Wordnet in Malaylam Language for Cricket Domain International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 10 (2018) pp. 7829-7834 © Research India Publications. http://www.ripublication.com WordNet in Malaylam language for Cricket domain Sreedhi Deleep Kumar Reshma E U PG Scholar PG Scholar Department of Computer Science and Engineering Department of Computer Science and Engineering Vidya Academy of Science and Technology, Thrissur, India Vidya Academy of Science and Technology, Thrissur, India Sunitha C Amal Ganesh Associate Professor Assistant Professor Department of Computer Science and Engineering, Department of Computer Science and Engineering Vidya Academy of Science and Technology, Thrissur, India Vidya Academy of Science and Technology, Thrissur, India Abstract There are numerous languages in India which belong to different language families. These language families are Indo- WordNet is an information base which is arranged Aryan, Dravidian, SinoTibetan, Tibeto-Burman and Austro- hierarchically in any language. Usually, WordNet is Asiatic. The major ones are the Indo-Aryan, spoken by the implemented using indexed file system. Good WordNets northern to western part of India and Dravidian, spoken by available in many languages. However, Malayalam is not southern part of India. The Eighth Schedule of the Indian having an efficient WordNet. WordNet differs from the Constitution lists 22 languages, which have been referred to as dictionaries in their organization. WordNet does not give scheduled languages and given recognition, status and official pronunciation, derivation morphology, etymology, usage notes, encouragement. or pictorial illustrations. WordNet depicts the semantic relation between word senses more transparently and elegantly. In this Malayalam has official language status in the state of Kerala work, the words or the terms are stored in the XML format. The and in the union territories of Lakshadweep and Puducherry. relationships of the various word are also mentioned. Hence It belongs to the Dravidian family of languages and is spoken when the user needs to know about a particular term or a word, by approximately 33 million people, according to the 2001 they can directly search for that word so that all the details census. Malayalam most likely originated from Middle Tamil regarding the word will be gathered and dispalyed from the (Sen-Tamil) in the 6th century. WordNets are available in WordNet. The informations includes its Synonyms, Meaning of many Languages. But in Malayalam there is not a good the word, An example statement for the word and also its WordNet available yet. As a beginning, this project aims to available relations such as hypernymy and holonymy. The create a WordNet in Malayalam language concentrating on a ultimate objective of this project is to create a WordNet in single domain. This project deals with the Cricket domain. Malayalam language pertaining to cricket domain. WordNet is a semantic dictionary that was designed as a Keywords: WordNet, Indexed file system, Malayalam, network following the idea that representing words and Cricket, Synonym, Hypernym, XML concepts as an interrelated system. WordNet groups words into sets of synonyms and provides short definitions and usage examples, also records a number of relations among these INTRODUCTION synonym sets or their members. In the area of Natural Language Processing, WordNet plays WordNet can be seen as a combination of dictionary and an important role. WordNet is a semantic dictionary that was thesaurus. WordNet superficially resembles a thesaurus, in designed as a network following the idea that representing that it groups words together based on their meanings. words and concepts as an interrelated system is consistent However, there are some important distinctions. First, with evidence for the way speakers organize their own mental WordNet interlinks not just word forms,but specific senses of lexicons[1]. Nowadays , WordNets are available in many words. As a result, words that are found in close proximity to Languages. But in Malayalam there is not a good wordnet one another in the network are semantically disambiguated. available yet. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow India is a country with diverse culture, language and varied any explicit pattern other than meaning similarity. WordNet's heritage. Due to this, it is very rich in languages and their structure makes it a useful tool for computational linguistics dialects. Being a multilingual society, a dictionary in multiple and natural language processing. languages becomes its need and one of the major resources to support a language. There are dictionaries for many Indian Properties of WordNet A WordNet can provide the following languages, but very few are available in multiple languages. information: WordNet is one of the most prominent lexical resources in the field of Natural Language Processing. 7829 International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 10 (2018) pp. 7829-7834 © Research India Publications. http://www.ripublication.com • Synonymy: include: Nouns. The figure of Princeton WordNet is as shown below. This one is easy and links words that have similar meanings, e.g. happy and glad. • Antonymy: The opposite of synonymy, e.g. happy and sad • Hypernymy: Hypnernymy refers to a hierarchical relationship between words. For example, furniture is a hypernym of chair since every chair is a piece of furniture (but not vice-versa). • Hyponymy: Hyponymy is the opposite of hypernymy. Dog is a hyponym of canine since every dog is a canine. Figure 1: Princeton WordNet • Meronymy: Meronymy refers to a part/whole relationship. For example, II. EuroWordNet paper is a meronym of book, since paper is a part of a book. EuroWordNet is a multilingual database with WordNets for • Troponymy: several European languages (Dutch, Italian, Spanish, German, French, Czech and Estonian). The WordNets are structured in Troponymy is the semantic relationship of doing something in the same way as the American WordNet for English ( the manner of something else. For example, walk is a Princeton WordNet, Miller et al 1990) in terms of synsets troponym of move and limp is a troponym of walk. (sets of synonymous words) with basic semantic relations between them. Each WordNet represents a unique language- internal system of lexicalizations. In addition, the WordNets RELATED WORKS are linked to an Inter-Lingual-Index, based on the Princeton This section enumerate the available WordNets and their WordNet. Via this index, the languages are interconnected so properties that it is possible to go from the words in one language to similar words in any other language. The index also gives I. Princeton WordNet access to a shared top-ontology of 63 semantic distinctions. This is the first WordNet to be developed. WordNet was This top-ontology provides a common semantic framework created in the Cognitive Science Laboratory of Princeton for all the languages, while language specific properties are University under the direction of psychology professor maintained in the individual WordNets. The database can be George Armitage Miller starting in 1985 and has been used, among others, for monolingual and cross-lingual directed in recent years by Christiane Fellbaum. WordNet is a information retrieval, which was demonstrated by the users in lexical database for the English language. It groups English the project. words in to sets of synonyms called synsets, provides short The EuroWordNet project was completed in the summer of definitions and usage examples, and records a number of 1999. The design of the database, the defined relations, the relations among these synonym sets or their members. As of top-ontology and the Inter-Lingual-Index are now frozen. November 2012 WordNet’s latest Online-version is 3.1. Nevertheless, many other institutes and research groups are The database contains 155,287 words organized in 117,659 developing similar WordNets in other languages (European synsets for a total of 206,941 wordsense pairs; in compressed and non-European) using the EuroWordNet specification. If form, it is about 12 megabytes in size. WordNet includes the compatible, these WordNets can be added to the above lexical categories nouns, verbs, adjectives and adverbs but database and, via the index, connected to any other WordNet. ignores prepositions, determiners and other function words. The EuroWordNet format is defined by the EuroWordNet Words from the same lexical category that are roughly Database Editor Polaris. synonymous are grouped into synsets. Synsets include simplex words as well as collocations like ”eat out” and ”car pool.” The different senses of a polysemous word form are III. IndoWordNet assigned to different synsets. The meaning of a synset is IndoWordNet is an integrated multilingual WordNet for further clarified with a short defining gloss and one or more Indian languages. These WordNet resources are used by usage examples. An example adjective synset is: good, right, researchers to experiment and resolve the issues in ripe (most suitable or right for a particular purpose; ”a good multilinguality through computation. However, there are few time to plant tomatoes”; ”the right time to act”; ”the time is cases where WordNet is used by the non-researchers or ripe for great sociological changes”) All synsets are connected general public. IndoWordNet is a linked lexical knowledge to other synsets by means
Recommended publications
  • An Efficient Database Design for Indowordnet Development Using
    An Efficient Database Design for IndoWordNet Development Using Hybrid Approach Venkatesh Prabhu2 Shilpa Desai1 Hanumant Redkar1 N eha Prabhugaonkar1 Apurva N agvenkar1 Ramdas Karmali1 (1) GOA UNIVERSITY, Taleigao - Goa (2) THYWAY CREATIONS, Mapusa - Goa [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] ABSTRACT WordNet is a crucial resource that aids in Natural Language Processing (NLP) tasks such as Machine Translation, Information Retrieval, Word Sense Disambiguation, Multi-lingual Dictionary creation, etc. The IndoWordNet is a multilingual WordNet which links WordNets of different Indian languages on a common identification number given to each concept. WordNet is designed to capture the vocabulary of a language and can be considered as a dictionary cum thesaurus and much more. WordNets for some Indian Languages are being developed using expansion approach. In this paper we have discussed the details and our experiences during the evolution of this database design while working on the Indradhanush WordNet Project. The Indradhanush WordNet Project is working on the development of WordNets for seven Indian languages. Our database design gives an efficient plan for storage of WordNet data for all languages. In addition it extends the design to hold specific concepts for a language. KEYWORDS: WordNet, IndoWordNet, synset, database design, expansion approach, semantic relation, lexical relation. Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP), pages 229–236, COLING 2012, Mumbai, December 2012. 229 1 Introduction 1.1 WordNet and its storage methods WordNet (Miller, 1993) maintains the concepts in a language, relations between concepts and their ontological details.
    [Show full text]
  • Exploring Resources in Word Sense Disambiguation for Marathi Language Amit Patil1, Chhaya Patil2, Dr
    www.rspsciencehub.com Volume 02 Issue 10S October 2020 Special Issue of First International Conference on Advancements in Management, Engineering and Technology (ICAMET 2020) Exploring Resources in Word Sense Disambiguation for Marathi Language Amit Patil1, Chhaya Patil2, Dr. Rakesh Ramteke3, Dr. R. P. Bhavsar4, Dr. Hemant Darbari5 1,2 Assistant Professor, Department of Computer Application, RCPET’s IMRD, Maharashtra, India 3,4Professor, School of Computer Sciences, KBC North Maharashtra University, Maharashtra, India 5Director General, Centre for Development of Advanced Computing (C-DAC), Maharashtra, India Abstract Word Sense Disambiguation (WSD) is one of the most challenging problems in the research area of natural language processing. To find the correct sense of the word in a particular context is called Word Sense Disambiguation. As a human, we can get a correct sense of the word given in the sentence because of word knowledge of that particular natural language, but it is not an easy task for the machine to disambiguate the word. Developing any WSD system, it required sense repository and sense dictionary. It is very costly and time-consuming to build these resources. Many foreign languages have available these resources, that is why most of the foreign languages like English, German, Spanish etc lot of work is done in these Natural languages. When we look for Indian languages like Hindi, Marathi, Bengali etc. very less work is done. The reason behind this is resource-scarcity. In this paper, we majorly focus on Marathi Language Word Sense Disambiguation because of very less work is done in the Marathi Language as compared to Hindi and other Indian Languages.
    [Show full text]
  • WWDS Apis: Application Programming Interfaces for Efficient Manipulation of World Wordnet Database Structure
    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) WWDS APIs: Application Programming Interfaces for Efficient Manipulation of World WordNet Database Structure Hanumant Redkar1, Sudha Bhingardive1, Kevin Patel1, Pushpak Bhattacharyya1 Neha Prabhugaonkar2, Apurva Nagvenkar2, Ramdas Karmali2 1Indian Institute of Technology Bombay, Mumbai, India 2Goa University, Goa, India {hanumantredkar, bhingardivesudha, kevin.svnit, pushpakbh}@gmail.com {nehapgaonkar.1920, apurv.nagvenkar, ramdas.karmali}@gmail.com Abstract example, developers can potentially extract information WordNets are useful resources for natural language from other WordNets through WWDS and its APIs that is processing. Various WordNets for different languages have missing in their source WordNet. The WWDS and WWDS been developed by different groups. Recently, World APIs are explained in the following sections. WordNet Database Structure (WWDS) was proposed by Redkar et. al (2015) as a common platformm to store these different WordNets. However, it is underutilized due to lack World WordNet Database Structure of programming interface. In this paper, we present WWDS APIs, which are designed to address this shortcoming. These WWDS is an efficient storage mechanism which uses WWDS APIs, in conjunction with WWDS, act as a wrapper that enables developers to utilize WordNets without multiple databases to accommodate different WordNets. Its worrying about the underlying storage structure. The APIs design is based on IndoWordNet database structure are developed in PHP, Java, and Python, as they are the (Prabhu et al., 2012). The language independent preferred programming languages of most developers and information such as semantic relations, ontology details, researchers working in language technologies. These APIs etc. is stored in a single master database named can help in various applications like machine translation, word sense disambiguation, multilingual information wordnet_master.
    [Show full text]
  • Comparative Study on Currently Available Wordnets
    International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 10 (2018) pp. 8140-8145 © Research India Publications. http://www.ripublication.com Comparative Study on Currently Available WordNets Sreedhi Deleep Kumar Reshma E U PG Scholar PG Scholar Department of Computer Science and Engineering Department of Computer Science and Engineering Vidya Academy of Science and Technology Vidya Academy of Science and Technology Thrissur, India. Thrissur, India. Sunitha C Amal Ganesh Associate Professor Assistant Professor Department of Computer Science and Engineering Department of Computer Science and Engineering Vidya Academy of Science and Technology Vidya Academy of Science and Technology Thrissur, India. Thrissur, India. Abstract SinoTibetan, Tibeto-Burman and Austro-Asiatic. The major ones are the Indo-Aryan, spoken by the northern to western WordNet is an information base which is arranged part of India and Dravidian, spoken by southern part of India. hierarchically in any language. Usually, WordNet is The Eighth Schedule of the Indian Constitution lists 22 implemented using indexed file system. Good WordNets languages, which have been referred to as scheduled available in many languages. However, Malayalam is not languages and given recognition, status and official having an efficient WordNet. WordNet differs from the encouragement. dictionaries in their organization. WordNet does not give pronunciation, derivation morphology, etymology, usage notes, A Dictionary can be called as are source dealing with the or pictorial illustrations. WordNet depicts the semantic relation individual words of a language along with its orthography, between word senses more transparently and elegantly. In this pronunciation, usage, synonyms, derivation, history, work, a general comparison of currently browsable WordNets etymology, etc.
    [Show full text]
  • Improving Semantic Similarity with Cross-Lingual Resources: a Study in Bangla—A Low Resourced Language
    informatics Article Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language Rajat Pandit 1,* , Saptarshi Sengupta 2, Sudip Kumar Naskar 3, Niladri Sekhar Dash 4 and Mohini Mohan Sardar 5 1 Department of Computer Science, West Bengal State University, Kolkata 700126, India 2 Department of Computer Science, University of Minnesota Duluth, Duluth, MN 55812, USA; [email protected] 3 Department of Computer Science & Engineering , Jadavpur University, Kolkata 700032, India; [email protected] 4 Linguistic Research Unit, Indian Statistical Institute, Kolkata 700108, India; [email protected] 5 Department of Bengali, West Bengal State University, Kolkata 700126, India; [email protected] * Correspondence: [email protected] Received: 17 February 2019; Accepted: 20 April 2019; Published: 5 May 2019 Abstract: Semantic similarity is a long-standing problem in natural language processing (NLP). It is a topic of great interest as its understanding can provide a look into how human beings comprehend meaning and make associations between words. However, when this problem is looked at from the viewpoint of machine understanding, particularly for under resourced languages, it poses a different problem altogether. In this paper, semantic similarity is explored in Bangla, a less resourced language. For ameliorating the situation in such languages, the most rudimentary method (path-based) and the latest state-of-the-art method (Word2Vec) for semantic similarity calculation were augmented using cross-lingual resources in English and the results obtained are truly astonishing. In the presented paper, two semantic similarity approaches have been explored in Bangla, namely the path-based and distributional model and their cross-lingual counterparts were synthesized in light of the English WordNet and Corpora.
    [Show full text]
  • A Sentiment Analysis of Gujarati Text Using Gujarati Senti Word Net
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8 Issue-9, July 2019 A Sentiment Analysis of Gujarati Text using Gujarati Senti word Net Lata Gohil, Dharmendra Patel Abstract: Sentiment Analysis plays vital role in decision II. RELATED WORK making. For English language intensive research work is done in this area. Very less work is reported in this domain for Indian Sentiment analysis is useful method to understand opinion languages compared to English language. Gujarati language is expressed in text. “It is one of the most active research areas almost unexplored for this task. More data in form of movie in natural language processing and is also widely studied in reviews, product reviews, social media posts etc are available in data mining, web mining, and information retrieval” [7]. regional languages as people like to use their native language on Beginning research work on sentiment analysis was mostly Internet which leads to need of mining these data in order to understand their opinion. Various tools and resources are focused on English language. However increasing developed for English language and few for Indian languages. non-English language content on Internet led demand to work Gujarati is resource poor language for this task. Motive of this towards other languages. There are two approaches namely paper is to develop sentiment lexical resource for Gujarati Lexicon and Machine Learning are widely explored for language which can be used for sentiment analysis of Gujarati sentiment analysis. Large amount of annotated data is the key text. Hindi SentiWordNet (H-SWN) [1] and synonym relations of requirement of Machine learning approach.
    [Show full text]
  • S.ARULMOZI Assistant Professor Mobile: +91-9441330510 Dept
    S.ARULMOZI Assistant Professor Mobile: +91-9441330510 Dept. of Dravidian & Computational Linguistics Residence: +91-8570-278214 Dravidian University, Kuppam 517426, India Email: [email protected] EDUCATION Ph.D., Applied Linguistics, University of Hyderabad, 1999. PGDTS, Translation Studies, University of Hyderabad, 1995 (66%) M.Phil., Applied Linguistics, University of Hyderabad, 1992 (70.8%) M.A., Linguistics, Bharathiar University, 1990 (64.4%) B.Sc., Chemistry, Bharathiar University, 1988. (54.6%) H.Sc., Board of Higher Secondary Education, Tamil Nadu, 1985 (61.75%) S.S.L.C., Board of Secondary Education, Tamil Nadu, 1983 (64.5%) EMPLOYMENT Designation Department University/Institution Period Assistant Professor Department of Dravidian Dravidian University 29 Dec 2005 and Computational onwards Linguistics Guest Faculty Centre for ALTS University of Hyderabad 19 Jan 2005 to 28 Dec 2005 Member Research AU-KBC Research Centre Anna University 29 Dec 2000 to Staff 15 Jan 2005 Project Fellow Department of Linguistics Tamil University 21 Jan 1999 to 2 Jul 2000 Language Assistant- DoE Project Central Institute of Indian 8 Jun 1998 to Tamil Languages 31 Oct 1998 RESEARCH EXPERIENCE Course Topic Research Guide Year University Ph.D. Aspects of Inflectional Prof. Probal Dasgupta 1998 University of Morphophonology: A & Dr. N. Krupanandam Hyderabad Computational Approach M.Phil. Dynamics of Translation in Prof. Probal Dasgupta 1992 University of Reconstructing Sci-Tech Hyderabad Terminologies M.A. Advertisement Headlines Prof. C.Shanmugom 1990 Bharathiar University 1 ACADEMIC OUTREACH S.No. Role Topic Event Place Date 1. Resource Person BIS Tagset for Workshop on Tamil Madurai 03-03-2013 Tamil POS POS Tagging Kamaraj Tagging University 2.
    [Show full text]
  • An Online Interface for Synset Creation with Special Reference to Sanskrit
    Introduction to Synskarta: An Online Interface for Synset Creation with Special Reference to Sanskrit Hanumant Redkar, Jai Paranjape, Nilesh Joshi, Irawati Kulkarni, Malhar Kulkarni, Pushpak Bhattacharyya Center for Indian Language Technology, Indian Institute of Technology Bombay, India. [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] In this paper, we have taken reference of San- Abstract skrit WordNet 3. Sanskrit is an Indo-Aryan lan- guage and is one of the ancient languages. It has WordNet is a large lexical resource express- vast literature and a rich tradition of creating ing distinct concepts in a language. Synset is léxica. The roots of all languages in the Indo Euro- a basic building block of the WordNet. In this pean family in India can be traced to Sanskrit paper, we introduce a web based lexicogra- (Kulkarni et al., 2010). Sanskrit WordNet is con- pher's interface ‘Synskarta’ which is devel- structed using expansion approach where Hindi oped to create synsets from source language to target language with special reference to WordNet is used as a source (Kulkarni et al., Sanskrit WordNet. We focus on introduction 2010). and implementation of Synskarta and how it While developing Sanskrit WordNet, lexicogra- can help to overcome the limitations of the phers create Sanskrit synsets by referring to Hindi existing system. Further, we highlight the fea- synsets and by following the three principles of tures, advantages, limitations and user evalua- synset creation (Bhattacharyya, 2010). Since San- tions of the same. Finally, we mention the skrit came into existence much before Hindi, it has scope and enhancements to the Synskarta and many words which are not present in Hindi its usefulness in the entire IndoWordNet WordNet.
    [Show full text]
  • Gujarati Language: Research Issues, Resources and Proposed Method on Word Sense Disambiguation
    International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-8, Issue-2S11, September 2019 Gujarati Language: Research Issues, Resources and Proposed Method on Word Sense Disambiguation Tarjni Vyas, Amit Ganatra B. GUJARATI GRAMMAR Abstract: Gujarati Word Sense Disambiguation (WSD) is an There are 32 consonants and 8 vowels in Gujarati language. exceptionally complex when it comes to Natural language handling because it needs to manage complexities found in a No Components Number Details language. In this paper, the discussion has put forward about 1 Consonants 32 Fig2 Guajarati language, Gujarati Wordnet and Gujarati word sense disambiguation. Accordingly, the deep learning approach is 2 Vowels 8 Fig2 found to perform better in Gujarati WSD yet one of its weakness is 3 Tenses 3 Past,Present, Future the prerequisite of enormous information sources without which 4 Vachans 2 Singular,Plural preparing is close to impossible. On the other hand, utilizes 5 Sentence Structure 3 Subject,object,Verb information sources to choose the meanings of words in a specific setting. Provided with that, deep learning approaches appear to be Sentence structure is made in the order of Subject, Object and more suitable to manage word sense disambiguation; however, the process will always be challenging given the ambiguity of Verb in Gujarati language. natural languages. There are three Tenses in Gujarati Past Tense, Present Tense Keywords: Word Sense Disambiguation, Gujarati Language, and Future Tense. There are two types of Vachans in Deep learning, Natural language processing, Lesk Gujarati.Singular and Plural. Gujarati Language includes Algorithm, Wordnet. following cases.(Table 1) I.
    [Show full text]
  • Adding a Sense-Annotated Telugu Lexicon
    Enrichment of OntoSenseNet: Adding a Sense-annotated Telugu lexicon Sreekavitha Parupalli and Navjyoti Singh Center for Exact Humanities (CEH) International Institute of Information Technology, Hyderabad, India [email protected] [email protected] Abstract The paper describes the enrichment of OntoSenseNet- a verb- centric lexical resource for Indian Languages. This resource contains a newly developed Telugu-Telugu dictionary. It is important because na- tive speakers can better annotate the senses when both the word and its meaning are in Telugu. Hence efforts are made to develop a soft copy of Telugu dictionary. Our resource also has manually annotated gold standard corpus consisting 8483 verbs, 253 adverbs and 1673 adjectives. Annotations are done by native speakers according to defined annotation guidelines. In this paper, we provide an overview of the annotation proce- dure and present the validation of our resource through inter-annotator agreement. Concepts of sense-class and sense-type are discussed. Addi- tionally, we discuss the potential of lexical sense-annotated corpora in improving word sense disambiguation (WSD) tasks. Telugu WordNet is crowd-sourced for annotation of individual words in synsets and is compared with the developed sense-annotated lexicon (OntoSenseNet) to examine the improvement. Also, we present a special categorization (spatio-temporal classification) of adjectives. 1 Introduction Lexically rich resources form the foundation of all natural language processing (NLP) tasks. Maintaining the quality of resources is thus a high priority issue [5]. Hence, it is important to enhance and maintain the lexical resources of any language. This is of significantly more importance in case of resource poor languages like Telugu [19].
    [Show full text]
  • Indowordnet and Its Linking with Ontology
    IndoWordNet and its Linking with Ontology Brijesh Bhatt Pushpak Bhattacharyya Center For Indian Language Technology, Center For Indian Language Technology, Department of Computer Science and Engineering, Department of Computer Science and Engineering, Indian Institute of Technology, Mumbai Indian Institute of Technology, Mumbai [email protected] [email protected] Abstract hierarchical structure of concepts related by sub- sumption relation, which can be shared between Reasoning about natural language requires applications. Though WordNet is often considered combining semantically rich lexical re- as an ontology (the synset corresponds to concept sources with world knowledge, provided and ‘hypernymy-hyponymy’ relation is similar to by ontologies. In this paper, we describe subsumption), it is a language specific resource linking of WordNets of Indian languages which may vary from language to language. The with an upper ontology SUMO (Sug- ontological issues in WordNet discussed in (pease gested Upper Merged Ontology). This and Fellbaum, 2010; Gangemi et al., 2003; Niles creates multilingual resource for Indian and Pease, 2003) are as follows, languages which can be used in various natural language processing applications. This paper presents the architecture of 1. Confusion between concept and individual: IndoWordNet- Linking of WordNets of WordNet synsets do not distinguish between seventeen different Indian languages and concept and individual. For example, both provides a method to link it with upper on- ‘Martial Art’ and ‘Karate’ are considered as tology SUMO. Two different systems: In- concept. doWordNet navigator and SIGMAKEE in- terface for Indian languages are developed 2. Confusion between object level and meta to access this resource. level concept: WordNet covers both object level and meta level concepts as hyponymy 1 Introduction of same concept.
    [Show full text]
  • Development of Assamese Wordnet
    Development of Assamese WordNet Iftikar Hussain, Navanath Saharia and Utpal Sharma Department of Computer Science & Engineering Tezpur University, Napaam, India-784028 [email protected], [email protected], [email protected] Abstract. An important challenge in the task of semantic analysis is the fact that in natural languages several words may denote the same concept, and a single word form may denote different concepts. WordNet is a repository of information about such characteristics of words, in a readily accessible format. A WordNet may be developed for a language for which computational processing is attempted. It calls for efforts in the domain of linguistics as well as computer science. In this paper we describe our works towards the development of WordNet for Assamese language, an official language of India. We cover proposed formats for the WordNet database text files and also an Application Programming Interface (API) of the Assamese WordNet. 1 Introduction One of the main challenges in Natural Language Processing (NLP) is determining the appropriate sense of each word that occurs in input expressions. Words in natural languages often have multiple senses, and often several distinct words denote the same sense. WordNet helps to overcome such challenges. WordNet is a database that consists of words or collocations. Words having similar senses are grouped together and the groups are interconnected through some lexical and semantic relations. Users or their applications can make queries on it and can find out appropriate senses of words distinctly. Assamese is one of the 22 official languages in India, spoken by nearly 30 million people.
    [Show full text]