Development of Assamese Wordnet
Total Page:16
File Type:pdf, Size:1020Kb
Development of Assamese WordNet Iftikar Hussain, Navanath Saharia and Utpal Sharma Department of Computer Science & Engineering Tezpur University, Napaam, India-784028 [email protected], [email protected], [email protected] Abstract. An important challenge in the task of semantic analysis is the fact that in natural languages several words may denote the same concept, and a single word form may denote different concepts. WordNet is a repository of information about such characteristics of words, in a readily accessible format. A WordNet may be developed for a language for which computational processing is attempted. It calls for efforts in the domain of linguistics as well as computer science. In this paper we describe our works towards the development of WordNet for Assamese language, an official language of India. We cover proposed formats for the WordNet database text files and also an Application Programming Interface (API) of the Assamese WordNet. 1 Introduction One of the main challenges in Natural Language Processing (NLP) is determining the appropriate sense of each word that occurs in input expressions. Words in natural languages often have multiple senses, and often several distinct words denote the same sense. WordNet helps to overcome such challenges. WordNet is a database that consists of words or collocations. Words having similar senses are grouped together and the groups are interconnected through some lexical and semantic relations. Users or their applications can make queries on it and can find out appropriate senses of words distinctly. Assamese is one of the 22 official languages in India, spoken by nearly 30 million people. WordNets are being built for about thirteen of these official languages at different institutions. Hindi WordNet, developed at IIT Bombay is the first WordNet developed for an Indian language. Assamese is a morphologically rich, free word order Indic language, where very little computational work is reported, viz. [1, 2, ?,?,?]. Our work reported in this paper is one of the first efforts towards building an Assamese WordNet. Though work on building Assamese WordNet has been taken up in Gauhati University[3] recently, results thereof are still awaited. The proposed architecture of our Assamese WordNet comprises a database and a graphical user interface (GUI). The WordNet database consists of text files which include the synonym set (synset) of a word and the sense of the word. Synsets are interconnected with other synsets via a number of lexical and semantic relations. The database consist of three text files namely index file, data file and ontology file. Assamese linguistic data are maintained in the database files in some pre-defined format. An Application Programming Interface (API) is developed through which other applications can access the WordNet database for their purposes. In course of our work, we had to deal with several issues related to the support of Assamese script in the computer, such as encoding, text typing etc. In the next section we describe the basics of WordNet. Section 3 explores some existing work on WordNet and section 4 discusses issues faced during implementation. In section 5 we discuss our proposed architecture for Assamese WordNet and current status. Section 6 concludes this paper indicating future directions. 2 WordNet WordNet is a repository of words of a language. The words are grouped together according to their similarity of meanings. For each word there is a synonym set called synset, representing one lexical relation. For each synset there is another element called gloss that describes the concept. Synsets in the WordNet are connected to other synsets via a number of lexical and semantic relations. Each entry of the wordNet1 consists of following elements 1. Synset : Words in a synset are arranged according to the frequency of usage. Synset: f emO, mxu, mkrÅg; TF2: fmou, modhu, makarandag; EM3: Honey. 2. Gloss : It describes the concepts. It consists of two parts (a) Text Definition:It explains concepts denoted by the synset. Example- fulr imZA rs; TF: phulor mithA ras; EM: Flower's honey. (b) Example Sentence:It gives the usage of the words in the sentence. Example- emO mAiKey fulr prA emO egATA¡; TF: mou mAkhiye phulor porA mou gotAi ; EM: Honey bee collects honey from flowers.; 3. Position in Ontology : An ontology is a hierarchical organization of concepts, more specifically, a categorization of entities and actions. For each syntactic category namely noun, verb, adjective and adverb, a separate 1 Hindi WordNet Data and Associated Software License Agreement. The IIT-Bombay represented for the purpose of the signature of this agreement by the Dean of Research and Development, IIT Bombay or by his authorized representative Dr. Pushpak Bhattacharyya, Professor of the Department of Computer Science and Engineering, IIT Bombay. 2 TF: Transliterated Assamese Form. 3 EM: English Meaning (Concept). ontological hierarchy is present. Each synset is mapped into some place in the ontology. A synset may have multiple parents. The ontology for the synset representing the concept school is shown in figure 1. Example- Synset = f øul, ibdAly, pAZSAlA g; TF: fskul, bidyAlay, pAthchAlAg; EM: School Fig. 1. Ontology for Synset for \School " 3 Related Work The Princeton English WordNet4 was developed by Professor G. A. Miller at the Cognitive Sciences Laboratory at Princeton University. A variety of lexical and semantic relations are used to represent the organization of lexical knowledge. Inputs provided through text files written by lexicographers are converted to database files. Two kinds of building blocks are distinguished in the source files: word forms and word meanings. There are separate files corresponding to each syntactic category such as noun, verb, adjective and adverb. All of the synsets in a lexicographer file are in the same syntactic category. For each syntactic category, two files are needed to represent the contents of the WordNet database - index.pos and data.pos, where pos is noun, verb, adj and adv. Each index file is an alphabetized list of all the words found in WordNet in the corresponding part 4 http://wordnet.princeton.edu of speech. A data file for a syntactic category contains information corresponding to the synsets that were specified in the lexicographer files. The creation of WordNet is a time consuming and manpower intensive exercise. The effort can be reduced to some extent by using text repositories such as the web and certain corpora, and also by translating an existing WordNet into another language. But results of such attempt are often far from ideal, in the sense that the WordNet so produced contains synsets that have outlier words and/or missing words. Additionally, semantic relations may be inappropriately set up or may be missing altogether. [4] reported an automatic method of WordNet evaluation for the first time. They focused on verifying synonymy within non-singleton synsets and also on hypernymy between synsets. They made some rule based algorithm to validate the synonyms and hypernyms. The synonym validation was tested on the Princeton WordNet (v2.1) noun synsets. Out of the 81426 noun synsets, 39840 are synsets with more than one word, and only these were given as input to the validator. The result gave 70% accuracy where all words in synsets were validated, approximately 90% where half were validated and about 9% where no words were validated. The Hypernym validation algorithm was able to validate 56203 out of 79297 noun hypernymy relation pairs in the Princeton WordNet, giving a validation percentage of 70.88%. The validation algorithm is available only for Princeton WordNet. However, the approach should broadly be applicable to other language WordNets as well. Hindi WordNet is the first WordNet developed for an Indian language. It is developed at CFILT, IIT Bombay. Among other Indian languages WordNets for, Marathi, Bengali, Nepali, Oriya, Telugu, Malayalam, Konkani, Kashmiri, Manipuri etc. are being developed at different Indian institutions. The NE WordNet[3] covering Assamese and Bodo languages is being developed at Gauhati University. 4 Assamese WordNet Assamese WordNet is a database of Assamese word forms (words and collocations) which are grouped together in the form of synsets. The synsets are interconnected to other synsets via a number of lexical and semantic relations such as hypernym and hyponym (the is-a relation), meronym and holonym (the part-of relation), antonyms etc. The lexical relationships hold between semantically related forms of words and the semantic relationships hold between related word definitions. The subgraph of a WordNet holding different relationships are shown in figure 2 Relations between the synsets in the Assamese WordNet are described below. { Hyponym and hypernym (is a kind of): Hypernymy is a semantic relation between two synsets to capture super-set hood. Similarly, hyponymy is a semantic relation between two synsets to capture sub-set hood. Example: f egÆA¡ ful, nAij^ ful, egÅA ful g =>hr f ful, puõ, kusum g; TF: f gendhAi phul, nArji phul, gendA phul g =>hr fphul, puspa,kusum g; Fig. 2. Subgraph of a WordNet EM: (name of a flower) =>hr (flower). ful (phul) is Hypernym of egÆA¡ ful (gendhai phul). egÆA¡ ful (gendhai phul) is a Hyponym of ful (phul). { Meronym and Holonym (part-whole relation): It is a semantic relation between two synsets. If the concepts A and B are related in such a manner that A is one of the constituent of B, then A is the meronym of B and B is the holonym of A. The meronymy relation is transitive and asymmetrical. Holonymy is the reverse of meronymy. It is used to construct a part-of hierarchy. Example: f ekAZAil, ekAZA, kg =>mh f Gr, bAs-vWn, g h g; TF: fkothAli, kothA, kakkhyag =>mh f ghor, bas-bhawan, grihag; EM: (room) =>mh (house). Here ekAZAil (kothali) is a part of Gr (ghor). Therefore, ekAZAil (kothali) is a Meronym of Gr(ghor) and Gr is a Holonym of ekAZAil.