Experiences in Building the Konkani WordNet Using the Expansion Approach

Shantaram Walawalikar Shilpa Desai Ramdas Karmali ILCI - Konkani Team Dept. of Computer Science Dept. of Computer Science University & Tech., & Tech., Goa University [email protected] [email protected] [email protected]

Sushant Naik Damodar Ghanekar Chandralekha D'Souza ILCI - Konkani Team ILCI - Konkani Team Dept. of Konkani Goa University Goa University Goa University [email protected]

Jyoti Pawar Dept. of Computer Science & Tech., Goa University [email protected]

Abstract 1. Introduction

WordNet can be described in short as a massive WordNet can be described as an electronic structure of words in a graph like form. It is an lexical database available on-line as a electronic lexical database available as a powerful resource to the researchers in the powerful resource to the researchers in the area area of computational linguistics, text of computational linguistics, text processing and processing and many other related areas. many other related areas. Since 1987 when Currently, the necessity of building has been felt for all the Indian Languages to WordNet first appeared globally, it has come a aid in multi lingual machine translation and long way, getting itself moulded as per the cross lingual information retrieval to promote ongoing requirements of the users and making tourism, farming, education and other related use of the advancement of technology viz. areas for overall growth and development of Computer Science and Communications. Indo the nation. IIT Bombay, has developed a WordNet is India's contribution to this global number of tools, resources and facilities by effort and the steps towards the development of which WordNet of any language can be Konkani WordNet shabdamAleM is a constructed through what is known as the ळ녍दभारं expansion approach. Projects to create part of this initiative. WordNet in most of the Indian languages The layout of this paper is as follows – Section 2 using this approach with WordNet as the discusses the characteristics of Konkani base are currently in progress. language. A brief description of the Hindi In this paper we report our experiences of WordNet, the expansion approach used to create creating a WordNet for Konkani WordNet, observations made during using the expansion approach with Hindi as the WordNet creation process and challenges the source language and Konkani as the target faced are given in section 3. Section 4 concludes language. The Konkani WordNet is in the the paper with a discussion on the future work initial stage of development. The 1969 Hindi plan. core synsets have been incorporated in the Konkani WordNet. The Offline Synset Linking Tool developed by IIT Bombay is 2. Characteristics of Konkani Language being used for this task. Konkani language is one of the twenty two languages included in the eighth schedule of the . It is also the official 2.2 Number language of the State of Goa. Konkani is an Indo-European (Indo-Aryan) language derived Konkani has two numbers - singular and plural from through and is influenced (Sardesai, 1986). The derivation of plural form and enriched by various other languages like from singular form is dependent on gender and Marathi, , , Hindi, phonetic characteristic of singular form. Portuguese and English. Though In some cases the change in pronunciation of the script is recognised as of vowel denotes change in number, .g. दोतोय Konkani, it is also written in Roman and dotora „doctor or doctors‟, पातय phAtara „stone Kannada scripts. Old is also or stones‟, देय dera „brother-in-law or brothers-in- found written in Malayalam and scripts. law‟, ओंठ oMTha „lip or lips‟. The first edition of Konkani grammar titled „Arte da Lingua Canarin‟ was written 2.3 Gender somewhere in 1617 A.D. by Fr. (Asmitai, 2008; Cunha, 1958) . It was Konkani has three genders - masculine, feminine enlarged by Fr. Diogo Ribeiro and revised by and neuter. However, in some cases feminine four priests of the , and printed are also addressed as neuter e.g., कभरा in 1640. This is considered to be the first आंगणांत खेऱटारं kamalA AMgaNAMta published grammar not only of Konkani but of kheLatAleM ‘Kamala was playing in the any Indian language. Monsignor Sebastiao courtyard‟. Here, the खेऱटारं refers to the Rudolpho Dalgado was the first known Indian neuter gender whereas Kamala is otherwise a lexicographer of Konkani as those preceded feminine . him were all European missionaries. He It is also interesting to note that two synonymous contributed to the development of Konkani nouns may have two different genders, e.g., 셂ख with his three important works „Konkani – rUkha „tree, masculine‟ and झाड jhADa „tree, Portuguese Dictionary‟ (1893), „Portuguese – neuter‟. Konkani Dictionary‟ (1905), and „Bouquet of Konkani Proverbs‟ consisting of 2177 2.4 Word Structure proverbs. Konkani is a highly inflected language (Almeida, 2.1 Pronunciation 1989). Nouns and are inflected for number and case. are inflected for person, Shennoi Goembab (1949) in his book, number, gender, tense and aspect. are „Konkanichi Vyakarani Bandavol‟ discusses inflected for gender and number. pronunciation in detail. The Structure of Konkani word (Goembab, Konkani pronunciation for अ, ए, ओ, औ have 1949; Borkar, 1986) can be depicted as under: additional pronunciations besides the original Sanskrit pronunciations. अ in ऩणस paNasa Nominal Base (N.B.) + Nominal Inflection „jackfruit‟ is known as लरयत svarita in Vedic (N..) Sanskrit. and also have open ए ओ pustakAcheM „of the book‟ pronunciations. These open pronunciations ऩुतकाचें must have been influenced by language. These are found in other Indian languages like (N.B.) + postposition Bengali, Bihari, Gujarati, Kannada, Telugu, याभाकड쥍मान rAmAkaDalyAna „from Ram‟ Tamil, Malayalam, etc, but it is not found in Marathi. (N.B.) + (N. I.) + (N. I.) In Konkani, according to the pronunciation of ऩुतकांतरं pustakAMtaleM „from the book’ a vowel in the same word, the meaning changes e.g. pera „guava fruit or guava ऩेय (N.B.) + (N. I.) + postposition tree‟, भोय mora „peacock, sl. or peacock, pl.), pustakAMtalyAna ‘from inside the लंलऱ voMvaLa;a „kind of flower – mimusops ऩुतकांत쥍मान elengi flower or its tree‟. book’

(N.B.) + (N. I.) + postposition + (N. I.) 3rd धांलतारो री रं धांलतारे 쥍मो रीं ऩुतकाऩे쥍मानचें pustakApelyAnacheM „from beyond the book’ Future 1st धांलतरं रीं रं धांलतरे 쥍मं रीं (N.B.) + (N. I.) + clitic 2nd धांलतरो री रं धांलतरे 쥍मो रीं 3rd धांलतरो री रं धांलतरे 쥍मो रीं ऩुतकाचेंचे pustakAcheMcha „of the book itself‟ Transitive Verb खालऩ khAvapa „to eat‟ (N.B.) + postposition + clitic Singular Plural याभाकड쥍मानम rAmAkaDalyAnaya „also from Present Ram‟ 1st person खातां खातात 2nd खाता खातात (N.B.) + (N. I.) + (N. I.) + clitic 3rd खाता खातात

pustakAMtaleMcha „from the ऩुतकांतरंचे Imperfect book itself‟ 1st खातारं रीं रं खातारे 쥍मं रीं

2nd खातारो री रं खातारे 쥍मो रीं (N.B.) + (N. I.) + postposition + clitic 3rd खातारो री रं खातारे 쥍मो रीं जेलचेेऩासतचे jevachepAsatacha „only for meals‟ Future 1st खातरं रीं रं खातरे 쥍मं रीं (N.B.) + (N. I.) + postposition + (N. I.) + clitic 2nd खातरो री रं खातरे 쥍मो रीं 3rd ऩुतकाऩे쥍मानचेंम pustakApelyAnacheMya खातरो री रं खातरे 쥍मो रीं

„also from beyond the book‟. Perfective

Intransitive 2.5 Verb Base Singular Plural

The verbal base of Konkani has three sources (Goembab, 1949), present active base, present Present Perfect passive base and past passive . The 1st धांलरां 쥍मां रां धांल쥍मात 쥍मांत 쥍मांत roots are either active or passive in sense, the 2nd धांलरा 쥍मा रा धांल쥍मात 쥍मांत 쥍मांत passive being intransitive and the active being 3rd धांलरा 쥍मा रा धांल쥍मात 쥍मांत 쥍मांत transitive. The following is a sample of these forms separated with base form of verb: Past 1st धांलरं रीं रं धांलरे 쥍मं रीं Non Perfective 2nd धांलरो री रं धांलरे 쥍मो रीं Intransitive 3rd धांलरो री रं धांलरे 쥍मो रीं The verb: धांलऩ dhAMvapa „to run‟ Singular Plural Past Perfect Present 1st धांवल쥍रं 쥍रीं 쥍रं धांवल쥍रे 쥍쥍मं 쥍रीं 1st person धांलतां धांलतात 2nd धांवल쥍रो 쥍री 쥍रं धांवल쥍रे 쥍쥍मो 쥍रीं 2nd धांलता धांलतात 3rd धांवल쥍रो 쥍री 쥍रं धांवल쥍रे 쥍쥍मो 쥍रीं 3rd धांलता धांलतात Transitive In the present tense, gender has no effect. But Singular Plural the verb endings change as we go to all other cases and are differentiated below with the Present Perfect respective affixes in the sequence of 1st person खारा खा쥍मात masculine, feminine and neuter. 2nd खारा खा쥍मात 3rd खारा खा쥍मात Imperfect 1st धांलतारं रीं रं धांलतारे 쥍मं रीं Past nd 2 धांलतारो री रं धांलतारे 쥍मो रीं 1st person खारो खारे 2nd खारो खारे भा蕍डऩ, चचे蕍डऩ mAD.hDapa, chiD.hDapa „to beat 3rd खारो खारे someone by putting under one's feet‟.

Past Perfect 2.7 Homographic Words: 1st person खा쥍रो खा쥍रे 2nd खा쥍रो खा쥍रे In Konkani, we also come across homographic 3rd खा쥍रो खा쥍रे words i.e. two words written but have

different meanings. 2.6 Contextual Word Usage

Example 2.7.1: pera pronounced as 'pair' as There are different Konkani words for the ऩेय in English the meanings are 1. guava fruit 2. joint similar sense denoting variety of shades. of finger 3. part between two nodes of a stem. Example 2.6.1: An example of this is the meaning of the noun „stink‟ in Konkani being Example 2.7.2: The word for mango tree is आंफो AMbo and the mango fruit (sl.) is also आंफो with गुठ् ठाण guTh.hThANa „stink‟. It is used in the same pronunciation and both are masculine. following variations - Further, the same word is used to denote that fruit ऩंलसाण poMvasANa „smell of spoilt fish‟. as a group. e.g. अदं ं ू फाजायांत चेड आफं ो aMduM bAjArAMta chaDa Ambo hiMvasANa „natural smell of fish‟. आमरोना हिंलसाण AyalonA „This year there was not much mango खातसाण khAtasANa „smell of urine‟. fruit in the market‟.

घाभसाण ghAmasANa „smell of sweat‟.

कानुट् टाण kAnuT.hTANa „smell of utensil in 3. Expansion from Hindi to Konkani which food preparation of onion is made‟. 3.1 Hindi Wordnet(HWN) बातसाण bhAtasANa „smell of paddy crop‟. दफबटाण darbaTANa „smell of burning of dry The Hindi WordNet (Narayan. et. al., 2002; chillies‟. Miller, 1995) on which our expanded model is based has currently 32950 synsets covering धुंलट् टाण dhuMvaT.hTANa „smell of smoke‟. 77800 unique words. Out of these synsets, 13830 synsets are linked with the synsets of the Example 2.6.2: There are verbs which depict Princeton WordNet. The synsets are constructed many shades of the word „beating‟. abiding by the following three principles - भायऩ, फडोलऩ mArapa, baDovapa „to beat‟. (i) Minimality - use the minimal set of words to make the concept unique थाऩटालऩ thApaTAvapa „to beat by slapping (ii) Coverage - The maximal set of words- more than once‟. ordered by frequency in the corpus - to धुभकालऩ, कुभकालऩ dhumakAvapa, include all possible words standing for the kumakAvapa „to beat with blows‟. sense. (iii) Replaceability - The example sentence chimaTAvapa „to beat by series of चचेभटालऩ should be such that the most frequent words pinching‟. in the synset can replace one another in the खंटालऩ, गु蕍डालऩ khoMTAvapa, sentence without altering the sense guD.hDAvapa „to beat by kicking continuously‟. 3.2 Konkani Wordnet(KWN) Creation Process buD.hDapa „to wound with claws or फु蕍डऩ nails‟. Konkani WordNet is created by using Expansion चेंचेऩ cheMchapa „to smash someone with stone Approach. In this approach, instead of etc.‟. reinventing the wheel, the readily available Hindi WordNet synsets developed by IIT dhoMgasapa „to forcibly push someone धंगसऩ Bombay, are referred to. They are, one by one, with the edge of a stick‟. understood by the lexicographer and the corresponding synsets in Konkani, expressing Example 3.3.2: Id. 3016- 핍मलिाय की लि प्रकृचत the same sense are created. Thus the HWN जो रगाताय दोियाल से प्राप्त िोती िै vyavahAra and KWN have identical glosses and examples kI vaha prakRRiti jo lagAtAra doharAva se as far as possible. This is being followed by prApta hotI hai „behavioral characteristic many other Indian languages so that the aquired due to constant repeatation‟. Hindi resultant WordNet will take a shape of concept is understood as “habit” by us while it IndoWordNet. has been linked to “custom” in English. According to Vossen (1996), the MultiWordNet Model seems less complex and Example 3.3.3: Id. 3464- जजसे 奍माचत चभरी िो guarantees the highest degree of compatibility jise khyAti milI ho „one who is famous‟. Hindi across different WordNets. In the development concept suggests that the concept is “famous” of any WordNet a large number of subjective while English synset is “popular”. and sometimes far from accurate decisions are involved. Hence, building two different Hindi concept/gloss definition not clear WordNets independently for two different Synset details of two such examples falling in languages, will display differences. Expand this category are given below – model tends to reduce these subjective choices and resultant discrepancies. It also to some Example 3.3.4: Id. 231 concept in Hindi reads extent helps in highlighting potential as हकसी देळका लि वलबाग जजसके चनलाचसमंकी inconsistencies existing in the WordNet of the , , , source language. ळासन ऩद्धती बाळा यिन सिन 핍मलिाय आदी kisI deshakA vaha But this does not mean that expansion model औयंसे चबन्न औय लतंत्र िो vibhAga jisake nivAsiyoMkI shAsana padhdatI, is without any drawbacks. As Vossen (1996) bhAshA, rahana sahana, vyavahAra AdI points out it forces "an excessive dependency auroMse bhinna aura svataMtra ho „that on the lexical and conceptual structure of one territory of any nation, of which the residents of the languages involved". In Konkani at present (at the time of writing of have administrative system, language, customs, this paper) all the core synsets are linked tradition, etc. different and independent from covering around 3500 unique words. These others‟. synsets are classified according to parts of For this the synsets are प्रदेळ pradesha, या煍म speech (nouns, verbs, and adjectives). rAjya, प्रांत prAMta. Here the mention of „administrative system, language, customs, 3.3 Observations: traditions etc. different and independent from others‟ is superfluous. Since linkage to English Hindi and Konkani being close languages and synsets was available this was referred to. The with the sentence structure of both being English concept reads as 'the territory occupied 'Subject Object Verb' (SOV), there was not by one of the constituent administrative district much problem in maintaining identical of a nation', with synsets as state, province. This concepts. is more appropriate concept. Our observations during the WordNet creation process can be subdivided under the following Example 3.3.5: In Id. 882 the Hindi concept 8 broad categories – reads as संध्मा का लि सभम जफ चेयकय रौटनेलारी गौओंके खुयंसे धूर उडती िै Hindi English incorrect linkage saMdhyA vaha samaya jaba charakara Some of the details are presented below – lauTanevAlI gau_oMke khuroMse dhUla uDatI hai „that time of the evening when the dust from Example 3.3.1: Id. 2897- चिऩकरी ChipakalI - the legs of cattle returning after grazing spreads एक यंगनेलारा जंतु जो प्राम् दीलायं ऩय हदखाई in the air‟. देता िै eka reMganevAlA jaMtu jo prAyaH The synset for the same is गोधूचर फेरा godhUli dIvAroM para dikhA_I detA hai „ a crawling belA. The synset literally translates as 'the time creature mostly seen on walls‟. Hindi and of dust from cattle'. Etymologically this word English synsets have wrong linkage. English 'godhuli' (go = cow; dhuli = dust) may have a should have been House Lizard instead of origin of coinciding this time with the return of Gecko. grazing cattle who come running through the dusty lands and the red dust gets mingled in their parallels in other languages. The linking of the whole atmosphere around. But this such synsets to other languages remains a description need not be a part of concept. question. Simple definition like 'a short span of time In Hindi region chhapati vendor possibly comes before and after the sunset' will meet the true door to door selling his chhapatis (a thinly made sense of the concept and also abide by the bread like eatable prepared from wheat flour). principle of minimality. English synset was Konkani speaking populace is not familiar with not available for this. this scene but they have met a vendor popularly known as ऩदेय padera „bread seller‟ visiting English concept/gloss definition not clear residential areas. Although our source language is Hindi, we There was another such example of a type of had referred to English synset to get a better saree of length nine yards popularly known as idea of the concept during which we made णललायी NavavArI (or नउलायी na_uvArI these observations. Following is an example in Marathi) which women from Goa, and other parts of India wear. Example 3.3.6: Id. 3052 कोई लतु खयीदने Major part of the population may not be aware मा फेचेने ऩय उसके फदरे भं हदमा जानेलारा of this concept. धन ko_I vastu kharIdane bechane para usake badale meM diyA jAnevAlA dhana „Money paid when any goods are bought or Linking of contextual words sold‟. English concept given as „cost of Using the expansion approach, certain synsets bribing someone‟ not appropriate to convey may totally get omitted because of the variety of the meaning of price. shades of meanings of different words as mentioned in section 2.6 above. English synset missing As stated earlier in section 3.1, only 13830 Coverage of synsets synsets are linked with the synsets of the The question also arises with respect to the Princeton WordNet. Hence we found many coverage for some of the synsets. synsets falling in this category. Many words though the meaning of them is known to the people, are not in parlance or English example missing common in literature; one may find them In some of the cases where the English synset possibly in poetry. The glaring example could be was linked the examples were found missing. of सूमब sUrya „the sun‟. Many people know that यवल ravi, आहद配म Aditya are other names of the Hindi example could have been better sun. Likewise there are many words which are We felt that more examples that would overall used for the sun in Puranas (ancient literature). enrich the WordNet and improve the accuracy Whether we have to cover these is a question. of the applications using the WordNet can be The role of metaphorical usage of words – used. Should they be included in the synset? E.g. सुंगट suMgaTa „prawn‟ is commonly used metaphor English example could have been better in Konkani to mean a slim girl. Same observation as above can be made with respect to the English examples. Linking a concept not present in the source language Recursive definition of concepts The concept of a nine yard saree - Synset of this It was also observed that in certain concepts concept is not available in HWN. Hence Marathi the definition was recursive, i.e. the synset WordNet has already created synset for this itself was referred to in the concept. concept. Id number has also been assigned by MWN of its own. Since the other member 3.4 Challenges Faced: languages would not know the existence of this synset, they would duplicate this under different Linking culture specific concepts IDs. Hence, centrally controlled system for The customs and culture played a challenging issuing Ids will have to be established. role. We have experienced in this exercise that very culture specific concepts do not have Coining of new words Acknowledgement Another issue that remains to be resolved is how far the lexicographer can be given liberty We wish to express our gratitude to the Indian to coin new words. This issue comes up if a Institute of Technology, Bombay (IITB) Hindi language does not have a word for a concept WordNet Team for providing the tools and (typically happens for culture specific guiding us in our process of creating the situations). This question will come after the Konkani WordNet. We thank the Indian other alternatives like transliteration and Language Corpora Initiative (ILCI) 11(12)/2008 multiword expression (short phrases) are – HCC (TDIL) project Team members for explored. giving inputs and support to this Konkani WordNet creation process. We also Computational concerns: Interface, acknowledge that we were able to carry out the efficiency of access and storage work using some of the equipments that were Interface of Offline Synset Linking Tool could purchased from the AICTE funding under RPS also show the relations like hypernymy, scheme 8023/BOR/RPS/091/06/07. hyponymy, antonymy already defined for source language synsets so that if the same 5. References does not correspond in the target language it could be changed. Almeida Matthew 1989. A description of Konkani. Thomas Stephens Konkani Kendr.

4. Conclusion and Future Work Asmitai Pratishthan 2008. Dalgado on Konkani.

WordNet has been a very essential constituent Borkar S. . 1986. Konkani Vyakaran, Konkani for any linguistic study. Hence creation of Bhasha Mandal, . WordNet for Konkani language has been Cunha Rivara.J.H. 1958. An Historical Essay on started. The expansion approach has been the Konkani Language. found most convenient to speed up the exercise. The software tool provided was also Dipak Narayan, Debasri Chakrabarty, Prabhakar found adequate for the purpose. Though only Pande and P. Bhattacharyya An Experience in the core synsets have been linked for the time Building the Indo WordNet- a WordNet for being, the project is taking momentum and the Hindi, in First International Conference on Global rest of the synsets will also be linked with WordNet, Mysore, India, January, 2002. greater speed than earlier. The Konkani WordNet ळ녍दभारं Goembab Shennoi 1949. Konkanichi Vyakarani shabdamAleM is at the initial stage of Bandavol, Gomantak Chhapkhano Girgaum creation. Currently only the concepts, synsets . and examples have been dealt with. However, it is required to check all the semantic Miller, G. A. 1995. “WordNet: a Lexical relations like synonymy, hypernymy, Database for English”. Communications of the hyponymy, meronymy, holonymy, ACM 38, (November 1995): 39 – 41. troponymy, entailment etc. Even the concept Miller, G. A., Fellbaum, C., and Miller K. J. (1993) of gradations will have to be introduced as in Five Papers on WordNet[Computer file] [2006, Hindi WordNet. It is felt that the existing November 2]. examples from the HWN should also be strengthened with our own additional Sardesai Madhavi 1986. Some aspects of Konkani examples. These examples could be taken grammar. Department of Linguistics, Deccan from any of the existing Konkani Corpora. College When the project gets completed it will be a useful tool for the computational studies of the Vossen P. 1996. Right or wrong: combining Indian Languages and a valuable asset of the lexical resources in the EuroWordNet project. Konkani language in particular. Proceedings of Euralex-96 International Congress.