<<

Introduction to Gujarati

Abstract tion 3 discusses the basic features of Gujarati and section 4 describes influence of Gujarati is one of the 22 official lan- other on Gujarati and justifies use guages of . It is an Indo- of language as base language for Gu- language descended from . Gu- jarati Wordnet development. Synset develop- jarati wordnet is being built using ex- ment approach and synset categorization are pansion approach with Hindi as the discussed in Section 5 and 6 respectively. Sec- source language. This paper describes tion 7 gives the current status of Gujarati experiences of building Gujarati word- wordnet.Issues related to synset linking are net. Paper discusses basic features of discussed in section 8. and evaluates suit- ability of Hindi language for expan- 2 Gujarati Language sion approach. Various issues related to synset linking using expansion ap- Gujarati, a native language of Indian state of proach and challenges related to lan- , is a member of Indo-Aryan family of guage specific concepts are also dis- languages. There are over 50 million speakers cussed. of Gujarati language. Initially, the of Gujarati was 1 Introduction restricted to business writing , while the lit- erature was in Devanāgarī script. The have emerged as a very useful form of the language is much older, enriched resource for computational linguistics and by poetry of poets like Narsinh . Gu- many natural language processing applica- jarati prose writing and journalism started in tions. Since the development of Princeton 19th century. Protest writing against colonial- WordNet (Fellbaum C., 1998), wordnets are ism led to a string of powerful essays leading to being built in many other languages. Hindi the foundation of modern . Wordnet(Narayan D. et al., 2002) was the first wordnet for the Indian languages. Based on 3 Features Hindi wordnet, wordnets for 17 different In- dian languages are getting built using the ex- Some features of Gujarati language are as fol- pansion approach. One such effort is Gujarati lows: wordnet. This paper describes experiences of building Gujarati wordnet. 3.1 Writing system The paper is organized as follows, section 2 is a variant of Devanāgarī gives introduction to Gujarati language, sec- script, differentiated by the loss of the charac- teristic horizontal line running above the let- 3.3.2 ters and by a small number of modifications agree with and genders. A in the remaining characters. feminine adjective does not take marker For .g. while agreeing with a plural with femi- Hindi: कमल, (kamal), Gujarati: કમળ nine gender. For e.g. Masculine singular 3.2 Vocabulary સારો છોકરો (‘saro chhokaro’ , Good Boy) As Gujarati is an Indo-Aryan language de- Masculine plural scended from Sanskrit, it’s vocabulary con- સારા છોકરાઓ(‘sara chhokarao’ , Good Boys) tains four categories of words: tat- Feminine singular , tadbhava, deshi and videshi words. સારી છોકરી (‘ chhokari’ , Good girl) Feminine plural • : Set of words accepted from San- સારી છોકરીઓ (‘sari chhokario’ , Good girls) skrit language. 3.3.3 Structure of Gujarati verbs have +infinitive structure. • tadbhava: Set of words from Sanskrit Gujarati extends root to make language adopted with a change in the sentence. For e.g. phonological form. ઝાડ પડયુ. (‘Zaad padyu’ , A tree fell) • deshi: Words which are specific to Gu- રામે ઝાડ પાડયુ. (‘raame Zaad paadyu’ , raam jarati Language. caused the tree fell) કાને રામ પાસે ઝાડ પડાવયુ. (‘kaane raam paase • videshi: Words which are accepted from Zaad padaavyu’ , Kan cause Ram who caused different languages, like Persian, English, the tree fell) Portugese etc. 4 Influence of other languages on Gujarati It is also noteworthy that in some cases tat- sama and tadbhava words for a Sanskrit word 4.1 with Hindi co-exist with same or different meanings. For As an Indo-Aryan language, Gujarati language e.g. (1) ધર્મ ( ) and ધરમ (Dharam) is very similar to Hindi. A brief comparison of both means same, ’Religion’. While, (2) કર્મ Gujarati with Hindi is as follows, () means Work, with religious connota- • Gender: Gujarati language defines three tion and કરમ (karam) means Work in general sense. genders while Hindi has only 2 genders. • Writing system: Gujarati does not have 3.3 the upper horizontal line running above Gujarati follows Subject--Verb word or- the letter and few characters are modified. der. There are three genders and two numbers. • Causative verbs: Both Hindi and Gujarati There is no . Some significant features handle causative verbs in the same fash- are as follows: ion. 3.3.1 Gender • ‘Want’ and ‘should’: Both Hindi and Gu- Gujarati distinguishes between three genders, jarati handle “ should ...” and “I want ..” masculine, feminine and neutral. For e.g. in a similar ways. Gujarati uses ‘jo’ which છોકરો (chhokaro , Boy) is similar to ‘chah’ of Hindi. છોકરી (chhokarI , Girl) For e.g. ‘ I should go home now.’ is writ- છોકરૂ (chhokarU, Small kid) ten as, However gender markers do not always rep- Hindi, ‘मुजे घर जाना चाहीये।’ resent the biological gender. Gujarati, ‘મારે ઘરે જવુ જોઇએ.’ મંકોડો (mankodo , Big Ant) મંકોડી (mankodI , Small Ant) (mare ghare javu joiAe) 4.2 Influence of other languages Go Mandal’ (Patel C. B.(ed) , 1958) and ‘Gu- There are other languages which also influ- jarati Lexicon’ (Chandaria R. , 2006). ’- ence Gujarati. As India was ruled by , gavad Go Mandal’ contains around 8.2 lacs English and Portuguese, there is influence of words spread across 9 volumes. ’Gujarati Lex- these languages on Gujarati. icon’ is an another more recent effort. The online interface of Gujarati lexicon provides • influence: Following words demon- easy access to meanings, synonyms, antonyms, strate Urdu influence on Gujarati, idioms, proverbs and phrases. These two re- sources provide great help in building synsets. દાવો (Urdu: dava English: Claim ) As Gujarati language is closely related to ફાયદો (Urdu: fayda English: Benefit) Hindi, the most of Gujarati synsets are cre- કાયદો (Urdu: kayda English: Law ) ated by translating Hindi synsets to Gujarati synsets. However, emphasis was given to un- ખરાબ (Urdu: kharab English: Bad ) derstand the concept independently of a lan- • English influence: Most of the Indian lan- guage and then to create synset. Though no- guages have adapted many of the English tion of concept is defined independently of the words and Gujarati is not an exception in language, many times it was observed that that. For example, the concept present in Hindi was not present in Gujarati or even though the concept was બેંક : Bank present there was no indigenous lexeme for the ફોન : Phone concept. : Table ટેબલ 6 Synset Categorization • Portuguese influence: Some of the words As described in previous section, sometime, of adapted in Gu- there is disagreement on concepts across lan- jarati are as follows, guages. Many concepts of Hindi are not સાબુ : ‘saabu’ soap present in other languages or there is no in- digenous lexeme for the concept in other lan- બટાટા : ‘bataataa’ guage. So, to facilitate synset linking across પાદરી : ‘paadarI’ father (Christian priest) languages, Hindi synsets are divided into fol- lowing different categories, Thus, Gujarati language has rich set of words derived from Indian languages as well as for- • Universal : This set of concepts is present eign languages. This insight helps in selecting in all the languages and is essential and an approach for building wordnet. most frequently used. For e.g., ‘सूर्य’ (sun). Most of these concepts belong 5 Synset Development Approach to top-level of the wordnet and are di- Gujarati wordnet is being built using expan- rectly linked with English WordNnet and sion approach (Vossen P., 1998). In this ap- SUMO. proach synsets are created by referring to ex- • Pan-Indian : This set of concepts is com- isting wordnet of related language. Hindi is mon in all Indian languages and linkable used as a source language to create synsets across all Indian languages but does not of Gujarati language. Benefits of this ap- have parallel concept in English. For ex- proach are: (1) Wordnet development pro- ample, ‘तबला’ (tabala)(An Indian rhythm cess becomes faster as the gloss and synset of instrument). the source language is already available as ref- erence. (2) It provides linking between the • In-Family : These are the concepts com- synsets of different languages which can be mon in specific subsets of Indian lan- used for machine applications. guages and linkable across all languages The task of synset development for Gujarati of the family. For example: ‘चाचा’ language is further simplified by availability (chacha)(paternal uncle) ‘भितजा’ (bhatija) of the on line lexical resources like ‘Bhagavad (brother’s son) • Language Specific : These concepts are Universal: 7169 specific to a language. These concepts Pan-Indian: 1348 are specific to the culture. It includes lo- Language specific: 108 cal food, festivals,etc. For example, ‘बीहु’, Verb: 1799 (Bihu) (Name of festival celebrated in As- Adverb: 210 sam state of India) word is very specific Adjective: 3606 to the state and the culture and does not appear in any other language. These con- 8 Issues related to synset cepts appears very low in the hierarchy of development the wordnet and normally represents in- During the development of synsets, some dis- stances or individuals. agreements were observed between Hindi con- • Rare : This includes very specific words cepts and Gujarati concepts. adopted in most of the languages. It in- 8.1 Hindi synsets not linked with cludes specific technical or scientific terms Gujarati like, ‘ngram’. Following are some examples of Hindi synsets • Synthesized : These are the synsets cre- not linked with Gujarati, ated in a language due to the influence of other languages. These synsets are not • Difference in concept description natural to the language but needed to link Concept: तुरही की तरह का एक बड़ा बाजा synsets of two different languages. Example: “नरिसहा की आवाज़ दूर-दूर तक Such classification of synsets helps in link- सुनाई देती है” ing concepts of different languages. For exam- Synset: ple, if a synset belongs to the universal synset नरिसहा, नरिसगा, गोमुख then it is present in both Hindi and English No such concept is identified in Gujarati language. And if a synset belongs to the Pan- language. However, there is a concept in Indian category then it belongs to both Hindi Gujarati language for similar instrument and Gujarati languages. Thus, wordnet de- which is used at war-front to announce velopment using expansion approach will be beginning of a war. faster by this method. Till date, 7163 universal synsets and 1356 • No indigenous lexeme in Gujarati Pan-Indian synsets have been manually iden- Concept: इत्र का व्यापार करनेवाला व्यक्ती tified and are now linked across all languages. Example: “आजकल, इत्र व्यापारी नक़ली इत्र Out of 7163 universal synsets 7012 are directly का व्यापार भी करने लगे हैं” linked with English wordnet synset and 24 are linked through hypernymy. Out of 1347 Pan- Synset: इत्र व्यापारी, इत्र फरोश, इत्र फ़रोश, Indian synsets 287 are directly linked with अत्तार, गंधी, गन्धी English wordnet synset and 125 are linked There is no indigenous lexeme for this through hypernymy. The 24 Universal synsets concept in Gujarati language. represent the concepts which are not present in a specific Indian language. 287 directly linked • Confusing gloss synsets represent concepts which are adopted Concept: एक छोटा पक्षी जो प्रायः अपना in . Language specific synsets घोसला मकानो में बनाता है are being developed and then they will be Example:“ linked by translating them into Hindi and En- गौरैया अपने बच्चो को दाना चुगा रही ” glish. ह Synset: गौरैया, गौरेया, वृषायण, आकली 7 Synset Development status The concept is general and exists in Gu- Till date, 15595 synsets, covering 42537 words, jarati language but it is difficult to iden- are built in the Gujarati wordnet. The tify the Gujarati name of the bird from category-wise count of synsets is as follows: the synset. • Difficult to adopt • religion specific concept Concept: जो प्रवीष्ट न हुआ हो Concept : મોક્ષ માટે ભગવાન નું નામ લેતા લેતા ગીરનાર પર થી પડતું મુકવું. Example: “अप्रवीष्ट महेमानो को शीघ्र ही भीतर प्रवेश करने दे” Example : “ગીરનાર શીખર પર થી ભક્તો ભૈરવજપ કરતા હતા” Synset: अप्रवीष्ट Synset : ( , bheiravajapa) Though this word can be translated in ભૈરવજપ भैरवजप Gujarati language, it is not a native con- 9 Conclusion cept used in Gujarati language. Existence of Hindi wordnet and similarity be- • No such concept in Gujarati tween Hindi and Gujarati languages helped de- Concept: जो अकेला चरता या वीचरण करता हो velopment of Gujarati wordnet. Also, the re- sources like ’Bhagavad-Go-Mandal’ and ’Gu- Example: “जंगली सूअर एक पृथकचर पशु है” jarati Lexicon’ were found to be very useful Synset: पृथकचर in synset development process. Synset cate- There is no such concept in Gujarati lan- gorization further simplified the synset linking guage. process. It is observed that most of the top level concepts are common and easily linked. Concept described above are not part of The concepts that vary across languages are general vocabulary and represent very specific specific to culture and tradition of the people. nouns. There was no difficulty in linking verb, Mostly these are noun concepts and do not adjectives or causative verbs. This is due to have hyponymy. Many of these are singleton the similarity between Hindi and Gujarati lan- synsets that appear very low in the wordnet guages. Out of around 7800 concepts of Hindi concept hierarchy. The future work is to iden- language referred so far, around 7500 concepts tify and link language specific and in-family were linked to Gujarati language. concepts. It is also required to develop lexi- cal relations and to evaluate suitability of se- 8.2 Language specific synset mantic relations of Hindi wordnet for Gujarati While major part of the day to day vocabulary language. of Gujarati language is similar to that of Hindi, there are some concepts which are very specific to Gujarati language. These concepts are very References specific to the . These con- Christiane Fellbaum 1998. WordNet: An Elec- cepts refer to food items, places, traditions, tronic Lexical Database. MIT Press religion etc. Some of the examples are as fol- Piek Vossen 1998. EuroWordNet: a multilingual lows: database with lexical semantic networks. Kluwer Academic Publishers • Culture specific concept D. Chakrabarty P. Pande D. Narayan and P. Concept : કોઇ ખાસ પ્રસંગે કસુંબો પીવા માટે Bhattacharyya 2002. An experience in build- ભેગા થવું ing the Indo WordNet - a WordNet for Hindi International Conference on Global WordNet Example : “ગુજરાત ના કોઇ ગામો માં આજે (GWC02), , India પણ ડાયરા થાય છે” Patel C. B. 1958. Bhagvad-Go-Mandal. Synset : ડાયરો (डायरो, Daayaro) http://www.bhagavadgomandalonline.com. • Tradition specific concept Ratilal Chandaria 2006. Gujarati Lexicon http://www.gujaratilexicon.com Concept : એક ફળ કે જે લગ્ન પ્રસંગે વર કન્યા ના હાથે બાંધે છે Example : “લગ્ન પછી વર કન્યા મીંઢળ છોડે છે” Synset : મીંઢળ (मीढळ, mIMdhaL)