<<

Introduction to Gujarati Wordnet

Prof. C. K. Bhensdadia Brijesh Bhatt Prof. Pushpak Bhattacharyya [email protected] [email protected] [email protected] Department of Computer Engg., Department of Computer Science and Department of Computer Science and Dharmsinh Desai University, Engineering, Engineering, Indian Institute of Technology, Indian Institute of Technology, Mumbai

Abstract languages of . Incidentally, Gujarati was is the youngest member of the of Gandhiji (Mohandas K. IndoWordnet[1]. As a part of IndoWordnet Gandhi, father of India) and Mohammed Ali project, Wordnet for Gujarati language is being Jinnah (father of ). developed from Wordnet using expansion approach. This paper reviews the Gujarati 2.1 History Wordnet development process. It describes the Initially, the of Gujarati was basic features of Gujarati language and evaluates restricted to business writing , while the suitability of Hindi language as a source literature was in Devanāgarī script. The poetry language. Also, the current status of the work and form of language is much older, enriched by the issues in development are described. poetry of poets like Narsinh . Gujarati prose writing and journalism started in 19th 1. Introduction century. Protest writing against colonialism led WordNet[2] is a machine readable lexical to a string of powerful essays leading to the database for developed at foundation of modern . Princeton University. It has evolved as the most valuable resource for the natural language 2.2 Features processing application. Following the Princeton Some features of Gujarati language are as WordNet, for many other languages follows: were developed across the globe. The first 2.2.1 Writing system: is a for Indian languages is Hindi variant of Devanāgarī script, differentiated by wordnet[3], developed at Indian Institute of the loss of the characteristic horizontal line Technology, Bombay. Recently, efforts are going running above the letters and by a small on to develop wordnets for many Indian number of modifications in the remaining Languages. One such effort is to build Gujarati characters. wordnet from Hindi wordnet using expansion For example: approach. Hindi: कमल The layout of the paper is as follows: section 2 (kamal) gives introduction to Gujarati language, section 3 Gujarati: describes historic influence of other languages on કમળ

Gujarati and justifies use of Hindi language as 2.2.2 Vocabulary: As Gujarati is an Indo- base language for Gujarati Wordnet development. language descended from , it's Section 4 describes the expansion approach vocabulary contains four general categories of selected for the Wordnet development. Section 5 words: describes the status of Gujarati Wordnet Tatsam, Tadbhav and Native and Loan words. devleopment and some issues related to synset Tatsam: Set of words accepted from Sanskrit linking. language. Tadbhav: Set of words from Sanskrit language 2. Gujarati Language adopted with change in phonological form. Gujarati, a native language of Indian state of Native: Words which are specific to Gujarati , is a member of Indo-Aryan family of Language. languages. There are over 50 million speakers of Loan Words: Words which are accepted from Gujarati language and it is one of the 22 official different languages, like Persian, English, Portugese etc. Next section describes such words to make sentence. in more detail. For example: It is also noteworthy that in some cases tatsam (1) ઝાડ પડયુ. and tadbhav words for same Sanskrit word co- (Zaad paDyu) exist with same or different meanings. A tree fell. For example: (2) રામે ઝાડ પાડયુ. (1) ધમર ( ) and ધરમ (Dharam) both (Rame Zaad paaDyu) means same, 'Religion'. Ram caused the tree fell. (2) કમર () : Work, with religious connotation (3) કાને રામ પાસે ઝાડ પડાવયુ. કરમ (karam) : Work (Kane Ram paase Zaad padaVyu) Kan cause Ram who caused the tree fell. 2.2.3 Grammar: Gujarati follows Subject- -Verb . There are three genders 3. Influence of other languages on Gujarati and two numbers. There are no articles. Some As an Indo-Aryan language, Gujarati language significant features are as follows: is very similar to Hindi, Marathi and Punjabi. 2.2.3.1 Gender: Gujarati distinguishes between Grammar and vocabulary of Gujarati language three genders : masculine, feminine and neutral. is very similar to Hindi with few exceptions. A However the gender marker do not represent the brief comparison is as follows : biological gender all the time. (1) Gender: As described in section 2, For example: Gujarati language defines three genders while છોકરો છોકરી Hindi has only 2 genders. (chhokaro) (chhokari) (2) Writing system: Gujarati dropped the (Boy) (Girl) upper horizontal line running above the letter, and few characters are modified as shown in મંકોડો મંકોડી the previous section. (mankodo) (mankodi) (3) Causative : Both Hindi and Gujarati (Big Ant) (Small Ant) handle causative verbs in the same fashion.

2.2.3.2 : Adjective agrees with For Example, and gender. Feminine adjective does not take Hindi: रोना रलाना रलवाना marker while agreeing with a plural noun (rona) (rulana) (rulavana) with feminine gender. is similar to, For example: Gujarati: રડવું રડાવવું રડાવરાવવું (1) Masculine singular (radvu) (radavavu) (radavravavu) સારો છોકરો (sar- chhokar-o) (4) 'Want' and 'should': Both Hindi and Good Boy Gujarati handles " should ..." and "I want .." in (2) Masculine plural similar ways. Gujarati uses 'jo' which is similar સારા છોકરાઓ to 'chah' of Hindi. (sar-a chhokara-o) Good Boys For example, (3) Feminine singular I should go home now. સારી છોકરી in Hindi, (sar-i chhokar-i) मुजे घर जाना चाहीये। Good girl in Gujarati, (4) Feminine plural મારે ઘરે જવુ જોઇએ. સારી છોકરીઓ (mare ghare javu joiAe) (sar-i chhokari-o) Good girls However there are other languages which also influence Gujarati. As India was ruled by 2.2.3.3 Structure of verbs: Gujarati verbs have Muslims, English and Portuguese, there is +infinitive structure. Gujarati extends root influence of these languages on Gujarati. influence: Following words demonstrate translating the Hindi synset to Gujarati synset. Urdu influence on Gujarati, However, emphasis was given to understand Gujarati Urdu English the concept independently of language and then દાવો dava Clami to create synset. ફાયદો fayda Benefit The task of synset development for Gujarati કાયદો kayda Law language is further simplified by on line ખરાબ kharab Bad availability of the milestone laxicon resources like 'Bhagavad Go Mandal'[5] and 'Gujarati English influence: Most of the Indian languages Lexicon'[6]. 'Bhagavad Go Mandal' was have adapted many of the English words and created in early twentieth century at princely Gujarati is not an exception in that. state of Gondal in Kathiawad. It contains For example, around 8.2 lacs words spread across 9 volumes. બેક : Bank It is accepted as standard reference for Gujarati ફોન : Phone language by 'Gujarat Sahitya Parishad' under ટેબલ : Table the leadership of . 'Gujarati Lexicon' is an another more recent effort, by Portuguese influence: Following are the some Ratilal Chandaria. The online interface of of the words of adapted in Gujarati lexicon provides easy access to Gujarati: meanings, synonyms, antonyms, idioms, સાબુ soap proverbs and phrases. These two resources બટાટા provide great help in building synsets. પાદરી father (Christian priest) 5. Observations Thus the Gujarati language has rich set of words 5.1 Synset linkage status derived from Indian languages as well as foreign The synsets are divided into two categories- languages. This insight helps in selecting the Core and Common. Following is the status of approach for building wordnet. synset developed under each categories. Core synset 4. Gujarati Wordnet development using No. of synsets: 1866 expansion approach Total words : 7985 Gujarati wordnet is being built using expansion Unique words: 7078 approach[4]. In this approach, instead of creating Common synset the synset from the scratch, synsets are created by No. of synset : 5632 referring to existing wordnet of related language. Total words : 17245 Hindi is used as a source language to create Unique words: 13800 synsets of Gujarati language. The benefits of this approach are: 5.2 Issues related to synset development (1) Wordnet development process becomes faster Some Hindi synsets were not linked with as the gloss and synset of the source language is Gujarati synsets because of the following already available as reference. reasons: (2) It provides linking between the synsets of (1) Concept does not exist in Gujarati language different languages which can be used for (2) Difficulty in interpreting gloss of Hindi machine applications. synset. Some examples are as follows: Synset linkage tool, provided by I.I.T.Bombay, is Core synset used to create synset of Gujarati language. This (1) ID: 408 synset linking tool provides graphical user Concept: तुरही की तरह का एक बडा बाजा interface which shows Hindi synset on the left side and provides interface to enter Gujarati Example: "नरिसंहा की आवाज दरू -दरू तक सुनाई synset on the right hand side. देती है" As Gujarati language is closely related to Hindi, most of the Gujarati synsets are created by Synset: नरिसंहा, नरिसंगा, बाँिकया, गोमुख, िसंगा No such concept is identified in Gujarati There was no difficulty in linking verb, language. However there is a concept in Gujarati or causative verbs. This is due to the language for similar instrument which is used at similarity between Hindi and Gujarati war-front to announce beginning of a war. languages. Out of around 7800 concepts of Hindi language referred so far, around 7500 (2)ID: 2636 concepts were linked to Gujarati language which means over 95% concepts are common Concept: इत का वयापार करनेवाला वयिक to both languages. Example: "आजकल, इत वयापारी नकली इत का 5.3 Gujarati language specific concepts While most of the part of the day to day वयापार भी करने लगे ह"ै vocabulary of Gujarati language is similar to synset: इत वयापारी, इत फरोश, इत फरोश, अतार, that of Hindi, there are some concepts which are very specific to Gujarati language. These गंधी, गनधी, इतफरोश, इतफरोश, इतिफरोश, इतिफरोश concepts are mostly related to unique features of Gujarati language and Gujarati literature. There is no such concept in Gujarati language. Some of the examples are as follows: (1) ગરબો (Garabo): Sacred light to worship (3) ID: 4436 Goddess during Navratri. Concept: एक छोटा पकी जो पायः अपना घोसला A form of dance performed by women to worship goddess during Navratri. मकानो मे बनाता है (2) ભવાઇ (BhavaI):‌ Specific form of , Example:"गौरैया अपने बचचो को दाना चुगा रही है" with special characters like 'Ranglo' and 'Rangli' used in ancient days to convey the Synset: गौरैया, गौरेया, सवलपघटक, वृषायण, बहुशुत, social issues. Though a rare form of an art, the आकली concept is still very common to Gujarati The concept is general and exists in Gujarati language. language but it is difficult to identify the Gujarati (3) છપપા (Chhappa): A specific from of poetry, name of the bird from the synset. similar to 'Dohe'. However, it is different from '' as it exists separately in Gujarati Common synset language. (4) ID : 3 6. Conclusion Concept: जो पिवष न हुआ हो Existence of Hindi wordnet and similarity Example: "अपिवष अितिथयो को शीघ ही भीतर पवेश between the Hindi and Gujarati language helped development of Gujarati wordnet. Also करने िदया जाय" the resources like 'Bhagavad-Go-Mandal' and Synset:अपिवष 'Gujarati Lexicon' were found to be very useful Though this word can be translated in Gujarati, it in synset development process. Effort of is not a concept used in Gujarati language. developing wordnet using expansion approach for various Indian language is going to produce (5) ID : 613 huge lexicon resource which will prove to be invaluable for machine translation and Natural Concept: जो अकेला चरता या िवचरण करता हो Language processing applications. Example: "जंगली सूअर एक पृथकचर पशु है" Acknowledgement Synset: पृथकचर This work is done under project 'Indradhanush- There is no such concept in Gujarati language. Wordnet development project for seven Indian languages', sponsored by Ministry of So, above examples describes some of the Communication & I. T., India. We sincerely synsets for which Gujarati synsets couldn't be acknowledge DIT for providing support for the created. However, these synsets are not part project. general vocabulary. References [1] Bhattacharyya P. (2009) "IndoWordnet", Lexical Resources Engineering Conference 2010 (LREC 2010), Malta, May, 2010. [2] Fellbaum C. (1998) “WordNet: An Electronic Lexical Database.” MIT Press. [3] Narayan D., (2002) "An Experience in Building the Indo WordNet- a WordNet for Hindi, 1st International Conference on Global WordNet (GWC 02), , India. [4] Vossen P. (ed.). 1998 "EuroWordNet: A Multilingual Database with Lexical Semantic Networks." Kluwer Academic Publishers, Dordrecht. [5]'Bhagvad-Go-Mandal', http://www.bhagvadgomandalonline.com [6] 'Gujarati Lexicon', http://www.gujaratilexicon.com