Introduction to Gujarati Wordnet
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Gujarati wordnet Abstract tion 3 discusses the basic features of Gujarati language and section 4 describes influence of Gujarati is one of the 22 official lan- other languages on Gujarati and justifies use guages of India. It is an Indo-Aryan of Hindi language as base language for Gu- language descended from Sanskrit. Gu- jarati Wordnet development. Synset develop- jarati wordnet is being built using ex- ment approach and synset categorization are pansion approach with Hindi as the discussed in Section 5 and 6 respectively. Sec- source language. This paper describes tion 7 gives the current status of Gujarati experiences of building Gujarati word- wordnet.Issues related to synset linking are net. Paper discusses basic features of discussed in section 8. Gujarati language and evaluates suit- ability of Hindi language for expan- 2 Gujarati Language sion approach. Various issues related to synset linking using expansion ap- Gujarati, a native language of Indian state of proach and challenges related to lan- Gujarat, is a member of Indo-Aryan family of guage specific concepts are also dis- languages. There are over 50 million speakers cussed. of Gujarati language. Initially, the writing system of Gujarati was 1 Introduction restricted to business writing , while the lit- erature was in Devanāgarī script. The poetry Wordnets have emerged as a very useful form of the language is much older, enriched resource for computational linguistics and by poetry of poets like Narsinh Mehta. Gu- many natural language processing applica- jarati prose writing and journalism started in tions. Since the development of Princeton 19th century. Protest writing against colonial- WordNet (Fellbaum C., 1998), wordnets are ism led to a string of powerful essays leading to being built in many other languages. Hindi the foundation of modern Gujarati literature. Wordnet(Narayan D. et al., 2002) was the first wordnet for the Indian languages. Based on 3 Features Hindi wordnet, wordnets for 17 different In- dian languages are getting built using the ex- Some features of Gujarati language are as fol- pansion approach. One such effort is Gujarati lows: wordnet. This paper describes experiences of building Gujarati wordnet. 3.1 Writing system The paper is organized as follows, section 2 Gujarati script is a variant of Devanāgarī gives introduction to Gujarati language, sec- script, differentiated by the loss of the charac- teristic horizontal line running above the let- 3.3.2 Adjective ters and by a small number of modifications Adjectives agree with nouns and genders. A in the remaining characters. feminine adjective does not take plural marker For e.g. while agreeing with a plural noun with femi- Hindi: कमल, (kamal), Gujarati: કમળ nine gender. For e.g. Masculine singular 3.2 Vocabulary સારો છોકરો (‘saro chhokaro’ , Good Boy) As Gujarati is an Indo-Aryan language de- Masculine plural scended from Sanskrit, it’s vocabulary con- સારા છોકરાઓ(‘sara chhokarao’ , Good Boys) tains four general categories of words: tat- Feminine singular sama, tadbhava, deshi and videshi words. સારી છોકરી (‘sari chhokari’ , Good girl) Feminine plural • tatsama: Set of words accepted from San- સારી છોકરીઓ (‘sari chhokario’ , Good girls) skrit language. 3.3.3 Structure of verbs Gujarati verbs have root+infinitive structure. • tadbhava: Set of words from Sanskrit Gujarati extends root verb to make causative language adopted with a change in the sentence. For e.g. phonological form. ઝાડ પડયુ. (‘Zaad padyu’ , A tree fell) • deshi: Words which are specific to Gu- રામે ઝાડ પાડયુ. (‘raame Zaad paadyu’ , raam jarati Language. caused the tree fell) કાને રામ પાસે ઝાડ પડાવયુ. (‘kaane raam paase • videshi: Words which are accepted from Zaad padaavyu’ , Kan cause Ram who caused different languages, like Persian, English, the tree fell) Portugese etc. 4 Influence of other languages on Gujarati It is also noteworthy that in some cases tat- sama and tadbhava words for a Sanskrit word 4.1 Comparison with Hindi co-exist with same or different meanings. For As an Indo-Aryan language, Gujarati language e.g. (1) ધર્મ ( Dharma) and ધરમ (Dharam) is very similar to Hindi. A brief comparison of both means same, ’Religion’. While, (2) કર્મ Gujarati with Hindi is as follows, (karma) means Work, with religious connota- • Gender: Gujarati language defines three tion and કરમ (karam) means Work in general sense. genders while Hindi has only 2 genders. • Writing system: Gujarati does not have 3.3 Grammar the upper horizontal line running above Gujarati follows Subject-Object-Verb word or- the letter and few characters are modified. der. There are three genders and two numbers. • Causative verbs: Both Hindi and Gujarati There is no article. Some significant features handle causative verbs in the same fash- are as follows: ion. 3.3.1 Gender • ‘Want’ and ‘should’: Both Hindi and Gu- Gujarati distinguishes between three genders, jarati handle “I should ...” and “I want ..” masculine, feminine and neutral. For e.g. in a similar ways. Gujarati uses ‘jo’ which છોકરો (chhokaro , Boy) is similar to ‘chah’ of Hindi. છોકરી (chhokarI , Girl) For e.g. ‘ I should go home now.’ is writ- છોકરૂ (chhokarU, Small kid) ten as, However gender markers do not always rep- Hindi, ‘मुजे घर जाना चाहीये।’ resent the biological gender. Gujarati, ‘મારે ઘરે જવુ જોઇએ.’ મંકોડો (mankodo , Big Ant) મંકોડી (mankodI , Small Ant) (mare ghare javu joiAe) 4.2 Influence of other languages Go Mandal’ (Patel C. B.(ed) , 1958) and ‘Gu- There are other languages which also influ- jarati Lexicon’ (Chandaria R. , 2006). ’Bha- ence Gujarati. As India was ruled by Muslims, gavad Go Mandal’ contains around 8.2 lacs English and Portuguese, there is influence of words spread across 9 volumes. ’Gujarati Lex- these languages on Gujarati. icon’ is an another more recent effort. The online interface of Gujarati lexicon provides • Urdu influence: Following words demon- easy access to meanings, synonyms, antonyms, strate Urdu influence on Gujarati, idioms, proverbs and phrases. These two re- sources provide great help in building synsets. દાવો (Urdu: dava English: Claim ) As Gujarati language is closely related to ફાયદો (Urdu: fayda English: Benefit) Hindi, the most of Gujarati synsets are cre- કાયદો (Urdu: kayda English: Law ) ated by translating Hindi synsets to Gujarati synsets. However, emphasis was given to un- ખરાબ (Urdu: kharab English: Bad ) derstand the concept independently of a lan- • English influence: Most of the Indian lan- guage and then to create synset. Though no- guages have adapted many of the English tion of concept is defined independently of the words and Gujarati is not an exception in language, many times it was observed that that. For example, the concept present in Hindi was not present in Gujarati or even though the concept was બેંક : Bank present there was no indigenous lexeme for the ફોન : Phone concept. : Table ટેબલ 6 Synset Categorization • Portuguese influence: Some of the words As described in previous section, sometime, of Portuguese language adapted in Gu- there is disagreement on concepts across lan- jarati are as follows, guages. Many concepts of Hindi are not સાબુ : ‘saabu’ soap present in other languages or there is no in- digenous lexeme for the concept in other lan- બટાટા : ‘bataataa’ potato guage. So, to facilitate synset linking across પાદરી : ‘paadarI’ father (Christian priest) languages, Hindi synsets are divided into fol- lowing different categories, Thus, Gujarati language has rich set of words derived from Indian languages as well as for- • Universal : This set of concepts is present eign languages. This insight helps in selecting in all the languages and is essential and an approach for building wordnet. most frequently used. For e.g., ‘सूर्य’ (sun). Most of these concepts belong 5 Synset Development Approach to top-level of the wordnet and are di- Gujarati wordnet is being built using expan- rectly linked with English WordNnet and sion approach (Vossen P., 1998). In this ap- SUMO. proach synsets are created by referring to ex- • Pan-Indian : This set of concepts is com- isting wordnet of related language. Hindi is mon in all Indian languages and linkable used as a source language to create synsets across all Indian languages but does not of Gujarati language. Benefits of this ap- have parallel concept in English. For ex- proach are: (1) Wordnet development pro- ample, ‘तबला’ (tabala)(An Indian rhythm cess becomes faster as the gloss and synset of instrument). the source language is already available as ref- erence. (2) It provides linking between the • In-Family : These are the concepts com- synsets of different languages which can be mon in specific subsets of Indian lan- used for machine translation applications. guages and linkable across all languages The task of synset development for Gujarati of the family. For example: ‘चाचा’ language is further simplified by availability (chacha)(paternal uncle) ‘भितजा’ (bhatija) of the on line lexical resources like ‘Bhagavad (brother’s son) • Language Specific : These concepts are Universal: 7169 specific to a language. These concepts Pan-Indian: 1348 are specific to the culture. It includes lo- Language specific: 108 cal food, festivals,etc. For example, ‘बीहु’, Verb: 1799 (Bihu) (Name of festival celebrated in As- Adverb: 210 sam state of India) word is very specific Adjective: 3606 to the state and the culture and does not appear in any other language. These con- 8 Issues related to synset cepts appears very low in the hierarchy of development the wordnet and normally represents in- During the development of synsets, some dis- stances or individuals. agreements were observed between Hindi con- • Rare : This includes very specific words cepts and Gujarati concepts.