Introduction to Gujarati Wordnet
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Gujarati Wordnet Prof. C. K. Bhensdadia Brijesh Bhatt Prof. Pushpak Bhattacharyya [email protected] [email protected] [email protected] Department of Computer Engg., Department of Computer Science and Department of Computer Science and Dharmsinh Desai University, Nadiad Engineering, Engineering, Indian Institute of Technology, Mumbai Indian Institute of Technology, Mumbai Abstract languages of India. Incidentally, Gujarati was Gujarati language is the youngest member of the first language of Gandhiji (Mohandas K. IndoWordnet[1]. As a part of IndoWordnet Gandhi, father of India) and Mohammed Ali project, Wordnet for Gujarati language is being Jinnah (father of Pakistan). developed from Hindi Wordnet using expansion approach. This paper reviews the Gujarati 2.1 History Wordnet development process. It describes the Initially, the writing system of Gujarati was basic features of Gujarati language and evaluates restricted to business writing , while the suitability of Hindi language as a source literature was in Devanāgarī script. The poetry language. Also, the current status of the work and form of language is much older, enriched by the issues in development are described. poetry of poets like Narsinh Mehta. Gujarati prose writing and journalism started in 19th 1. Introduction century. Protest writing against colonialism led WordNet[2] is a machine readable lexical to a string of powerful essays leading to the database for English language developed at foundation of modern Gujarati literature. Princeton University. It has evolved as the most valuable resource for the natural language 2.2 Features processing application. Following the Princeton Some features of Gujarati language are as WordNet, wordnets for many other languages follows: were developed across the globe. The first 2.2.1 Writing system: Gujarati script is a wordnet for Indian languages is Hindi variant of Devanāgarī script, differentiated by wordnet[3], developed at Indian Institute of the loss of the characteristic horizontal line Technology, Bombay. Recently, efforts are going running above the letters and by a small on to develop wordnets for many Indian number of modifications in the remaining Languages. One such effort is to build Gujarati characters. wordnet from Hindi wordnet using expansion For example: approach. Hindi: कमल The layout of the paper is as follows: section 2 (kamal) gives introduction to Gujarati language, section 3 Gujarati: describes historic influence of other languages on કમળ Gujarati and justifies use of Hindi language as 2.2.2 Vocabulary: As Gujarati is an Indo- base language for Gujarati Wordnet development. Aryan language descended from Sanskrit, it's Section 4 describes the expansion approach vocabulary contains four general categories of selected for the Wordnet development. Section 5 words: describes the status of Gujarati Wordnet Tatsam, Tadbhav and Native and Loan words. devleopment and some issues related to synset Tatsam: Set of words accepted from Sanskrit linking. language. Tadbhav: Set of words from Sanskrit language 2. Gujarati Language adopted with change in phonological form. Gujarati, a native language of Indian state of Native: Words which are specific to Gujarati Gujarat, is a member of Indo-Aryan family of Language. languages. There are over 50 million speakers of Loan Words: Words which are accepted from Gujarati language and it is one of the 22 official different languages, like Persian, English, Portugese etc. Next section describes such words verb to make causative sentence. in more detail. For example: It is also noteworthy that in some cases tatsam (1) ઝાડ પડયુ. and tadbhav words for same Sanskrit word co- (Zaad paDyu) exist with same or different meanings. A tree fell. For example: (2) રામે ઝાડ પાડયુ. (1) ધમર ( Dharma) and ધરમ (Dharam) both (Rame Zaad paaDyu) means same, 'Religion'. Ram caused the tree fell. (2) કમર (karma) : Work, with religious connotation (3) કાને રામ પાસે ઝાડ પડાવયુ. કરમ (karam) : Work (Kane Ram paase Zaad padaVyu) Kan cause Ram who caused the tree fell. 2.2.3 Grammar: Gujarati follows Subject- Object-Verb word order. There are three genders 3. Influence of other languages on Gujarati and two numbers. There are no articles. Some As an Indo-Aryan language, Gujarati language significant features are as follows: is very similar to Hindi, Marathi and Punjabi. 2.2.3.1 Gender: Gujarati distinguishes between Grammar and vocabulary of Gujarati language three genders : masculine, feminine and neutral. is very similar to Hindi with few exceptions. A However the gender marker do not represent the brief comparison is as follows : biological gender all the time. (1) Gender: As described in section 2, For example: Gujarati language defines three genders while છોકરો છોકરી Hindi has only 2 genders. (chhokaro) (chhokari) (2) Writing system: Gujarati dropped the (Boy) (Girl) upper horizontal line running above the letter, and few characters are modified as shown in મંકોડો મંકોડી the previous section. (mankodo) (mankodi) (3) Causative verbs: Both Hindi and Gujarati (Big Ant) (Small Ant) handle causative verbs in the same fashion. 2.2.3.2 Adjective: Adjective agrees with noun For Example, and gender. Feminine adjective does not take Hindi: रोना रलाना रलवाना plural marker while agreeing with a plural noun (rona) (rulana) (rulavana) with feminine gender. is similar to, For example: Gujarati: રડવું રડાવવું રડાવરાવવું (1) Masculine singular (radvu) (radavavu) (radavravavu) સારો છોકરો (sar-o chhokar-o) (4) 'Want' and 'should': Both Hindi and Good Boy Gujarati handles "I should ..." and "I want .." in (2) Masculine plural similar ways. Gujarati uses 'jo' which is similar સારા છોકરાઓ to 'chah' of Hindi. (sar-a chhokara-o) Good Boys For example, (3) Feminine singular I should go home now. સારી છોકરી in Hindi, (sar-i chhokar-i) मुजे घर जाना चाहीये। Good girl in Gujarati, (4) Feminine plural મારે ઘરે જવુ જોઇએ. સારી છોકરીઓ (mare ghare javu joiAe) (sar-i chhokari-o) Good girls However there are other languages which also influence Gujarati. As India was ruled by 2.2.3.3 Structure of verbs: Gujarati verbs have Muslims, English and Portuguese, there is root+infinitive structure. Gujarati extends root influence of these languages on Gujarati. Urdu influence: Following words demonstrate translating the Hindi synset to Gujarati synset. Urdu influence on Gujarati, However, emphasis was given to understand Gujarati Urdu English the concept independently of language and then દાવો dava Clami to create synset. ફાયદો fayda Benefit The task of synset development for Gujarati કાયદો kayda Law language is further simplified by on line ખરાબ kharab Bad availability of the milestone laxicon resources like 'Bhagavad Go Mandal'[5] and 'Gujarati English influence: Most of the Indian languages Lexicon'[6]. 'Bhagavad Go Mandal' was have adapted many of the English words and created in early twentieth century at princely Gujarati is not an exception in that. state of Gondal in Kathiawad. It contains For example, around 8.2 lacs words spread across 9 volumes. બેક : Bank It is accepted as standard reference for Gujarati ફોન : Phone language by 'Gujarat Sahitya Parishad' under ટેબલ : Table the leadership of Mahatma Gandhi. 'Gujarati Lexicon' is an another more recent effort, by Portuguese influence: Following are the some Ratilal Chandaria. The online interface of of the words of Portuguese language adapted in Gujarati lexicon provides easy access to Gujarati: meanings, synonyms, antonyms, idioms, સાબુ soap proverbs and phrases. These two resources બટાટા potato provide great help in building synsets. પાદરી father (Christian priest) 5. Observations Thus the Gujarati language has rich set of words 5.1 Synset linkage status derived from Indian languages as well as foreign The synsets are divided into two categories- languages. This insight helps in selecting the Core and Common. Following is the status of approach for building wordnet. synset developed under each categories. Core synset 4. Gujarati Wordnet development using No. of synsets: 1866 expansion approach Total words : 7985 Gujarati wordnet is being built using expansion Unique words: 7078 approach[4]. In this approach, instead of creating Common synset the synset from the scratch, synsets are created by No. of synset : 5632 referring to existing wordnet of related language. Total words : 17245 Hindi is used as a source language to create Unique words: 13800 synsets of Gujarati language. The benefits of this approach are: 5.2 Issues related to synset development (1) Wordnet development process becomes faster Some Hindi synsets were not linked with as the gloss and synset of the source language is Gujarati synsets because of the following already available as reference. reasons: (2) It provides linking between the synsets of (1) Concept does not exist in Gujarati language different languages which can be used for (2) Difficulty in interpreting gloss of Hindi machine translation applications. synset. Some examples are as follows: Synset linkage tool, provided by I.I.T.Bombay, is Core synset used to create synset of Gujarati language. This (1) ID: 408 synset linking tool provides graphical user Concept: तुरही की तरह का एक बडा बाजा interface which shows Hindi synset on the left side and provides interface to enter Gujarati Example: "नरिसंहा की आवाज दरू -दरू तक सुनाई synset on the right hand side. देती है" As Gujarati language is closely related to Hindi, most of the Gujarati synsets are created by Synset: नरिसंहा, नरिसंगा, बाँिकया, गोमुख, िसंगा No such concept is identified in Gujarati There was no difficulty in linking verb, language. However there is a concept in Gujarati adjectives or causative verbs. This is due to the language for similar instrument which is used at similarity between Hindi and Gujarati war-front to announce beginning of a war.