Telugu Wordnet
Total Page:16
File Type:pdf, Size:1020Kb
Telugu WordNet S. Arulmozi Department of Dravidian & Computational Linguistics Dravidian University, Kuppam 517425, India [email protected] Abstract Section 4 gives a statistical account on the synsets developed. The last section This paper describes an attempt to develop Telugu WordNet, particularly construction of summarizes the work. synsets in Telugu language along the lines of Hindi synsets using the expansion approach. 2 The Telugu Language Based on the Hindi WordNet synsets, we assign Telugu synsets manually using the Offline Tool Telugu belongs to the South Central Dravidian Interface. We share the challenges faced in the subgroup of the Dravidian family of languages. construction of core synsets from Hindi into It has recorded history from 6th Century A.D. Telugu language. A brief account on Telugu th language and its notable features are also and literary history dating back to 11 Century provided. A.D. It has been recently awarded the Classical Status. It is the second most spoken language 1 Introduction after Hindi in India. Telugu has been the language of choice for lyrical compositions for WordNet building activities in Dravidian its vowel ending words, rightly called the languages started with the work of Tamil “Italian of the East”. WordNet1 at AU-KBC Research Centre using The vocabulary of Telugu is highly Rajendran’s (2001) ontological classification Sanskritized in addition to the Persian-Arabic of Tamil vocabulary. Work on Dravidian borrowings / kaburu/ `story ’, WordNet (comprising WordNets in four major కబురు జవాబు /javaabu/ `answer ’; Urdu /taraaju/ Dravidian languages, viz. Kannada, తరా灁 Malayalam, Tamil and Telugu) started during a 2 `balance’. It does have cognates in other Workshop held at Chennai in which synsets Dravidan languages such as puli/ `tiger ’, were built for Construction Domain. Currently పు젿 / 3 uuru/ `village’; /tala/ `head’. Dravidian WordNet activity is being carried ఊరు / తల out for Kannada at University of Hyderabad, Malayalam at Amrita Vishwa Vidyapeetham, Words in Dravidian languages, especially in Tamil at Tamil University and Telugu at Telugu are long and complex. i.e. because it Dravidian University. also suffixation, words are build up from many affixes that combine with one another. Telugu, In this paper, we describe the construction of like other Dravidian languages is highly rich in synsets for Telugu language. Based on the morphology and hence agglutinative in nature; Hindi WordNet synsets, we aim to develop it also allows polyagglutination. synsets in Telugu using the expand model. The Telugu has the state-of-the-art Morphological paper is organized as follows: Section 2 gives a Analyser 4 and Telugu-Hindi, Hindi-Telugu, brief account on the morphological features of Telugu-Tamil, Tamil-Telugu MT systems5. In Telugu language. This section also provides addition to this, the Resource Centre for Indian the language technology activities pertaining to Language Technology Solutions-Telugu Telugu language. Section 3 details the Telugu (RCILTS) established by the Ministry of synset building activity, challenges/problems Communications and IT, Govt.of India during faced during the construction of synsets. 2000-2003 has developed several products, 1 4 Project partially funded by Tamil Virtual University. Developed by G.Uma Maheswar Rao at the Centre for 2 Workshop on WordNet for Dravidian Languages ALTS, University of Hyderabad. 5 organized from 2-3 June 2003 Consortia on Indian-Language-Indian Language MT 3 Project funded by the Ministry of HRD, Govt. of India. systems established by MCIT, GoI. services and knowledge bases pertaining to 3.1.1 Computational Concerns Telugu language. They include Drishti, the The interface7 has the following problems: first comprehensive OCR system in Telugu, Tel-Spell, Spell checker for Telugu, Telugu 1. In the target language box, complete option Corpus (9.2 m words)6, etc. does not show how many synsets are completed. It was working in the earlier This effort on building a WordNet for Telugu version. is the first of its kind along the lines of Hindi WordNet. 2. By mistake, if the user opens the tool for 3 Construction of Telugu WordNet the second time, there shall be a message displayed stating, that the tool is already open. Various approaches are followed in the construction of WordNets across the languages 3. The tool does not work when logged-in as a of the world. For the Indian languages, Guest user (in Windows Vista). Whenever WordNets are constructed using the expand guest user is logged in and opens the tool, it model. does not allow to SAVE the data. For the construction of Telugu WordNet, we also follow the expand model. i.e. Hindi 4. When X-Link (Synset Linker) is invoked, WordNet synsets are taken as a starting point the popup window does not display any of departure. The concepts provided along with options. the Hindi synsets are first conceived and appropriate concepts in Telugu are manually 5. Everytime the user has to use the exit provided by language experts. The Telugu button to close the tool, instead the close synsets are then built based on the concepts option X shall be enabled. created keeping in view the three principles, 3.1.2 Lexicographical Concerns viz. Minimality, Coverage and Replaceability. 1. Hindi synsets ID 5499: For the building of synsets, we use the OfflineTool provided along with Hindi Synset IDs are the same in both versions (2.0 WordNet synsets. This standalone interface and 2.1) of Offline Tool. Subsequently, the allows users to view the Hindi synsets, concept, example and sysnets are the same. concepts, example sentence on the one side and simultaneously keying the target language But in version 2.0 of the OfflineTool, EWN ID (Telugu in our case) synsets, concepts and is 1214525 and it displays the Concept as: “the example sentence. The tool also has the social act of assembling”; Example as: "they Princeton WordNet English synsets interlinked. demanded the right of assembly"; Synset: This helps the language experts to cross check “assembly, assemblage, gathering” and in with English WordNet synsets. version 2.1, PWD ID is 8069519 displaying Below we present the challenges/problems the Concept as: “a large number of things or faced in the construction of 2000 core synsets people considered together”; Example: "a from Hindi into Telugu. crowd of insects assembled around the flowers"; Synset: “crowd”. 3.1 Expansion from Hindi/English into Telugu Now, this is a problem when one tries to For the construction of Telugu synsets, we use manually assign synsets in Telugu based on the the OfflineTool provided by Hindi WordNet Hindi/English synsets. group. We have encountered certain problems with regard to the tool. We have also faced 2. Hindi ID Number 7379: specific problems while rendering corresponding synsets, concepts in Telugu. The concept भाई का ऱ蔼का /bhai ka ladka/ which Below we present few of the computational means `brother’s son’ is an interesting problem. and lexicographical concerns. The synsets given for this concept in Hindi are 6 Achievements of RCILTS-Telugu, TDIL 7 Offline Tool Version 2.1 “भतीजा, भ्रातजृ , भ्रातापुत्र, भ्रातपृ ुत्र, भतीज, अवतसं , executive) and the synsets are “businessman, man of affairs”. ” BatIjA, BrAtRuja, BrAtAputra, अवतꅍस / BrAtRuputra, BatIja, avataMsa, avatansa/ 5. Hindi Synset ID: 1680 which all means `brother’s son’ in Hindi. When it comes to Telugu, providing concept is Concept: a challenge. Straightforward, one can assign बहत ब蔼ा या क्तवशेष ऊँ चाई का या जजसका क्तवतार /sOdaruni kumAruDu/. But ु సోదరుꀿ కుమారుడు ऊपर की ओर अधिक हो when assigning synsets, one has to cope up bahuta baDxA yA viSeSha UMcAyI kA yA with the ambiguity in the concept. i.e. whose jisakA vistAra Upara kI ora adhika ho brother;s son. In any case, we have provided equivalent as /anna koDuku/. For the above mentioned Hindi concept, the అన్న కొడుకు corresponding English synsets in both the Similar challenge faced in ID 1804. versions are completely different. 3. Hindi synset ID: 7531 In the 2.0 version, for the EWN ID 1250892, the concept given is “the action of establishing Concept: on a socialist basis” ; example given is “the चाऱीस सेर की एक तौऱ socialization of medical services” and the synsets are “socialization, socialisation”. cAlIsa sera kI eka taula In 2.1 version, for the PWN ID 1250892, the which means `a measure of 40 kg’. concept given is “(literal meaning) being at or having a relatively great or specific elevation For this concept, there is no corresponding or upward extension (sometimes used in equivalent in Telugu. But there are different combinations like “knee-high””, example is “a usages in different dialects of Andhra Pradesh. high mountain”, and so on and the synset is For example, in Kuppam, the measure is equal “high”. to 10 kg whereas in Kadapa it is 14 kg. These kinds of synsets pose a real problem for When it comes to providing equivalent synsets, the language experts because whenever they Hindi and Telugu uses the same, i.e. manu. assign a concept/synset, it creates a confusion. Further, it 4. Hindi Synset ID: 1602 We have presented only few challenges and Concept: problems faced in the construction of Telugu 핍यापार करनेवाऱा 핍यक्ति WordNet using Hindi synsets. The real vyApAra karanevAlA vyakti challenge will come up once we start working horizontally on synsets. which means `bookbinder’ 4 Statistics The English IDs given for the above mentioned Hindi synset ID is different from The status of Telugu WordNet after Version 2.0 to Version 2.1. completion of core (2k) synsets is given below: In Version 2.0 of OfflineTool, EWN ID is No. of Core Synsets: 2000 10560080 and the concept given is “someone No.