IndoWordNet Dictionary: An Online Multilingual Dictionary using IndoWordNet Hanumant Redkar Sandhya Singh Center for Indian Language Technology, Center for Indian Language Technology, Indian Institute of Technology Bombay, India Indian Institute of Technology Bombay, India [email protected] [email protected] Nilesh Joshi Anupam Ghosh Center for Indian Language Technology, Center for Indian Language Technology, Indian Institute of Technology Bombay, India Indian Institute of Technology Bombay, India [email protected] [email protected] Pushpak Bhattacharyya Department of Computer Science and Engineering, Indian Institute of Technology Bombay, India [email protected] based, thesaurus based, word usage based and Abstract language based. English WordNet infor- mation is also rendered using this interface. India is a country with diverse culture, lan- The IndoWordNet dictionary will help users guage and varied heritage. Due to this, it is to know meanings of a word in multiple Indi- very rich in languages and their dialects. Be- an languages. ing a multilingual society, a dictionary in multiple languages becomes its need and one of the major resources to support a language. 1 Introduction There are dictionaries for many Indian lan- guages, but very few are available in multiple Language is a constituent element of civilization. languages. WordNet is one of the most prom- In a country like India, diversity is its primary as- inent lexical resources in the field of Natural pect. This leads to varied languages and their dia- Language Processing. IndoWordNet is an in- lects. There are numerous languages in India which tegrated multilingual WordNet for Indian lan- belong to different language families. These lan- guages. These WordNet resources are used by guage families are Indo-Aryan, Dravidian, Sino- researchers to experiment and resolve the is- Tibetan, Tibeto-Burman and Austro-Asiatic. The sues in multilinguality through computation. major ones are the Indo-Aryan, spoken by the However, there are few cases where WordNet is used by the non-researchers or general pub- northern to western part of India and Dravidian, lic. This paper focuses on providing an online spoken by southern part of India. The Eighth interface – IndoWordNet Dictionary to non- Schedule of the Indian Constitution lists 22 lan- researchers as well as researchers. It is devel- guages, which have been referred to oped to render multilingual WordNet infor- as scheduled languages and given recognition, sta- mation of 19 Indian languages in a tus and official encouragement. dictionary format. The WordNet information A Dictionary can be called as a resource dealing is rendered in multiple views such as: sense with the individual words of a language along with 71 D S Sharma, R Sangal and E Sherly. Proc. of the 12th Intl. Conference on Natural Language Processing, pages 71–78, Trivandrum, India. December 2015. c 2015 NLP Association of India (NLPAI) its orthography, pronunciation, usage, synonyms, which is a linked WordNet for European languages derivation, history, etymology, etc . arranged in an (Vossen, 1999) and BalkaNet, which is a linked order for convenience of referencing the words. WordNet for Balkan Languages (Christodoulakis, Various criterions used for classifying this resource 2002). The most innovative aspect of WordNets is are - density of entries, number of languages in- that lexical information is organized in terms of volved, nature of entries, degree of concentration meaning; i.e. , a synset contains words of the same on strictly lexical data, axis of time, arrangement part-of-speech which have approximately the same of entries, purpose, prospective user, etc . Some of meaning. Thus, it is synonymy that functions as the the common types of dictionaries are 1- essential principle in the construction of WordNets • Encyclopedia: Single or multi-volume publi- (Vincze et al., 2008). This feature of WordNet is cation that contains accumulated and authorita- most important for the dictionary construction. tive knowledge on a subject arranged IndoWordNet is used in the field of Natural alphabetically. E.g. Britannica encyclopedia. Language Processing tasks like Machine Transla- • Thesaurus: Thesaurus is a dictionary that lists tion, Information Retrieval, Information Extrac- words in groups of synonyms and related con- tion, etc . But, not much has been explored to use cepts. this resource beyond research labs. In this paper, • Etymological Dictionary: An etymological we present an interface – IndoWordNet Dictionary dictionary discusses the etymology/origin of (IWN Dictionary ) in the form of multilingual the words listed. It is the product of research in online dictionary which uses IndoWordNet as a historical linguistics. resource. The primary focus of this interface is to • Dialect Dictionary: These dictionaries deal provide synset information in a systematic and with the words of a particular geographical re- classified manner which is rendered in multiple gion or social group which are non standard. views. • Specialized Dictionary: These dictionaries The rest of the paper is organized as follows: covers relatively restricted set of phenomena. Section 2 justifies the need of IndoWordNet Dic- • Bilingual or Multilingual Dictionary: These tionary. Section 3 details the IndoWordNet Dic- are linguistic dictionaries in two or more lan- tionary, its components, followed by its design and guages. layout. Section 4 gives the features of the diction- ary. Section 5 lists its limitations. Finally, the con- • Reverse Dictionary: These dictionaries are clusion, scope and enhancements to the IWN based on the concept/idea/definition to words. Dictionary are presented. • Learner’s Dictionary: These dictionaries are meant for foreign students/tourists to learn the 2 Need for IndoWordNet Dictionary usage of the word in language. • Phonetic Dictionary : These dictionaries help Our work on developing IWN Dictionary interface in searching the words by the way they sound. is motivated from various available online re- • Visual Dictionary : These dictionaries use pic- sources. To name some: langtolang.com 3 which tures to illustrate the meaning of words. includes cross-lingual references across 47 non- Indian languages, wordreference.com 4 which WordNet is a lexical resource composed of includes 17 non-Indian languages, and others being synsets and semantic relations. Synsets are sets of logosdictionary.org 5 and xobodo.org6 synonyms. They are linked by semantic relations which has multiple languages including some Indi- like hypernymy, meronymy, etc. and lexical rela- an languages. But, all these resources render not tions like antonymy, gradation, etc . (Miller et al., more than two languages at a given instance. 2 1990; Fellbaum, 1998). IndoWordNet is a linked Further survey is done, which reveals that structure of WordNets of 19 different Indian lan- Mohanty et al. (2008) had developed a tool for guages from Indo-Aryan, Dravidian and Sino- multilingual dictionary development process to Tibetan families (Bhattacharyya, 2010). Other popular multilingual WordNets are: EuroWordNet, 3 http://www.langtolang.com/ 4 http://www.wordreference.com 1 http://www.ciil-ebooks.net/html/lexico/link5.htm 5 http://www.logosdictionary.org/ 2 http://www.cfilt.iitb.ac.in/indowordnet/ 6 http://www.xobdo.org/ 72 create and link the synset based lexical resource for • To improve on one’s own language voca- machine translation purpose. The aim was to sim- bulary. plify the process of synset creation and to link it • To address social and educational needs. with different Indian language WordNets. The tool was mainly used by lexicographers involved in the 3 IndoWordNet Multilingual Dictionary process of creating various Indian language WordNets. Also, Sinha et al. (2006) who have de- signed a browsable bilingual interface for viewing 3.1 What is IndoWordNet Dictionary? WordNet information in two languages, Hindi and Marathi. The input to this browser is a search word IndoWordNet Dictionary 7 or IWN Dictionary is an in any of the two languages and the output is the online interface to render multilingual IndoWord- search result for both the languages. The primary Net information in the dictionary format. It allows usage of this interface is to help users get the se- user to view the results in multiple formats as per mantic information of the search string in both the need. Also, user can view the result in multiple Hindi and Marathi. However, Sarma et al. (2012) languages simultaneously. The look and feel of the built a multilingual dictionary considering three IWN Dictionary is kept same as a traditional dic- languages, viz. , Assamese, Bodo and Hindi. The tionary keeping in mind the user adaptability. So dictionary interface allows searching between Hin- far, it renders WordNet information of 19 Indian di-Assamese and Hindi-Bodo language pairs at a languages. These languages are: Assamese, Bodo, time. Bengali, Gujarati, Hindi, Kannada, Kashmiri, All these interfaces mentioned above could dis- Konkani, Maithili, Malayalam, Manipuri, Marathi, play the meanings in at most two languages with Nepali, Odia, Punjabi, Sanskrit, Tamil, Telugu and the lexical information available in the WordNet at Urdu. The WordNet information is also rendered in a time. Hence, we have developed a web based English. The English WordNet information is interface to render multilingual WordNet infor- taken from Princeton University 8 website. All mation in a dictionary format. This interface is
File Typepdf
Upload Time-
Content LanguagesEnglish
Upload UserAnonymous/Not logged-in
File Pages8 Page
File Size-