An Online Multilingual Dictionary Using Indowordnet
Total Page:16
File Type:pdf, Size:1020Kb
IndoWordNet Dictionary: An Online Multilingual Dictionary using IndoWordNet Hanumant Redkar Sandhya Singh Center for Indian Language Technology, Center for Indian Language Technology, Indian Institute of Technology Bombay, India Indian Institute of Technology Bombay, India [email protected] [email protected] Nilesh Joshi Anupam Ghosh Center for Indian Language Technology, Center for Indian Language Technology, Indian Institute of Technology Bombay, India Indian Institute of Technology Bombay, India [email protected] [email protected] Pushpak Bhattacharyya Center for Indian Language Technology, Indian Institute of Technology Bombay, India [email protected] views such as: sense based, thesaurus based, Abstract word usage based and language based. Eng- lish WordNet information is also rendered us- India is a country with diverse culture, lan- ing this interface. The IndoWordNet guage and varied heritage. Due to this, it is dictionary will help language researchers as very rich in languages and their dialects. Be- well as general public to know meanings of ing a multilingual society, a multilingual dic- a word in multiple Indian languages. tionary becomes its need and one of the major resources to support a language. There are dictionaries for many Indian languages, but 1 Introduction very few are available in multiple languages. WordNet is one of the most prominent lexical Language is a constituent element of civilization. resources in the field of Natural Language In a country like India, diversity is its primary as- Processing and Information Retrieval, while pect. This leads to varied languages and their dia- IndoWordNet is an integrated multilingual lects. There are numerous languages in India which WordNet for Indian languages. These re- belong to different language families. These lan- sources are used by researchers to experiment guage families are Indo-Aryan, Dravidian, Sino- and resolve the issues in multilinguality Tibetan, Tibeto-Burman and Austro-Asiatic. The through computation. However, there are few major ones are the Indo-Aryan, spoken by most of cases where WordNet is used by the non- researchers or general public. This paper fo- Indians while Dravidian spoken by southern Indi- cuses on providing an online interface – ans. The Eighth Schedule of the Indian Constitu- IndoWordNet Dictionary to non-researchers tion lists 22 languages, which have been referred to as well as researchers. It is developed to ren- as scheduled languages and given recognition, sta- der multilingual WordNet information of 19 tus and official encouragement. Indian languages in a dictionary format. The A Dictionary can be called as a resource dealing WordNet information is rendered in multiple with the individual words of a language along with its orthography, pronunciation, usage, synonyms, which is a linked WordNet for European languages derivation, history, etymology, etc . arranged in an (Vossen, 1999) and BalkaNet, which is a linked order for convenience of referencing the words. WordNet for Balkan Languages (Christodoulakis, Various criterions used for classifying this resource 2002). The most innovative aspect of WordNets is are - density of entries, number of languages in- that lexical information is organized in terms of volved, nature of entries, degree of concentration meaning; that is, a synset contains words of the on strictly lexical data, axis of time, arrangement same part-of-speech which have approximately the of entries, purpose, prospective user, etc . Some of same meaning. Thus, it is synonymy that functions the common types of dictionaries are 1- as the essential principle in the construction of • Encyclopedia: Single or multi-volume publi- WordNets (Vincze et al., 2008). This feature of cation that contains accumulated and authorita- WordNet is most important for the dictionary con- tive knowledge on a subject arranged struction. alphabetically. E.g. Britannica encyclopedia. IndoWordNet is used in the field of Natural • Thesaurus: Thesaurus is a dictionary that lists Language Processing tasks like Machine Transla- words in groups of synonyms and related con- tion, Information Retrieval, Information Extrac- cepts. tion, etc . But, not much has been explored to use • Etymological Dictionary: An etymological this resource beyond research labs. In this paper, dictionary discusses the etymology/origin of we present an interface – IndoWordNet Dictionary the words listed. It is the product of research in (IWN Dictionary ) in the form of multilingual historical linguistics. online dictionary which uses IndoWordNet as a • Dialect Dictionary: These dictionaries deal resource. The primary focus of this interface is to with the words of a particular geographical re- provide synset information in a systematic and gion or social group which are non standard. classified manner which is rendered in multiple • Specialized Dictionary: These dictionaries views. covers relatively restricted set of phenomena. The rest of the paper is organized as follows: • Bilingual or Multilingual Dictionary: These Section 2 justifies the need of IndoWordNet Dic- are linguistic dictionaries in two or more lan- tionary. Section 3 details the IndoWordNet Dic- guages. tionary, its components, followed by its design and layout. Section 4 gives the features of the diction- • Reverse Dictionary: These dictionaries are ary. Section 5 lists its limitations. Finally, the con- based on the concept/idea/definition to words. clusion, scope and enhancements to the interface • Learner’s Dictionary: These dictionaries are are presented. meant for foreign students/tourists to learn the usage of the word in language. 2 Need for IndoWordNet Dictionary • Phonetic Dictionary : These dictionaries help in searching the words by the way they sound. Our work on developing IWN Dictionary interface • Visual Dictionary : These dictionaries use pic- is motivated from various available online re- tures to illustrate the meaning of words. sources. To name some: langtolang.com 3 which includes cross-lingual references across 47 non- WordNet is a lexical resource composed of Indian languages, wordreference.com 4 which synsets and semantic relations. Synsets are sets of includes 17 non-Indian languages, and others being synonyms. They are linked by semantic relations logosdictionary.org 5 and xobodo.org 6 like hypernymy, meronymy, etc. and lexical rela- which has multiple languages including some Indi- tions like antonymy, gradation, etc . (Miller et al., an languages. But, all these resources render not 2 1990; Fellbaum, 1998). IndoWordNet is a linked more than two languages at a given instance. structure of WordNets of 19 different Indian lan- Further survey is done, which reveals that guages from Indo-Aryan, Dravidian and Sino- Mohanty et al. (2008) had developed a tool for Tibetan families (Bhattacharyya, 2010). Other popular multilingual WordNets are: EuroWordNet, 3 http://www.langtolang.com/ 4 http://www.wordreference.com 1 http://www.ciil-ebooks.net/html/lexico/link5.htm 5 http://www.logosdictionary.org/ 2 http://www.cfilt.iitb.ac.in/indowordnet/ 6 http://www.xobdo.org/ multilingual dictionary development process to located to that area. create and link the synset based lexical resource for • To improve on one’s own language voca- machine translation purpose. The aim was to sim- bulary. plify the process of synset creation and to link it • To address social and educational needs. with different Indian language WordNets. The tool was mainly used by lexicographers involved in the 3 IndoWordNet Multilingual Dictionary process of creating various Indian language WordNets. Also, Sinha et al. (2006) who have de- signed a browsable bilingual interface for viewing 3.1 What is IndoWordNet Dictionary? WordNet information in two languages Hindi and Marathi. The input to this browser is a search 7 IndoWordNet Dictionary or IWN Dictionary is an string in any of the two languages and the output is online interface to render multilingual IndoWord- the search result for both the languages. The pri- Net information in the dictionary format. It allows mary usage of this interface is to help users get the user to view the results in multiple formats as per semantic information of the search string in both the need. Also, user can view the result in multiple Hindi and Marathi. However, Sarma et al. (2012) languages simultaneously. The look and feel of the built a multilingual dictionary considering three IWN Dictionary is kept same as a traditional dic- languages, viz. , Assamese, Bodo and Hindi. The tionary keeping in mind the user adaptability. So dictionary interface allows searching between Hin- far, it renders WordNet information of 19 Indian di-Assamese and Hindi-Bodo language pairs at a languages. These languages are: Assamese, Bodo, time. Bengali, Gujarati, Hindi, Kannada, Kashmiri, All these interfaces mentioned above could dis- Konkani, Maithili, Malayalam, Manipuri, Marathi, play the meanings in at most two languages with Nepali, Odia, Punjabi, Sanskrit, Tamil, Telugu and all the linguistic information available in the Urdu. The WordNet information is also rendered in WordNet at a time. Hence, we have developed a English. The English WordNet information is web based interface to render multilingual 8 taken from Princeton University website and im- WordNet information in a dictionary format. This ported into IndoWordNet database structure interface is developed keeping in mind the general (Prabhu, et al., 2012). The data is imported to this public, apart from