Banglanet: Towards a Wordnet for Bengali Language K.M

BanglaNet: Towards a WordNet for Bengali Language K.M. Tahsin Hassan Rahit Khandaker Tabin Hasan American International American International University -Bangladesh University -Bangladesh [email protected] [email protected] Md. Al- Amin Zahiduddin Ahmed American International American International University -Bangladesh University -Bangladesh [email protected] [email protected] Abstract present, there are roughly 6,500 languages 1. Among those, Bengali is the 7th most popular language 2 in the world. Yet, there is a lack Despite being a popular language in of work for Bengali wordnet. Global Word- the world, the Bengali language lacks Net Association (GWA) has enlisted almost all in having a good wordnet. This re- wordnets in several levels depending on avail- stricts us to do NLP related research ability and how rich it is. At first level, there work in Bengali. Most of the today’s are 34 Open Multi-lingual WordNet 3 that are wordnets are developed by following merged into Global WordNet Grid. But in spite expand approach. One of the key chal- of being a popular language, Bengali is not one lenges of this approach is the cross- of them. GWA also enlist other available word- lingual word-sense disambiguation. In nets. Among those 80 wordnets, there are two our research work, we make seman- Bengali wordnets which are developed in In- tic relation between Bengali wordnet dia. and Princeton WordNet based on well- In this research work, a baseline for BanglaNet established research work in other lan- has been developed which is a wordnet for guages. The algorithm will derive rela- the Bengali language. To link the wordnet tions between concepts as well. One of with Princeton WordNet, semi-automatic cross- our key objectives is to provide a panel lingual sense mapping approach is used. We for lexicographers so that they can val- align the Princeton WordNet synset into a bi- idate and contribute to the wordnet. lingual dictionary through the English equiv- alent and its part-of-speech (POS). Manual 1 Introduction translation and link-up can also be employed after the alignment. This paper covers previous The Princeton WordNet (PWN) (Miller, 1995; works for other wordnets including previous Fellbaum, 1998) is one of the most semanti- cally rich English lexical database which is 1 How many spoken languages are there in widely used as a resource in many research the world, http://www.infoplease.com/askeds/ many-spoken-languages.html (Accessed 2016-10- and development. It is not only an important re- 22) source for NLP applications in each language, 2Most widely spoken languages in the world, http: but also for inter-linking WordNets of differ- //www.infoplease.com/ipa/A0775272.html (Accessed 2016-10-22) ent languages to develop multilingual applica- 3Open Multilingual WordNet, http://compling.hss. tions to overcome the language barrier. In the ntu.edu.sg/omw/ (Accessed 23-10-2016) attempts of developing Bengali WordNet, de- in 1996. EuroWordNet (Vossen, 2002) began scribe initiative taken for BanglaNet and our as an EU project, with the goal of developing design and execution process in depth. Lastly, wordnets for Dutch, Spanish and Italian and analysis of resultant lexical database has been linking these wordnets to the English Word- presented. We aim to include BanglaNet into Net in a multilingual database. Later in 1997, GlobalWordNet in future. Intending to doing it was extended and German, French, Czech so, relation with Princeton WordNet is main- and Estonian included. Balkan WordNet (Tu- tained as much as possible as per the conven- fis et al., 2004) - which was developed in the tion. Additionally, a web-based collaborative BalkaNet project was developed with an aim tool, called Oikotan which is BanglaNet Lexi- to develop a multilingual semantic network for cography Development Panel (LDP) has been Balkan languages such as reek, Turkish, Ro- developed for revising the result of synset as- manian, Bulgarian, Czech and Serbian. In de- signment and provide a framework to create veloping BalkaNet semantic relations are clas- BanglaNet via the linkage with synsets. sified in the independent WordNets according to a shared ontology. BalkaNet was integrated 2 Background Study along with EuroWordNet through a WordNet Management System. Relations among synsets 2.1 WordNet Development Techniques have been built mostly automatically (Pala and To this date, there are two ways develop word- Smrz, 2004) and these relations are developed net for a particular language. based on Princeton WordNet. However, to Merge Approach is used to build the word achieve high accuracy rate developer needs net from scratch. The Princeton WordNet is to pay special attention to the problem of the built in this approach. The taxonomies of the translation equivalents. language, synsets, relations among synsets are developed first. Experienced work power, lexi- There are open challenges in NLP re- cographer and time are needed to develop for search to automate development of semantic this approach (Taghizadeh and Faili, 2016). resources constitutes. In WOLF (Wordnet Li- Mapping resultant wordnet with the Princeton bre du Français, Free French Wordnet) (Apidi- WordNet is also required extensive work and anaki and Sagot, 2012) development, multi- cross-language expert. ple NLP algorithms including cross-lingual word sense disambiguation is used. WOLF Expand Approach is used to map or trans- is free wordnet for the French language. In late local words directly to the Princeton Word- Asian region, Japanese WordNet (Isahara et al., Net’s synsets by using the existing bilingual 2008) was developed using expand approach. dictionaries. Most of the WordNet available Korean WordNet (Lee et al., 2002) was de- currently is developed by following this ap- veloped using extracting semantic hierarchy proach. This process can be made easy by by utilizing a monolingual MRD and an ex- semi-automatically doing many tasks and then isting thesaurus in expand approach. Thai refactoring it for further proofing. WordNet was (Sathapornrungkij and Pluem- 2.2 Related Works pitiwiriyawej, 2005) also developed by following this same approach. Another large work in 2.2.1 International Languages Asian region includes IndoWordNet (Prabhu The first attempt for developing WordNet in et al., 2012) developed in India to incorpo- another language other than English started rate language used in Indian sub-continent. In- doWordNet was also developed using existing Recently, BabelNet 4 (Navigli and Ponzetto, WordNets. 2012a) has become a good example of multi- Word-Sense Disambiguation (WSD) tech- lingual language resource. BabelNet simpli- nique played a major role in most of the word- fied WSD process by incorporating coding API net development. Lefever, Els and Hoste, (Navigli and Ponzetto, 2012b). Primarily, it Veronique have presented review on cross- uses open-source resources such as Wikipedia. lingual disambiguation (Lefever and Hoste, However, BabelNet does not create any Word- 2010)(Lefever and Hoste, 2013). They found Net for a particular language. It is a huge out that languages where the ratio of word standalone network of multilingual resources against sense is low, it becomes hard to extract which utilizes Princeton WordNet along with translation for that language since the number other resources to make relations. of translation for a particular word in another language becomes greater. Hence, a particular 2.2.2 Bengali word contains multiple translations in counter language. Between two of Bengali wordnets listed in French encountered the similar problem like GWA, one is developed by Indian Institute us. It had no corpus with predicate-argument of Statistics under Indradhanush Project 5. It annotations which help to express semantic re- has an online browser which does not pro- lation build-up. Van der Plas et al. researched vide the semantic relation between synsets and on predicate labeling in French (van der Plas only provides different concept available for a and Apidianaki, 2014) to overcome this issue word. Another Bengali wordnet is developed using Word Sense Disambiguation. as part of IndoWordNet by Center for Indian There are two terms in cross-lingual WSD. Language Technology (CILT) and Indian In- One is best match and another one is Out-of- stitute of Technology (IIT-Bombay) (Prabhu five. In best mode, the word or sense with the et al., 2012). A notable point in this Word- best probability score tagged with its counter Net is - it is built by following the expand word or sense. In case of, Out-of-five approach, approach. It does have the semantic relation if multiple senses or word belongs to candi- between synset to some extent. This is the most date conceptualization, best five probability mature and contextually rich Bengali WordNet candidates are considered for further analysis. to this date. Both WordNets are browsable Further analysis can be done manually or auto- and closed source. These are neither publicly matically. It can be semi-automatic as well. available for development, use or extend nor it WSD process performance can be improved provides any API for general use. by using the Direct Semantic Transfer (DST) There was an effort for developing Bengali technique developed by Van der Plas et al. WordNet in BRAC University’s Center for Re- (Van der Plas et al., 2011). It tells us that the search on Bengali Language Processing. In senses which can be directly transferred to an- their development process they followed merge other language if and only if both share same approach (Faruqe and Khan, 2010). semantic property. Surtani et al. developed a system where it 4 can predict the paraphrases based on corpus BabelNet can be found on http://babelnet.org (Ac- cessed 2016-12-07) (Surtani et al., 2013). In their system, they 5Indradhanush Project, http://indradhanush.unigoa. have a semantic relation prediction model. ac.in (Accessed 2016-10-22.) vi) Lexicographer validation for resultant mapping. 3.1 Similarity Matrices In step iv, similarity algorithm is used.

Load more