Building a Wordnet for Sinhala
Total Page:16
File Type:pdf, Size:1020Kb
Building a WordNet for Sinhala Indeewari Wijesiri Malaka Gallage University of Moratuwa University of Moratuwa Moratuwa, Sri Lanka Moratuwa, Sri Lanka [email protected] [email protected] Buddhika Gunathilaka Madhuranga Lakjeewa University of Moratuwa University of Moratuwa Moratuwa, Sri Lanka Moratuwa, Sri Lanka [email protected] [email protected] Daya C. Wimalasuriya Gihan Dias University of Moratuwa University of Moratuwa Moratuwa, Sri Lanka Moratuwa, Sri Lanka [email protected] [email protected] Rohini Paranavithana Nisansa de Silva University of Colombo University of Moratuwa Colombo, Sri Lanka Moratuwa, Sri Lanka [email protected] 1 Introduction Abstract Despite being used by over 19 million people and being one of the official languages of Sri Lanka, there has not been much progress in de- Sinhala is one of the official languages of Sri veloping natural language processing (NLP) ap- Lanka and is used by over 19 million people. plications for the Sinhala language. This is partly It belongs to the Indo-Aryan branch of the In- due to the lack of commercial interest on devel- do-European languages and its origins date oping Sinhala NLP applications on a global back to at least 2000 years. It has developed scale. For instance, as of now, neither Google into its current form over a long period of time Translate 1 nor Google News 2 is available for with influences from a wide variety of lan- guages including Tamil, Portuguese and Eng- Sinhala while both are available in Hindi and lish. As for any other language, a WordNet is Tamil – two other regional languages spoken by extremely important for Sinhala to take it into a much larger population and thus with a higher the digital era. This paper is based on the pro- business value. ject to develop a WordNet for Sinhala based Within this backdrop, we believe that develop- on the English (Princeton) WordNet. It de- ing a fully functional WordNet for Sinhala would scribes how we overcame the challenges in provide a much needed boost for the Sinahla adding Sinhala specific characteristics which NLP work. This is because it is well recognized were deemed important by Sinhala language that a WordNet is a very important tool in per- experts to the WordNet while keeping the forming natural language processing tasks for structure of the original English WordNet. It also presents the details of the crowdsourcing any language. A WordNet will be helpful to Sin- system we developed as a part of the project - hala NLP application developers in tasks ranging consisting of a NoSQL database in the from word sense disambiguation and information backend and a web-based frontend. We con- retrieval to translation. Moreover a Sinhala clude by discussing the possibility of adapting WordNet will be a valuable resource to linguists this architecture for other languages and the road ahead for the Sinhala WordNet and Sin- 1 hala NLP. http://translate.google.com/ 2https://support.google.com/news/answer/ 40237 studying the Sinhala language. We paid special ing the possibility of adopting the entire system attention to the interests and concerns of the lat- to other languages in Section 5. We present the ter group as described later in the paper. details of some related work in Section 6 and The project team, mainly consisting of per- provide concluding remarks in Section 7. sonnel from the Knowledge and Language Engi- neering Lab of University of Moratuwa, started 2 Developing the Linguistic Infrastruc- the task of developing a WordNet for Sinhala ture with several brainstorming sessions which in- volved Sinhala language experts, computer sci- Development of linguistic infrastructure was car- ence specialists and people who had previously ried out as the first phase of the project. Several made some contributions in digitizing the Sinha- discussions with Sinhala language experts were la language (for example in developing Sinhala conducted to better understand the key features Unicode characters). Although we were biased of the Sinhala language. towards using the expansion approach, which 2.1 Discussions with Sinhala Linguists develops a WordNet based on an existing WordNet for another language, we discussed the From the beginning of the project the develop- possibility of adopting the merge approach, ment team was collaborating with some promi- which develops a WordNet using the first princi- nent experts on Sinhala language. The basic idea ples by leveraging existing dictionaries and other of this collaboration was to acquire the necessary resources (Bhattacharyya, 2010). We settled on knowledge of the Sinhala language to get to the expansion approach because it was evident know the linguistic requirements of a Sinhala that we do not have the resources to successfully WordNet and to form an expert evaluator panel pursue the merge approach. to help with the crowdsourcing effort in develop- We came up with basic design for the Word- ing the WordNet. Net through the above mentioned brainstorming One important topic discussed with the experts sessions and then proceeded to develop the tech- was that Sinhala has a significant difference in nical infrastructure needed. This consists of de- written and spoken usage. These differences in- veloping Sinhala WordNet APIs and a web inter- clude differences in word usage and differences face as well as a crowdsourcing system to add in grammar. We were particularly interested in synsets and relationships. The latter is needed differences in word usage in spoken and written because coming up with Sinhala synsets and re- forms as grammar rules fall outside the scope of lationships based on the synsets of another lan- a WordNet. It was observed that words with sub- guage requires a lot of manual work. Initially we tle but important differences are used in the writ- were planning to use the Hindi WordNet as the ten and spoken forms of Sinhala. For instance, source WordNet but switched to the English for the sense “man”, නිසා (minisa) is the most WordNet a couple of months into the project. frequent word used in written Sinhalese while The reasons for this change are discussed in Sec- නිහා (miniha) is the most frequent word used in tion 2.2. Apart from this the development effort spoken Sinhalese. While the difference is subtle proceeded fairly smoothly and we have complet- (a single phoneme in this case) its implications ed the implementation of the WordNet API and are significant for a natural speaker of Sinhala. In the crowdsourcing system. Currently we are in this case, using නිසා in normal conversations the process of adding synsets using this system. appears extremely odd. Moreover such differ- The rest of the paper is organized as follows. ences are very common and combining words In Section 2, we present the details of the discus- used in spoken and written Sinhala results in sions we had with Sinhala language experts and very odd phrases. the effects these discussions had in the structure The problem faced by us was whether to in- of the Sinhala WordNet. In Section 3 we discuss clude this difference in the Sinhala WordNet. the technical details of the project. Here, we de- Doing so would go against the main objective of scribe the use of a NoSQL database to facilitate a WordNet which is organizing words by their modification to a WordNet, which has not been meanings; clearly there is no difference in the done before to the best of our knowledge. In Sec- meanings of නිසා and නිහා as it is simply a tion 4, we describe how the crowdsourcing sys- matter of language usage. Despite this concern, tem works including how it gives suggestions to we decided to include this difference as a flag for the contributors simplifying their task. We reflect each word due to the following reasons. on some important aspects of the project includ- 1. Not including these in the WordNet would pear odd. This is despite the fact that all four result in the loss of a valuable opportunity words are acceptable in written Sinhala. Thus to encode these differences in a machine details of the origin of a word are also included readable manner; the contributors of the in the Sinhala WordNet. Both the source lan- crowdsourcing system can do this with lit- guage and the derivation type (tatsama/tatbawa) tle extra effort but doing it as a separate are kept on this regard. project would require a lot more effort. Each noun in Sinhala can be in 9 morphologi- The importance of this factor is magnified cal forms called ‘vibhakthi’(විභ槊ති). Furthermore by the lack of commercial interest in Sin- there are fairly complicated rules in forming hala NLP. compound words called ‘sandi’(සේ쇒) and ‘sa- 2. Since one of the primary reasons for de- masa’(සමාස). The formation of these forms and veloping a Sinhala WordNet was to serve rules as well as the inflectional forms of a verb the needs of Sinhala linguists we wanted are based on the root of the word, which may not to accommodate their requirements. We be the most commonly used form of the word. suspected that eliminating this type of in- Therefore, it was decided to keep the word root formation would make the WordNet less as well as the most common morphological form useful to them. Janssen (2002) has made a in storing a word in the WordNet. similar argument with regards to eliminat- In summary, we decided to include the follow- ing gender information from WordNets. ing features for each word. Hence, adding this information to the Written/ Spoken usage WordNet was seen as a pragmatic move.