Practical Approach on Implementation of Wordnets for South African Languages

Practical Approach on Implementation of WordNets for South African Languages Tshephisho Joseph Sefara Tumisho Billson Mokgonyane Council for Scientific and Industrial Research Department of Computer Science Pretoria, South Africa University of Limpopo, South Africa [email protected] [email protected] Vukosi Marivate Department of Computer Science University of Pretoria, South Africa [email protected] Abstract (b) to artificially expand their dataset by making use of data augmentation in natural language pro- This paper proposes the implementation cessing (NLP) tasks (Marivate and Sefara, 2020), of WordNets for five South African lan- (c) to improve cybercrime investigation in social guages, namely, Sepedi, Setswana, Tshiv- network mining tasks (Iqbal et al., 2019). enda, isiZulu and isiXhosa to be added WordNets have been applied in many domains to open multilingual WordNets (OMW) such as machine learning classification to improve on natural language toolkit (NLTK). The the performance of classification algorithms. An African WordNets are converted from example is the role of WordNets is to increase the Princeton WordNet (PWN) 2.0 to 3.0 to amount of NLP task data via data augmentation match the synsets in PWN 3.0. Af- (Marivate and Sefara, 2020). This has been done ter conversion, there were 7157, 11972, for many well-resourced (European) languages. 1288, 6380, and 9460 lemmas for Sepedi, For low-resource languages such as South African Setswana, Tshivenda, isiZulu and isiX- languages, few studies have been done to build hosa respectively. Setswana, isiXhosa, Se- WordNets under low resources. African WordNets pedi contains more lemmas compared to (Bosch and Griesel, 2017; Griesel et al., 2019) 8 languages in OMW and isiZulu contains is a project that develop aligned WordNets for more lemmas compared to 7 languages in African languages spoken in South Africa. Ini- OMW. A library has been published for tially, the project included five languages Sepedi, continuous development of African Word- Setswana, Tshivenda, isiXhosa and isiZulu. Dur- Nets in OMW using NLTK. ing the development, DEBVisDic (WordNet edi- tor) was used to build semantic networks. Due to 1 Introduction limited resources, the expand model was followed WordNet consists of information about adverbs, during the development of the African WordNets. adjectives, verbs and nouns in English and it or- The expand model in WordNets creation is when ganizes the words according to the notion of a the structure of Princeton WordNet is used to cre- synset. A synset can be defined as a set of words ate other WordNets in other languages. that are interchangeable in certain context. For The goal of this paper is to build a multilingual example, the set fhouse, home, buildingg form a lexical database with WordNets for South African synset since the words can be used interchange- languages based on the Princeton WordNet 3 to be 1 ably referring to the same concept. Synsets can utilized using the Python NLTK library via open 2 be linked to each other by means of semantic re- multilingual WordNets (OMW) . We utilize the lations such as meronymy (leaf-tree), hypernymy WordNets previously built by Bosch and Griesel versus hyponymy relation (flower-rose). The in- (2017) for Sepedi, Setswana, Tshivenda, isiXhosa terlinked synsets create a strong semantic network and isiZulu, to be compatible with OMW standard. that allows researchers (a) to automatically expand The main contributions of this paper are as fol- their search queries in information retrieval tasks 1http://www.nltk.org/ (Azad and Deepak, 2019; Abbache et al., 2016), 2http://compling.hss.ntu.edu.sg/omw/ lows: that improved the performance of NLP applications such as question answering. • A Python library has been released to allow Bosch and Griesel (2017) discussed methods inspection and improvement of the resource. to build WordNets for low-resourced languages 3 The library can be found on Github and when the development of WordNets for South 4 Python repository . African languages was initiated using expand model based on PWN version 2. Authors cre- • We released and published the data set (Se- ated a total of 53982 synsets, 9279 definitions and fara et al., 2020). 28853 usage examples. Due to low-resource en- The outline of this paper is as follows: In Sec- vironment, identification and translation of appro- tion 2 we discuss literature review of WordNets priate synsets was done by a human expert. One and their applications. Section 3 describes the of the method is that authors used bilingual dictio- methodology taken to create the resource. Sec- naries to transfer information from dictionary to tion 4 concludes the paper with future work. WordNet then a linguists make final approval for inclusion in the WordNets. 2 Literature Review The Finnish WordNet is a lexical database for Finnish based on PWN structure (Linden´ and Carl- This section discusses the current WordNets and son, 2010). All word senses in PWN were trans- their applications in various domains. lated into Finnish to make FinnWordNet. The 2.1 WordNets PWN word senses were translated by a human translator to validate the quality of the content. Bond and Foster (2013) created OMW that sup- The translation process is explained by (Linden´ port more than 150 languages. OMW is made by and Carlson, 2010). FinnWordNet has 117659 combining WordNets published with open source synsets and freely available under Creative Com- licenses, Wiktionary data, and Unicode Common mons 3.0 license. Locale Data Repository. The aim of OMW is to provide access to WordNets in multiple languages. 2.2 Applications of WordNets All the WordNets in OMW are linked to PWN Baccianella et al. (2010) annotated all the synsets (Miller, 1995). The OMW and PWN can be ac- of WordNet (Miller, 1995) with respect to the no- cessed through NLTK. tions of positivity, negativity, and neutrality to cre- EuroWordNet is a project created by Vossen ate new dataset called SentiWordNet, an improved (1998) to build multilingual WordNets for Eu- lexical resource that is designed to support opinion ropean languages based on PWN. The goal of mining applications and sentiment classification. EuroWordNet is to create multilingual database, Authors published the dataset on Github6. build WordNets independently, obtain compatibil- Siddharthan et al. (2018) uses WordNet to cre- ity across languages, and to maintain language- ate WordNet-feelings which is a new dataset that specific relations. categorises word senses as feelings. Authors cre- Postma et al. (2016) created an open WordNet ated ten categories and manually annotated the for Dutch that contains a total of 117,914 synsets dataset by adding new categories and definitions. using data from Cornetto database, open source WordNets are used as data sources in search and resources, and the PWN. Authors also created a information retrieval tasks when building a query Python module5 that can be applied to NLP appli- (Azad and Deepak, 2019). Abbache et al. (2016) cations. improved the performance of information retrieval ElKateb et al. (2006) proposed the development system for Arabic language by using WordNet and of WordNet for Arabic language using PWN for association rules to expand the search query. In English as basis. Authors constructed the Ara- their methodology, authors removed stop words bic WordNet by using methods used to develop (functional words) from the query before extract- the EuroWordNet (Vossen, 1998). Regragui et ing and selecting synonyms using Arabic WordNet al. (2016) added new content to Arabic WordNet as the main source for word selection. 3https://github.com/JosephSefara/AfricanWordNet Marivate and Sefara (2020) used WordNet to 4https://pypi.org/project/africanwordnet 5https://github.com/cltl/OpenDutchWordnet 6https://github.com/aesuli/SentiWordNet create data augmentation technique for NLP clas- Algorithm 1: Sense map conversion sification applications. Authors compared the Input: s: sense map file technique with semantic similarity augmentation Output: s^ list containing pair of source and round-trip augmentation. The WordNet- and target offset ID based augmentation improved the performance of 1 def mono(s): the classification models when using Wikipedia 2 Let F Open(s) be a to file reader; dataset. The same WordNet-based augmentation 3 for line in F : was used by Zhang et al. (2015) to train a temporal 4 SourceOffset use regular convolutional network that learns text understand- expression to match source offset ing from character level input up to an abstract text ID from line; concepts. Hasan et al. (2020) applied semantic 5 T argetOffset use regular similarity of WordNet to manage the ambiguity in expression to match target offset social media text by selecting informative features ID from line; to enhance semantic representation. 6 s^ [Sourceoffset, T argetoffset]; 3 Methodology 7 return(s^); This section discusses the design and implementation of the WordNets for South African lan- Table 1: Sample of the sense mapping guages. It first discusses sense map preparation, 2.0 2.1 3.0 then WordNets conversion, and finally implemen- 12976279-n 13571065-n 13752172-n tation. 12976532-n 13571318-n 13752443-n 3.1 Sense map preparation 3.2 WordNets conversion In this section, we explain the sense map preparation process. This section discusses conversion of African We used the sensemap(5WN) published on WordNets to PWN 3.0 and explain OMW format. PWN website 7 for versions 2.0, 2.1, and 3.0. We collected the WordNets created by Bosch Sense map simply list each 2.0 noun sense (en- and Griesel (2017) from South African Centre for 8 coded as a sense key) paired with its mapping to Digital Language Resources (SADiLaR) . SADi- one or more 2.1 noun senses. We converted all the LaR is a national center supported and funded by polysemous (nouns and verbs) and monosemous the South African Department of Science and In- (nouns and verbs) to 2.1.

Load more