<<

Practical Approach on Implementation of for South African Languages Tshephisho Joseph Sefara Tumisho Billson Mokgonyane Council for Scientific and Industrial Research Department of Computer Science Pretoria, South Africa University of Limpopo, South Africa [email protected] [email protected]

Vukosi Marivate Department of Computer Science University of Pretoria, South Africa [email protected]

Abstract (b) to artificially expand their dataset by making use of data augmentation in natural language pro- This paper proposes the implementation cessing (NLP) tasks (Marivate and Sefara, 2020), of WordNets for five South African lan- (c) to improve cybercrime investigation in social guages, namely, Sepedi, Setswana, Tshiv- network mining tasks (Iqbal et al., 2019). enda, isiZulu and isiXhosa to be added WordNets have been applied in many domains to open multilingual WordNets (OMW) such as classification to improve on natural language toolkit (NLTK). The the performance of classification algorithms. An African WordNets are converted from example is the role of WordNets is to increase the Princeton WordNet (PWN) 2.0 to 3.0 to amount of NLP task data via data augmentation match the synsets in PWN 3.0. Af- (Marivate and Sefara, 2020). This has been done ter conversion, there were 7157, 11972, for many well-resourced (European) languages. 1288, 6380, and 9460 lemmas for Sepedi, For low-resource languages such as South African Setswana, Tshivenda, isiZulu and isiX- languages, few studies have been done to build hosa respectively. Setswana, isiXhosa, Se- WordNets under low resources. African WordNets pedi contains more lemmas compared to (Bosch and Griesel, 2017; Griesel et al., 2019) 8 languages in OMW and isiZulu contains is a project that develop aligned WordNets for more lemmas compared to 7 languages in African languages spoken in South Africa. Ini- OMW. A library has been published for tially, the project included five languages Sepedi, continuous development of African Word- Setswana, Tshivenda, isiXhosa and isiZulu. Dur- Nets in OMW using NLTK. ing the development, DEBVisDic (WordNet edi- tor) was used to build semantic networks. Due to 1 Introduction limited resources, the expand model was followed WordNet consists of information about adverbs, during the development of the African WordNets. adjectives, verbs and nouns in English and it or- The expand model in WordNets creation is when ganizes the words according to the notion of a the structure of Princeton WordNet is used to cre- synset. A synset can be defined as a set of words ate other WordNets in other languages. that are interchangeable in certain context. For The goal of this paper is to build a multilingual example, the set {house, home, building} form a lexical database with WordNets for South African synset since the words can be used interchange- languages based on the Princeton WordNet 3 to be 1 ably referring to the same concept. Synsets can utilized using the Python NLTK library via open 2 be linked to each other by means of semantic re- multilingual WordNets (OMW) . We utilize the lations such as meronymy (leaf-tree), hypernymy WordNets previously built by Bosch and Griesel versus hyponymy relation (flower-rose). The in- (2017) for Sepedi, Setswana, Tshivenda, isiXhosa terlinked synsets create a strong and isiZulu, to be compatible with OMW standard. that allows researchers (a) to automatically expand The main contributions of this paper are as fol- their search queries in tasks 1http://www.nltk.org/ (Azad and Deepak, 2019; Abbache et al., 2016), 2http://compling.hss.ntu.edu.sg/omw/ lows: that improved the performance of NLP applica- tions such as . • A Python library has been released to allow Bosch and Griesel (2017) discussed methods inspection and improvement of the resource. to build WordNets for low-resourced languages 3 The library can be found on Github and when the development of WordNets for South 4 Python repository . African languages was initiated using expand model based on PWN version 2. Authors cre- • We released and published the data set (Se- ated a total of 53982 synsets, 9279 definitions and fara et al., 2020). 28853 usage examples. Due to low-resource en- The outline of this paper is as follows: In Sec- vironment, identification and translation of appro- tion 2 we discuss literature review of WordNets priate synsets was done by a human expert. One and their applications. Section 3 describes the of the method is that authors used bilingual dictio- methodology taken to create the resource. Sec- naries to transfer information from dictionary to tion 4 concludes the paper with future work. WordNet then a linguists make final approval for inclusion in the WordNets. 2 Literature Review The Finnish WordNet is a lexical database for Finnish based on PWN structure (Linden´ and Carl- This section discusses the current WordNets and son, 2010). All word senses in PWN were trans- their applications in various domains. lated into Finnish to make FinnWordNet. The 2.1 WordNets PWN word senses were translated by a human translator to validate the quality of the content. Bond and Foster (2013) created OMW that sup- The translation process is explained by (Linden´ port more than 150 languages. OMW is made by and Carlson, 2010). FinnWordNet has 117659 combining WordNets published with open source synsets and freely available under Creative Com- licenses, Wiktionary data, and Unicode Common mons 3.0 license. Locale Data Repository. The aim of OMW is to provide access to WordNets in multiple languages. 2.2 Applications of WordNets All the WordNets in OMW are linked to PWN Baccianella et al. (2010) annotated all the synsets (Miller, 1995). The OMW and PWN can be ac- of WordNet (Miller, 1995) with respect to the no- cessed through NLTK. tions of positivity, negativity, and neutrality to cre- EuroWordNet is a project created by Vossen ate new dataset called SentiWordNet, an improved (1998) to build multilingual WordNets for Eu- that is designed to support opinion ropean languages based on PWN. The goal of mining applications and sentiment classification. EuroWordNet is to create multilingual database, Authors published the dataset on Github6. build WordNets independently, obtain compatibil- Siddharthan et al. (2018) uses WordNet to cre- ity across languages, and to maintain language- ate WordNet-feelings which is a new dataset that specific relations. categorises word senses as feelings. Authors cre- Postma et al. (2016) created an open WordNet ated ten categories and manually annotated the for Dutch that contains a total of 117,914 synsets dataset by adding new categories and definitions. using data from Cornetto database, open source WordNets are used as data sources in search and resources, and the PWN. Authors also created a information retrieval tasks when building a query Python module5 that can be applied to NLP appli- (Azad and Deepak, 2019). Abbache et al. (2016) cations. improved the performance of information retrieval ElKateb et al. (2006) proposed the development system for Arabic language by using WordNet and of WordNet for Arabic language using PWN for association rules to expand the search query. In English as basis. Authors constructed the Ara- their methodology, authors removed stop words bic WordNet by using methods used to develop (functional words) from the query before extract- the EuroWordNet (Vossen, 1998). Regragui et ing and selecting synonyms using Arabic WordNet al. (2016) added new content to Arabic WordNet as the main source for word selection. 3https://github.com/JosephSefara/AfricanWordNet Marivate and Sefara (2020) used WordNet to 4https://pypi.org/project/africanwordnet 5https://github.com/cltl/OpenDutchWordnet 6https://github.com/aesuli/SentiWordNet create data augmentation technique for NLP clas- Algorithm 1: Sense map conversion sification applications. Authors compared the Input: s: sense map file technique with augmentation Output: sˆ list containing pair of source and round-trip augmentation. The WordNet- and target offset ID based augmentation improved the performance of 1 def mono(s): the classification models when using Wikipedia 2 Let F ← Open(s) be a to file reader; dataset. The same WordNet-based augmentation 3 for line in F : was used by Zhang et al. (2015) to train a temporal 4 SourceOffset ← use regular convolutional network that learns text understand- expression to match source offset ing from character level input up to an abstract text ID from line; concepts. Hasan et al. (2020) applied semantic 5 T argetOffset ← use regular similarity of WordNet to manage the ambiguity in expression to match target offset social media text by selecting informative features ID from line; to enhance semantic representation. 6 sˆ ← [Sourceoffset, T argetoffset]; 3 Methodology 7 return(sˆ);

This section discusses the design and implemen- tation of the WordNets for South African lan- Table 1: Sample of the sense mapping guages. It first discusses sense map preparation, 2.0 2.1 3.0 then WordNets conversion, and finally implemen- 12976279-n 13571065-n 13752172-n tation. 12976532-n 13571318-n 13752443-n

3.1 Sense map preparation 3.2 WordNets conversion In this section, we explain the sense map prepara- tion process. This section discusses conversion of African We used the sensemap(5WN) published on WordNets to PWN 3.0 and explain OMW format. PWN website 7 for versions 2.0, 2.1, and 3.0. We collected the WordNets created by Bosch Sense map simply list each 2.0 noun sense (en- and Griesel (2017) from South African Centre for 8 coded as a sense key) paired with its mapping to Digital Language Resources (SADiLaR) . SADi- one or more 2.1 noun senses. We converted all the LaR is a national center supported and funded by polysemous (nouns and verbs) and monosemous the South African Department of Science and In- (nouns and verbs) to 2.1. Then lastly, we con- novation. The WordNets are in the form of XML verted 2.1 synsets to 3.0. We used the 3.0 sense format based on PWN version 2. 9 maps to convert all the synsets from 2.1 to 3.0. We used a library called BeautifulSoup to ex- The synsets that are not in all the sense maps are tract all the synset offset ID, part-of-speech tag, used as is. lemma, and word form since the WordNets are in Algorithm 1 illustrates the steps taken during XML format. Table 2 shows the number of synsets conversion of the sense maps. The algorithm was before and after conversion to 3.0 excluding the synsets that do not exists in PWN. There is an in- run twice, for first time to convert 2.0→2.1 then duplicate offset targets in 2.1 were removed. The crease in number of synsets, isiZulu increased by 90, isiXhosa by 150, Sepedi by 101, Setswana by second time to convert 2.1→3.0 then duplicate off- set target in 3.0 were also removed. Table 1 shows 240, and Tshivenda by 16. The increase is caused a sample of the converted offsets that will later be by synsets that have multiple mappings in PWN used to match every synset in African WordNets 3.0. We saved the new synsets in a format that is from 2.0 to 3.0. supported by OMW. The OMW format is as fol- lows: Table 1 shows a sample format of sense map- ping that are later used to convert the WordNets offset-pos langcode:lemma wordform from 2.0 to 3.0. 8https://www.sadilar.org/ 7https://wordnet.princeton.edu/documentation/senseidx5wn 9https://pypi.org/project/beautifulsoup4/ where offset is the unique ID (linking to the PWN), langcode is the universal language code10, word- form is the written word, and pos is the part-of- speech.

Table 2: Conversion of synsets Language Original Synsets New Synsets isiZulu 9026 9116 isiXhosa 13731 13881 Sepedi 10647 10748 Setswana 22234 22474 Tshivenda 1581 1597 Figure 1: An extract of the converted WordNet for Sepedi using OMW format An example of the formatted synsets is de- picted in Figure 1 that is compatible with OMW in added to NLTK. The files have been named ac- NLTK. OMW consists of 29 languages in NLTK cording to the following format: as shown in Table 3. There are 8 languages in OMW that contains lemmas less than that of • Sepedi: wn-data-nso.tab Setswana, isiXhosa, and Sepedi. IsiZulu contains • Setswana: wn-data-tsn.tab more lemmas than 7 languages in OMW while Tshivenda contains the smallest lemmas than all • isiXhosa: wn-data-xho.tab other languages. • isiZulu: wn-data-zul.tab

Table 3: OMW in NLTK • Tshivenda: wn-data-ven.tab Language Lemma Language Lemma eng 147306 glg 23124 where each file resides in a directory inside OMW fin 129839 ell 18225 corpus in NLTK and the directory name is named jpn 89637 arb 17785 according to the ISO language code. The ISO lan- tha 80508 fas 17560 guage code for Sepedi is nso, Setswana is tsn, cmn 61532 tsn 11972 isiXhosa is xho, isiZulu is zul, and Tshivenda is fra 55350 xho 9460 ven. 11 por 54069 nso 7157 A Python helper library has been created to install these African WordNets to OMW in NLTK. cat 46531 bul 6720 The African WordNets can be used like other pol 45387 zul 6380 WordNets on OMW. For example, the library has nld 43077 als 5988 to be imported to the environment then the fol- ita 41855 swe 5824 lowing statements shows the lemma names of the slv 41032 heb 5325 word ’entity’ in Setswana: ind 36954 dan 4468 >>> from nltk .corpus import spa 36681 nob 4186 >>> import africanwordnet >>> wordnet.synset(’entity .n.01’).lemmas(’tsn’) zsm 33932 nno 3387 [Lemma(’entity .n.01. sel oˆ ’ ) , hrv 29010 qcn 3206 Lemma(’entity .n.01.sengwe’)] eus 26240 ven 1288 Listing 1: Lemma example

The following statement is used to view the 3.3 Implementation synsets of the isiZulu word ’iqoqo’ (means collec- This section discuss implementation of African tion).

WordNets in NLTK. >>> from nltk .corpus import wordnet Total of 5 files (sample shown in Figure 1) have >>> import africanwordnet >>> wordnet.synsets(’iqoqo’,lang=(’zul’)) been created that consists of the 5 languages to be [Synset(’whole.n.02’), Synset(’conspectus.n.01’), 10https://www.loc.gov/standards/iso639- 2/php/code list.php 11https://pypi.org/project/africanwordnet Synset(’overview.n.01’), • evaluation of the African WordNets using Synset(’sketch.n.03’), Synset(’compilation.n.01’), various evaluation methods. Ramanand and Synset(’collection.n.01’), Bhattacharyya (2007) proposed a method to Synset(’team.n.02’), Synset(’set.n.01’)] evaluate synsets using dictionary definitions Listing 2: Synonym example since currently there are no enough dictionar- ies for these languages this method could not The following statement is used to view the hy- be utilized. ponyms of the Sepedi word ’taelo’ (means edict). Acknowledgments >>> from nltk .corpus import wordnet >>> import africanwordnet We would like to thank SADILAR for making the >>> synsets = wn.synsets(’taelo’,lang=(’nso’)) >>> f o r s y n s e t in s y n s e t s : datasets available for research. And we would like ... f o r hypo in synset .hyponyms(): ... f o r lemma in hypo.lemmas(”nso”): to thank Github for hosting the developed library. ... p r i n t ( lemma ) Lemma(’behest .n.01. tlhalo sˇ o ’ ) Lemma( ’commandment.n.01. molao ’) Lemma( ’commandment.n.01. taelo ’) References Lemma( ’commission.n.06. taelo ’) Lemma(’injunction .n.01. taelo ’) Lemma(’order.n.01. taelo ’) Ahmed Abbache, Farid Meziane, Ghalem Belalem, Lemma( ’summons.n.02. tagafalo ’) Fatma Zohra Belkredim, et al. 2016. Arabic query expansion using WordNet and association rules. In- Listing 3: Hyponym example ternational Journal of Intelligent Information Tech- nologies (IJIIT), 12(3):51–64. The following statement is used to view the hypernyms of the isiXhosa word ’omisa’ (means Hiteshwar Kumar Azad and Akshay Deepak. 2019. A new approach for query expansion using Wikipedia dry). and WordNet. Information sciences, 492:147–163. >>> from nltk .corpus import wordnet >>> import africanwordnet Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- >>> synsets = wn.synsets(’omisa’,lang=(’xho’)) tiani. 2010. Sentiwordnet 3.0: an enhanced lexical >>> f o r s y n s e t in s y n s e t s : resource for and opinion mining. ... f o r hypo in synset .hypernyms(): ... f o r lemma in hypo.lemmas(”xho”): In Lrec, volume 10, pages 2200–2204. ... p r i n t ( lemma ) Lemma( ’dry.v.01.omisa’) Francis Bond and Ryan Foster. 2013. Linking and ex- Lemma( ’change.v.01. guqula ’) Lemma(’change.v.01. tshintsha ’) tending an open multilingual wordnet. In Proceed- Lemma( ’ c h a n g e integrity .v.01.guqula i m f e z e k o ’ ) ings of the 51st Annual Meeting of the Association for Computational (Volume 1: Long Pa- Listing 4: Hypernyms example pers), pages 1352–1362.

Sonja E Bosch and Marissa Griesel. 2017. Strate- 4 Conclusion and Future Work gies for building wordnets for under-resourced lan- guages: The case of african languages. Literator This paper presented the implementation of (Potchefstroom. Online), 38(1):1–12. African WordNets to be used in NLTK via OMW. Sabry ElKateb, William Black, Horacio Rodr´ıguez, We discussed the conversion of PWN sense maps Musa Alkhalifa, Piek Vossen, Adam Pease, and from 2.0 to 2.1 to 3.0. There was an increase of Christiane Fellbaum. 2006. Building a wordnet synsets during conversion. We proposed an algo- for arabic. In Proceedings of the Fifth International Conference on Language Resources and Evaluation rithm that helps to convert synsets from PWN 2.0 (LREC’06), pages 29–34. to 3.0. A Python library has been made available12 to utilize the WordNets. Marissa Griesel, Sonja Bosch, and Mampaka L Mo- japelo. 2019. Thinking globally, acting locally– The future work will focus on progress in the African wordnet project. In Proceed- ings of the Tenth Global Wordnet Conference, pages • improving conversion of PWN sense maps 191–196. from 2.0 to 3.0 so that all synsets are available in 3.0. Kim et al. (2018) proposed automatic Ali Muttaleb Hasan, Noorhuzaimi Mohd Noor, Taha Hussein Rassem, Shahrul Azman Mohd Noah, mapping of synsets using bilingual dictionar- and Ahmed Muttaleb Hasan. 2020. A proposed ies. Due to limited bilingual dictionaries this method using the semantic similarity of wordnet method could not be utilized. 3.1 to handle the ambiguity to apply in social me- dia text. In Information Science and Applications, 12https://pypi.org/project/africanwordnet pages 471–483. Springer. Farkhund Iqbal, Benjamin CM Fung, Mourad Deb- babi, Rabia Batool, and Andrew Marrington. 2019. Wordnet-based criminal networks mining for cyber- crime investigation. IEEE Access, 7:22740–22755. Jiseong Kim, Younggyun Hahm, Sunggoo Kwon, and Key-Sun Choi. 2018. Automatic wordnet mapping: from corenet to princeton wordnet. In Proceedings of the Eleventh International Conference on Lan- guage Resources and Evaluation (LREC 2018). Krister Linden´ and Lauri Carlson. 2010. Finnwordnet- wordnet pa˚ finska via overs¨ attning.¨ LexicoNordica – Nordic Journal of Lexicography, 17:119–140. Vukosi Marivate and Tshephisho Sefara. 2020. Im- proving short text classification through global aug- mentation methods. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pages 385–399. Springer. George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41. MC Postma, E Miltenburg, R Segers, A Schoen, and PTJM Vossen. 2016. Open dutch wordnet. In Pro- ceedings of the Eigth Global Wordnet Conference. J Ramanand and Pushpak Bhattacharyya. 2007. To- wards automatic evaluation of wordnet synsets. GWC 2008, page 360. Yasser Regragui, Lahsen Abouenour, Fettoum Krieche, Karim Bouzoubaa, and Paolo Rosso. 2016. Arabic wordnet: New content and new applications. In Pro- ceedings of the Eighth Global WordNet Conference, pages 330–338. Tshephisho Sefara, Tumisho Mokgonyane, and Vukosi Marivate. 2020. Wordnets for South African lan- guages. Zenodo, December. Advaith Siddharthan, Nicolas Cherbuin, Paul J Es- linger, Kasia Kozlowska, Nora A Murphy, and Leroy Lowe. 2018. WordNet-feelings: a linguis- tic categorisation of human feelings. arXiv preprint arXiv:1811.02435. Piek Vossen, 1998. Introduction to EuroWordNet, pages 1–17. Springer Netherlands, Dordrecht. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text clas- sification. In Advances in neural information pro- cessing systems, pages 649–657.