KLPT – Kurdish Language Processing Toolkit
KLPT – Kurdish Language Processing Toolkit
Sina Ahmadi Insight Centre for Data Analytics National University of Ireland Galway [email protected]
Abstract Therefore, the development of underlying tasks in NLP for a specific language will potentially pave Despite the recent advances in applying the way for further contributions to the field, by ei- language-independent approaches to various ther improving the current tools or further progress natural language processing tasks thanks to in new tasks. For instance, tokenization as a fun- artificial intelligence, some language-specific tools are still essential to process a language in damental task is widely required in many other a viable manner. Kurdish language is a less- applications such as part-of-speech tagging, ma- resourced language with a notable diversity in chine translation and syntactic analysis. Once ad- dialects and scripts and lacks basic language dressed, future researchers can build upon it for processing tools. To address this issue, we in- more advanced tasks or eventually improve it. troduce a language processing toolkit to han- Despite a plethora of performant tools and spe- dle such a diversity in an efficient way. Our toolkit is composed of fundamental compo- cific frameworks for NLP, such as NLTK (Loper nents such as text preprocessing, stemming, and Bird, 2002), Stanza (Qi et al., 2020), Teanga tokenization, lemmatization and transliteration (Ziad et al., 2018) and spaCy2, the progress with and is able to get further extended by future respect to less-resourced languages is often hin- developers. This project is publicly available1. dered by not only the lack of basic tools and re- sources but also the accessibility of the previous 1 Introduction studies under an open-source licence. This is Language technology is an increasingly impor- particularly the case of Kurdish, a less-resourced tant field in our information era which is depen- Indo-European language that is the focus of the dent on our knowledge of the human language current paper. As an example, although the task and computational methods to process it. Un- of spell-checking and stemming for Kurdish have like the latter which undergoes constant progress been addressed by many previous studies, (Jaf with new methods and more efficient techniques and Ramsay, 2014; Salavati and Ahmadi, 2018; being invented, the processability of human lan- Mustafa and Rashid, 2018; Saeed et al., 2018a; guages does not evolve with the same pace. This Hawezi et al., 2019) to mention but a few, none is particularly the case of languages with scarce re- of them provides an implementation of their tool sources and limited grammars, also known as less- under any licence. resourced languages. On the other hand, some previous studies use Various natural language processing (NLP) specific frameworks that are hardly integrable and tasks are of pipeline architecture; that is, to ad- inter-operable. For instance, (Walther and Sagot, dress a specific task, a few other language pro- 2010) and (Walther et al., 2010) describe their cessing tasks may be initially required (Manning efforts in developing a large-scale morphological et al., 2014). With the current advances in the lexicon and a part-of-speech tagger for Kurdish open-source movements, more researchers and in- within the Alexina framework under the LGPL-LR dustrial developers are encouraged to share their licence. Despite the valuable impact of this study knowledge in an open-source manner, accessi- in the field, for example in (Cotterell et al., 2017) ble under certain conditions (Ljungberg, 2000). and (Gökırmak and Tyers, 2017), the tool does not
1https://github.com/sinaahmadi/klpt 2https://github.com/explosion/spaCy
72 Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 72–84 Virtual Conference, November 19, 2020. c 2020 Association for Computational Linguistics > > S Z Z ì R S Q è G P (a) Consonants