KLPT – Kurdish Language Processing Toolkit
Total Page:16
File Type:pdf, Size:1020Kb
KLPT – Kurdish Language Processing Toolkit Sina Ahmadi Insight Centre for Data Analytics National University of Ireland Galway [email protected] Abstract Therefore, the development of underlying tasks in NLP for a specific language will potentially pave Despite the recent advances in applying the way for further contributions to the field, by ei- language-independent approaches to various ther improving the current tools or further progress natural language processing tasks thanks to in new tasks. For instance, tokenization as a fun- artificial intelligence, some language-specific tools are still essential to process a language in damental task is widely required in many other a viable manner. Kurdish language is a less- applications such as part-of-speech tagging, ma- resourced language with a notable diversity in chine translation and syntactic analysis. Once ad- dialects and scripts and lacks basic language dressed, future researchers can build upon it for processing tools. To address this issue, we in- more advanced tasks or eventually improve it. troduce a language processing toolkit to han- Despite a plethora of performant tools and spe- dle such a diversity in an efficient way. Our toolkit is composed of fundamental compo- cific frameworks for NLP, such as NLTK (Loper nents such as text preprocessing, stemming, and Bird, 2002), Stanza (Qi et al., 2020), Teanga tokenization, lemmatization and transliteration (Ziad et al., 2018) and spaCy2, the progress with and is able to get further extended by future respect to less-resourced languages is often hin- developers. This project is publicly available1. dered by not only the lack of basic tools and re- sources but also the accessibility of the previous 1 Introduction studies under an open-source licence. This is Language technology is an increasingly impor- particularly the case of Kurdish, a less-resourced tant field in our information era which is depen- Indo-European language that is the focus of the dent on our knowledge of the human language current paper. As an example, although the task and computational methods to process it. Un- of spell-checking and stemming for Kurdish have like the latter which undergoes constant progress been addressed by many previous studies, (Jaf with new methods and more efficient techniques and Ramsay, 2014; Salavati and Ahmadi, 2018; being invented, the processability of human lan- Mustafa and Rashid, 2018; Saeed et al., 2018a; guages does not evolve with the same pace. This Hawezi et al., 2019) to mention but a few, none is particularly the case of languages with scarce re- of them provides an implementation of their tool sources and limited grammars, also known as less- under any licence. resourced languages. On the other hand, some previous studies use Various natural language processing (NLP) specific frameworks that are hardly integrable and tasks are of pipeline architecture; that is, to ad- inter-operable. For instance, (Walther and Sagot, dress a specific task, a few other language pro- 2010) and (Walther et al., 2010) describe their cessing tasks may be initially required (Manning efforts in developing a large-scale morphological et al., 2014). With the current advances in the lexicon and a part-of-speech tagger for Kurdish open-source movements, more researchers and in- within the Alexina framework under the LGPL-LR dustrial developers are encouraged to share their licence. Despite the valuable impact of this study knowledge in an open-source manner, accessi- in the field, for example in (Cotterell et al., 2017) ble under certain conditions (Ljungberg, 2000). and (Gökırmak and Tyers, 2017), the tool does not 1https://github.com/sinaahmadi/klpt 2https://github.com/explosion/spaCy 72 Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 72–84 Virtual Conference, November 19, 2020. c 2020 Association for Computational Linguistics > > S Z Z ì R S Q è G P (a) Consonants a: e: I i: o: u: U 0 (b) Vowels Table 1: A comparison of the Kurdish alphabets. Variations are specified with "/" seem to be widely used in the subsequent projects. Kurdish people that those are in fact different di- As such, projects such as (Jaf and Ramsay, 2014) alects of the Kurdish language (Haig and Matras, and (Ahmadi and Hassani, 2020a) tackle the very 2002; Matras, 2017). In this study, we remain with same topic from scratch. this theory and refer to them as Kurdish dialects. Language-specific toolkits have been previ- It is worth mentioning that despite the linguistic ously designed for various languages, such as similarities of Zazaki, also known as Dimlî, and IceNLP for Icelandic (Loftsson and Rögnvalds- Gorani languages and the popular belief that they son, 2007), VnCoreNLP for Vietnamese (Vu et al., are dialects of Kurdish, studies show that they be- 2018), FudanNLP for Chinese (Qiu et al., 2013), long to the Zaza-Gorani language family which PSI-Toolkit for Polish (Gralinski´ et al., 2013) and is independent from the Kurdish language (Paul, ParsiPardaz for Persian (Sarabi et al., 2013). In the 1998; Jugel, 2014; Ahmadi, 2020c). same vein, in order to facilitate the basic language Kurdish has been historically written in various processing tasks for Kurdish in an organized and scripts, namely Cyrillic, Armenian, Latin and Ara- methodical way and aware of the increasing im- bic among which the latter two are still widely portance of open-source and inter-operable tools in use. Efforts in standardization of the Kurdish for building more efficient systems and get fur- alphabets and orthographies have not succeeded ther advanced in the field, we present KLPT–the to be globally followed by all Kurdish speakers Kurdish language processing toolkit. This toolkit in all regions (Tavadze, 2019; Haig and Matras, is developed in Python and is composed of core 2002; Aydogan˘ , 2012). As such, the Kurmanji di- modules and is extendable by future developers. alect is mostly written in the Latin-based script while the Sorani, Southern Kurdish and Laki are 2 Kurdish Language mostly written in the Arabic-based script. That, Kurdish belongs to the Northwestern branch of the not only scatters readers and speakers to commu- Iranian languages within the Indo-European lan- nicate together, but also creates further challenges guage family which is spoken by 20-30 million in processing the language (Esmaili, 2012; Ah- speakers in the Kurdish regions of Turkey, Iraq, madi, 2019). Table1 provides the Latin-based and Iran and Syria and also, among the Kurdish di- Arabic-based Kurdish alphabets used for all the di- aspora around the world (Ahmadi et al., 2019). alects. The division of Kurdish into Northern Kurdish (or Kurdish language is a highly inflectional lan- Kurmanji), Central Kurdish (or Sorani), Southern guage, particularly due to a high number of affixes Kurdish and Laki, respectively with kmr, ckb, and clitics (Ahmadi and Hassani, 2020b). Regard- sdh and lki ISO 639-3 language codes, has ing nouns, although Sorani does not have gender been widely studied previously (Edmonds, 2013). or grammatical cases, it has a full article mark- Based on the structural differences between these, ing system for definite, indefinite and demonstra- some scholars believe that they are distinct lan- tive in singular and plural forms (Jugel, 2014). guages and therefore, refer to them as Kurdish lan- On the other hand, Kurmanji has a fewer num- guages (Kreyenbroek, 2005). On the other hand, ber of article markers for feminine and masculine it is also commonly believed by both scholars and genders (Thackston, 2006). With respect to the 73 Information retrieval and Text mining 3:63% Lexical resources 22.22 % 10:90% Dialectology 18.51 % 7:27% 9.25 % 3.70 % Sign language recognition 3.70 % 78:18% Machine Translation 11.11 % 12.96 % 1.85 % Optical character recognition 18.8 % Speech recognition Sorani Sorani and Kurmanji Other Kurmanji Sorani, Kurmanji and Gorani Morphological and syntactic analysis (a) per sub-field in NLP (b) per dialect Figure 1: Proportion of publications related to Kurdish language processing verbs, Kurdish has a few number of around 300 1998; Karimi, 2014). Unlike Kurmanji which single-word verbs (Walther and Sagot, 2010), e.g. uses oblique cases for this purpose, Sorani only kirdin/kirin “to do”, which are inflected based on uses different pronominal markers to specify erga- person (1,2,3, SG, PL), tense (past, present, future), tivity, therefore it is called split-ergative (Esmaili aspect (indefinite, perfect, progressive, imperfec- and Salavati, 2013). Except the past tenses, a tive) and mood (indicative, subjunctive, condi- nominative-accusative alignment is observed in tional). Unlike Kurmanji, Sorani Kurdish does other tenses. not have future tense and uses adverbs for this Not being equally documented and used, Kur- purpose. However, Kurdish extensively takes use dish dialects have different levels of linguistic of compound constructions for creating new verb resourcefulness. In comparison to Sorani and forms, particularly with (Noun + Verb), (Adjective Kurmanji which are widely used by the media + Verb) and (Preposition + Verb) forms (Traida, and press, Southern Kurdish and Laki are under- 2007). For instance, siław ‘hi (n)’, pîroz ‘holy’ documented and lack basic language resources (adj) and heł (verbal particle denoting ‘up’) with such as electronic dictionaries and corpora (Fat- the single-word verb kirdin can respectively form tah, 2000; Ahmadi et al., 2019; Ahmadi, 2020c). compound verbs siław kirdin “to greet”, pîroz kirdin “to congratulate” and heł kirdin “to turn 3 Current State of Kurdish Language on”. The stringing characteristic of the Arabic- Processing based script of Kurdish further adds to this mor- phological complexity in such a way that several The earliest works in the field of Kurdish language word forms may be concatenated together (Ah- processing date back to 2009. Our literature re- madi, 2020b). view indicates that some of these contributions fail to provide open-source solutions.