Building a Corpus for the Zaza– Family Sina Ahmadi

Insight Centre for Data Analytics, National University of Ireland Galway, Ireland

Our approach is as follows: Zipf’s law Objectives Kurdish vs. Zaza-Gorani Zipf’s Law, also known as the rank-size distribution, 1 We selected websites based the number of the states that in a reasonably huge data set, including • • Provide a description of two endangered Despite the common belief that Zazaki and Gorani available articles, availability of metadata in pages’ language corpus, there is a correlation between word languages in the Zaza-Gorani language are two dialects of Kurdish, studies indicate a source and the diversity of the covered topics frequencies and word ranks that follows a power law consensus among linguists that those two are two family: Zazaki and Gorani 2 We extract the content of the HTML pages function. It is beneficial to understand the signifi- • Describe some of the linguistic features of these distinct languages on their own [1] and further clean them by removing non-relevant cance of words in a language. two languages in comparison to Kurdish • Kurdish, Zazaki and Gorani languages are all in information such as URLs, hashtags, contact The following Figure illustrates the rank-size distribu- • Create a language corpus for Zazaki the Northwestern branch of the Iranian details and cited sentences in languages other than tion in the Zaza-Gorani and Kurdish corpora where a and Gorani languages within the Indo-European language our target ones, e.g. Koranic verses in Arabic three-segment pattern is observed equally between the family • Analyze the curated corpus and compare it with 3 Identify the language or the dialect in which languages. a Kurdish corpus • These languages and dialects have linguistically the article is written using a simple classifier using influenced each other in various ways, the most frequent and unique words in each including phonetics, vocabulary and morphology language as features, e.g. ziwan/zan/zon Introduction • Mutual influence is particularly observed ‘language’ for Zazaki and ziman for between Kurmanji Kurdish and Zazaki and Kurdish Zazaki and Gorani are two of the main and most Kurdish and Gorani 4 Manually verify the selected articles known languages belonging to the Zaza-Gorani • There is generally a close feeling among all the language family. Zaza-Gorani languages are not three ethnic groups, , Goranis and Zazas, Results only less-resourced but also deemed endangered with respect to the Kurdish identity and languages. Zazaki, also known as Dimlî, is spoken Basic statistics culture with many centuries living together by an estimated number of 2 million speakers. On Among the 20 most frequent words in our corpus References the other hand, Gorani, also written as Gurani, is the Approach and the Kurdish corpus of [2], conjunctions ‘and’, language of ∼300,000 speakers. ‘that’, demonstratives ‘this’, ‘that’ and prepositions We used the material published on news web- [1] David Neil MacKenzie. In this study, we present a corpus for Zazaki and ‘in’, ‘from’, ‘until’ appear in all the languages. The Dialect of Awroman (Hawraman-i Luhon): sites in Zazaki and Gorani languages to build the first Gorani. Shabaki, as the last language in this lan- Grammatical Sketch, Texts, and Vocabulary. corpus for those two languages. In comparison to the # Zazaki Gorani E. Munksgaard., 1966. guage family could not be included due to it being Sorani and Kurmanji dialects of Kurdish for which articles 4,855 428 extremely under-documented and least known. The [2] Kyumars Sheykh Esmaili and Shahin Salavati. many websites are available, there are a very limited word tokens 1,633,770 194,563 Sorani Kurdish versus Kurmanji Kurdish: an empirical corpus is built on the news articles from various number of websites for Zaza-Gorani languages. word types 102,665 41,454 comparison. sources in several topics such as science, politics, In Proceedings of the 51st Annual Meeting of the characters 10,802,266 2,246,425 culture and art. We believe that this resource can Association for Computational Linguistics, 2013. average word length 4.84 5.50 pave the way for further developments in the process- [3] Philippe Rekacewicz. ing of Zaza-Gorani languages in various NLP tasks The languages of (Les langues du Kurdistan) [in Figure 1. Zazaki and Gorani French]. such as automatic language and dialect identification are spoken in the red (Date accessed: 23.06.2020), https://www. and spelling and grammatical error correction. encircled areas monde-diplomatique.fr/cartes/langueskurdes edition, 2008. (Map based on [3])

Download the corpus This corpus is publicly available under a CC BY-SA 4.0 license at https://github.com/ sinaahmadi/ZazaGoraniCorpus.