AUTOMATIC KURDISH DIALECTS IDENTIFICATION Hossein Hassani 1,2 and Dzejla Medjedovic 3 1Department of Computer Science and Engineering, University of Kurdistan Hewlêr, Erbil, Kurdistan Region, Iraq 2Department of Computer Science, Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
[email protected] [email protected] 3Department of Computer Science, Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
[email protected] ABSTRACT Automatic dialect identification is a necessary Language Technology for processing multi- dialect languages in which the dialects are linguistically far from each other. Particularly, this becomes crucial where the dialects are mutually unintelligible. Therefore, to perform computational activities on these languages, the system needs to identify the dialect that is the subject of the process. Kurdish language encompasses various dialects. It is written using several different scripts. The language lacks of a standard orthography. This situation makes the Kurdish dialectal identification more interesting and required, both form the research and from the application perspectives. In this research, we have applied a classification method, based on supervised machine learning, to identify the dialects of the Kurdish texts. The research has focused on two widely spoken and most dominant Kurdish dialects, namely, Kurmanji and Sorani. The approach could be applied to the other Kurdish dialects as well. The method is also applicable to the languages which are similar to Kurdish in their dialectal diversity and differences. KEYWORDS Dialect identification, NLP, Kurdish language, Kurmanji, Sorani 1. I NTRODUCTION Dialectology has not received a considerable attention in Computational Linguistics (CL) and Natural Language Processing (NLP).