Arabic Language WEKA-Based Dialect Classifier for Arabic Automatic Speech Recognition Transcripts Areej Alshutayri , Eric Atwell , AbdulRahman AlOsaimy James Dickins, Michael Ingleby and Janet Watson University of Leeds, LS2 9JT, UK
[email protected],
[email protected] ,
[email protected],
[email protected],
[email protected],
[email protected] Abstract This paper describes an Arabic dialect identification system which we developed for the Dis- criminating Similar Languages (DSL) 2016 shared task. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. We experimented with several classifiers and the best accuracy was achieved using the Sequential Minimal Optimization (SMO) algorithm for training and testing process set to three different feature-sets for each testing process. Our approach achieved an accuracy equal to 42.85% which is considerably worse in comparison to the evaluation scores on the training set of 80-90% and with training set 60:40 percentage split which achieved accuracy around 50%. We observed that Buckwalter transcripts from the Saar- land Automatic Speech Recognition (ASR) system are given without short vowels, though the Buckwalter system has notation for these. We elaborate such observations, describe our methods and analyse the training dataset. 1 Introduction Language Identification or Dialect Identification is the task of identifying the language or dialect of a written text. The task of Arabic dialect identification may require both computer scientists and Arabic linguistics experts. The Arabic language is one of the worlds major languages, and there is a common standard written form used worldwide, Modern Standard Arabic (MSA).