Can Modern Standard Arabic Approaches be used for Arabic Dialects? Sentiment Analysis as a Case Study Kathrein Abu kwaik Stergios Chatzikyriakidis Simon Dobnik CLASP and FLOV, University of Gothenburg, Sweden fkathrein.abu.kwaik,stergios.chatzikyriakidis,
[email protected] Abstract where MSA is the official language used for ed- ucation, news, politics, religion and, in general, in We present the Shami-Senti corpus, the first any type of formal setting, but dialects are used in Levantine corpus for Sentiment Analysis (SA), and investigate the usage of off-the-shelf mod- everyday communication, as well as in informal els that have been built for Modern Standard writing (Versteegh, 2014). Arabic (MSA) on this corpus of Dialectal Ara- In this paper, we examine whether it is possi- bic (DA). We apply the models on DA data, ble to adapt classification models that have been showing that their accuracy does not exceed trained and built on MSA data for DA from the 60%. We then proceed to build our own mod- Levantine region, or whether we should build and els involving different feature combinations train specific models for the individual dialects, and machine learning methods for both MSA and DA and achieve an accuracy of 83% and therefore considering them as stand-alone lan- 75% respectively. guages. To answer this question we use Sentiment Analysis as a case study. Our contributions are the 1 Introduction following: There is a growing need for text mining and an- • We systematically evaluate how well the ML alytical tools for Social Media data, for exam- models on MSA for SA perform on DA of ple Sentiment Analysis (SA) tools which aim to Levantine; distinguish people’s views into positive and neg- • We construct and present a new sentiment ative, objective and subjective responses, or even corpus of Levantine DA; into neutral opinions.