A Rule-Based Stemmer for Arabic Gulf Dialect

JKSUCI 152 No. of Pages 9 27 March 2015 Journal of King Saud University – Computer and Information Sciences (2015) xxx, xxx–xxx 1 King Saud University Journal of King Saud University – Computer and Information Sciences www.ksu.edu.sa www.sciencedirect.com 3 A rule-based stemmer for Arabic Gulf dialect * 1 4 Belal Abuata , Asma Al-Omari 5 Faculty of IT & Computer Sciences, Yarmouk University, Irbid 21163, Jordan 6 Received 22 November 2013; revised 12 March 2014; accepted 15 April 2014 7 9 KEYWORDS 10 Abstract Arabic dialects arewidely used from many years ago instead of Modern Standard Arabic 11 Arabic dialect stemmer; language in many fields. The presence of dialects in any language is a big challenge. Dialects add a 12 Gulf dialect; new set of variational dimensions in some fields like natural language processing, information 13 Rule base stemming; retrieval and even in Arabic chatting between different Arab nationals. Spoken dialects have no 14 Arabic NLP standard morphological, phonological and lexical like Modern Standard Arabic. Hence, the objec- tive of this paper is to describe a procedure or algorithm by which a stem for the Arabian Gulf dialect can be defined. The algorithm is rule based. Special rules are created to remove the suffixes and prefixes of the dialect words. Also, the algorithm applies rules related to the word size and the rela- tion between adjacent letters. The algorithm was tested for a number of words and given a good correct stem ratio. The algorithm is also compared with two Modern Standard Arabic algorithms. The results showed that Modern Standard Arabic stemmers performed poorly with Arabic Gulf dialect and our algorithm performed poorly when applied for Modern Standard Arabic words. 15 Ó 2015 Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). 16 1. Introduction such as online communication (chat rooms, SMS, Facebook, 22 Twitters and others). Most of the research on Arabic is focused 23 17 Nowadays, Arabs prefer to use their local dialect in daily con- on MSA (Duwairi et al., 2007; Harrag et al., 2011; AI-Shalabi 24 18 versation whenever it is not required for them to use the et al., 2003; Goweder et al., 2008). Currently, there are 12 25 19 Modern Standard Arabic (MSA). Recently, dialect began to different Arabic dialects spoken in 28 countries around the 26 20 be used in both television and radio (Almeman and Lee, world. While most of these dialects are specific to a particular 27 ” ” 21 2013). Arabs also use dialect instead of MSA in important fields region (for example, ‘‘Sudanese Arabic or ‘‘Iraqi Arabic ), 28 the most commonly spoken Arabic dialect is Egyptian 29 Arabic. This variety of Arabic dialects is largely due to the fact 30 * Corresponding author. Tel.: +962 2 7211111; fax: +962 2 that, as Arabic spread and took hold in new regions, it often 31 7211128. adopted traces of the language it replaced. 32 E-mail addresses: [email protected] (B. Abuata), zadst@ A limited number of Arabic dialect software have been 33 yahoo.com (A. Al-Omari). 1 Tel.: +962 2 7211111; fax: +962 2 7211128. developed and a limited number of research papers published 34 (Al-Gaphari and Al-Yadoumi, 2010). Dialectal varieties have 35 Peer review under responsibility of King Saud University. not received much attention due to the lack of dialectal tools 36 and annotated texts. Hence working on dialect is difficult for 37 many reasons (Al-Shareef and Hain, 2011). Firstly, dialect is 38 Production and hosting by Elsevier not considered a written language, it is normally used in 39 http://dx.doi.org/10.1016/j.jksuci.2014.04.003 1319-1578 Ó 2015 Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Please cite this article in press as: Abuata, B., Al-Omari, A. A rule-based stemmer for Arabic Gulf dialect. Journal of King Saud University – Computer and Informa- tion Sciences (2015), http://dx.doi.org/10.1016/j.jksuci.2014.04.003 JKSUCI 152 No. of Pages 9 27 March 2015 2 B. Abuata, A. Al-Omari 40 spoken form and limited textual data exist in dialect form com- words having different meanings. Arabic language orientation 99 41 pared with MSA. Another reason is that dialect still inherits is right-to-left which is the opposite in English. There are 30 100 42 the complex morphological form of MSA. Moreover, addi- letters used in the Arabic language. The difficulty in dealing 101 43 tional affixes are introduced informally for each dialect, with Arabic language is due not only to its orientation, but 102 44 thereby increasing cross-dialectical differences. Finally no also to the language diacritization of scripts, vowels which 103 45 standard convention is agreed on how various dialects should may or may not be included in, and its complex morphological 104 46 be transcribed. analysis. All of these and other factors such as its sensitivity to 105 47 Modern Standard Arabic (MSA) is a form of Arabic lan- gender, number, case, degree, and tense, make it very difficult 106 48 guage that is usually used in news media and formal speeches to deal with the Arabic language (Abu-Salem et al., 1999). 107 49 (Diab et al., 2004). There are no native speakers of MSA as Arabic is classified into three variants: Classical Arabic, 108 50 stated in Mutahhar and Watson (2002). The importance of Modern Standard Arabic (MSA) and Colloquial or Dialectal 109 51 processing the dialects comes from here: ‘‘Almost no native Arabic. The Classical Arabic is the language of the Holly 110 52 speakers of Arabic sustain continuous spontaneous production Qur’an and used from the Pre-Islamic Arabia to that of the 111 53 of MSA. Dialects are the primary form of Arabic used in all Abbasid Caliphate. However modern authors never used 112 54 unscripted spoken genres: ‘‘conversations, talk shows, inter- Classical Arabic. They instead use a literary Language known 113 55 views, etc.” (Habash and Rambow, 2005). Dialects are becom- as MSA. MSA uses the classical vocabulary that is not pre- 114 56 ing widely in use in new written media (newsgroups, weblogs, sented in the spoken varieties. Dialectal Arabic refers to many 115 57 online chat etc). ‘‘Substantial Dialect-MSA differences impede national varieties which formalize the daily spoken language in 116 58 direct application of MSA NLP tools” (Diab and Habash, the Arab world. It is unwritten and often used in informal spo- 117 59 2006). Fields of researches such as Arabic NLP, Arabic ken media. It differs from MSA on all levels of linguistic repre- 118 60 Translations, Arabic and Cross language Retrieval and other sentation; the most extreme differences are on phonological 119 61 related Arabic research fields suffer from lack of resources and morphological levels. 120 62 due to lack of standards for the dialects, as well as lack of writ- In the following two subsections, an overview of MSA and 121 63 ten resources of dialects themselves as shown in Maamouri and Arabic dialect is presented. Also, some of the related works to 122 64 Bies (2004). It is right to say that the dialect is very popular Arabic dialect are mentioned. 123 65 where the majority of people use it during their daily life, they 66 use it for conversation and online chatting as well (Alghamdi 2.1. Modern Standard Arabic 124 67 et al., 2008). Unfortunately, not only is the dialect rarely used 68 in writing, but it also has no written standard. It is realized as a Modern Standard Arabic (MSA) is a form of Arabic language 125 69 language of heart and feeling where MSA is considered as a that is used widely in news media and formal speeches (Diab 126 70 language of mind. It is a formal language that has a very good et al., 2004). There are no native speakers of MSA as stated 127 71 written standard (Al-Gaphari and Al-Yadoumi, 2010). in Mutahhar and Watson (2002). The grammatical system of 128 72 This paper is organized as follows: In Section 2, we give an the Arabic language is based on a root-and-pattern structure 129 73 overview of related work. In Section 3 we present our method/ and considered as a root-based language with not more than 130 74 algorithm used to find the stem of dialect words used in 10,000 roots (Ali, 1988). A Root in Arabic is the bare verb 131 75 Arabian Gulf Countries. In Section 4 we discuss the evaluation form which can be trilateral, which is the overwhelming major- 132 76 results and their analysis. In Section 5 we talk about conclu- ity of Arabic words, and to a lesser extent, quadrilateral, pen- 133 77 sion and future work. taliteral, or hexaliteral, each of which generates increased verb 134 forms and noun forms by the addition of derivational affixes 135 78 2. Background and related works (Saliba and Al-Dannan, 1990). 136 A stem is a combination of a root and derivational mor- 137 phemes to which an affix (or more) can be added (Gleason, 138 79 Arabic language belongs to Semitic group of languages unlike 1970). However, when applying this definition to Arabic, the 139 80 the English language which belongs to the Indo-European lan- verb roots and their verb and noun derivatives are considered 140 81 guage group.

A Rule-Based Stemmer for Arabic Gulf Dialect

Different Dialects of Arabic Language

A STUDY of WRITING Oi.Uchicago.Edu Oi.Uchicago.Edu /MAAM^MA

Rhode Island College

The Provenance of Arabic Loan-Words in a Phonological and Semantic Study. Thesis Submitted for the Degree of Doctor of Philosoph

Camsemud 2007

Initial Overview of the Linguistic Diversity of Refugee Communities in Cairo

Arabic Pidgins and Creoles Andrei Avram University of Bucharest

12/11/08 Appendix #1 013:*** 01:013:145 Introduction to Aramaic

Dialect Mixing and Dialect Levelling in Kordofanian Baggara Arabic (Western Sudan) Stefano Manfredi

Foreign Language Self-Guided Study Resources

Automatic Arabic Dialect Identification Systems for Written Texts: a Survey

Arabic Dialects (General Article)