JKSUCI 152 No. of Pages 9 27 March 2015 Journal of King Saud University – Computer and Information Sciences (2015) xxx, xxx–xxx 1 King Saud University Journal of King Saud University – Computer and Information Sciences www.ksu.edu.sa www.sciencedirect.com

3 A rule-based stemmer for Gulf dialect

* 1 4 Belal Abuata , Asma Al-Omari

5 Faculty of IT & Computer Sciences, Yarmouk University, Irbid 21163, Jordan

6 Received 22 November 2013; revised 12 March 2014; accepted 15 April 2014 7

9 KEYWORDS 10 Abstract Arabic dialects arewidely used from many years ago instead of 11 Arabic dialect stemmer; language in many fields. The presence of dialects in any language is a big challenge. Dialects add a 12 Gulf dialect; new set of variational dimensions in some fields like natural language processing, information 13 Rule base stemming; retrieval and even in Arabic chatting between different Arab nationals. Spoken dialects have no 14 Arabic NLP standard morphological, phonological and lexical like Modern Standard Arabic. Hence, the objec- tive of this paper is to describe a procedure or algorithm by which a stem for the Arabian Gulf dia- lect can be defined. The algorithm is rule based. Special rules are created to remove the suffixes and prefixes of the dialect words. Also, the algorithm applies rules related to the word size and the rela- tion between adjacent letters. The algorithm was tested for a number of words and given a good correct stem ratio. The algorithm is also compared with two Modern Standard Arabic algorithms. The results showed that Modern Standard Arabic stemmers performed poorly with Arabic Gulf dialect and our algorithm performed poorly when applied for Modern Standard Arabic words. 15 Ó 2015 Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

16 1. Introduction such as online communication (chat rooms, SMS, Facebook, 22 Twitters and others). Most of the research on Arabic is focused 23

17 Nowadays, Arabs prefer to use their local dialect in daily con- on MSA (Duwairi et al., 2007; Harrag et al., 2011; AI-Shalabi 24 18 versation whenever it is not required for them to use the et al., 2003; Goweder et al., 2008). Currently, there are 12 25 19 Modern Standard Arabic (MSA). Recently, dialect began to different Arabic dialects spoken in 28 countries around the 26 20 be used in both television and radio (Almeman and Lee, world. While most of these dialects are specific to a particular 27 ” ” 21 2013). Arabs also use dialect instead of MSA in important fields region (for example, ‘‘Sudanese Arabic or ‘‘Iraqi Arabic ), 28 the most commonly spoken Arabic dialect is Egyptian 29 Arabic. This variety of Arabic dialects is largely due to the fact 30 * Corresponding author. Tel.: +962 2 7211111; fax: +962 2 that, as Arabic spread and took hold in new regions, it often 31 7211128. adopted traces of the language it replaced. 32 E-mail addresses: [email protected] (B. Abuata), zadst@ A limited number of Arabic dialect software have been 33 yahoo.com (A. Al-Omari). 1 Tel.: +962 2 7211111; fax: +962 2 7211128. developed and a limited number of research papers published 34 (Al-Gaphari and Al-Yadoumi, 2010). Dialectal varieties have 35 Peer review under responsibility of King Saud University. not received much attention due to the lack of dialectal tools 36 and annotated texts. Hence working on dialect is difficult for 37 many reasons (Al-Shareef and Hain, 2011). Firstly, dialect is 38 Production and hosting by Elsevier not considered a written language, it is normally used in 39

http://dx.doi.org/10.1016/j.jksuci.2014.04.003 1319-1578 Ó 2015 Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Please cite this article in press as: Abuata, B., Al-Omari, A. A rule-based stemmer for Arabic Gulf dialect. Journal of King Saud University – Computer and Informa- tion Sciences (2015), http://dx.doi.org/10.1016/j.jksuci.2014.04.003 JKSUCI 152 No. of Pages 9 27 March 2015 2 B. Abuata, A. Al-Omari

40 spoken form and limited textual data exist in dialect form com- words having different meanings. Arabic language orientation 99 41 pared with MSA. Another reason is that dialect still inherits is right-to-left which is the opposite in English. There are 30 100 42 the complex morphological form of MSA. Moreover, addi- letters used in the Arabic language. The difficulty in dealing 101 43 tional affixes are introduced informally for each dialect, with Arabic language is due not only to its orientation, but 102 44 thereby increasing cross-dialectical differences. Finally no also to the language diacritization of scripts, vowels which 103 45 standard convention is agreed on how various dialects should may or may not be included in, and its complex morphological 104 46 be transcribed. analysis. All of these and other factors such as its sensitivity to 105 47 Modern Standard Arabic (MSA) is a form of Arabic lan- gender, number, case, degree, and tense, make it very difficult 106 48 guage that is usually used in news media and formal speeches to deal with the Arabic language (Abu-Salem et al., 1999). 107 49 (Diab et al., 2004). There are no native speakers of MSA as Arabic is classified into three variants: , 108 50 stated in Mutahhar and Watson (2002). The importance of Modern Standard Arabic (MSA) and Colloquial or Dialectal 109 51 processing the dialects comes from here: ‘‘Almost no native Arabic. The Classical Arabic is the language of the Holly 110 52 speakers of Arabic sustain continuous spontaneous production Qur’an and used from the Pre-Islamic Arabia to that of the 111 53 of MSA. Dialects are the primary form of Arabic used in all Abbasid Caliphate. However modern authors never used 112 54 unscripted spoken genres: ‘‘conversations, talk shows, inter- Classical Arabic. They instead use a literary Language known 113 55 views, etc.” (Habash and Rambow, 2005). Dialects are becom- as MSA. MSA uses the classical vocabulary that is not pre- 114 56 ing widely in use in new written media (newsgroups, weblogs, sented in the spoken varieties. Dialectal Arabic refers to many 115 57 online chat etc). ‘‘Substantial Dialect-MSA differences impede national varieties which formalize the daily spoken language in 116 58 direct application of MSA NLP tools” (Diab and Habash, the Arab world. It is unwritten and often used in informal spo- 117 59 2006). Fields of researches such as Arabic NLP, Arabic ken media. It differs from MSA on all levels of linguistic repre- 118 60 Translations, Arabic and Cross language Retrieval and other sentation; the most extreme differences are on phonological 119 61 related Arabic research fields suffer from lack of resources and morphological levels. 120 62 due to lack of standards for the dialects, as well as lack of writ- In the following two subsections, an overview of MSA and 121 63 ten resources of dialects themselves as shown in Maamouri and Arabic dialect is presented. Also, some of the related works to 122 64 Bies (2004). It is right to say that the dialect is very popular Arabic dialect are mentioned. 123 65 where the majority of people use it during their daily life, they 66 use it for conversation and online chatting as well (Alghamdi 2.1. Modern Standard Arabic 124 67 et al., 2008). Unfortunately, not only is the dialect rarely used 68 in writing, but it also has no written standard. It is realized as a Modern Standard Arabic (MSA) is a form of Arabic language 125 69 language of heart and feeling where MSA is considered as a that is used widely in news media and formal speeches (Diab 126 70 language of mind. It is a formal language that has a very good et al., 2004). There are no native speakers of MSA as stated 127 71 written standard (Al-Gaphari and Al-Yadoumi, 2010). in Mutahhar and Watson (2002). The grammatical system of 128 72 This paper is organized as follows: In Section 2, we give an the Arabic language is based on a root-and-pattern structure 129 73 overview of related work. In Section 3 we present our method/ and considered as a root-based language with not more than 130 74 algorithm used to find the stem of dialect words used in 10,000 roots (Ali, 1988). A Root in Arabic is the bare verb 131 75 Arabian Gulf Countries. In Section 4 we discuss the evaluation form which can be trilateral, which is the overwhelming major- 132 76 results and their analysis. In Section 5 we talk about conclu- ity of Arabic words, and to a lesser extent, quadrilateral, pen- 133 77 sion and future work. taliteral, or hexaliteral, each of which generates increased verb 134 forms and noun forms by the addition of derivational affixes 135 78 2. Background and related works (Saliba and Al-Dannan, 1990). 136 A stem is a combination of a root and derivational mor- 137 phemes to which an affix (or more) can be added (Gleason, 138 79 Arabic language belongs to Semitic group of languages unlike 1970). However, when applying this definition to Arabic, the 139 80 the English language which belongs to the Indo-European lan- verb roots and their verb and noun derivatives are considered 140 81 guage group. Arabic is the fifth widely spoken language in the as stems. Affixes in Arabic are: prefixes, suffixes (or postfixes) 141 82 world used by 5% of people around the world (Kadri and Nie, and infixes (morphemes). Prefixes are attached at beginning of 142 83 2006). It is the official language in 26 countries, located in the the words, where suffixes are attached at the end, and infixes 143 84 Arab world within the west Asia to North Africa. Arabic is the are found in the middle of the words. For example, the 144 85 language of Islam, where hundreds of millions of Muslims use altalibat) which means ‘‘women 145) ﺍﻟﻄﺎﻟﺒﺎﺕ Arabic word 86 it for their religious daily uses. students”, consists of the elements shown in Table 1: 146 87 The is used in several languages such as El-Sadany and Hashish (1988), described the formation of 147 88 Persian, Malay, and Urdu (Al-Fedaghi and Al-Sadoun, Arabic affixes in the Arabic verbs as a derivation of the follow- 148 89 1990). Arabic Alphabets consist of letters, numbers, punctua- ing rule: 149 90 tion marks, space and special symbols (e.g. mathematical nota- 91 tions). It is different from English in its vowels and diacritic Prefixe1 + Prefix2 + Stem + Suffix1 + Suffixe2 150 92 marks. Diacritics are used in the form of over and underscores + Suffix3 151 93 with Arabic letter. However, most recent written Arabic texts 94 are unvocalized. Table 1 Example of Arabic Affixes. 95 The Arabic language is considered a member of a highly 96 sophisticated category of natural languages which has a very Word Prefix Suffix Infix Root ﻃ ﻠﺐﺍﺕﺍﻝ ﺍﻟﻄﺎﻟﺒﺎﺍﺕ rich morphology. Generally speaking, its richness is attributed 97 98 to the fact that one root can generate several hundreds of

Please cite this article in press as: Abuata, B., Al-Omari, A. A rule-based stemmer for Arabic Gulf dialect. Journal of King Saud University – Computer and Informa- tion Sciences (2015), http://dx.doi.org/10.1016/j.jksuci.2014.04.003 JKSUCI 152 No. of Pages 9 27 March 2015 A rule-based stemmer for Arabic Gulf dialect 3

152 where the prefixes and suffixes in the above rule are lists of 153 finite length. The properties of the above rule are given as 154 follows: 155

157 159 Prefix1: The elements of which serve like conjunctions, ,’faa‘‘) ﻭ and ,ﺏ,ﻑ examples for such an element are baa’ and ”) 160 Prefix2: The attributes associated with its elements help in the partial determination of the tense and the features of the subject pronoun. An example of an yaa’). For this case the‘ ﻱ) element from Prefix2 is tense is present (complete determination of tense), subject pronoun is for the third person (the number and gender of the subject are not Figure 1 Pre-Islamic or Pre-Arabic Expansion. determined). In case the Prefix2 is not present (nil) the tense can’t be determined (may be past or imperative) 161 Suffixe1: Its elements partially determine the tense and an accent, which is a version of a standard language with dif- 191 completely determine the features from the subject ferent pronunciations. Arabic dialects can be classified either 192 pronoun (person, gender, number). An example of historically or popularly (Zina Saadi, 2013). 193 woon”) as in the Arabic‘‘ ﻭﻥ) the Suffixe1 list is For Historical Classification; dialects can be either: 194 yadrusoon) which means ‘‘they study) ﻳﺪﺭﺳﻮﻥ word now”  Pre-Islamic or Pre-Arabic Expansion (6th century BC to 195 162 Suffixe2 and The attributes associated with their elements which 196 163 Suffixe3: determine the features of the first and second 6th century AD) as in Fig. 1.  objects pronouns respectively. An example from Post-Islamic or Post-Arabic Expansion (since 6th century 197 hom”) AD) as in Fig. 2. 198‘‘ ﻫﻢ) these lists is 164 Stem: It is formed by substituting the characters of the 199 root in certain forms called measures or templates As for Popular Classification, the Arabic dialects can be 200 awzan”). An example for the trilateral divided into various as shown in Fig. 3. Some of these groups 201‘‘ ﺃﻭﺯﺍﻥ) measure is: are: 202 measure ﺗﻔﺎﻋﻞ 165 root ﻉ ﻥ ﻕ 166 Sudanese Arabic – Mostly spoken in . 203 stem ﺗﻌﺎﻧﻖ 167 – This dialect is often heard in Syria, 204 168 Lebanon, Palestine, and western Jordan. 205 – Mostly heard throughout the Gulf Coast 206 from Kuwait to Oman. 207 169 2.2. Arabic dialect – This dialect is most often heard in the desert 208 and oasis areas of central . 209 170 Arabic dialect is a collective term of the daily spoken language – This dialect is most common in Yemen. 210 171 through the Arab world. It is radically different from the lit- Iraqi Arabic – The dialect most commonly spoken in Iraq. 211 172 erary language (MSA). Arabic dialect is spoken by more than Hijazi Arabic – This dialect is spoken in the west area of 212 173 400 million persons in nearly two dozen countries and holds present-day Saudi Arabia, which is referred to as the 213 174 the dual distinction of being the fifth most widely spoken as Hejaz region. 214 175 well as one of the fastest growing languages in the world – This is considered the most widely spo- 215 176 (Cote, 2009). Arabic dialects are primarily oral languages; ken and understood ‘‘second dialect.” It’s mostly heard in 216 177 written material is almost invariably in MSA. As a result, there . 217 178 is a serious lack of Language Model (LM) training material for – Spoken mostly in Morocco. 218 179 dialectal speech (Alotaibi et al., 2009). Each Country has its - Spoken mostly in Tunisia. 219 180 main dialect and this main dialect can be divided into a group Hassaniiya Arabic – Most often spoken in Mauritania. 220 181 of sub-dialect e.g. the Saudi dialect includes Najdi (Central) Andalusi Arabic – This dialect of the Arabic language is 221 182 dialect, Hejazi (Western) dialect, Southern dialect (Almeman now extinct, but it still holds an important place in literary 222 183 and Lee, 2013). One factor that differentiates dialects is the history. 223 184 influence from the languages previously spoken in the areas. Maltese Arabic – This form of Arabic dialect is most often 224 185 This influence have typically provided a significant number found in Malta. 225 186 of new words, and also influenced pronunciation and word 226 187 order. There are few works carried out for stemming of Arabic 227 188 All the dialects derive ultimately from the same ancient dialect. Most of these works studied a specific dialect used in 228 189 Arabic source, but have undergone changes in grammatical one country. Examples of these are the works by Al-Gaphari 229 190 structure, vocabulary and so on. A dialect is different from and Al-Yadoumi (2010) and Alamlahi and Ahmed (2007). 230

Please cite this article in press as: Abuata, B., Al-Omari, A. A rule-based stemmer for Arabic Gulf dialect. Journal of King Saud University – Computer and Informa- tion Sciences (2015), http://dx.doi.org/10.1016/j.jksuci.2014.04.003 JKSUCI 152 No. of Pages 9 27 March 2015 4 B. Abuata, A. Al-Omari

whereas the dialects mostly lack the dual form. Also, MSA 252 has two plural forms, one masculine and one feminine, 253 whereas many (though not all) dialects often make no such 254 gendered distinction.5 On the other hand, dialects have a 255 more complex cliticization system than MSA, allowing for 256 circumfix negation, and for attached pronouns to act as 257 indirect objects. 258  Dialects lack grammatical case, while MSA has a complex 259 case system. In MSA, most cases are expressed with diacrit- 260 ics that are rarely explicitly written, with the accusative case 261 Figure 2 Post-Islamic Post-Arabic Expansion. being a notable exception, as it is expressed using a suffix 262 (+A) in addition to a diacritic (e.g. on objects and adverbs). 263  There are lexical choice differences in the vocabulary itself. 264 231 Their work is based on morphological rules related to Sana’ani Table 2 gives several examples. Note that these differences 265 232 dialect as well as MSA. These rules assist in the conversion of go beyond a lack of orthography standardization. 266 233 dialect to its corresponding MSA. Their method tokenizes the  Differences in verb conjugation, even when the trilateral 267 234 input dialect text and divides each token into two parts: stem root is preserved. See the lower part of Table 2 for some 268 235 and its affixes. Such affixes can be categorized into two cate- conjugations of the root sˇ -r-b (to drink). 269 236 gories: dialect affixes and/or MSA affixes. At the same time, 270 237 the stem could be dialect stem or MSA stem. Therefore, their Table 2 shows a few examples illustrating similarities and 271 238 method implemented by using a simple MSA stemmer; must differences across MSA and two Arabic dialects: Levantine 272 239 pay attention to such situations. Then their dialect stemmer and Gulf. Even when a word is spelled the same across two 273 240 is applied to strip the resulting token and extract dialect affixes or more varieties, the pronunciation might differ due to differ- 274 241 (Al-Gaphari and Al-Yadoumi, 2010). Their work uses the dia- ences in short vowels (which are not spelled out). Also, due to 275 242 lect stemming as a system that translates the Sana’ani dialect the lack of orthography standardization, and variance in pro- 276 243 to MSA. nunciation even within a single dialect, some dialectal words 277 244 There are a large number of linguistic differences between could have more than one spelling (e.g. Levantine ‘‘ drinks” 278 245 MSA and the regional dialects. Some of those differences are could be bysˇ rb). (Table 2 uses the Habash–Soudi–Buckwalter 279 246 not found in written form if they are on the level of short transliteration scheme to represent Arabic orthography, which 280 247 vowels, which are deleted in Arabic text anyway.Some of the maps each Arabic letter to a single, distinct character (Habash 281 248 main differences are (Zaidan and Callison-Burch, 2013): et al., 2007). 282 Our paper focuses on Gulf Arabic dialect group which 283 249  MSA’s morphology is richer than dialects’ along some includes Kuwait, Bahrain, Qatar, United Arab of Emirates 284 250 dimensions such as case and mood. For instance, MSA and some parts of East of Saudi Arabia and some parts of 285 251 has a dual form in addition to the singular and plural forms, South Iraq. 286

Figure 3 Arabic dialect groups (http://www.importanceoflanguages.com/LearnArabic/tag/arabic-dialects-map).

Please cite this article in press as: Abuata, B., Al-Omari, A. A rule-based stemmer for Arabic Gulf dialect. Journal of King Saud University – Computer and Informa- tion Sciences (2015), http://dx.doi.org/10.1016/j.jksuci.2014.04.003 JKSUCI 152 No. of Pages 9 27 March 2015 A rule-based stemmer for Arabic Gulf dialect 5

Non-Arabic word list since these words must not be analyzed. 313 Table 2 Examples of similarities and differences across MSA Table 3 shows example of these words. 314 and two Arabic dialects. Also a Gulf dialect contains many stop words and these 315 English MSA LEV GLF words must not be analyzed so the proposed algorithm has 316 Book ktAb ktAb ktAb to compare the word with a stop word list. Some examples 317 Shloon 318 ,ﺷﻨﻮ ’Sheno What ,ﻣﻨﻮ ’Year sn⁄ sn⁄ sn⁄ of stop words are Meno ‘Who Sometimes stop words include affixes; these 319 .ﺷﻠﻮﻥ ’Money nqwd mSAry flws How Come on! hyA! ylA! ylA! affixes must be deleted before comparing it to a stop word list, 320 ! This stop word contains a 321 .ﺷﻠﻮﻧﻜﻢ ’I want Aryd bdy Ab y´ e.g. Shloonkum ‘How are you Now  hlq AlHyn which has to be deleted first so it became Shloon 322 ﻛﻢ AlAn suffix kum When? mty´ ? Aymty´ ? mty´ ? 323 .ﺷﻠﻮﻥ ’How‘ What? mAa¨ A? Aysˇ ?wsˇ ? The Proposed algorithm for the Gulf dialect Arabic in 324 I drink Aˆ sˇ rb bsˇ rb Asˇ rb He drinks ysˇ rb bsˇ rb ysˇ rb order to analyze the words to extract the stem will remove 325 We drink nsˇ rb bnsˇ rb nsˇ rb all the Affixes usually appeared in MSA form, e.g. Alef-Lam 326 The 327 .(ﻭ Waw ,ﻱ Yaa ,ﺍ Vowels (Alef ,ﻭﻥ Waw-Noon ,ﺍﻟـ proposed algorithm supposes that the smallest length of a stem 328 is three letters since more than Three-quarters of the MSA 329 words have a root or stem of size three. The main steps of 330 Table 3 Examples of non-Arabic dialect words. the proposed algorithm are as follows: 331 Dialect words Origen Meaning Affixes Result removed  Read the word 332  Check the size of the word (<=3). This control is per- 333 ﺑﺸﺖ Besht ﻙ Fares Cloak Kaf ﺑﺸﺘﻚ Beshtek formed every time when a letter is removed from the 334 ﺗﺠﻮﺭﻱ Tegoree ﺍﻟـ India Storage Alef-Lam ﺍﻟﺘﺠﻮﺭﻱ Altegoree word. 335 ﺩﻻﻍ Dlag ﺍﺕ Turkey Socks Alef taa ﺩﻻﻏﺎﺕ Dlagaat Remove suffixes and prefixes found in suffixes and pre- 336  ﺍﺑﺠﻮﺭ Abajor ﺍﺕ France Rolling Alef taa ﺍﺑﺠﻮﺭﺍﺕ Abajorat shutters fixes set 337 Delete the following prefixes if found 338 ➢ ﺩﺭﻳﻮﻝ England Driver – Drawel ﺩﺭﻳﻮﻝ Drawel 339 .(ﺍﻝ ، ﻟﻞ ، ﻭﺍﻝ، ﺑﺎﻝ، ﻭﻟﻞ) 340 .(ﺍﺵ، ﻭﺵ) Delete the following prefixes if found ➢ 341 ﺍﺕ، ﻭﻥ، ﻙ، ﻛﻢ، ﻭﻛﻢ، ﺗﻨﻲ، ﻭﻧﻪ،) Delete the following suffixes ➢ 342 .(ﻳﻨﻪ، ﻫـ، ﺗﻪ، ﻫﻢ،ﻫﺎ، ﻭﻧﻬﻢ، ﻳﻦ، ﺕ، ﺗﻲ، ﻧﻲ،ﺗﻚ، ﺗﻜﻢ، ﺍﻟﻜﻢ،ﺍﻟﻚ 287 The important thing that must be acknowledged is that 343 288 Arabic language influenced and gets influenced by other lan-  Check if the word belongs to Non-Arabic word or to 344 289 guages. It lent some words to other language like Persian, Stop word: 345 290 Turkish, Hindi and Malay. Arabic literary affects Europeans ➢ If true then stop. 346 291 culture especially, in mathematics, science and philosophy. ➢ If false, delete the following prefix 347 292 Also it has borrowed words from other languages such as 348 .(ﺑﺎ ، ﻭﺍ، ﺍ، ﺕ، ﻱ، ﻡ، ﻣﺖ، ﻣﺎ ، ﺍﻥ، ﻣﻦ، ﻳﺖ) 293 Hebrew, Greek and Persian in early centuries. In modern times 349 294 it borrowed from English, French and Turkish. 350 (ﺕ) and the third letter is (ﺍ، ﻥ، ﺕ، ﻡ) If the first letter is  then delete both. 351 then delete it 352 ﺍ، ﺕ، ﻱ، ﻡ Method  If the first letter is .3 295  Check each letter in the word if it is a vowel then delete 353 296 The primary goal of this study is to derive an efficient algo- it: 354 297 rithm to extract the stem of dialect words used in Arabian ➢ If we have a vowel with non-vowels neighbored then 355 298 Gulf countries (Kuwait, Bahrain, Qatar, UAE, Saudi deletes it. 356 299 Eastern Area, and South of Iraq). These words are collected ➢ If we have two consecutive vowel letters then we have 357 300 from the chatting and the Internet forums. The list does not to delete one of them according to the following 358 359 .(ﻱ) and then (ﻭ) then (ﺍ) include the old words which are not used these days (e.g. sahed order 301 and includes some new words ➢ If we find three consecutive vowels keep the one in 360 (ﻛﺮﻓﺎﻳﻪ ’karfaia ‘‘bed ,ﺻﺎﻫﺪ ’fever‘ 302 ja’as the middle and delete the two neighbored letters. 361 ,ﻣﺠﺴﻲ ’added to the Gulf dialect (e.g. majase ‘stubbed 303 362 .(ﺟﻌﺺ ’mean‘ 304 305 Gulf dialect words include some Non-Arabic words which  After deleting the vowel letters, if we get two letters 363 306 come from different countries. Some come from India and similar to the neighbored then delete one of them. E.g. 364 we get 365 ﺝ delete the ﺭﺟﺠﻞ we get (ﺍ ، ﻱ) delete the ﺭﺟﺎﺟﻴﻞ Iran (Fares) because before the discovery of oil the Arabian 307 366 .ﺭﺟﻞ Gulf traders used to go to these countries. Other words come 308 309 from England, France and Turkey because after the discovery 367 310 of oil European traders from these countries came to Arabian In this pseudo code a file of Non-Arabic word and a file of 368 311 Gulf countries. The proposed algorithm has to remove the the stop words have to be created first and before starting the 369 312 affixes added to these words first before comparing it to a program: 370

Please cite this article in press as: Abuata, B., Al-Omari, A. A rule-based stemmer for Arabic Gulf dialect. Journal of King Saud University – Computer and Informa- tion Sciences (2015), http://dx.doi.org/10.1016/j.jksuci.2014.04.003 JKSUCI 152 No. of Pages 9 27 March 2015 6 B. Abuata, A. Al-Omari

THEN 437 372 374 Begin Delete L 438 375 Define WLF a file of the word list to be stemmed, FNAW a file of ENDIF 439 376 Non-Arabic words and another file FSW for stop-words IF one of L neighbored is vowel THEN 440 377 WHILE WLF is NOT Empty Delete the vowel neighbored with less priority 441 378 W = Read Single Word () ENDIF 442 379 SET wordLength to Length of W IF two neighbored of L are vowels THEN 443 380 IF wordLength <= 3 THEN Delete both neighbors with less priority 444 381 Stop and Exit While ENDIF 445 382 ENDIF //Scan if there are two consecutive letters that are the same then 446 383 WHILE wordLength > 3 delete one of them 447 THEN IF neighbor of L = L THEN 448 (ﺍﻝ ، ﻟﻞ ، ﻭﺍﻝ، ﺑﺎﻝ، ﻭﻟﻞ) IF W contains Prefix 384 385 Delete Prefix from W Delete neighbor of L OR L 449 386 IF wordLength <= 3 THEN IF wordLength <=3 THEN 450 387 Print W as the stem AND Exit While Print W as the stem AND Exit While 451 388 ENDIF ENDIF 452 389 ENDIF ENDIF 453 390 ENDWHILE ENDWHILE 454 391 WHILE wordLength > 3 Print W as the stem AND Exit While 455 THEN End While, 456 (ﺍﺵ، ﻭﺵ) IF W contains Prefix 392 393 Delete Prefix from W End 457 394 If wordLength <= 3 THEN 460 395 Print W as the stem AND Exit While 396 ENDIF 397 ENDIF 398 ENDWHILE 399 WHILE wordLength > 3 ﺍﺕ،ﻭﻥ،ﻙ، ﻛﻢ،ﻭﻙ، ﺗﻨﻲ،ﻭﻧﻪ،ﻩ، ﺗﻪ،ﻫﻢ،ﻭﻧﻬﻢ،ﻱ) IF W contains Suffix 400 THEN (ﺕ،ﺗﻲ،ﻧﻲ،ﺍﻟﻜﻢ،ﺍﻟﻚ 401 402 Delete Suffix from W 403 IF wordLength <= 3 THEN 404 Print W as the stem AND Exit While 405 ENDIF 406 ENDIF 407 ENDWHILE 408 IF W found in FNAW Then 409 Print W as the stem AND Exit While 410 ENDIF 411 IF W is found in FSW THEN 412 Print W as the stem AND Exit While 413 ENDIF 414 WHILE wordLength > 3 do THEN (ﺑﺎ، ﻭﺍ، ﺍ، ﺕ، ﻱ، ﻡ، ﻣﺖ، ﺍﻥ، ﻣﻦ) IF W contains Prefix 415 416 Delete Prefix 417 IF wordLength <= 3 THEN 418 Print W as the stem AND Exit While 419 ENDIF 420 ENDIF 421 ENDWHILE 422 WHILE wordLength > 3 (ﺕ) AND 3rd letter of W is (ﺍ، ﻥ، ﺕ، ﻡ) IF 1st Letter of W is 423 424 THEN 425 Delete Both Letters from W 426 IF wordLength <= 3 THEN 427 Print W as the stem AND Exit While 428 ENDIF .(I urge you) ﺑﺎﻧﺸﺪﻙ ENDIF Figure 4 Stemming the word 429 430 ENDWHILE 431 //Check each letter in the word if it is a vowel or not 432 WHILE not all letters scanned and wordLength >3 433 Set L to Letter of W Examples of applying this algorithm on some words are 461 434 //Consecutive vowels must be deleted according to the priority (alef shown in Figs. 4–6. 462 Table 4 contains some examples that show how the algo- 463 (ﻱ Yaa , ﻭwaw ,ﺍ 435 436 IF neighbored letters of L not vowels AND L is Vowel rithm handles some of the special rules found in some words 464 458459 (these rules are found in the pseudo code). 465

Please cite this article in press as: Abuata, B., Al-Omari, A. A rule-based stemmer for Arabic Gulf dialect. Journal of King Saud University – Computer and Informa- tion Sciences (2015), http://dx.doi.org/10.1016/j.jksuci.2014.04.003 JKSUCI 152 No. of Pages 9 27 March 2015 A rule-based stemmer for Arabic Gulf dialect 7

Table 4 Examples of some special rules of the algorithm. No Rule Examples ﺍﻟﻌﻴﺎﻝ = ﻋﻴﺎﻝ (ﺍﻝ، ﻟﻞ،ﻭﺍﻝ،...) Delete the prefix (1) The example here also good for Then the size is 4 so we take (ﻱ) the last rule in the pseudo code the letter at position 2 about taking scan letters and vowel which is followed by (ﻱ) so keep the (ﺍ) removing vowels other vowel so the root will (ﺍ) and remove This example shows .(ﻉ ﻱ ﻝ) be also the rule if we have two vowels in the pseudo code ﺍﺑﺘﻠﺶ ، ﻧﺒﺘﻠﺶ، ﻳﺒﺘﻠﺶ، ﺗﺒﺘﻠﺶ، ﺍ، ﻥ، ﺕ، ﻱ، ﻡ) If the first letter is (2) ﻣﺒﺘﻠﺶ = ﺑﻠﺶ we(ﺕ) and the third is (ﻭﻟﻴﺲ ﻭ have to delete This is the same as MSA e.g.: ﺍﺭﺗﺤﻞ، ﻧﺮﺗﺤﻞ، ﻳﺮﺗﺤﻞ، ﺗﺮﺗﺤﻞ، ﻣﺮﺗﺤﻞ = ﺭﺣﻞ ﻭﺷﻌﻠﻮﻣﻚ = ﻋﻠﻮﻣﻚ :For some prefixes (3) Then the size is five so we have (ﻭﺵ) If the word starts with  we have to delete to take the character at (ﺍﺵ) or (ﻭ) The example here is also good position 3 which is a vowel for the last rule in the pseudo so delete and shift to the right with the two (ﻝ) code about taking scan letters to get so the root is (ﻉ ، ﻡ) and removing vowels neighbors (ﻉ ﻝ ﻡ) ﺍﻧﺤﺎﺵ = ﺣﺎﺵ we have (ﺍﻥ) If it starts with  ﺍﻧﺨَﺶ = ﺧَﺶ to remove ﺍﻧﺘﺮﺱ = ﺗﺮﺵ ﻣﺸﺨﺎﻝ = ﺷﺨﺎﻝ (ﻣﻦ) or (ﻣﺖ) ,(ﻡ) If we have  which is a (ﺍ) Then remove the ﺵ ﺥ ﻝ ( ) You will see). vowel to get the root) ﺗﺸﻮﻓﻮﻥ Figure 5 Stemming of (ﻣﺮﺯﺍﻡ = ﺭﺯﻡ) and the same for ﻣﺘﺤﺪﺭ = ﺣﺪﺭ ، ﻣﺘﺒﺮﺯ = ﺑﺮﺯ ﻣﻨﺤﺎﺵ = ﺣﺎﺵ ﻳﺮﻃﻦ = ﺭﻃﻦ ، ﻳﺠﻨﺪﺱ = ﺟﻨﺪﺱ (ﻳﺖ)or(ﻱ) If we have  ﻳﺘﻐﻠﻲ = ﻏﻠﻲ ، ﻳﺘﻌﻠﺚ = ﻋﻠﺚ This can be the same as in MSA. e.g.: ﻳﻜﺘﺐ = ﻛﺘﺐ ، ﻳﺮﺳﻢ = ﺭﺳﻢ ﻳﺘﻔﻜﺮ = ﻓﻜﺮ ، ﻳﺘﺄﻣﻞ = ﺃﻣﻞ ﻳﻐﺮﺑﻠﻚ = ﻏﺮﺑﻞ :For some suffix (4) ﻭﻫﻘﻜﻢ = ﻭﻫﻖ (ﻛﻢ)or(ﻙ) If it ends with  This also the same as in MSA e.g.: ﻛﺘﺎﺑﻚ = ﻛﺘﺐ ، ﻛﺘﺎﺑﻜﻢ = ﻛﺘﺒﻜﺘﺎﺑﻚ = ﻛﺘﺐ ، ﻛﺘﺎﺑﻜﻢ = ﻛﺘﺐ ﻳﺨَﺸﻮﻧﻪ = ﺧَﺶ (ﻭﻧﻬﻢ) or (ﻭﻧﻪ) ,(ﻭﻥ) If we have  ﻳﺨَﺸﻮﻧﻬﻢ = ﺧَﺶ ﻳﺨَﺸﻮﻥ = ﺧَﺶ And this is the same as in MSA .ﺍﺑﺠﻮﺭﺍﺕ Figure 6 Stemming of e.g.: ﻳﻈﻬﺮﻭﻥ ﻇﻬﺮ 466 4. Results and analysis = ﻳﻈﻬﺮﻭﻧﻪ = ﻇﻬﺮ ﻳﻈﻬﺮﻭﻧﻬﻢ = ﻇﻬﺮ ﻣﺎﺻﺨﺎﺕ = ﻣﺼﺦ (ﺍﺕ) The data test corpus used to test the algorithm is obtained  If we have 467 2345 ﻗﻔﺸﺎﺕ = ﻗﻔﺶ from many places related to Arabic Gulf dialects We 468 469 browsed different sites in order to obtain as many as possible And this is the same as in MSA 470 Gulf word varieties. The initial data corpus is 15486. The cor- e.g.: ﺟﻠﺴﺎﺕ = ﺟﻠﺲ pus was then analyzed and preprocessed to remove duplicates 471 ﻭﻣﻀﺎﺕ ﻭﻣﺾ 472 and MSA words. The resulted corpus consists of 5436 distinct =

2 http://www.alamuae.com/uaedic/index.html. Gulf dialect words. Table 5 and Table 6 show some character- 473 3 http://ar.mo3jam.com/. istics of the test corpus. 474 4 http://www.7bna.com/vb/showthread.php?t=93153. After applying this algorithm on the Arabic dialect test cor- 475 5 http://www.majma.org.jo/majma/index.php/2009-02-10-09-36-00/ pus we get the results reported in Table 7. 476 648-mag80-5.html.

Please cite this article in press as: Abuata, B., Al-Omari, A. A rule-based stemmer for Arabic Gulf dialect. Journal of King Saud University – Computer and Informa- tion Sciences (2015), http://dx.doi.org/10.1016/j.jksuci.2014.04.003 JKSUCI 152 No. of Pages 9 27 March 2015 8 B. Abuata, A. Al-Omari

Table 5 Arabic dialect corpus word frequencies based on Table 8 Results of the three stemmers for MSA corpus. word length. Stemmer Total number Accuracy Right Not Wrong Word length Word frequency Word ratio name of words (%) root stem root 7 85 1.56% Khoja’s stemmer 5436 92 5016 65 355 6 305 5.61% Darwish’s stemmer 5436 76 4140 314 982 5 913 16.80% New stemmer 5436 52 2825 982 1629 4 1872 34.44% 3 2213 40.71 2 48 0.88 Totals 5436 100% are: Khoja’s stemmer (Khoja and Garside, 1999), and 495 Darwish’s stemmer (Kareem Darwish, 2002). We selected these 496 two stemmers as they are considered among the Arabic stem- 497 mers in terms of accuracy levels when applied on MSA words. 498 Table 6 Examples of Arabic dialect corpus set of different Khoja’s stemmer is a heavy stemmer whereas Darwish’s stem- 499 word lengths. mer is a light stemmer. The test corpus is the same as the one 500 Word Arabic meaning English meaning Derivations Stem used for testing the new algorithm which consists of 5436 501 words. Table 7 shows the results of applying the three stem- 502 ﺣﻠﻖ ﺗﺰﺣﻠﻖ, ﻳﺘﺰﺣﻠﻘﻮﺯﻥ, Sliding game ﻌﺒﺔ ﺍﻟﺘﺰﺣﻠﻖ ﺯﺣﻼﻗﻴﻟﺔ mers on the Dialect list: 503 ﻭﻥ ﺍﻟﺪﻳﻮﺍﻧﻴﻪ, ﺍﻟﺪﻭﺍﻭﻳﺩﻦ, Men hall’ ﻜﺎﻥ ﻟﺘﺠﻤﻊ ﺍﻟﺮﺟﺎﻝ ﺩﻳﻮﺍﻧﻴﻣﺔ Another test is carried out for the same three stemmers 504 ﻤﻜﻢ ﻳﺘﻜﻤﻜﻤﻮﻥ - ﺗﻜﻤﻜﻤﻮﻛﺍ Covered herself ﻐﻄﺖ ﺗﻜﻤﻜﻤﺗﺖ where the corpus is changed to MSA list of words collected 505 ﻠﺦ ﻳﺸﻠﺨﻮﻥ, ﺗﺸﻠﻴﺷﺦ Lie ﺎﺫﺏ ﺑﻮﺷﻼﻛﺥ form various MSA Arabic text. Table 8 shows the results of 506 ﻠﻮﻥ ﺍﺷﻠﻮﻧﻜﻢ - ﺍﺷﻠﻮﻧﺷﻚ - How ﻴﻒ ﺍﺷﻠﻮﻛﻥ ﺮﻳﺞ ﺍﺑﺎﺭﻳﺞ ﺍﺑﺮﻳﺠﻜﺑﻢ ﻧﺎﺀ ﺍﻟﻤﺎﺀ ﺇﺑﺮﻳﻖ ﺍﺑﺮﻳﺍﺞ - Water jug -– this test. 507 ﺶ ﺍﺗﺨﺸﻮﻥ - ﺧﺸﻴﺧﺖ Hide ﺧﺘﺒﺄ ﺍﻧﺨﺍﺶ Depending on the previous results we found that: 508 ﻛﺪ ﺭﻛﺪﻭﺍ - ﺗﺮﻛﺪﻭﺭﻥ Calm ﻫﺪﺃ ﺍﺭﻛﺍﺪ ﺍﺑﻲ .ﺗﺒﻮﻥ. ﺗﺒﻴﻦ He want ﺮﻳﺪ ﻳﺒﻳﻲ  MSA stemming is different from Dialect stemming. This is 509 ﺑﺎﻕ - ﺗﺒﻮﻗﻴﻦ - ﻣﺒﻴﻮﻕ He steal ﺮﻕ ﻳﺒﻮﺳﻕ shown from the results obtained by the MSA stemmers 510 ﺺ ﺣﺼﺎﻳﺺ, ﻳﺤﺼﺣﺺ White pearl ﻟﺆﻟﺆ ﺍﻻﺑﻴﺾ ﺣﺼﺍﻪ Khoja’s and Darwish’s stemmers) and the new stemmer 511) ﺞ ﺻﺠﻚ - ﺍﻟﺼﺻﺞ Truly ﺪﻕ ﺻﺻﺞ on Gulf dialect corpus and the MSA corpus as shown in 512 Tables 7 and 8. Hence, dialect Arabic requires special 513 stemmers. 514 477 The reasons which cause the wrong stems are:  The proposed stemmer has to stem all words even the Non- 515 Arabic words because sometimes these words have some 516 ﻱ ﺝ  478 has 517 ﺍﺑﺠﻮﺭﺍﺕ ’Some Arabian Gulf country convert Jeem to Yaa , e.g. affixes e.g. the word Abajorat ‘Rolling shutters ﻣﺠﻬﻮﺩ ﻣﻴﻬﻮﺩ 479 which has to be removed first 518 ﺍﺕ Mayhood ‘Effort’ instead of Majhood . In this case a suffix Alef and Taa ﻱ 480 the algorithm has to remove Yaa because it is a vowel before comparing the word with the Non-Arabic word list. 519 481  Try to stem nouns – name of things – e.g. Kamee ‘Dry milk’ These types of words are not stemmed both of Khoja and 520 ﻛﺮﻭﻇﺎﺕ ﻛﺎﻣﻲ 482 , Krothat ‘Nuts’ . Darwish stemmers. 521 483  Some plurals for Non-Arabic words are not in a standard  Wrong roots came from stemming Non-Arabic words e.g. 522 ﻭﻥ ﺍﺕ” 484 and Asansoor ‘Elevator’ 523 ﺍﺳﺘﻜﺎﻧﻪ ’form, i.e. ending with Alef Taa ‘‘, Waw noon ‘‘ ‘‘ or Estekanah ‘cup of tea ﻳﻦ 485 Also it try to stem a stop words since they are not 524 .ﺍﺳﻨﺴﻮﺭ Yaa noon ‘‘ ‘‘. E.g. The plural of the Non-Arabic word ﺍﺑﺎﺭﻳﺞ ﺍﺑﺮﻳﺞ 486 Sheno 525 ,ﻣﻨﻮ ’Persian- Ebreej ‘Jug’ is Abareej . The problem the same as in MSA language e.g. Meno ‘Who- 487 526 ﺷﻨﻮ ’here is that the algorithm searches for suffix and prefix in ‘What 488 the word only before it checks whether the word is Non-  The right root came from that for dialect words they have 527 489 Arabic or not. the same affixes used in MSA language which Khoja stem- 528 490 has a 529 ﺭﻛﺪﻭﺍ ’mer can recognize and delete e.g. Rekdo ‘be quite 491 which has to be removed, other 530 ﻭﺍ Since there is no stemmer for the Gulf dialect words we suffix Waw and Alef 492 which has a prefix 531 ﺍﻻﺛﻮﻝ ’used two Arabic stemmers to compare our stemmer with them. example is Alathwal ‘Stubbed 493 532 . ﺍﻻ Here we want to prove that MSA stemmers are not applicable Alef Lam Alef 494 to Arabic dialect and their performance is low. These stemmers 533

5. Conclusion and future work 534 Table 7 Results of the three stemmers for Arabic gulf dialect corpus. MSA stemming algorithms are not applicable to Arabic 535 Stemmer Total number of Accuracy Right Not Wrong Dialect. In this paper we showed that MSA stemmers perform 536 name words (%) root stem root poorly when applied to Arabic Gulf dialect and the Arabic 537 Khoja’s 5436 39 2121 1684 1631 Gulf dialect cannot be applied for MSA words. There is no 538 stemmer stemming algorithm that handles Arabic Gulf dialect words. 539 Darwish’s 5436 28 1524 2132 1780 Only few algorithms are available that handle only a singular 540 stemmer Arabic dialect. In this paper we presented a new rule based 541 New 5436 88 4784 0 652 Dialect stemmer built for the Gulf dialects. The algorithm is 542 stemmer built of a set of predefined rules for Gulf dialects. This new 543

Please cite this article in press as: Abuata, B., Al-Omari, A. A rule-based stemmer for Arabic Gulf dialect. Journal of King Saud University – Computer and Informa- tion Sciences (2015), http://dx.doi.org/10.1016/j.jksuci.2014.04.003 JKSUCI 152 No. of Pages 9 27 March 2015 A rule-based stemmer for Arabic Gulf dialect 9

544 stemmer accuracy is acceptable and it gave superior results the North American Chapter of the Association for Computational 595 545 compared to other stemming algorithms. The algorithm can Linguistics/Human Language Technologies Conference 596 597 546 handle many dialects. (HLTNAACL04), Boston, MA. Duwairi, R., Al-Refai, M., Khasawneh, N., 2007. Stemming versus 598 547 This new algorithm can handle all known Arabic dialects light stemming as feature selection techniques for Arabic text 599 548 by defining new rules and integrating these rules with the cur- categorization. Innovations in Information Technology, IIT ’07. 600 549 rent used rules. Improvement to our stemmer can also be 4th International Conference on, 18–20 Nov, pp. 446–450. 601 550 added to handle all the non-Arabic words used in Arabic dia- El-Sadany, T.A., Hashish, M.A., 1988. Semi-automatic vowelization 602 551 lects. More tests are also required for the use of the algorithm of Arabic verbs. Proceedings of 10th National Computer 603 552 in Arabic translation and Arabic Sentiment analysis. Conference, pp. 45–56. 604 Gleason, H.A., 1970. An Introduction to Descriptive Linguistics, third 605 553 6. Uncited reference ed. Holt. Rinehart and Winston, New York. 606 Goweder, A., Alhami, H., Rashed, T., Al-Musrati, A., 2008. A hybrid 607 method for stemming Arabic text. In Proceedings of the 9th 608 554 Habash et al. (2005). International Arab Conference on Information Technology (ACIT 609 2008) (Tunis, 2008). 610 555 References Habash, N., Rambow, O., 2005. Tokenization, morphological analy- 611 sis, and part-of-speech tagging for Arabic in one fell swoop. In 612 556 Abu-Salem, H., Al-Omari, M., Evens, M.W., 1999. Stemming Proceeding of the Association for Computational Linguistic 613 557 methodologies over individual query words for an Arabic (ACL). 614 558 information retrieval system. J. Am. Soc. Inf. Sci. 50 (6), 524–529. Nizar Habash, Owen Rambow, George Kiraz, 2005. Morphological 615 559 Yahya Alamlahi, Fateh Ahmed, 2007. Sana’ani dialect to modern analysis and generation for Arabic dialects. Proceeding Semitic ‘05 616 560 standard Arabic: rule-based direct machine translation, Computer Proceedings of the ACL Workshop on Computational Approaches 617 561 Science Dep., Sana’a University, Sana’a, Yemen. to . University of Michigan, Ann Arbor, 618 562 Al-Fedaghi, S.S., Al-Sadoun, H.B., 1990. Morphological compression Michigan, USA, pp: 17–24s. 619 563 of Arabic text. Inf. Process. Manage. 26 (2), 303–316. Habash, N., Soudi, A., Buckwalter, T., 2007. On Arabic translitera- 620 564 Al-Gaphari, G.H., Al-Yadoumi, M., 2010. A method to convert tion. In: van den Bosch, A., Soudi, A. (Eds.), Arabic 621 565 Sana’ani accent to modern standard Arabic, 2010. Int. J. Inf. Sci. Computational Morphology: Knowledge-Based and Empirical 622 566 Manage. 8 (1), 39–49. Methods. Springer. 623 567 Alghamdi, M., Alhargan, F., Alkanhal, M., Alkhairy, A., Eldesouki, Harrag, F., El-Qawasmah, E., Al-Salman, A.M.S., 2011. Stemming as 624 568 M., Alenazi, A., 2008. Saudi accented Arabic voice bank. J. King a feature reduction technique for Arabic text categorization. 625 569 Saud Univ. Comp. Inf. Sci. 20 (1), 45–62 (Riyadh). Programming and Systems (ISPS), 10th International Symposium 626 570 Ali, N., 1988. Computers and Arabic Language. Al-Khat Publishing on, 25–27 April, pp. 128–133. 627 571 Press, Ta’reep, Egypt. Kadri, Y., Nie, J.Y., 2006. Effective stemming for Arabic information 628 572 Almeman, K., Lee, M., 2013. Automatic building of Arabic multi retrieval. In Proceedings of the Challenge of Arabic for NLP/MT 629 573 dialect text corpora by bootstrapping dialect words. Conference. The British Computer Society. London, UK. 630 574 Communications, Signal Processing and their Applications Kareem Darwish, 2002. Al-stem: a light Arabic stemmer. [Online]. 631 575 (ICCSPA), 1st International Conference on 12–14 Feb, pp. 1–6. Available: http://www.glue.umd.edu/~kareem/research. 632 576 Alotaibi, Y.A., Selouani, S., Cichocki, W., 2009. Investigating Khoja, S., Garside, R., 1999. Stemming Arabic text. Computing 633 577 emphatic consonants in foreign accented Arabic. J. King Saud Department, Lancaster University, Lancaster, 15 (April 2012). Doi: 634 578 Univ. Comp. Inf. Sci. 21 (1), 13–25 (Riyadh). http://zeus.cs.pacificu.edu/shereen/research.htm. 635 579 AL-Shalabi, R., Kannan, G., AI-Serhan, H., 2003. New approach for Maamouri, M., Bies, A., 2004. Developing an Arabic treebank: 636 580 extracting Arabic roots. Proc. of 2003 International Arab confer- Methods, guidelines, procedures, and tools. Linguistic Data 637 581 ence on Information Technology (ACIT’2003), Alexandria, pp. 42– Consortium (LDC). 638 582 59. Mutahhar, A.R., Watson, J., 2002. Social issues in popular Yemeni 639 583 Al-Shareef, Sarah, Hain, Thomas, 2011. An investigation in speech culture. Yemeni–British project supported by the British Embassy, 640 584 recognition for colloquial Arabic. In proceeding of: Social Fund for Development and Leigh Douglas Memorial Fund, 641 585 INTERSPEECH 2011, 12th Annual Conference of the Sana’a, Yemen. 642 586 International Speech Communication Association, Florence, Saliba, B., Al-Dannan, A., 1990. Automatic morphological analysis of 643 587 Italy, August 27–31, pp. 2869–2872. Arabic: a study of content word analysis. Proc. First Kuwait 644 588 Cote, Robert A., 2009, Choosing one dialect for the Arabic speaking Comput. Conf., 231–243 645 589 world: a status planning dilemma, 75 Arizona Working Papers in Zaidan, O.F., Callison-Burch, C., 2013. Arabic dialect identification. 646 590 SLA & Teaching 16, 75–97. Comput. Linguist. 1 (1), 1–36. 647 591 Diab, M., Habash, N., 2006. Arabic Dialect Processing. AMTA, Zina Saadi, One language, many dialects: an analysis of Arabic 648 592 Bostan. dialects. Computational Linguist. Middle Eastern Languages 649 593 Diab, M., Hacioglu, K., Jurafsky, D. 2004. Automatic tagging of Specialist. Basis Technology corp. Found online (accessed 03.2013). 650 594 Arabic text: from row text to base phrase chunks. In 5th Meeting of 651

Please cite this article in press as: Abuata, B., Al-Omari, A. A rule-based stemmer for Arabic Gulf dialect. Journal of King Saud University – Computer and Informa- tion Sciences (2015), http://dx.doi.org/10.1016/j.jksuci.2014.04.003