<<

Morphologically Annotated Corpora for Seven : Taizi, Sanaani, Najdi, Jordanian, Syrian, Iraqi and Moroccan

Faisal Alshargi,? Shahd Dibas,‡ Sakhar Alkhereyf,† Reem Faraj,† Basmah Abdulkareem,† Sane Yagi,‡ Ouafaa Kacha,‡ Nizar Habash,∗ Owen Rambow§ ?Universitat¨ Leipzig, Germany ‡University of , Jordan †Columbia University, USA ∗New York University Abu Dhabi, UAE §Elemental Cognition, USA [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract ideal starting point for experimenting with using multidialectal resources to create and train NLP We present a collection of morphologi- tools. The dialects we consider are Taizi Yemeni cally annotated corpora for seven Arabic (YE.TZ)1, Sanaani Yemeni (YE.SN), Saudi Na- dialects: Taizi Yemeni, Sanaani Yemeni, jdi (SA.NJ), Jordanian (JOR), Syrian Damascene Najdi, Jordanian, Syrian, Iraqi and Moroc- (SY.DM), Iraqi Baghdadi (IR.BG), and Moroccan can Arabic. The corpora collectively cover Rabati (MA.RB) Arabic. over 200,000 words, and are all manually The paper is structured as follows. We start with annotated in a common set of standards a review of relevant literature (Section2). We then for orthography, diacritized lemmas, to- summarize some linguistic facts about DA in gen- kenization, morphological units and En- eral (Section3) and subsequently present each of glish glosses. These corpora will be pub- our seven dialects in Section4, summarizing the licly available to serve as benchmarks for corpora used and some interesting facts specific to training and evaluating systems for Arabic each . Section5 then presents our annota- dialect morphological analysis and disam- tion methodology. We then briefly discuss mor- biguation. phological analyzers, and conclude.

1 Introduction 2 Related Work

As Arabic dialects (DA) become more widely Data Collections There have been several data written in social media, there is increased interest collections centered on Arabic dialects, specifi- in the Arabic NLP community to have annotated cally spoken Arabic. A very useful resource is the corpora that will allow us to both study the dialects Semitisches Tonarchiv at the University of Heidel- linguistically, and to create systems that can auto- berg in Germany.2 We have included two Yemeni matically process dialectal text. There have been transcriptions from this resource in our YE.TZ and important efforts to create relatively large corpora YE.SN corpora. Khalifa et al.(2016) is a large col- for Egyptian (Maamouri et al., 2014), Palestinian lection of over 100M words of a number of Ara- (Jarrar et al., 2014), and Emirati Arabic (Khal- bic dialect, although the majority is from the Gulf. ifa et al., 2018). While these resources are very Bouamor et al.(2018) created a large corpus with helpful for single dialects, the problem is that parallel data text from 25 Arab cities. Further data there are many dialects, and in fact it is often un- collections include (Al-Amri, 2000) which has not clear what to count as separate dialects (for exam- yet been digitized for use in NLP research. ple, the subdialects of Levantine). Therefore, we present a different approach in this paper: we an- Annotated Corpora There are few annotated notate seven dialects, but with relatively smaller corpora for dialectal Arabic: the Levantine Ara- corpora (most around 30,000 words). Some of bic Treebank (specifically Jordanian) (Maamouri the dialects are closely related (Jordanian and Syr- et al., 2006), the Treebank ian), others are more distant (Moroccan). We use (Maamouri et al., 2014), Curras, the Pales- the same annotation methodology for all dialects: 1The abbreviations we use intend to capture the country same guidelines, same processing steps, and same name and the city or region name when applicable. annotation file format. This makes our effort an 2http://www.semarch.uni-hd.de

137 Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 137–147 Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics tinian Arabic annotated corpus (Jarrar et al., 3 Dialects: Linguistic Facts 2014), the Annotated corpus (Khalifa In this section we present some general facts and et al., 2018), Syrian, Jordanian dialectal corpora phenomena shared across different dialects. In (Bouamor et al., 2014; Harrat et al., 2014), a small subsequent subsections, we present our dialects effort on Sanaani and Moroccan (AlShargi et al., in more detail and commenting on the corpus 2016) (which this paper builds on), and SUAR sources. (Al-Twairesh et al., 2018), a morphologically an- notated corpus for Najdi and Hijazi which is semi- Dialects and MSA Arabic dialects share many automatically annotated using the MADAMIRA commonalities with and Mod- tool (Pasha et al., 2014) and subsequently man- ern Standard Arabic (MSA). All variants of Ara- ually checked. Additionally, Voss et al.(2014) bic are morphologically complex as they include present a corpus of Moroccan dialect which has rich inflectional and derivational morphology that been annotated for language variety (code switch- is expressed in two ways: namely, via templates ing). Several of these efforts have followed the ap- and affixes. Furthermore, they contain several proach of Curras (Jarrar et al., 2014), which con- classes of attachable clitics. However, the dialects sists of around 70,000 words of a balanced genre as a class differ in consistent ways from MSA, corpus. The corpus was manually annotated using and they differ amongst each other. In fact, the the DIWAN tool (Alshargi and Rambow, 2015), differences between MSA and Dialectal Arabic which we also use. The annotation in Curras is (DA) have often been compared to those between done by first using a morphological tagger for an- and the (Chiang et al., other Arabic dialect, namely MADAMIRA Egyp- 2006). The principal morpho-syntactic difference tian (Pasha et al., 2014), to produce a base that was between DA and MSA is the loss of productive then corrected or accepted by a trained annotator. case marking, and nunation (tanween) on nouns, and mood on imperfective verbs. Other NLP Resources for Dialectal Arabic Dialectal Variations Differences among the di- The effort to annotate corpora in context is a cen- alects are found on all levels of linguistic descrip- tral step in developing morphological analyzers tion, i.e., phonology, morphology, syntax, and the and taggers (Eskander et al., 2013; Habash et al., lexicon. We summarize three phonological and 2013). However, other notable approaches and three morphological salient examples in Table1 efforts that do not use annotated corpora have for our dialects: the pronunciation of MSA /q/ focused on developing specific resources manu- written  q,3 MSA /Ã/ written and MSA /k/ ally or semi-automatically, e.g., the Egyptian Ara- † h. written k; and the various forms of the future, bic morphological analyzer (Habash et al., 2012b) ¼ progressive and possessive particles. which is built upon the Egyptian Colloquial Ara- From a lexical point of view, there are many bic Lexicon (Kilany et al., 2002), the multi- words that have different meanings across dialects. dialectal dictionary Tharwa (Diab et al., 2014), or For example, the word  mA$y /ma:Si/ is ‘no’ extending MSA analyzers and resources (Salloum ú æ AÓ and Habash, 2014; Harrat et al., 2014; Boujelbane in YE.SN and MA.RB, ‘yes/ok’ in SY.DM and JOR, and ‘walking’ in SA.NJ. Another exam- et al., 2013). Q ple is the word ú¯A“ SAfy /s a:fi/ which means ‘enough’ in MA.RB, but ‘pure’ in the other di- Linguistic Studies There are many theoretical alects and MSA. Some cases show subtle dif- and descriptive linguistic studies for the dialects ferences in meaning, e.g., Ð@Yg xdAm /xadda:m/ we work on: Yemeni dialects (Watson, 1993, means ‘employee’ generically in MA.RB, but it 2002), Najdi (Ingham, 1994), Gulf Arabic dialect has a more specific and negative connotation in (Holes, 1990), Jordanian (Bani-Yasin and Owens, YE.TZ and YE.SN, namely ‘enslaved servant’. 1987), Moroccan (Harrell, 1962), Syrian (Cow- While the above cases are all homonyms (homo- ell, 1964), and Iraqi (Erwin, 1963); not to men- phones and homographs), there are instances of tions comparative studies across dialects and MSA (Holes, 2004; Brustad, 2000). We make extensive 3We represent the Arabic words in and in the Buckwalter transliteration (in italics) (Habash et al., 2007). use of such studies as part of the design of our an- When needed, we present the IPA (in /.../). The English gloss notation guidelines. is added in single quotes.

138 Phenomenon MSA YE.TZ YE.SN SA.NJ JOR SY.DM IR.BG MA.RB Pronunciation of † q /q//q//g//g/ or /dz//g/ or /P//P//g//q/ or /g/ j Pronunciation of h. /Ã//g//Ã//Ã//Z//Z//Ã//Ã/ Pronunciation of ¼ k /k//k//k//k/ or /ts//k/ or /Ù//k//k/ or /Ù//k/

Future Particle +€ s+ +€ $+ +¨ E+ +H. b+ +h H+ +h H+ +h H+ +¨ g+ swf  A$ Ed rH rH rH gAdy ¬ñƒ €@ Y« hP hP hP ø XA« +€ $+ h@P rAH + y+ ø b b  b b d k Progressive Particle φ +H. + +H. + Y«A¯ qAEd +H. + +H. + +X + +¼ +  t ËAg. jAls Ñ« Em Y«A¯ qAEd +H +  d Possessive Particle φ ©J.K tbE ©J.K tbE ‡k Hq ©J.K tbE ©J.K tbE ÈAÓ mAl +X +   ‡k Hq ‡k Hq ¨AK tAE ÈAK X dyAl Table 1: Cross-dialectal and MSA variants in some phonological and morphological phenomena homophones that have different meanings in dif- 4 Dialect-Specific Corpora ferent dialects. For example the utterance /fagr/ Until recently, Arabic was mostly written in Mod- can mean ‘morning’ in YE.TZ (written as Qm¯ fjr), or . ern Standard Arabic (MSA) and Classical Arabic, ‘poverty in YE.SN (written as Q®¯ fqr). The YE.SN while written DA was rare. One early source of pronunciation of Qm¯ fjr is /faZr/; and the YE.TZ pro- . written dialectal Arabic are textbooks for learning nunciation of Q®¯ fqr is /faqr/. an Arabic Dialect intended for non-Arabic speak- There are also cases of the same meaning be- ers. Furthermore, sometimes spoken language has ing expressed in different ways, e.g., ‘spoon’ is been recorded and transcribed. However, owing  mlEqp  mElqp 鮪ÊÓ in MSA, metathesized é®ÊªÓ in to the advent of the internet and its rapid growth  xA wqp JOR and SY.DM, and é¯ñƒA g $ in IR.BG. among Arabic speaking populations, written ma- terials in DA are now more accessible and easy Dialectal Orthography Since Arabic dialects to obtain than they were in the past. These writ- do not have spelling standards, several previous ten materials are typically informal written con- efforts on Arabic dialect annotations (Maamouri versations among participant or traditional folk lit- et al., 2014; Jarrar et al., 2014; Khalifa et al., 2018) erature like short stories, poems, prose, thoughts contributed to a movement that lead to the cre- and song. These texts can be found in online fo- ation of a common Conventional Orthography for rums, blogs, and postings on social media net- Dialectal Arabic (CODA) (Habash et al., 2012a; works. All of the our dialectal corpora consist of Zribi et al., 2014; Habash et al., 2018). We also sources of various genres, collected from both on- follow this approach to map from any spontaneous line and print materials in order to cover many of orthography in our data to CODA. The spirit of the aspects of these dialects. Each of the YE.TZ, CODA is to define a common and consistent ap- SA.NJ, IR.BG, JOR corpora has 30K words, while proach to spelling DA words that acknowledges the YE.SN has 32K words, SY.DM has 35k words their etymological and historical relationship with and MA.RB has 20k words. It should be noted MSA and CA, but also maintains their uniqueness that the data collected from the internet was writ- and independence. For example, if a DA word has ten in Arabic characters, using “spontaneous” or- an MSA cognate containing  q, then its CODA thography since there are no orthographic stan- † dards for DA. The Roman alphabet sentence were spelling will use † q even if the dialectal pronun- ciation is different. In contrast, DA morphemes transcribed from the textbooks into the Arabic al- are spelled in a way to reflect their DA unique- phabet using CODA. All examples presented in  the rest of this section are in CODA except where ness. For example the SY.DM word ‡J ®Jk Hnfyq /èanfi:P/ ‘we will wake up’ is a cognate of MSA specified otherwise.  snfyq /sanafi:qu/: the future marker reflects ‡J ®Jƒ 4.1 Taizi Corpus (YE.TZ) the dialectal morphology and is not spelled as in MSA, but the stem is spelled as in MSA and thus Sources The YE.TZ written data was collected the † q does not reflect the dialectal pronunciation. manually from different resources such as forums, 139  blogs, and social media networks. With reference contexts. For example, the word QÔ¯ qmr ‘moon’ is to spoken data, half of the oral interviews were pronounced /gamar/. This variation is not unique recorded and transcribed manually by the annota- to YE.SN and other dialects such as IR.BG and tors, the remaining oral interview transcripts are JOR have it as well. This /g/ is often sponta- taken from the Semitisches Tonarchiv (Section2). neously spelled as † q, which is consistent with The data includes wise anecdotes, proverbs, sto- CODA guidelines. A particularly marking phe- ries, poems, songs and dialogues. nomenon in YE.SN is the devoicing and empha- sis of some instances of word-medial /d/, e.g., Phonology and Orthography A distinguishing  gdwp ‘tomorrow’ is pronounced /GutQwa/ feature of YE.TZ is that MSA j /Ã/ is pro- èðY« h. and as a result may be written spontaneously as nounced as /g/, e.g., jml ‘camel’ /gamal/, and ÉÔg. èñ¢ « gTwp. that MSA † q /q/ retains its pronunciation. In that regard, CODA spellings were straightforward. Morphology As shown in Table1, there are four future particles in YE.SN: + E+, Ed, +  $+, Morphology Similar to a number of other di- ¨ Y« € + y+. While + E+ may be used with 1st, 2nd, or alects but unlike MSA, negation is expressed as ø ¨ 3rd person conjugated verb, the rest are only used an enclitic $ ‘not’, e.g., ydxl+$ ‘ does € Ê gYK with 1st person singular conjugated verbs. not enter’. The vocative particle is expressed as the proclitics AK yA ‘Oh’ and @ð wA ‘Oh’, or as an Lexicon YE.SN has some distinguishing closed the enclitic è@ Ah as in èAÓ@ AmAh ‘my mother’. The class words, such as prepositions ù® ¯ qfY ‘behind’ verbal proclitic A¯ qA ‘already’, which corresponds and ‡ƒ $q ‘next’, and numbers like HA Jƒ stAt ‘six’, to MSA Y¯ qd, frequently appears with past verbs, and ª¢ë hTE$ ‘eleven’. There are also some e.g., «  qA EmlnA ‘we have already done that’. Turkish , e.g., sAny ‘direct’ and AJÊÔ A¯ ú GAƒ kryk ‘shovel’. Lexicon There are many open-class words that ½K Q» make YE.TZ different from MSA and other di- 4.3 Najdi Corpus (SA.NJ) alects, e.g., èñ ¯P zqwp ‘shrewd’, á» P zkn ‘order’, and  qrAE ‘breakfast’. Some words have MSA Sources The SA.NJ corpus was collected from ¨@Q¯ different sources that represent different genres: meanings that differ from YE.TZ, e.g.,  $l ‘take’ Ƀ forums, poetry, jokes and tweets. We collected dif- and  bz ‘take’. YE.TZ has a number of loan- QK . ferent posts from the Saudi web forum eqla3. words from English that underwent , com, including personal narratives (mainly sar- e.g., èPAm  sjArp ‘cigarette’, and J» ktly ‘kettle’. . ú Î castic) and discussions. We also collected Na- 4.2 Sanaani Corpus (YE.SN) jdi poems from the late twentieth century, mainly written by the contemporary Najdi poets Khalid Sources The social texts were taken from AlFaisal, Mohammed bin Ahmed AlSudairy and a Sanaani Radio Station program called Saad Bin Jadlan. We manually collected Najdi èYª‚Óð Yª‚Ó msEd wmsEdp, which addressed jokes from various online resources. And finally, social issues and problems of the community. on Twitter, we searched for distinctive Najdi key- The oral interview transcripts were taken from words such as AJk HnA ‘we’, éƒðQ ¯ qrw$p ‘incon- the Semitisches Tonarchiv (Section2). The venience’, and mnyb ‘I’m not’. interviews describe daily life, history and lifestyle I.  JÓ in . Folktales describing traditional stories Phonology and Orthography As Table1 handed down in Sanaa are taken from internet shows, there are a number of phonological alter- forums. Collections of wisdom sayings and tales nations in SA.NJ. The /dz/ variant of † q /q/ and of the famous wise man of “Ali walad /ts/ variants of ¼ k /k/ are rather restricted in their Zaid” are taken from internet websites. Other usage. And unlike MSA, SA.NJ shows no distinc- texts were taken from social media, and include tion between the pronunciation of MSA etymolog- Q Q political events in Yemen, Sanaani jokes, religious ical  /d / and /D /. These phenomena affect sermons and transcripts that discuss the Sanaani spontaneous orthography and had to be addressed dialect in MSA. in the CODA annotations. Phonology and Orthography MSA † q /q/ is Morphology One marking morphological fea- pronounced /g/ in YE.SN, including in religious ture of SA.NJ (and other Gulf Arabic dialects) is

140 the use of negation circumfix H. + .. +AÓ mA+ .. +b, you’; however when following a vowel, both be- as in mAnyb ‘I am not’ (spontaneously, often come ky /ki/, e.g. $Afwky ‘they saw you’. I.  KAÓ ú» ú»ñ¯Aƒ written as mnyb). Similar constructions ex- I.  JÓ Negation is marked with the enclitic € $; such as, ist in other dialects but are more productive, e.g.  ñƒAK. bAswy$ ‘I do not do’. Egyptian €+ .. +AÓ mA+ .. +$ negates verbs in addition to pronouns. Unlike most DA and like Lexicon Some JOR words are from Syriac, e.g., MSA, SA.NJ retains some tanween (nunation). H. ñƒ $wb ‘hot’, and Q ºK. bkyr ‘early in the morn- ing’. Other words are borrowed from Turkish, e.g., For example: ½Ë ÉKA¯ AK @ >nA qAylK lk /Pana ga:ylin  dgry ‘straightforward’ and drAbzyn lak/ ‘I said (active participle) to you’. However, as ø Q«X áK QK.@PX in MSA, the nunation is rarely written. Some mor- ‘ladder’. Some words that were borrowed from phological phenomena are becoming very rare, English underwent some morpho-phonological e.g., the use of  ts for 2nd person singular changes. For example, PðYKPñ» kwrydwr ‘corri-  feminine pronominal enclitic is dying out among dor’, IÓQ ¯ frmt ‘format’, and ½ÊK. blk ‘to block younger people and merging with the masculine somebody’. form ¼ k. 4.5 Syrian Corpus (SY.DM) Lexicon SA.NJ has some distinguishing words Sources The written data was collected manu- >bxS kfw such as ‘m'.@ ‘more expert’, ñ®» ‘good’, ally from different online written resources such as and Pñ¯@X dAfwr ‘nerd’ There are many borrowed forums, blogs, and social media networks. Among words from English compared to borrowings from the data, there were anecdotes, proverbs, stories,

Turkish or Persian. For instance, the verb ÕήK yflm some poems, songs and dialogues. is borrowed form English ‘film’ and means ‘to act dramatically’. Phonology and Orthography SY.DM has a phoneme /P/ that is a cognate with ei- 4.4 Jordanian Corpus (JOR) ther MSA ( & } > < ’) or MSA ð ø @@ Z Sources The corpus includes written as well as Qaf † q. In most spontaneous SY.DM orthogra- spoken data. The written materials were drawn phy, the two forms are distinguished in a manner from internet sources, such as, forums, blogs, and similar to CODA guidelines. A few exceptions social media. They include informal conversations include the word Cë hl> ‘now’ which in CODA among participant or traditional folk literature like is written as ‡Êë hlq highlighting its etymological short stories, poems, prose, memoirs, and songs. link to I ¯ñËAë hAlwqt ‘this time’. Less common j As for spoken data, oral interviews and observa- spelling variations include the devoicing of h. /Z/ tions were recorded and transcribed by the anno- to /S/, which may be reflected in spontaneous or- tators. Nearly 20 informants were interviewed by thography, e.g., ©ÒJm.' njtmE /niZtmiQ/ ‘we meet’ the researchers. Older as well as uneducated peo- may appear as ©ÒJ‚ n$tmE /niStmiQ/. ple are included in order to ensure the authenticity Morphology A distinction of SY.DM (and of the data. The JOR data included a mix of sub- North Levantine) compared to South Levantine dialects that reflect the multiplicity of DA forms, and a number of other dialects is the absence of including markedly Palestinian as well as Jorda- the negation enclitic $. SY.DM makes use of a nian variants. For this reason, we refer to this cor- € number of future particles in free distribution (See pus simply as JOR. Table1). The progressive particle Ñ« Em can only Phonology and Orthography In some JOR be used to indicate active progression at the mo- sub-dialects, as with IR.BG, MSA ¼ k is affricated ment, while the progressive proclitic +H b+ has a  . to /Ù/, e.g., I. Ê¿ klb /Ùalb/ ‘dog’. † q also realizes wider range from habitual to progressive. in two forms as /g/ and /P/. Some of these phe- nomena results in different spontaneous spellings Lexicon As with JOR, some SY.DM words that are then normalized during annotation. were originally Syriac, e.g., Hñƒ $wb ‘hot’, or . úG@QK. brAny ‘outer’. Other words are borrowed from Morphology JOR’s 2nd person feminine singu- Turkish, e.g., øQ«X dgry ‘straightforward’. Some lar pronominal clitic has two alternations depend- words encountered major semantic shifts, e.g., ing on the sub-dialect: ky /ki/ and k /ik/. ú » ¼ Q£ Tz comes from Turkish tuz ‘salt’, then shifting Examples include   $ftky or   $ftk ‘I saw to mean ‘something unimportant’, and eventually ú ¾J®ƒ ½J®ƒ 141 ‘good riddance’. Other words were found to be The folktales come from a Moroccan website that borrowed from French, e.g., PñºK X dykwr ‘decor’ reprinted stories originally published in an ency- and ñKAg . gAtw ‘gateaux’, and from Persian like clopedia of traditional Moroccan folktales. The øQå Qå srsry ‘bad man’. Markedly SY.DM expres- textbook examples include many basic greetings  and expressions, as well as sample dialogues. The sions include †ñK.Qk Hrbwq /èarbu:P/ ‘shrewd’. blog posts range in topic, but include relationship 4.6 Iraqi corpus (IR.BG) advice, recipes, and philosophical musings. The Sources The materials of the IR.BG corpus were humor includes both short and long jokes from a obtained from social media websites, blogs and few Facebook pages and one other website. other online sources. The sources contain posts Phonology and Orthography Most MA.RB on political, social, and religious issues that touch consonants are pronounced like their MSA equiva- upon the daily life of the Iraqi people. The sources lents; however, there are exceptions: dental conso- include blogs, e.g., different sarcastic posts with nants in MSA have become alveolar, so MSA v a witty sense of humor gathered from the Iraqi H /T/, * /D/, and Z /DQ/, are pronounced /t/, /d/, blog    $l$ AlErAqy, and short essays with X ú¯@QªË@ Êƒ and /dQ/, respectively in MA.RB. Such issues nat- commentary and views that sharply criticize loss urally interact with spontaneous orthography and in traditional values and morals in the Iraqi soci- are annotated as per CODA guidelines. ety after 2003. Proverbs, common sayings, and fa- mous expressions were also collected from online Morphology Among the set of dialects dis- blogs and forums. cussed here, MA.RB has the most distinct set of morphological features, such as its future, progres- Phonology and Orthography Some instances sive and possessive particles (see Table1). Like of MSA k appear as /tS/ in IR.BG, e.g.,  kAnt ¼ IKA¿ other North African dialects, and unlike MSA, ‘she was’ /tSa:nat/. Some of these cases appear in MA.RB uses the prefix + n+ for imperfect first spontaneous orthography as  t$ or even J/ j à  h h. person singular, and distinguishes first person plu- (mostly due to Persian spelling influences). Some ral by adding the plural suffix +wA. Interest- instances of MSA /q/ are pronounced as /g/, e.g., @ð+ ingly the imperfect first person singular in MA.RB  fwq ‘above’ /fo:g/. Some of these cases appear †ñ¯ looks like the imperfect first person plural in MSA in spontaneous orthography as G or k, also À ¼ and numerous other dialects. Finally, the perfect due to Persian influences. second person singular masculine and feminine Morphology A strong marker of IR.BG is the both use the suffix  ty, which corresponds to the ú G progressive proctlitc +X d+, e.g., ?†ñ‚ Y ƒ $dtswq? feminine suffix in other DA. ‘what are you driving?’. IR.BG also has three fu- Lexicon MA.RB has a number of loanwords ture particles: h@P rAH, hP rH, and +h H+, which seem to be in free variation. from Berber, French and Spanish; and many speakers code-switch between Moroccan and Lexicon The IR.BG lexicon has some distin- French or Spanish. Examples include French guishing words such as pñ£@ >Twx ‘little darker’, fwrmAj ‘cheese’, and  bwrtAbl ‘mo-  h. AÓPñ¯ ÉK.AKPñK. and |ny ‘I’. IR.BG has many loanwords from bile phone’; and Spanish  smAnp ‘week’, and ú G@ éKAÖÞ Kurdish, Persian, and Russian, e.g., Kurdish PñK.AK. bAbwr ‘ship’. é»A¿ kAkh ‘mister’, Persian ¨@Y J ¯ qndAg ‘very weak tea or hot water and sugar’, and Russian 5 Annotation Process 

142 Gloss to him and I will not go / and not going this letter I will write Ortho   éJ Ë@ I. ëX@ áËð éËAƒQË@ èYë I. J»Aƒ Lemma + ktb +φ

MSA Prefix - IV1S CONJ DET - FUT PART+IV1S Stem PREP IV NEG PART NOUN DEM PRON FS IV Suffix PRON 3MS IVSUFF MOOD:S - NSUFF FEM SG - IVSUFF MOOD:I +CASE DEF ACC Raw  ñË kQå„ ƒ AÓð H. @ñm.Ì'@ èX@ I. Jºƒ CODA  éË kQå A ƒ AÓð H. @ñm.Ì'@ èX@ I. J»Aƒ Lemma li saraH mA jawAb Aa*ah katab Morph l +h $+A+ srH +φ+$ w+ mA Al+ jwAb A*h $+A+ ktb +φ

YE.TZ Prefix - FUT PART+IV1S CONJ DET - FUT PART+IV1S Stem PREP IV NEG PART NOUN DEM PRON MS IV Suffix PRON 3MS IVSUFF SUBJ:1S+NEG PART - - - IVSUFF SUBJ:1S Raw    éË €Q ‚ƒ AÓð éËAƒQË@ éJ K I. J»Y« CODA    éË €Q ƒAƒ AÓð éËAƒQË@ éJ K I. J»@ Y« Lemma li sAr mA risAlap tayh katab Morph l +h $+A+ syr +φ+$ w+ mA Al+ rsAl +p tyh Ed#+A+ ktb +φ

YE.SN Prefix - FUT PART+IV1S CONJ DET - FUT PART#+IV1S Stem PREP IV NEG PART NOUN DEM PRON FS IV Suffix PRON 3MS IVSUFF SUBJ:1S+NEG PART - NSUFF FEM SG - IVSUFF SUBJ:1S Raw  éË l' @P I.  JÓð éËAƒQËAë I. JºK. CODA   éË l' @P I.  KAÓð éËAƒQËAë I. J»AK. Lemma li rAyH AnA risAlap katab Morph l +h rAyH w+m+ Any +b h+Al+ rsAl +p b+A+ ktb+φ SA.NJ Prefix - - CONJ+NEG PART DEM PART+DET FUT PART+IV1S Stem PREP ADJ PRON 1S NOUN IV Suffix PRON 3MS - NEG PART NSUFF FEM SG IVSUFF SUBJ:1S Raw '   éJ Ë l @P AJÓð éËAƒQË@ ø XAë I. J» hP CODA '   éJ Ë l @P AJÓð éËAƒQË@ ø XAë I. J»@ hP Lemma li rAH mnA risAlap hA*iy katab raH Morph l +h rAyH w+ mnA Al+ rsAlp hA*y A+ ktb +φ rH JOR Prefix - - CONJ DET - IV1S - Stem PREP ADJ NEG PART NOUN DEM PRON FS IV FUT PART Suffix PRON 3MS - NSUFF FEM SG - IVSUFF SUBJ:1S - Raw   ðYJªË hðP hP AÓð éËAƒQËAë I. J»@ hP CODA   èYJªË hðP@ hP AÓð éËAƒQËAë I. J»@ hP Lemma Eind rAH raH mA risAlap katab raH Morph l+ End +h A+ rwH +φ rH w+ mA h+Al+ rsAl +p A+ ktb +φ rH

SY.DM Prefix PREP IV1S - CONJ DEM PART+DET IV1S - Stem NOUN IV FUT PART NEG PART NOUN IV FUT PART Suffix POSS PRON 3MS IVSUFF SUBJ:1S - - NSUFF FEM SG IVSUFF SUBJ:1S - Raw   éË hðP@ AÓð éËAƒQË@ ø Aë I. J» hP CODA   éË hðP@ AÓð éËAƒQË@ ø Aë I. J»@ hP Lemma li rAH mA risAlap hAy katab raH Morph l +h A+ rwH +φ w+ mA Al+ rsAlp hAy A+ ktb +φ rH

IR.BG Prefix - IV1S CONJ DET - IV1S - Stem PREP IV NEG PART NOUN DEM PRON FS IV FUT PART Suffix PRON 3MS IVSUFF SUBJ:1S - NSUFF FEM SG - IVSUFF SUBJ:1S - ¯ Raw    éJ Ë ú æ„Öß  XA« AÓð éËAƒQË@ XAë I. JºK ø XA« CODA     éJ Ë ú æ„Öß  XA« AÓð éËAƒQË@ XAë I. JºK ø XA« Lemma li m$aY gAdy mA risAlap hAd ktab gAdy Morph l +h n+ m$y +φ gAdy +$ w+ mA Al+ rsAlp hAd n+ ktb +φ gAdy

MA.RB Prefix - IV1S - CONJ DET - IV1S - Stem PREP IV FUT PART NEG PART NOUN DEM PRON FS IV FUT PART Suffix PRON 3MS IVSUFF SUBJ:1S NEG PART - NSUFF FEM SG - IVSUFF SUBJ:1S -

Table 2: An annotation example from DIWAN for , Taizi, Sanaani, Najdi, Jordanian, Syrian, Iraqi and dialects. All the sentences have the same meaning: ‘I will write this letter and not go to him’. The table is presented in a right-to-left direction. Raw represents a spontaneous word spelling. CODA represents the conventional orthography we use. Lemma shows the diacritized lemma form; this is the only line where we show diacritics. Morph represent the sequence of prefixes, the stem, and the sequence of suffixes. Prefix, Stem, and Suffix show the part of speech tags for the components of the word shown in the Morph line.

143 Error Type Dialects Word gloss Error Correction  SA.NJ QÓ@ |mr order +|mr/CV+ +|mr/CV+(null)/CVSUFF SUBJ:2MS Null Subject >SAbHk > > YE.TZ ½m'.A“@ fight /IV1S+SAbH/IV+k/IVSUFF DO:2MS /IV1S+SAbH/IV +(null)/IVSUFF SUBJ:1S +k/IVSUFF DO:2MS Ta-Marbuta SY.DM úæJ.ªk. jEbty pouch +jEb/NOUN+p/NSUFF FEM SG +jEb/NOUN+t/NSUFF FEM SG +y/POSS PRON 1S +y/POSS PRON 1S  Case SY.DM ­®‚ËAK. bAlsqf roof b/PREP+Al/DET+sqf/NOUN b/PREP+Al/DET+sqf/NOUN+ +(null)/CASE DEF GEN

Table 3: Examples of annotation errors found during error analysis: null morphemes should be added; ta-marbuta is a common source of errors; case should never be annotated for the dialects

MADAMIRA

DIWAN Annotation Error Correction MADIWAN file MAgold file Collect Dialect Text

Figure 1: Steps to creating a new annotated corpus for a dialect

3. annotators. tating each token with morphological and seman- tic information, including the following fields: The dialect leads verify the annotators’ work, and the project manager organizes and monitors • The CODA spelling of the raw token. the flow of the progress of everyone using the tool in the project. • The lemma, or the citation form, of the token.

Annotation Steps First, the dialect leads collect • The morphemes of the word (prefixes, stem, the corpus text from different resources like so- suffixes) and their part-of-speech (POS). The cial media, forms, websites, etc. The next step is stem is marked by the symbol # on either to develop dialect-specific annotation guidelines, side. including the CODA specification for normalized orthography. The dialect leads then train the an- • The English gloss of the word. notators before annotation starts. The leads follow • Features indicating proclitics and enclitics. the annotator’s work. The annotations are not ap- proved until the dialect leads check them. Wrong • Features indicating word POS, functional annotations are sent back to the annotator for cor- number and gender (Alkuhlani and Habash, rection. After the first round of annotation is done, 2011), and aspect. we perform a second round of error checking, us- ing both manual inspection and scripts that check The annotation for one sentence in different di- for coherent annotations. The result is a DIWAN alects is shown in Table2. This is not actually file which includes the correct annotation for the a sentence from our corpora, of course; we have entire corpus. In the last step, we automatically re- chosen it to illustrate the annotation. format the annotations into a format which is best Error Correction Linguistic annotation is car- suited for computational purposes; we perform a ried out manually. In order to guarantee high lev- third round of error checking for format errors, els of accuracy and precision, we performed ex- which we fix automatically. Figure1 shows these tensive error checking and correction. After an- steps. notating the seven different corpora, the anno- Morphological Features Annotated The DI- tated words were compiled in the form of linguis- WAN interface assists human annotators in anno- tic codes in either one file or separate files to be

144 checked and corrected by a second reviewer. This Nora Al-Twairesh, Rawan Al-Matham, Nora Madi, form of error checking cannot of course identify Nada Almugren, Al-Hanouf Al-Aljmi, Shahad Al- annotation errors in context (for example, a noun shalan, Raghad Alshalan, Nafla Alrumayyan, Shams Al-Manea, Sumayah Bawazeer, et al. 2018. Suar: is misidentified as a verb); instead, this approach Towards building a corpus for the Saudi dialect. is efficient at finding impossible annotations. Ex- Procedia computer science, 142:72–82. amining the data demonstrated that the most chal- Sarah Alkuhlani and Nizar Habash. 2011. A Cor- lenging part for the annotators was the suffixes pus for Modeling Morpho-Syntactic Agreement in part, especially when there are long and compli- Arabic: Gender, Number and Rationality. In Pro- cated words. Some examples indicating the errors ceedings of the Conference of the Association for are listed below in Table3. Computational Linguistics (ACL), Portland, Oregon, USA. Distribution of Resources All created re- sources will be freely available for research pur- Faisal AlShargi, Aidan Kaplan, Ramy Eskander, Nizar Habash, and Owen Rambow:. 2016. Morpholog- poses from Columbia (http://innovation. ically annotated corpora and morphological ana- columbia.edu). lyzers for Moroccan and Sanaani . In Proceedings of the Tenth International Confer- 6 Conclusion and Future Work ence on Language Resources and Evaluation (LREC 2016). We presented a collection of morphologically an- notated corpora for seven Arabic dialects, col- Faisal Alshargi and Owen Rambow. 2015. Diwan:a di- alectal word annotation tool for Arabic. In In: Pro- lectively covering over 200,000 words. All cor- ceedings of WANLP 2015 - ACL-IJCNLP, 2015. pora were manually annotated in a common set of standards for orthography, diacritized lem- Raslan Bani-Yasin and Jonathan Owens. 1987. The mas, tokenization, morphological units and En- phonology of a northern di- alect. Zeitschrift der Deutschen Morgenlandischen¨ glish glosses. These corpora will be publicly avail- Gesellschaft, 137(2):297–331. able to serve as benchmarks for training and eval- uating systems for Arabic dialect morphological Houda Bouamor, Nizar Habash, and Kemal Oflazer. analysis and disambiguation. 2014. A multidialectal parallel corpus of Arabic. In Proceedings of the Ninth International Conference In future work, we will use these resources to on Language Resources and Evaluation (LREC- train morphological taggers as described in (Es- 2014). European Language Resources Association kander et al., 2016). We also plan to extend the (ELRA). collection of dialect to include additional less stud- Houda Bouamor, Nizar Habash, Mohammad Salameh, ied varieties following the lead of efforts such as Wajdi Zaghouani, Owen Rambow, Dana Abdul- Bouamor et al.(2018). We also plan to expand to- rahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, wards different historical and literature based va- Alexander Erdmann, and Kemal Oflazer. 2018. The rieties of Arabic. MADAR Arabic dialect corpus and lexicon. In Pro- ceedings of the Language Resources and Evaluation Conference (LREC), Miyazaki, Japan. 7 Acknowledgments Rahma Boujelbane, Mariem Ellouze Khemekhem, and This work is supported by the Air Force Research Lamia Hadrich Belguith. 2013. Mapping rules for Laboratory (AFRL) under a grant administered by building a Tunisian dialect lexicon and generating Ball Aerospace. Alkhereyf is supported by the corpora. In Proceedings of the Sixth International KACST Graduate Studies program. The views ex- Joint Conference on Natural Language Processing, pages 419–428. pressed here are those of the authors and do not re- flect the official policy or position of the U.S. De- Kristen Brustad. 2000. The Syntax of Spoken Arabic: A partment of Defense or the U.S. Government We Comparative Study of Moroccan, Egyptian, Syrian, also would like to thank all the anonymous review- and Kuwaiti Dialects. Georgetown University Press. ers for their insightful and valuable comments and David Chiang, Mona Diab, Nizar Habash, Owen Ram- suggestions. bow, and Safiullah Shareef. 2006. Parsing Arabic dialects. In 11th Conference of the European Chap- ter of the Association for Computational Linguistics. References Mark Cowell. 1964. A Reference Grammar of Syrian Abd Al-Salam Al-Amri, editor. 2000. Texts in Sanani Arabic. Georgetown University Press, Washington, Arabic. O. Harrassowitz, Wiesbaden, Germany. D.C.

145 Mona Diab, Mohamed Al-Badrashiny, Maryam Salima Harrat, Karima Meftouh, Mourad Abbas, and Aminian, Mohammed Attia, Heba Elfardy, Nizar Kamel Smaili. 2014. Building resources for Alge- Habash, Abdelati Hawwari, Wael Salloum, Pradeep rian Arabic dialects. In 15th Annual Conference of Dasigi, and Ramy Eskander. 2014. Tharwa: the International Communication Association Inter- A Large Scale Dialectal Arabic-Standard Arabic- speech. English Lexicon. In Proceedings of the Language Resources and Evaluation Conference (LREC), Richard Harrell. 1962. A Short Reference Grammar pages 3782–3789, Reykjavik, Iceland. of Moroccan Arabic: With Audio CD. Georgetown classics in Arabic language and linguistics. George- Wallace Erwin. 1963. A Short Reference Grammar of town University Press. Iraqi Arabic. Georgetown University Press, Wash- ington, D.C. Clive Holes. 1990. Gulf Arabic. Croom Helm Descrip- tive Grammars. Routledge, London / New York. Ramy Eskander, Nizar Habash, and Owen Rambow. 2013. Automatic extraction of morphological lex- Clive Holes. 2004. Modern Arabic: Structures, Func- icons from morphologically annotated corpora. In tions, and Varieties. Georgetown Classics in Ara- Proceedings of the 2013 Conference on Empirical bic Language and Linguistics. Georgetown Univer- Methods in Natural Language Processing, pages sity Press. 1032–1043, Seattle, Washington, USA. Association for Computational Linguistics. Bruce Ingham. 1994. . John Benjamins.

Ramy Eskander, Nizar Habash, Owen Rambow, and Mustafa Jarrar, Nizar Habash, Diyam Akra, and Nasser Arfath Pasha. 2016. Creating resources for dialec- Zalmout. 2014. Building a corpus for Palestinian tal Arabic from a single annotation: A case study on Arabic: a preliminary study. In Proceedings of Egyptian and Levantine. In Proceedings of COLING the EMNLP 2014 Workshop on Arabic Natural 2016, the 26th International Conference on Compu- Language Processing (ANLP), pages 18–27, Doha, tational Linguistics: Technical Papers, pages 3455– . Association for Computational Linguistics. 3465, Osaka, Japan. The COLING 2016 Organizing Committee. Salam Khalifa, Nizar Habash, Dana Abdulrahim, and Sara Hassan. 2016. A large scale corpus of Gulf Nizar Habash, Mona T. Diab, and Owen Rambow. Arabic. CoRR, abs/1609.02960. 2012a. Conventional orthography for dialectal Ara- bic. In LREC. Salam Khalifa, Nizar Habash, Fadhl Eryani, Os- sama Obeid, Dana Abdulrahim, and Meera Al Nizar Habash, Fadhl Eryani, Salam Khalifa, Owen Kaabi. 2018. A morphologically annotated cor- Rambow, Dana Abdulrahim, Alexander Erdmann, pus of Emirati Arabic. In Proceedings of the Reem Faraj, Wajdi Zaghouani, Houda Bouamor, Language Resources and Evaluation Conference Nasser Zalmout, Sara Hassan, Faisal Al shargi, (LREC), Miyazaki, Japan. Sakhar Alkhereyf, Basma Abdulkareem, Ramy Es- kander, Mohammad Salameh, and Hind Saddiki. Hanaa Kilany, Hassan Gadalla, Howaida Arram, 2018. Unified guidelines and resources for Ara- Ashraf Yacoub, Alaa El-Habashi, and Cynthia bic dialect orthography. In Proceedings of the McLemore. 2002. Egyptian Colloquial Arabic Lex- Language Resources and Evaluation Conference icon. LDC catalog number LDC99L22. (LREC), Miyazaki, Japan. Mohamed Maamouri, Ann Bies, Tim Buckwal- Nizar Habash, Ramy Eskander, and Abdelati Hawwari. ter, Mona Diab, Nizar Habash, Owen Rambow, 2012b. A morphological analyzer for Egyptian Ara- and Dalila Tabessi. 2006. Developing and us- bic. In Proceedings of the twelfth meeting of the ing a pilot dialectal Arabic treebank. In Pro- special interest group on computational morphology ceedings of the Fifth International Conference on and phonology, pages 1–9. Association for Compu- Language Resources and Evaluation (LREC’06), tational Linguistics. Genoa, Italy. European Language Resources Asso- ciation (ELRA). Nizar Habash, Ryan Roth, Owen Rambow, Ramy Es- kander, and Nadi Tomeh. 2013. Morphological Mohamed Maamouri, Ann Bies, Seth Kulick, Michael analysis and disambiguation for dialectal Arabic. In Ciul, Nizar Habash, and Ramy Eskander. 2014. De- Proceedings of the 2013 Conference of the North veloping an Egyptian Arabic treebank: Impact of di- American Chapter of the Association for Computa- alectal morphology on annotation and tool develop- tional Linguistics: Human Language Technologies, ment. In LREC, pages 2348–2354. pages 426–432. Arfath Pasha, Mohamed Al-Badrashiny, Mona T Diab, Nizar Habash, Abdelhadi Soudi, and Tim Buckwalter. Ahmed El Kholy, Ramy Eskander, Nizar Habash, 2007. On Arabic Transliteration. In A. van den Manoj Pooleery, Owen Rambow, and Ryan Roth. Bosch and A. Soudi, editors, Arabic Computa- 2014. Madamira: A fast, comprehensive tool for tional Morphology: Knowledge-based and Empiri- morphological analysis and disambiguation of Ara- cal Methods, pages 15–22. Springer, Netherlands. bic. In LREC, volume 14, pages 1094–1101.

146 Wael Salloum and Nizar Habash. 2014. Adam: An- alyzer for dialectal Arabic morphology. Journal of King Saud University-Computer and Information Sciences, 26(4):372–378. Clare Voss, Stephen Tratz, Jamal Laoudi, and Dou- glas Briesch. 2014. Finding romanized Arabic di- alect in code-mixed tweets. In Proceedings of the Ninth International Conference on Language Re- sources and Evaluation (LREC’14), pages 2249– 2253, Reykjavik, Iceland. European Language Re- sources Association (ELRA). ACL Anthology Iden- tifier: L14-1086. Janet Watson, editor. 1993. A syntax of Sanani Arabic. O.Harrassowitz, Wiesbaden, Germany. Janet Watson. 2002. The Phonology and Morphology of Arabic. Oxford University Press. Ines` Zribi, Rahma Boujelbane, Abir Masmoudi, Mariem Ellouze, Lamia Hadrich Belguith, and Nizar Habash. 2014. A conventional orthography for . In LREC, pages 2355–2361.

147