Morphologically Annotated Corpora for Seven Arabic Dialects: Taizi, Sanaani, Najdi, Jordanian, Syrian, Iraqi and Moroccan
Total Page:16
File Type:pdf, Size:1020Kb
Morphologically Annotated Corpora for Seven Arabic Dialects: Taizi, Sanaani, Najdi, Jordanian, Syrian, Iraqi and Moroccan Faisal Alshargi,? Shahd Dibas,z Sakhar Alkhereyf,y Reem Faraj,y Basmah Abdulkareem,y Sane Yagi,z Ouafaa Kacha,z Nizar Habash,∗ Owen Rambowx ?Universitat¨ Leipzig, Germany zUniversity of Jordan, Jordan yColumbia University, USA ∗New York University Abu Dhabi, UAE xElemental Cognition, USA [email protected], [email protected], [email protected], [email protected], [email protected] Abstract ideal starting point for experimenting with using multidialectal resources to create and train NLP We present a collection of morphologi- tools. The dialects we consider are Taizi Yemeni cally annotated corpora for seven Arabic (YE.TZ)1, Sanaani Yemeni (YE.SN), Saudi Na- dialects: Taizi Yemeni, Sanaani Yemeni, jdi (SA.NJ), Jordanian (JOR), Syrian Damascene Najdi, Jordanian, Syrian, Iraqi and Moroc- (SY.DM), Iraqi Baghdadi (IR.BG), and Moroccan can Arabic. The corpora collectively cover Rabati (MA.RB) Arabic. over 200,000 words, and are all manually The paper is structured as follows. We start with annotated in a common set of standards a review of relevant literature (Section2). We then for orthography, diacritized lemmas, to- summarize some linguistic facts about DA in gen- kenization, morphological units and En- eral (Section3) and subsequently present each of glish glosses. These corpora will be pub- our seven dialects in Section4, summarizing the licly available to serve as benchmarks for corpora used and some interesting facts specific to training and evaluating systems for Arabic each dialect. Section5 then presents our annota- dialect morphological analysis and disam- tion methodology. We then briefly discuss mor- biguation. phological analyzers, and conclude. 1 Introduction 2 Related Work As Arabic dialects (DA) become more widely Data Collections There have been several data written in social media, there is increased interest collections centered on Arabic dialects, specifi- in the Arabic NLP community to have annotated cally spoken Arabic. A very useful resource is the corpora that will allow us to both study the dialects Semitisches Tonarchiv at the University of Heidel- linguistically, and to create systems that can auto- berg in Germany.2 We have included two Yemeni matically process dialectal text. There have been transcriptions from this resource in our YE.TZ and important efforts to create relatively large corpora YE.SN corpora. Khalifa et al.(2016) is a large col- for Egyptian (Maamouri et al., 2014), Palestinian lection of over 100M words of a number of Ara- (Jarrar et al., 2014), and Emirati Arabic (Khal- bic dialect, although the majority is from the Gulf. ifa et al., 2018). While these resources are very Bouamor et al.(2018) created a large corpus with helpful for single dialects, the problem is that parallel data text from 25 Arab cities. Further data there are many dialects, and in fact it is often un- collections include (Al-Amri, 2000) which has not clear what to count as separate dialects (for exam- yet been digitized for use in NLP research. ple, the subdialects of Levantine). Therefore, we present a different approach in this paper: we an- Annotated Corpora There are few annotated notate seven dialects, but with relatively smaller corpora for dialectal Arabic: the Levantine Ara- corpora (most around 30,000 words). Some of bic Treebank (specifically Jordanian) (Maamouri the dialects are closely related (Jordanian and Syr- et al., 2006), the Egyptian Arabic Treebank ian), others are more distant (Moroccan). We use (Maamouri et al., 2014), Curras, the Pales- the same annotation methodology for all dialects: 1The abbreviations we use intend to capture the country same guidelines, same processing steps, and same name and the city or region name when applicable. annotation file format. This makes our effort an 2http://www.semarch.uni-hd.de 137 Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 137–147 Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics tinian Arabic annotated corpus (Jarrar et al., 3 Dialects: Linguistic Facts 2014), the Gulf Arabic Annotated corpus (Khalifa In this section we present some general facts and et al., 2018), Syrian, Jordanian dialectal corpora phenomena shared across different dialects. In (Bouamor et al., 2014; Harrat et al., 2014), a small subsequent subsections, we present our dialects effort on Sanaani and Moroccan (AlShargi et al., in more detail and commenting on the corpus 2016) (which this paper builds on), and SUAR sources. (Al-Twairesh et al., 2018), a morphologically an- notated corpus for Najdi and Hijazi which is semi- Dialects and MSA Arabic dialects share many automatically annotated using the MADAMIRA commonalities with Classical Arabic and Mod- tool (Pasha et al., 2014) and subsequently man- ern Standard Arabic (MSA). All variants of Ara- ually checked. Additionally, Voss et al.(2014) bic are morphologically complex as they include present a corpus of Moroccan dialect which has rich inflectional and derivational morphology that been annotated for language variety (code switch- is expressed in two ways: namely, via templates ing). Several of these efforts have followed the ap- and affixes. Furthermore, they contain several proach of Curras (Jarrar et al., 2014), which con- classes of attachable clitics. However, the dialects sists of around 70,000 words of a balanced genre as a class differ in consistent ways from MSA, corpus. The corpus was manually annotated using and they differ amongst each other. In fact, the the DIWAN tool (Alshargi and Rambow, 2015), differences between MSA and Dialectal Arabic which we also use. The annotation in Curras is (DA) have often been compared to those between done by first using a morphological tagger for an- Latin and the Romance languages (Chiang et al., other Arabic dialect, namely MADAMIRA Egyp- 2006). The principal morpho-syntactic difference tian (Pasha et al., 2014), to produce a base that was between DA and MSA is the loss of productive then corrected or accepted by a trained annotator. case marking, and nunation (tanween) on nouns, and mood on imperfective verbs. Other NLP Resources for Dialectal Arabic Dialectal Variations Differences among the di- The effort to annotate corpora in context is a cen- alects are found on all levels of linguistic descrip- tral step in developing morphological analyzers tion, i.e., phonology, morphology, syntax, and the and taggers (Eskander et al., 2013; Habash et al., lexicon. We summarize three phonological and 2013). However, other notable approaches and three morphological salient examples in Table1 efforts that do not use annotated corpora have for our dialects: the pronunciation of MSA /q/ focused on developing specific resources manu- written q,3 MSA /Ã/ written j and MSA /k/ ally or semi-automatically, e.g., the Egyptian Ara- h. written k; and the various forms of the future, bic morphological analyzer (Habash et al., 2012b) ¼ progressive and possessive particles. which is built upon the Egyptian Colloquial Ara- From a lexical point of view, there are many bic Lexicon (Kilany et al., 2002), the multi- words that have different meanings across dialects. dialectal dictionary Tharwa (Diab et al., 2014), or For example, the word mA$y /ma:Si/ is ‘no’ extending MSA analyzers and resources (Salloum ú æ AÓ and Habash, 2014; Harrat et al., 2014; Boujelbane in YE.SN and MA.RB, ‘yes/ok’ in SY.DM and JOR, and ‘walking’ in SA.NJ. Another exam- et al., 2013). Q ple is the word ú¯A SAfy /s a:fi/ which means ‘enough’ in MA.RB, but ‘pure’ in the other di- Linguistic Studies There are many theoretical alects and MSA. Some cases show subtle dif- and descriptive linguistic studies for the dialects ferences in meaning, e.g., Ð@Yg xdAm /xadda:m/ we work on: Yemeni dialects (Watson, 1993, means ‘employee’ generically in MA.RB, but it 2002), Najdi (Ingham, 1994), Gulf Arabic dialect has a more specific and negative connotation in (Holes, 1990), Jordanian (Bani-Yasin and Owens, YE.TZ and YE.SN, namely ‘enslaved servant’. 1987), Moroccan (Harrell, 1962), Syrian (Cow- While the above cases are all homonyms (homo- ell, 1964), and Iraqi (Erwin, 1963); not to men- phones and homographs), there are instances of tions comparative studies across dialects and MSA (Holes, 2004; Brustad, 2000). We make extensive 3We represent the Arabic words in Arabic script and in the Buckwalter transliteration (in italics) (Habash et al., 2007). use of such studies as part of the design of our an- When needed, we present the IPA (in /.../). The English gloss notation guidelines. is added in single quotes. 138 Phenomenon MSA YE.TZ YE.SN SA.NJ JOR SY.DM IR.BG MA.RB Pronunciation of q /q//q//g//g/ or /dz//g/ or /P//P//g//q/ or /g/ j Pronunciation of h. /Ã//g//Ã//Ã//Z//Z//Ã//Ã/ Pronunciation of ¼ k /k//k//k//k/ or /ts//k/ or /Ù//k//k/ or /Ù//k/ Future Particle + s+ + $+ +¨ E+ +H. b+ +h H+ +h H+ +h H+ +¨ g+ swf A$ Ed rH rH rH gAdy ¬ñ @ Y« hP hP hP ø XA« + $+ h@P rAH + y+ ø b b b b d k Progressive Particle φ +H. + +H. + Y«A¯ qAEd +H. + +H. + +X + +¼ + t ËAg. jAls Ñ« Em Y«A¯ qAEd +H + d Possessive Particle φ ©J.K tbE ©J.K tbE k Hq ©J.K tbE ©J.K tbE ÈAÓ mAl +X + k Hq k Hq ¨AK tAE ÈAK X dyAl Table 1: Cross-dialectal and MSA variants in some phonological and morphological phenomena homophones that have different meanings in dif- 4 Dialect-Specific Corpora ferent dialects. For example the utterance /fagr/ Until recently, Arabic was mostly written in Mod- can mean ‘morning’ in YE.TZ (written as Qm¯ fjr), or .