A Multidialectal Parallel Corpus of Arabic

Houda Bouamor1, Nizar Habash2 and Kemal Oﬂazer1

1Carnegie Mellon University in Qatar [email protected],[email protected] 2Center for Computational Learning Systems, Columbia University [email protected]

Abstract The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2,000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.

Keywords: Arabic, Dialects, Parallel Corpus

1. Introduction tion of 2,000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in The Arabic language today is a collection of variants addition to English. Since such parallel data does not (i.e., dialects) that are historically related to Classi- exist naturally (unlike parallel news, e.g.) this is a cal Arabic, but are the product of intense interaction very valuable resource that has many potential appli- among the various historical dialects of Classical Ara- cations such as Arabic dialect identification and ma- bic, the pre-Islamic local languages in the Arab World chine translation. (such as Coptic, Berber and Syriac), neighboring lan- The remainder of this paper is organized as follows. guages (such as Persian, Turkish and Spanish) and Section 2 discusses the differences between MSA and colonial era languages (such as Italian, French and DA and within DAs . In Section 3, we review the main English). Dialects of Arabic (DA) vary among them- previous efforts for building dialectal resources. Our selves and the Modern Standard Arabic (MSA) va- approach for building the multidialectal parallel cor- riety, which is the language of media and education pus is explained in Section 4. Section 5 presents some across the Arab World. The differences are not only preliminary corpus analysis and statistics. Finally, we lexical, but also phonological, morphological and to conclude and describe our future work in Section 6. lesser degree syntactic. There are numerous linguistic studies in Arabic dialects, with the comparative studies being limited to small scale and old laborious field 2. Arabic Dialect Variation method techniques of information collection (Brustad, While MSA is the shared official language of culture, 2000). media and education from Morocco to the Gulf coun- In the context of natural language processing (NLP), tries, it is not the native language of any speakers of some Arabic dialects have started receiving increasing Arabic. Most native speakers of Arabic are unable attention, particularly in the context of machine trans- to produce sustained spontaneous discourse in MSA; lation (Zbib et al., 2012; Salloum and Habash, 2013) in unscripted situations where spoken MSA would and in terms of basic enabling technologies (Habash normally be required (such as talk shows on TV), et al., 2012b; Habash et al., 2013; Pasha et al., 2014). speakers usually resort to repeated code-switching be- However, the focus is on a small number of iconic di- tween their dialect and MSA (Abu-Melhim, 1991; alects, (e.g., Cairene Arabic). In this paper, we present Bassiouney, 2009). Arabic dialects are often classified the first multidialectal Arabic parallel corpus, a collec- regionally (as Egyptian, North African, Levantine, Gulf, Yemeni) or sub-regionally (e.g., Tunisian, Al- et al., 2012a): Egyptian bas ‘only’, . èQ K.Q£ gerian, Lebanese, Syrian, Jordanian, Kuwaiti, Qatari, tarabayza¯h ‘table’, H@QÓ mirAt ‘wife [of]’ and ÈðX etc.). There are a number of additional ways to clas- dawl ‘these’, correspond to MSA faqaT, sify the dialects (as per social class or city level). ¡®¯ éËðA£ TAwila¯h, ék ðP zawja¯h and ZBñë hawlA’ˆ , respectively. Arabic is a morphologically complex language that For comparison,. the LEV forms of the above words combines a rich inflectional morphology with a highly are bas (like EGY), TAwli¯h (closer to 1 . éËðA£ ambiguous orthography, which poses many chal- MSA), mart and hadawl. lenges for NLP (Habash, 2010). These features are èQÓ ÈðYë shared by DA and MSA. The differences between Syntax Comparative studies of several Arabic di- MSA and DAs have often been compared to Latin alects suggest that the syntactic differences between and the Romance languages (Habash, 2006). Arabic the dialects are minor. For example, negation may be dialects differ phonologically, lexically, and morpho- realized differently (AÓ mA, Ó mish, ñÓ muw, B lA, logically from one another and from MSA (Watson, ÕË lam, etc.) but its syntactic distribution is to a large 2007). extent uniform across varieties (Benmamoun, 2012). Phonology An example of phonological differences is in the pronunciation of dialectal words whose MSA 3. Related work cognate has the letter Qaf ( q).2 It is often observed that in Tunisian Arabic, this consonant appears as /q/ Much work has been done in the context of standard (similar to MSA), while in Egyptian and Levantine Arabic NLP (Habash, 2010). There are lots of par- Arabic it is /’/ (glottal stop) and in Gulf Arabic it is allel and monolingual data collections, richly anno- /g/ (Haeri, 1991; Habash, 2010). tated collections (e.g., treebanks), sophisticated tools for morphological analysis and disambiguation, syn- Orthography While MSA has an established stan- tactic parsing, etc. (Habash, 2010). Efforts to create dard orthography, the dialects do not. Often people resources for Dialectal Arabic (DA) have been limited write words reflecting the phonology or the history to a small number of major dialects (Diab and Habash, (etymology) of these words. DAs are sometimes writ- 2007; Habash et al., 2013; Pasha et al., 2014). ten in Roman script (Darwish, 2013). In the context Several researchers have explored the idea of exploit- of NLP, a conventional orthography for DA (CODA) ing existing MSA rich resources to build tools for DA has been proposed and instantiated for Egyptian Ara- NLP. Al-Sabbagh and Girju (2010) described an ap- bic by Habash et al. (2012a) and was later extended to proach of mining the web to build a DA-to-MSA lex- Tunisian Arabic (Zribi et al., 2014). icon. Chiang et al. (2006) built syntactic parsers Morphology Morphological differences are quite for DA trained on MSA treebanks. Similarly Sawaf common. One example is the future marker particle (2010), Sajjad et al. (2013) and Salloum and Habash which appears as + sa+ or ¬ñ sawfa in MSA, (2013) translated dialectal Arabic to MSA as a bridge + Ha+ or raH in Levantine dialects, + ha+ in to translate to English. Boujelbane et al. (2013) built h hP ë a bilingual dictionary using explicit knowledge about Egyptian and AK bAš in Tunisian. This together with the relation between Tunisian Arabic and MSA. variation in the templatic. morphology make the forms Crowdsourcing to build specific resources (e.g., paral- of some verbs rather different: e.g. ’I will write’ is lel data for translation) for a specific dialect has also saÂaktubu (MSA), HaÂaktub (Pales- I. J»A I. J»Ag been successful (Zbib et al., 2012). Some efforts on tinian), haktib (Egyptian) and bAš I. Jºë I. JºK AK . dialect identification at the regional level have been niktib (Tunisian). done (Habash et al., 2008; Elfardy and Diab, 2013; Lexicon The number of lexical differences is quite Zaidan and Callison-Burch, 2013). significant. The following are a few examples (Habash In the context of DA-to-English SMT, Riesa and Yarowsky (2006) presented a supervised algorithm for 1Short vowels and consonantal doubling are represented online morpheme segmentation on DA that cut the with optionally written diacritics in Arabic orthography. OOVs by half. Zaidan and Callison-Burch (2011) This leads to a high degree of ambiguity and also hides crawled the websites of three Arabic Newspapers and some of the differences among dialects and MSA. extracted reader commentary on their articles to build 2 Arabic transliteration is presented in the Habash- the Arabic Online Commentary dataset. They also Soudi-Buckwalter scheme (Habash et al., 2007): (in alpha- collected crowd-driven dialectal annotations on Ara- betical order) AbtθjHxdðrzsšSDTDˇ ςγfqklmnhwy and the bic sentences using Mechanical Turk. More recently, additional symbols: ’ ,Â , Aˇ , A¯ , wˆ , yˆ , ¯h , ý . Z @ @ @ ð' ø è ø Dialect/Language Example English Because you are a personality that I can not describe. Modern Standard Arabic .Aê®ð ©J ¢@B éJ m ½KB lÂnk šxSy¯h lA ÂstTyς wSfhA. Egyptian Arabic .Aê®ð@ ¬Qªë Ó Ym.'ð éJ m ½KB lÂnk šxSy¯h wbjd mš. hςrf ÂwSfhA. Syrian Arabic .Aê®ð@ ¬Q«@ hP AÓ Yj. J«ð éJ m ½KB lÂnk šxSy¯h wςnjd mA rH Âςrf ÂwSfhA. Jordanian Arabic é®ð@ PY¯@ ÉJ jÓ éJ m Yg. IK@ Ant jd šxSy¯h mstHyl Aqdr AwSfhA. Palestinian Arabic .ñJK. AÓ ½J m ½J Ê« é<Ë@ ZAAÓ Yg. á« ςn jd mA šA’ Allh ςlyk šxSytk mA btnwSf. Tunisian Arabic .Aê®ñK Òj. JÓ jÊK. éJ m ¼Q£Ag úÎ« ςlý xATrk šxSy¯h blHq mnjmš nwSfhA.

Table 1: A table comparing the translations for one sentence in the Multidialectal Arabic Corpus. Egyptian Arabic is the original sentence which was translated to MSA, English and the other dialects.

Zbib et al. (2012) demonstrated a crowd-sourcing so- professional translators hired on MTurk. Since our lution to translating sentences from Egyptian and Lev- translators saw the sentences out of context, we pro- antine into English, and thus built two bilingual cor- vided them with the equivalent ones in English to help pora. The dialectal sentences were selected from a disambiguate some readings if necessary. large corpus of Arabic web text. They argued that dif- Every translator was asked to: (a) read the sentences ferences in genre between MSA and DA make bridg- carefully and simply translate them without adding ing through MSA of limited value. any new information; (b) avoid word by word translation; and (c) be consistent in their orthographic 4. Approach choices and avoid Roman script writing. We asked In addition to Standard Arabic (MSA) and English the translators to be internally consistent in spelling (EN), our corpus covers ﬁve dialects: Egyptian (EG), words since there is no standard orthography available Tunisian (TN), Syrian (SY), Jordanian(JO) and Pales- for Arabic dialects at this time and we wanted to min- tinian (PA) . The last three dialects represent the Lev- imize unnecessary sparsity. We did not provide them antine group of Arabic dialects. In the future, we plan with any orthographic guidelines (other than the re- to expand this effort to other dialects. quest for internal consistency). A different approach would have been to collect the dialectal sentences in In order to build our corpus, four translators (na- Arabic script following a general conventional orthog- tive speakers of Palestinian, Syrian, Jordanian and raphy for DA such as CODA. However CODA guide- Tunisian) were asked to translate 2,000 sentences writ- lines at this time only cover Egyptian and Tunisian ten in Egyptian into their dialects. Egyptian was cho- (Habash et al., 2012a; Zribi et al., 2014). sen as a starting point because it is the most widely un- derstood and used dialect throughout the Arab world. 5. Preliminary Data Analysis The Egyptian media industry has traditionally played a dominant role in the Arab world. A large number of Table 1 illustrates the translations for one sentence in cinema productions, television dramas and comedies the multidialectal Arabic corpus. This example high- have since long familiarized Arab audiences with the lights the many lexical and morphological differences Egyptian dialect. A ﬁfth translators (who happened to among the different dialects. For example, the Egyp- be Egyptian) was asked to translate the same text to tian expression Ym.'.ð wbjd ‘and seriously’ was trans- MSA. lated into YjJ«ð wςnjd in Syrian, and jÊK blHq in The sentences are selected from the Egyptian part Tunisian. The. example shows, as well, that. there are of the Egyptian-English corpus built by Zbib et al. many shared words that, on their own, cannot disam- (2012). This corpus was translated to English by non- biguate among the different dialects. EG SY JO PA TN MSA # tokens 11,131 11,586 9,866 11,131 10,896 11,048 # unique tokens 4,588 4,167 4,055 3,675 4,483 4,436 # tokens per sentence 9.22 9.61 8.17 8.52 8.98 9.70

Table 2: Statistics on our corpus Arabic dialect corpus(a sample of 1,000 sentences for each dialect and MSA)

Raw Sentences MSA TN PA JO SY EG 39.12 37.30 43.42 42.33 44.66 SY 28.23 31.61 38.87 47.42 JO 26.97 32.74 45.29 PA 26.97 30.20 TN 26.32

Table 3: Sentence pair average similarities using the Overlap Coefﬁcient

Orthographically Normalized Sentences MSA TN PA JO SY EG 44.64 41.32 50.29 49.33 53.81 SY 34.47 35.12 47.56 54.05 JO 33.30 34.26 49.81 PA 33.40 34.22 TN 31.41

Table 4: Orthographically normalized sentence pair average similarities using the Overlap Coefﬁcient

Table 2 provides various statistics for a sample of other dialects. This is not surprising since Tunisian is a 1,000 sentences extracted from our multidialectal cor- Western dialect, whereas Levantine (PA, JO, SY) and pus. Egyptian are all Eastern dialects. The highest degree To compare the similarity of the sentence pairs, we of similarity across these dialects seems to be within compute the Overlap Coefficient (OC), representing the Levantine family (Syrian and Jordanian). When the percentage of lexical overlap between the vocab- we compare Levantine against Egyptian, we observe that the highest degree of similarity is between Syr- ularies for each dialect pair D1 and D2. The OC is computed as follows: ian and Egyptian. This could be explained by the fact that the Syrian translator is currently living in Egypt, which might introduce a bias in the translation pro- |D ∩ D | OC = 1 2 cess. In the future, we plan to consider carefully such min(|D |, |D |) 1 2 biasing factors when creating translations. We conducted a preliminary lexical analysis restricted In order to study the impact of an orthographic nor- to simple matches. The results are given in Table 3. It malization on the similarity degree between dialects, is important to note the high lexical overlap of Egyp- we compute the overlap coefficient on pairs of sen- tian Arabic with the rest of dialects studied. This could tences in which all the Hamzated Alif forms (@ A¯, @ Â, be explained by the fact that the translations were orig- Aˇ) are replaced by a bare Alif A, the Alif-Maqsura inally obtained from Egyptian. We first observe that @ @ ý by Ya y and the Ta-Marbuta ¯h by Ha h. Sim- the MSA and Egyptian closeness is particularly high. ø ø è è The fact that the MSA translator is Egyptian possibly ilarity results are reported in Table 4. It is important introduced a bias in the translation process which ex- to notice that the normalization does not change any- plains this higher degree of similarity (39.12). The thing in the similarity ranking, which suggests that the other dialects are all less similar to MSA than Egyp- kind of orthography errors made by the translators are tian. Their overlap degree with MSA ranges from 26 naturally distributed across the different corpora. to 28. If we focus on different dialects (without MSA), Figure 1 presents some additional examples of sen- we notice that Tunisian has the least overlap with all tences from this corpus. EG ðYJ« ÐAJK ñÓQ« Éð AÖÏ AJ ®K Q¯@ úÍ@ é®K Y YJªË Q¯A Yg@ð MSA èYJ« I J.ÒÊË é¯A J@ Éð AÓYJ« ð AJ ®K Q¯@ ú¯ éË K Y úÍ@ Q¯A m SY Example 1 YJ« ÐAJK ñÓQ« Éð AÖÏ AJ ®K Q¯@ úÍ@ ñ®J ¯P YJªË Q¯A Yg@ð JO YJ« ÐAJK ñÓQ« Éð AÖÏ , AJ®K Q¯AK ñJkA YJ« Q¯A Yg@ð . . PA èYJ« ÐAJK @ñÓQ« éÊð ÐñK ð AJ ®K Q¯@ ú¯ éË@ K Y úÎ«Q¯A Yg@ð TN ðYJ« HAJ . K èA«YJ@ Éð J » AJ ®K Q¯@ ú¯ ñJ.kA YJ« AÓ ð Q¯A Yg@ð EN One left to his friend to Africa, when he got there he invited him to sleep over. EG á ªK. ñkQj. JÓ ªK. @ñJ.j K. JK@ ø @ à@ ðQ®ÖÏ@ MSA AÒîDªK. @ñkQm.' BI. k é¯C« ú¯ á m ø@ à@ Q®ÖÏ@ áÓ SY ' á Example 2 ªK. ñkQm. AÓ ªK. ñJ.j K. JK@ ø @ ñK@ ðQ®ÖÏ@ JO ' ' ªK. ñkQm. AÓ ªK. ñJ.m . ú Í@ ðQ®ÖÏ@ PA ªK . @ñkQm.' AÓ ªK . ñJ.m' á J K@ ø@ ñK@Ð PB . TN ªK . ñkQm .' AÓ ÑîDªK . ñJ.m' Pð P ø@ ðQ ®ÖÏ@ EN Its supposed to be that any two that loves each other not hurt each other EG éK @ AîDK. úÎÒªK é¯PA« Óð AëAg. AJm× Ó èQîD á KA ®K. àAJ ÊÓ ½K.BðX MSA AîE. á Êª®K@ XAÓ ú¯QªK B ð AîDJ k. AJm' B èQîD á KA ®K. úÎ JÜ Ø ½K.BðX SY Example 3 úæ AîD ¯ úÎÒªK éKA¯Q« ñÓð AJk. Am'. ñÓ úæK@ ð èQîD á KA¯ éKAJ ÊÓ ½JK@Qk JO éJ ¯ úÎÒªK hP ñ èQîD á KA ¯ éJK CÓ ½JK@ Q k PA AîD¯ øñ ¼YK. ñ é¯PA« Óð AîDk . Am' Ó I K@ èQîD á KA ¯ éKAJ ÊÓ ½JK@ Q k . TN ÑîDK. ÉÒªK . @ ¯QªK AÓ ð ÑîDK. ½JAg XA«AÓ H@QîDË@ HðQK. éJ J.ªÓ ½JK@ Q k EN Your closet is full of dresses that you don’t need or don’t know what. to do with

Figure 1: Translations for three sentences in the Multidialectal Arabic Corpus

6. Conclusion and Future Work Acknowledgements

The first and third authors were supported by grant We presented the first multidialectal Arabic parallel NPRP-09-1140-1-177 from the Qatar National Re- corpus, a collection of 2,000 sentences covering in ad- search Fund (QNRF), a member of the Qatar Foun- dition to English and MSA, the Egyptian, Tunisian, dation. The second author was supported by the De- Jordanian, Palestinian and Syrian Arabic dialects. The fense Advanced Research Projects Agency (DARPA) methodology we used relied on the familiarity of under Contract No. HR0011-12-C-0014. Any opin- Egyptian Arabic to most other dialect speakers. A pre- ions, findings and conclusions or recommendations liminary analysis of the data confirms known expecta- expressed in this paper are those of the authors and do tions about degrees of similarity between some of the not necessarily reflect the views of QNRF or DARPA. dialects, but also points out to possible bias created by the choice of a specific dialect (in our case Egyptian) as the starting point. 7. References We plan on extending the corpus in terms of size and Abu-Melhim, Abdel-Rahman. (1991). Code- number of dialects. We also plan to enrich it with addi- switching and Linguistic Accommodation in tional annotations such as a CODA version, morpho- Arabic. In Perspectives on Arabic Linguistics logical tokenization, POS tagging and manual word III: Papers from the Third Annual Symposium on alignments. Having all these rich annotations can be Arabic Linguistics, volume 80, pages 231–250. very helpful to supporting research in Arabic dialect John Benjamins Publishing. NLP. Al-Sabbagh, Rania and Girju, Roxana. (2010). Min- The corpus will be made freely available for research ing the Web for the Induction of a Dialectical Arabic purposes. Lexicon. In LREC, Valetta, Malta. Bassiouney, Reem. (2009). Arabic Sociolinguistics. cal analysis and disambiguation for dialectal arabic. Edinburgh University Press. In Proceedings of NAACL-HLT, pages 426–432, At- Benmamoun, Elabbas. (2012). Agreement and Cliti- lanta, Georgia, June. cization in Arabic Varieties from Diachronic and Habash, Nizar. (2006). On Arabic and its Dialects. Synchronic Perspectives. In Bassiouney, Reem, ed- Multilingual Magazine, 17(81). itor, Al’Arabiyya: Journal of American Association Habash, Nizar Y. (2010). Introduction to Arabic nat- of Teachers of Arabic, volume 44-45, pages 137– ural language processing, volume 3. Morgan & 150. Georgetown University Press. Claypool Publishers. Boujelbane, Rahma, Ellouze Khemekhem, Mariem, Haeri, Niloofar. (1991). Sociolinguistic Variation in and Belguith, Lamia Hadrich. (2013). Mapping Cairene Arabic: Palatalization and the qaf in the Rules for Building a Tunisian Dialect Lexicon and Speech of Men and Women. Generating Corpora. In Proceedings of the Sixth In- Pasha, Arfath, Al-Badrashiny, Mohamed, Kholy, ternational Joint Conference on Natural Language Ahmed El, Eskander, Ramy, Diab, Mona, Habash, Processing, pages 419–428, Nagoya, Japan. Nizar, Pooleery, Manoj, Rambow, Owen, and Roth, Brustad, Kristen. (2000). The Syntax of Spoken Ara- Ryan. (2014). Madamira: A fast, comprehensive bic: A Comparative Study of Moroccan, Egyptian, tool for morphological analysis and disambiguation Syrian, and Kuwaiti Dialects. Georgetown Univer- of arabic. In In Proceedings of LREC, Reykjavik, sity Press. Iceland. Chiang, David, Diab, Mona, Habash, Nizar, Rambow, Riesa, Jason and Yarowsky, David. (2006). Min- Owen, and Shareef, Safiullah. (2006). Parsing Ara- imally Supervised Morphological Segmentation bic Dialects. In Proceedings of EACL, Trento, Italy. with Applications to Machine Translation. In Pro- Darwish, Kareem. (2013). Arabizi Detection and ceedings of AMTA, Cambridge, MA. Conversion to Arabic. CoRR. Sajjad, Hassan, Darwish, Kareem, and Belinkov, Yonatan. (2013). Translating dialectal arabic to en- Diab, Mona and Habash, Nizar. (2007). Arabic Di- glish. In Proceedings of ACL, Sofia, Bulgaria. alect Processing Tutorial. In NAACL. Salloum, Wael and Habash, Nizar. (2013). Dialec- Elfardy, Heba and Diab, Mona. (2013). Sentence tal arabic to english machine translation: Pivoting Level Dialect Identification in Arabic. In Proceed- through modern standard arabic. In Proceedings of ings of the Association for Computational Linguis- NAACL-HLT, Atlanta, Georgia. tics, pages 456–461, Sofia, Bulgaria. Sawaf, Hassan. (2010). Arabic dialect handling in hy- Habash, Nizar, Soudi, Abdelhadi, and Buckwalter, brid machine translation. In Proceedings of AMTA, Timothy. (2007). On Arabic transliteration. In Denver, Colorado. Soudi, Abdelhadi, Neumann, Guenter, and van den Watson, Janet CE. (2007). The Phonology and Mor- Bosch, Antal, editors, Arabic Computational Mor- phology of Arabic. Oxford University Press. phology, volume 38 of Text, Speech and Language Zaidan, Omar F. and Callison-Burch, Chris. (2011). Technology, chapter 2, pages 15–22. Springer. Crowdsourcing Translation: Professional Quality Habash, N., Rambow, O., Diab, M., and Kanjawi- from Non-Professionals. In Proceedings of ACL, Faraj, R. (2008). Guidelines for Annotation of Ara- Portland, Oregon, USA. bic Dialectness. In Proceedings of the LREC Work- Zaidan, Omar and Callison-Burch, Chris. (2013). shop on HLT & NLP within the Arabic world. Arabic dialect identification. Computational Lin- Habash, Nizar, Diab, Mona, and Rambow, Owen. guistics. (2012a). Conventional Orthography for Dialectal Zbib, Rabih, Malchiodi, Erika, Devlin, Jacob, Stal- Arabic. In Proceedings of the Eighth International lard, David, Matsoukas, Spyros, Schwartz, Richard, Conference on Language Resources and Evaluation Makhoul, John, Zaidan, Omar F., and Callison- (LREC-2012), pages 711–718, Istanbul, Turkey. Burch, Chris. (2012). Machine translation of arabic Habash, Nizar, Eskander, Ramy, and Hawwari, Abde- dialects. In Proceedings of NAACL-HLT, Montréal, lati. (2012b). A Morphological Analyzer for Egyp- Canada. tian Arabic. In Proceedings of the Twelfth Meet- Zribi, I., Boujelbane, R., Masmoudi, A., El- ing of the Special Interest Group on Computational louze Khmekhem, M., Hadrich Belguith, L., and Morphology and Phonology, pages 1–9, Montréal, Habash, N. (2014). A Conventional Orthography Canada. for Tunisian Arabic. In In Proceedings of LREC, Habash, Nizar, Roth, Ryan, Rambow, Owen, Eskan- Reykjavik, Iceland. der, Ramy, and Tomeh, Nadi. (2013). Morphologi-