An Arabic-Moroccan Darija Code-Switched Corpus
Total Page:16
File Type:pdf, Size:1020Kb
An Arabic-Moroccan Darija Code-Switched Corpus Younes Samih and Wolfgang Maier Institute for Language and Information University of Dusseldorf,¨ Dusseldorf,¨ Germany {samih,maierwo}@phil.hhu.de Abstract In multilingual communities, speakers often switch between languages or dialects within the same context. This phenomenon is called code-switching. It can be observed, e.g., in the Arab world, where Modern Standard Arabic and Dialectal Arabic coexist. Recently, the computational treatment of code-switching has received attention. Just as other natural language processing tasks, this task requires annotated linguistic resources. In our work, we turn to a particular under-resourced Arabic Dialect, Moroccan Darija. While other dialects such as Egyptian Arabic have received their share of attention, very limited effort has been devoted to the development of basic linguistic resources that would support a computational treatment of Darija. Motivated by these considerations, we describe our effort in the development and annotation of a large scale corpus collected from Moroccan social media sources, namely blogs and internet discussion forums. It has been annotated on token-level by three Darija native speakers. Crowd-sourcing has not been used. The final corpus has a size of 223k tokens. It is, to our knowledge, currently the largest resource of its kind. Keywords: code-switching, language identification, Moroccan Arabic 1. Introduction Benmamoun, 2001), a dialect with over 21 million native Modern Standard Arabic (MSA) is the official language of speakers (Lewis et al., 2014), remains a particularly under- most Arabic countries. It is spoken by more than 360 mil- resourced variant of Arabic. It is strongly embedded in a lion people around the world and exists in state of diglossia multilingual context that entails frequent code-switching, (Ferguson, 1959). Arabic speakers tend to use Dialectal i.e., switching between languages within the same context Arabic (DA) and MSA, two substantially different but his- (Bullock and Toribio, 2009). Building linguistic resources torically related language varieties, for different purposes and creating the necessary tools for Darija is a priority, not as the situation demands in their day-to-day lives. least because its vocabulary is particularly distant to MSA (Diab et al., 2010). T u n Levantine In this paper, we therefore contribute a corpus of Moroccan e s Arabic i a Iraqi Moroccan n Darija with code-switching annotation on token level. The Arabic Arabic Algerian Egyptian Arabic corpus has been collected from internet discussion forums Lybian Arabic Arabic and blogs, and is currently the largest manually annotated Gulf Arabic Arabic-Moroccan Darija code-switched corpus known to Other the authors. It will be of use for supporting research in Other Yemeni Arabic the linguistic and sociolinguistic aspects of code-switching of Arabic and it will constitute an ideal data source for Other multilingual processing in general and for research in code- switching detection in particular, an area which recently has attracted attention (see Sec. 5.). The remainder of the paper is organized as follows. In the following section, we outline the properties of code- Figure 1: The Arab world and Arabic Dialects switching. In Sec. 3., we describe the corpus creation. Sec. 4. presents the annotation. Sec. 5. reviews related work Fig. 1 shows a schematic map of dialects. Note that of- and Sec. 6. concludes the article. ten Moroccan, Algerian, Tunisian and Lybian Arabic are grouped together as Maghrebi Arabic, even though they are 2. Code-Switching not necessarily mutually intelligible. While MSA is an es- tablished standard among educated Arabic speakers, DA is 2.1. Linguistic Analysis of Code-Switching only used in everyday informal communication. Until re- Code-switching1 is common phenomenon in multilingual cently, DA was considered as a partially under-resourced communities wherein speakers switch from one language language, as the written production remains relatively very or dialect to another within the same context (Bullock and low in comparison to MSA. Increasingly, however, DA Toribio, 2009). Communities where commonly, more than is emerging as the language of informal communication one language, resp. dialect is spoken can be found around on the web, in emails, micro-blogs, blogs, forums, chat the world. Examples include India, where speakers switch rooms, etc. This new situation amplifies the need for con- between English and Hindi (among other local languages) sistent language resources and language identification sys- tems for Arabic and its dialects. While certain dialects, par- 1Note that for the purpose of this paper, we do not distin- ticularly Egyptian, have already received attention in NLP guish between code-switching and similar concepts such as code- research, Moroccan Arabic (Darija) (Ennaji et al., 2004; mixing. 4170 (Dey and Fung, 2014); the United States, where migrants It is the native language of about 40% of the Moroccan from Spanish-speaking countries continue to use their na- population. tive language alongside English (Poplack, 1980); Spain, where people switch between regional languages such as Basque and Spanish (Munoa˜ Barredo, 2003); Paraguay, Modern Standard Arabic is a written language used where Spanish co-exists with Guarani (Estigarribia, 2015); mainly in formal education, media, administration, and finally the Arab world, where speakers alternate be- and religion. tween MSA and Dialectal Arabic. In the literature, three types of codes-switching are distin- guished. In inter-sentential code-switching languages are French is not an official language, but dominant in higher switched between sentences. An instance of this type of education, in the media, and some industries. switching is (1) (from Munoa˜ Barredo (2003)), where the speaker switches from Basque to Spanish. (1) egia ez dala erreala? eso es otra cosa! In recent years, the Moroccan linguistic landscape has you say that the truth is not real? that’s a different changed dramatically due to social, political, and techno- thing! logical factors. Darija, the colloquial, traditionally unwrit- ten variety of Arabic, is increasingly dominating the lin- Intra-sentential code-switching consist of a language guistic scene. It is being written in a variety of ways in switch within a sentence. An example is (2) (borrowed print media, advertising, music, fictional writing, transla- from Dey and Fung (2014)). Here, the speaker switches tion, the scripts for dubbed foreign TV series, and a weekly from Hindi to English within the same sentence. news magazine (Elinson, 2013). It is also increasingly ap- (2) Tume nahi pata, she is the daughter of the CEO, pearing on the web in blogs, emails, and social media plat- yaha do char din ke liye ayi hai. forms and is often code-switched with other languages and Dont you know, she is the daughter of the CEO, shes dialects, including MSA, English, and French, Spanish and here for a couple of days. Berber (Tratz et al., 2014). As an example for intra-sentential and intra-word code- A third type of code-switching is intra-word switching, switching in Morocco, consider (3). It is taken from our where a language switch occurs in a single word. For in- own data. stance, the morphology of one language involved can be applied on a stem of the other language. An corresponding (3) Ì example can be found in the Sec. 2.2.. I. Jm. '@ ð ú ¾J Ë@ ð @ðQË@ øñ AîD ¯ ù®J. K áË AQ¯ Since the mid 1960s, there has been a large body of lin- guistic studies on code-switching, the bulk of them con- .A¢J Pñ®Ë@ ð ñÖÏAK. É¿B@ ð Õ»YK QK ð øðAg In France, the only things that remain are preten- centrating on social and linguistic factors that constrain sion, empty pockets, and, to add more, also eating its occurrences (Berk-Seligson, 1986). Various models of with knife and fork. constraints have been proposed. Poplack (1980) formu- lates a model in terms of the equivalence constraint of The speaker switches between MSA, Darija, and uses a the languages involved at the switch point. Namely, code- word where French is mixed with Arabic morphology. switching tends to occur at points in the sentence where the France surface structure of the respective languages is the same. MSA words include, e.g., AQ¯ ( ), ñÖÏAK. É¿B@ Myers-Scotton (1993) focuses on structural constrains in eating with knife ( ); Darija words include ú¾ J Ë@ ð @ðQ Ë@ code-switching. She proposes the matrix language-frame Õ»YKQ K ð øðAg IJmÌ'@ ð (pretension, empty pockets, and, (MLF). It is based on the assumption that one language is . to add more the matrix language (ML) and the other language is the em- ); finally, A¢JPñ ®Ë@ is the French word for fork fourchette bedded language (EL). While ML provides the grammati- (” ”), written in Arabic script. It is prefixed cal and functional elements as well as the structural frame by the Arabic definite article and suffixed with an Arabic of the sentence, the EL can only provide content elements case marker. (Myers-Scotton, 1997). For a further, detailed linguistic overview, consult Muysken (2000). 2.2. Code-Switching in Morocco 3. Corpus Creation The linguistic situation in Morocco complex due to its di- verse ethnic and linguistic make-up and the colonial history. We acquire our data from internet discussion forums and Following Benmamoun (2001), one can distinguish differ- blogs which are hosted in Morocco or extensively used by ent languages and dialects that occupy the linguistic space: Moroccans. The crawled output is stripped from HTML tags and other meta-data. Since sentence splitting is not Darija is the native language for the majority of the pop- a trivial task in Arabic and no such tool is available for ulation and is the language of popular culture. Darija, we leave the downloaded text units (”posts”) intact. Then we tokenize the text with a simple heuristic, delete Berber is the language of the original people of Morocco.