An Algerian -French Code-Switched Corpus

Ryan Cotterell1, Adithya Renduchintala1, Naomi Saphra1, Chris Callison-Burch2

1 Center for Language and Speech Processing, Johns Hopkins University 2 Computer and Information Science Department, University of Pennsylvania

Abstract Arabic is not just one language, but rather a collection of dialects in addition to (MSA). While MSA is used in formal situations, dialects are the language of every day life. Until recently, there was very little dialectal Arabic in written form. With the advent of social-media, however, the landscape has changed. We provide the first romanized code-switched Algerian Arabic-French corpus annotated for word-level language id. We review the history and sociological factors that make the linguistic situation in Algerian unique and highlight the value of this corpus to the natural language processing and linguistics communities. To build this corpus, we crawled an Algerian newspaper and extracted the comments from the news story. We discuss the informal nature of the language in the corpus and the challenges it will present. Additionally, we provide a preliminary analysis of the corpus. We then discuss some potential uses of our corpus of interest to the computational linguistics community.

Keywords: code-switching, Algerian Arabic, romanized Arabic, French

1. Introduction and borrowing. Borrowing is the act of using a foreign word without recourse to syntactic or morphological prop- Language identification systems have long operated under erties of that language and often occurs with phonological the assumption that text is written in a single language. As assimilation. Code-switching, on the other hand, involves social media becomes a more prominent mode of communi- switching between languages in which the speakers are flu- cation, such systems are confronting text that increasingly ent, and can in effect be viewed as changing the gram- challenges the monolingual assumption. More than half mar in use. Some linguists have even proposed a scale of the world’s population is bilingual and information com- code-switching, positing the existence of a continuum be- munication is often code-switched, reflecting the need for a tween borrowing and code switching (Auer, 1999). Code- deeper understanding of code-switching in relation to NLP switching points (the times at which speakers change lan- tasks. Recent work has proposed both supervised and un- guage) and the context around which switching occurs are supervised methods for word-level language id. Current also of interest to linguists. These points often lie within a methods, however, rely on the assumption that external re- sentence and their position is influenced by the syntax of the sources exist, such as large non-code-switched corpora and respective languages. Poplack (Poplack, 1988) posited that dictionaries. These resources are not available for some code-switching points cannot occur within a constituent. languages and dialects, including Algerian vernacular Ara- Recent work, however, has found that many speakers relax bic, a dialect often code-switched with French. We are re- this constraint. The Matrix Language-Frame (MLF) model leasing a corpus of romanized Algerian Arabic and French is one theory has gained traction (Myers-Scotton, 1993) to scraped from the comments sections of Echorouk, an Al- explain code-switching patterns. MLF proposes that there gerian daily newspaper with the second-largest reader base is a Matrix Language (ML) and an Embedded Language of any Arabic paper. This is the only significant corpus (EL). The ML is the more dominant language and is of- of romanized Arabic known to the authors and addition- ten the language which the speaker identifies as their na- ally it is the largest corpus of code-switched data to our tive tongue. The EL is then inserted into the ML at certain knowledge. Such a resource is necessary because roman- grammatical frames. Within this framework, further work ized Arabic is becoming increasingly popular on the inter- has gone into the exact syntactic and morphological con- net. The corpus will also be of use for the linguistic study of texts that allow for code-switching points (Myers-Scotton code-switching. Much of previous code-switching research and Bolonyai, 2001). has focused on data collected from field work, and a found dataset like ours could provide an interesting perspective on 3. Code-Switching in North Africa the use of code-switching in conversation. Code-switching in North African Arabic is an established phenomenon that has been studied by the linguistics com- 2. Code-switching munity (Bentahila and Davies, 1983). It dates back to the Code-switching is a linguistic phenomenon wherein speak- initial French colonization of North Africa. North Africa is ers switch between two or more languages in conversation, also home to many cultures, a fact which potentially affects often within a single utterance (Bullock and Toribio, 2009). language use and code switching in particular. Until re- It can be viewed through a sociolinguistic lens where situ- cently, mixed language communication has been observed ation and topic influence the choice of language (Kachru, mainly as a spoken phenomenon. With the widespread use 1977). To define code-switching as a phenomenon, it is of computer-mediated communication, code-switching is important to make the distinction between code-switching becoming common in North African Arabic (Salia, 2011). Comments on news-feeds and social media outlets like Arabic Arabizi Arabic Arabizi Twitter and Facebook often contain code-switching. North @ a H. b, p African Arabic is not the only language to appear in code-  t  th, s switching writing. A body of recent sociolinguistics work H H has considered the phenomenon in various settings. Swiss- h. j, g h 7, h German and German code-switching in chat rooms was an- p 7’, 5 X d alyzed in Siebenhaar (2006) and Callahan (2004) consid- th z r ered Spanish and English code-switching in a written cor- X P pus. P z € s c € sh, ch  9 4. Related Work  9’, d t From a computational perspective, code-switching has re- th 3 ceived relatively little attention. Joshi (1982) provides a ¨ tool for parsing mixed sentences. More recently, Rosner ¨ gh, 3’ ¬ f and Farrugia (2007) focused on processing code-switched † 8, 2, k, q ¼ k SMS. Solorio and Liu (2008) trained classifiers to predict l m code-switching points in Spanish and English. Nguyen È Ð and Dogruoz (2013) also focused on word level language à n è h identification in Dutch-Turkish news commentary. To our ð w, o, ou ø y, i, e knowledge, Elfardy and Diab (2012) is the only computa- tional work on Arabic code-switching done to date. That Figure 1: Correspondence Between Arabic Letters and Ro- work does not include romanized Arabic. Many languages manized Arabic (Yaghan, 2008) written in a non-Roman script are romanized on the inter- net. This practice presents a problem for standard NLP tools that are trained on the language with its standard or- dialect written in romanized form. Romanized Arabic is thography (Irvine et al., 2012). We believe that this type of particularly difficult because there is no standard form of romanized data will become more pervasive as more users romanization used across the Arab world. In order to use employ computer-mediated communication globally. More standard NLP tools on such corpora, it is often necessary information will be generated in such settings, and it is crit- to deromanize the corpus. In the case of Urdu, this task ical for future NLP systems to be able to process the data has been successfully completed using standard Machine produced. The corpus we are presenting is a step in this Translation software (Irvine et al., 2012). direction. Arabic written in the alphabet, often dubbed arabizi, is extremely common on the internet and SMS. The ex- 5. Data Collection act mapping from the onto the Latin alpha- We used a corpus crawled from an Algerian newspaper bet varies significantly between regions. The specific case website. We scraped 598,047 pages in September 2012. of romanization by young speakers of in the These fora are rich in both dialectal Arabic and French con- United Arab Emirates is discussed thoroughly in Palfrey- tent. The corpus contains discussion on a wide-ranging man and Khalil (2003). set of issues including domestic politics, international re- Figure 1 expresses the most common mappings across the lations, religion, and sporting events. We extracted 6,949 Arab world. Algerians, and North Africans in general, tend comments, containing 150,000 words in total. We sepa- to use romanizations that reflect French orthography: for rated the comments section from the main article on each instance ð 7→ ou, € 7→ ch, h 7→ dj and @ 7→ e` or e´. page and stripped HTML tags and other non-user generated To illustrate this difference consider. the frequency of the content. The metadata was stripped in an attempt to pre- common transcriptions of  (God willing); we see serve anonymity. The Arabic portion of the corpus was an- é<Ë@ ZAƒ à@ RL 7→ ch about an order of magnitude more often than notated for sentence level dialect on Mechanical Turk (Cot- € 7→ sh. terell and Callison-Burch, 2014). This transcription variation makes it unlikely that a sin- We separated all the comments, in which more than half gle, general-purpose Arabic deromanization tool will be the non-white space characters were in the Roman alpha- enough, and such romanized corpora will need to be de- bet, determining these to be romanized. We did no further veloped for other dialects as well in order to analyze the processing, e.g. tokenization. The final data set contains users romanization preferences on dialectal basis. 339,504 comments with an average length of 19 tokens, as determined by separating on white space and punctua- tion. 1,000 of the comments are annotated using the guide- 7. Text Analysis lines described below. Our corpus has 493,038 types and Because our Algerian corpus is from the length-constrained 6,718,502 tokens, and is formatted in JSON. informal domain of online forum comments, it would be difficult to process meaningfully without normalizing be- 6. Romanization of Arabic forehand. It exhibits extreme variation in spelling and This corpus is unique in that it is the first large corpus to grammar. Many forms of the same word may appear the authors’ knowledge that is composed of of an Arabic throughout our corpus. For example, we identified 69 vari- ants of the common word é<Ë@ ZAƒ à@ alone. 70% of all 9. Annotation Guidelines token types in our corpus appeared only once, so the OOV rate in this forum corpus can confound language processing systems without text normalization. Several typical sources Annotating for word level language identification in code- of variation for Arabic identified by (Darwish et al., 2012) switched text is a difficult task because whether a word is were found in our corpus. code-switched is often more of a continuum than a binary • The use of elongations, especially in the form of vowel decision. Place names form a simple example:  PA K. repetition. (Paris) is an MSA word in that it is found in most Arabic dictionaries, but it is clearly of French origin. On the other we ki ta3arfou wach rah testfadou ? hhhhh hand, ñ KY J ¯ (video) is an example of a recent borrow- cha3ab kar3adjiiiiiiiiii ing from European languages that should be considered an • Spelling mistakes, such as dropped or transposed char- Arabic word. To simplify the decision, we made use of acters. guidelines for dialectal Arabic annotation provided in (El- fardy and Diab, 2012). Their guidelines were created for • Abbreviations. annotating world level language id in a corpus composed of mixed dialectal Arabic and MSA both written in the Ara- rah thablona bel BAC had al3am !!!!! bic script. As we annotating two linguistically dissimilar • Emotional tokens and ejaculative abbreviations, such languages, the same level of ambiguity does not arise. as the abbreviation “lol” borrowed from English web Further research in area of code-switching should focus on speech, or emoticons. richer annotation schemata that are both linguistically mo- In addition to these irregularities, our corpus contains vari- tivated, i.e. taking into account the continuum of code- ations particular to romanized Arabic text because there is switching, and serve practical NLP needs. Another interest- no standardized way to transcribe Arabic orthography in ing area to focus on could be to annotate broad categories this informal domain, Arabic words can be represented by describing the type of code-switching. Kecskes (2006) multiple spellings. describes three prominent patterns in code-switching (In- sertion, Alternation and Congruent Lexicalization) based 8. Example Posts on Gibraltar data. Insertion involves adding lexical items We present below a few example sentences that we have from one language into the structure of the other. Alter- collected use the data collection methodology described nation is similar to insertion except that larger chunks are above. inserted, rather than single tokens. Congruent Lexicaliza- • bezaf m3a saifi oalah mnkalifoha mairbahch tion is adding lexical items from different lexical invento- Had enough with Saifi. ries into a common grammar structure. The dataset could annotate each point of code-switching with these patterns • la howla wa la kowata il bi lah el3alier l3adim wa la of code-switching as well. This data set, however, is fo- yassa3oni an akoul anaho kllo chaye momkin ma3a cused on only the points of code-switching. ljazairyine For Gods Sake! I can just say that anything is possible The annotators were presented with posts and asked to la- with Algerians . bel each word, split on white space and punctuation. They were given the choice of Arabic (A), French (F) and Other • we ki ta3arfou wach rah testfadou ? hhhhh cha3ab (O). We excluded punctuation from the annotation. Figure kar3adjiiiiiiiiii 4 shows the distribution of these tags in our dataset. The an- Don’t try to know everything because it does not mat- notation was conducted with an interactive Python script. ter to you. • 7ade said nchalah li lmontakhabina el3assekari nchalah yjibo natija mli7a bitawfiiiiiiiiiiiiiiiiik . . . onchoriya chorouk stp Good luck to our military team, I hope they get a good 10. Potential Uses score. Good Luck! Say it chorouk! • ya khawti tt simplement c bajiou l3arab o makanech fi tzayer kamla joueur kima lhadji et on vai ras le 27 03 The corpus provided is the first of its kind in that it is the 2011 chkoun houma rjal liyestahlouha first large corpus of romanized Arabic that is code-switched My brothers he’s simply the Baggio of Arabs, and with another language. This has numerous potential uses there is none like Lhadji in , and on 27th of in both the NLP community and the linguistics commu- march 2011 we will see who wins. nity. In the NLP community, the processing of informal text is becoming and increasingly popular task among re- • mais les filles ta3na ysedkou n’import quoi ana hada searchers (Yang and Eisenstein, 2013). This corpus adds face book jamais cheftou khlah kalbi another complication to informal text processing with the Our girls believe anything, I have never seen this addition of code-switching. In the linguistics community, Facebook before. a corpus based analysis of a code-switched corpus offers the possibility to test various hypotheses on large number Aravind K Joshi. 1982. Processing of sentences with intra- of documents. The MLF hypothesis has already been stud- sentential code-switching. In Proceedings of the 9th con- ied in bilingual speech corpora from Miami, Patagonia, and ference on Computational linguistics-Volume 1, pages Wales (Carter, 2010), but it will be necessary to study such 145–150. Academia Praha. theories in the context of many distinct languages and cul- Braj B Kachru. 1977. Linguistic schizophrenia and lan- tures to gain deeper insight into the code-switching phe- guage ensus: A note on the Indian situation. Linguistics, nomenon. 15(186):17–32. I Kecskes. 2006. A dual language model to explain code- 11. Conclusion and Future Work switching: A cognitive-pragmatic approach. Intercul- The primary contribution of this paper is the release of a tural Pragmatics, 3:257–283. Algerian Arabic-French code-switched corpus. We have Carol Myers-Scotton and Agnes Bolonyai. 2001. Calculat- made use of a previously proposed annotation scheme for ing speakers: Codeswitching in a rational choice model. word level language identification and highlighted the un- Language in Society, 30(1):1–28. usual qualities of this corpus that make it a significant con- Carol Myers-Scotton. 1993. Common and uncommon tribution to field. Future work in this line should largely ground: Social and structural factors in codeswitching. focus on experiments using the corpus both in NLP and Language in Society, 22:475–475. linguistics. It would also be of interest to construct and Dong-Phuong Nguyen and AS Dogruoz. 2013. Word level annotate similar corpora for other informal code-switched language identification in online multilingual communi- Arabic dialects. cation. Association for Computational Linguistics. David Palfreyman and Muhamed al Khalil. 2003. a funky Acknowledgements language for teenzz to use: representing gulf Arabic in This material is based on research sponsored by a DARPA instant messaging. Journal of Computer-Mediated Com- Computer Science Study Panel phase 3 award entitled munication, 9(1):0–0. “Crowdsourcing Translation” (contract D12PC00368). The Shana Poplack. 1988. Contrasting patterns of code- views and conclusions contained in this publication are switching in two communities. Codeswitching: Anthro- those of the authors and should not be interpreted as rep- pological and Sociolinguistic Perspectives. New York: resenting official policies or endorsements by DARPA or Mouton de Gruyter, pages 215–244. the U.S. Government. This research was supported by the Mike Rosner and Paulseph-John Farrugia. 2007. A tagging Johns Hopkins University Human Language Technology algorithm for mixed language identification in a noisy Center of Excellence. domain. In INTERSPEECH, pages 190–193. 12. References R. Salia. 2011. Between Arabic and French Lies the Di- alect: Moroccan Code-Weaving on Facebook. Under- Peter Auer. 1999. From codeswitching via language mix- graduate thesis, Columbia University. ing to fused lects toward a dynamic typology of bilingual Beat Siebenhaar. 2006. Code choice and code-switching speech. International Journal of Bilingualism, 3(4):309– in Swiss-German Internet Relay Chat rooms. Journal of 332. Sociolinguistics, 10(4):481–506. Abdelali Bentahila and Eirlys E Davies. 1983. The syntax Thamar Solorio and Yang Liu. 2008. Learning to predict of Arabic-French code-switching. Lingua, 59(4):301– code-switching points. In Proceedings of the Conference 330. on Empirical Methods in Natural Language Processing, Barbara E Bullock and Almeida Jacqueline Toribio. 2009. pages 973–981. Association for Computational Linguis- The Cambridge handbook of linguistic code-switching. tics. Cambridge University Press Cambridge. Mohammad Ali Yaghan. 2008. Arabizi: A contemporary Laura Callahan. 2004. Spanish/English codeswitching in a style of Arabic slang. Design Issues, 24(2):39–52. written corpus, volume 27. John Benjamins Publishing. Yi Yang and Jacob Eisenstein. 2013. A log-linear model Ryan Cotterell and Chris Callison-Burch. 2014. The ex- for unsupervised text normalization. In Proc. of EMNLP. tended Arabic online commentary dataset. In LREC. Kareem Darwish, Walid Magdy, and Ahmed Mourad. 2012. Language processing for Arabic microblog re- trieval. In Proceedings of the 21st ACM International Conference on Information and Knowledge Manage- ment, CIKM ’12, pages 2427–2430, New York, NY, USA. ACM. Heba Elfardy and Mona T Diab. 2012. Token level identification of linguistic code switching. In COLING (Posters), pages 287–296. Ann Irvine, Jonathan Weese, and Chris Callison-Burch. 2012. Processing informal, Romanized Pakistani text messages. In Proceedings of the Second Workshop on Language in Social Media, pages 75–78. Association for Computational Linguistics.