of Arabizi into Orthography: Developing a Parallel Annotated Arabizi- SMS/Chat Corpus

Ann Bies, Zhiyi Song, Mohamed Maamouri, Stephen Grimes, Haejoong Lee, Jonathan Wright, Stephanie Strassel, Nizar Habash†, Ramy Eskander‡, Owen Rambow‡ Linguistic Data Consortium, University of Pennsylvania {bies,zhiyi,maamouri,sgrimes,haejoong, jdwright,strassel}@ldc.upenn.edu †Computer Science Department, New York University Abu Dhabi †[email protected] ‡Center for Computational Learning Systems, Columbia University ‡{reskander,rambow}@ccls.columbia.edu

letters for emphasis; typos and non-standard ab- Abstract breviations are common; and non-linguistic con- tent is written out, such as laughter, sound repre- This paper describes the process of creating a sentations, and emoticons. novel resource, a parallel Arabizi-Arabic This situation is exacerbated in the case of Ar- script corpus of SMS/Chat data. The lan- abic social media for two reasons. First, Arabic guage used in social media expresses many dialects, commonly used in social media, are differences from other written genres: its vo- quite different from cabulary is informal with intentional devia- tions from standard orthography such as re- (MSA) phonologically, morphologically and lex- peated letters for emphasis; typos and non- ically, and most importantly, they lack standard standard abbreviations are common; and non- orthographies (Maamouri et.al. 2014). Second, linguistic content is written out, such as Arabic speakers in social media as well as dis- laughter, sound representations, and emoti- cussion forums, Short Messaging System (SMS) cons. This situation is exacerbated in the text messaging and online chat often use a non- case of Arabic social media for two reasons. standard called “Arabizi” (Dar- First, Arabic dialects, commonly used in so- wish, 2013). Social media communication in cial media, are quite different from Modern Arabic takes place using a variety of orthogra- Standard Arabic phonologically, morphologi- phies and writing systems, including Arabic cally and lexically, and most importantly, they lack standard orthographies. Second, script, Arabizi, and a mixture of the two. Alt- Arabic speakers in social media as well as hough not all social media communication uses discussion forums, SMS messaging and Arabizi, the use of Arabizi is prevalent enough to online chat often use a non-standard romani- pose a challenge for Arabic NLP research. zation called Arabizi. In the context of natu- In the context of natural language processing ral language processing of social media Ara- of social media Arabic, transliterating from bic, transliterating from Arabizi of various Arabizi of various dialects to Arabic script is a dialects to Arabic script is a necessary step, necessary step, since many of the existing state- since many of the existing state-of-the-art re- of-the-art resources for Arabic dialect processing sources for Arabic dialect processing expect and annotation expect Arabic script input (e.g., Arabic script input. The corpus described in this paper is expected to support Arabic NLP Salloum and Habash, 2011; Habash et al. 2012c; by providing this resource. et al., 2014). To our knowledge, there are no naturally oc- 1 Introduction curring parallel texts of Arabizi and Arabic script. In this paper, we describe the process of The language used in social media expresses creating such a novel resource at the Linguistic many differences from other written genres: its Data Consortium (LDC). We believe this corpus vocabulary is informal with intentional devia- will be essential for developing robust tools for tions from standard orthography such as repeated converting Arabizi into Arabic script.

93 Proceedings of the EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP), pages 93–103, October 25, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics The rest of this paper describes the collection approximates the Arabic letter that one wants to because ع of Egyptian SMS and Chat data and the creation express (e.g., the numeral 3 represents of a parallel text corpus of Arabizi and Arabic it looks like a mirror reflection of the letter). script for the DARPA BOLT program. 1 After Due to the use of characters and also reviewing the history and features in Arabizi frequent code switching in social media Arabizi, (Section 2) and related work on Arabizi (Section it can be difficult to distinguish between Arabic 3), in Section 4, we describe our approach to col- words written in Arabizi and entirely unrelated lecting the Egyptian SMS and Chat data and the foreign language words (Darwish 2013). For annotation and transliteration methodology of the example, mesh can be the English word, or not”. However, in context these“ مش Arabizi SMS and Chat into Arabic script, while Arabizi for in Section 5, we discuss the annotation results, cases can be clearly labeled as either Arabic or a along with issues and challenges we encountered foreign word. An additional complication is that in annotation. many words of foreign origin have become Ara- bic words (“borrowings”). Examples include موبايل tomato” and mobile“ بندورة Arabizi and Dialect banadoora 2 “mobile phone”. It is a well-known practical and 2.1 What is Arabizi? theoretical problem to distinguish borrowings Arabizi is a non-standard (foreign words that have become part of a lan- script that is widely adopted for communication guage and are incorporated fully into the mor- over the Internet (World Wide Web, email) or phological and syntactic system of the host lan- for sending messages (instant messaging and guage) from actual code switching (a bilingual mobile phone text messaging) when the actual writer switches entirely to a different language, Arabic script alphabet is either unavailable for even if for only a single word). Code switching technical reasons or otherwise more difficult to is easy to identify if we find an extended passage use. The use of Arabizi is attributed to different in the foreign language which respects that lan- reasons, from lack of good input methods on guage’s syntax and morphology, such as Bas some mobile devices to writers’ unfamiliarity ra2yak I have the mask. The problem arises with . In some cases, writing in when single foreign words appear without Arabic Arabizi makes it easier to code switch to English morphological marking: it is unclear if the writer or French, which is something educated Arabic switched to the foreign language for one word or speakers often do. Arabizi is used by speakers of whether or she simply is using an Arabic a variety of Arabic dialects. word of foreign origin. In the case of banadoora tomato”, there is little doubt that this has“ بندورة ,Because of the informal nature of this system there is no single “correct” encoding, so some become a fully Arabic word and the writer is not character usage overlaps. Most of the encoding code switching into Italian; this is also signaled in the system makes use of the Latin character by the fact that a likely Arabizi spelling (such as (as used in English and French) that best approx- banadoora) is not in fact the Italian orthography imates phonetically the Arabic letter that one (pomodoro). However, the case is less clear cut mobile phone”: even if it is a“ موبايل wants to express (for example, either b or p cor- with mobile -This may sometimes vary due to borrowing (clearly much more recent than bana .(ب responds to tomato”), a writer will likely spell“ بندورة regional variations in the pronunciation of the doora in the the word with the English orthography as mobile ﺝ Arabic letter (e.g., is used to represent Levantine dialect, while in Egyptian dialect g is rather than write, say, mubail. More research is used) or due to differences in the most common needed on this issue. However, because of the non-Arabic second language (e.g., sh corre- difficulty of establishing the difference between in the previously English dominated code switching and borrowing, we do not attempt ش sponds to Middle East Arab countries, while ch shows a to make this distinction in this annotation predominantly French influence as found in scheme. North Africa and ). Those letters that do not have a close phonetic approximate in the Lat- 2.2 Egyptian Arabic Dialect in script are often expressed using numerals or Arabizi is used to write in multiple dialects of other characters, so that the numeral graphically Arabic, and differences between the dialects themselves have an effect on the spellings cho- sen by individual writers using Arabizi. Because 1 http://www.darpa.mil/Our_Work/I2O/Programs/Broad_Op erational_Language_Translation_%28BOLT%29.aspx Egyptian Arabic is the dialect of the corpus cre-

94 ated for this project, we will briefly discuss some per as part of the automatic transliteration step of the most relevant features of Egyptian Arabic because they target the same conventional or- with respect to Arabizi transliteration. For a thography of dialectal Arabic (CODA) (Habash more extended discussion of the differences be- et al., 2012a, 2012b), which we also target. tween MSA and Egyptian Arabic, see Habash et There are several commercial products that con- al. (2012a) and Maamouri et al. (2014). vert Arabizi to Arabic script, namely: Microsoft Phonologically, Egyptian Arabic is character- Maren, 2 Google Ta3reeb, 3 Basis Arabic chat ized by the following features, compared with translator4 and Yamli.5 Since these products are MSA: for commercial purposes, there is little infor- (a) The loss of the interdentals /ð/ and /θ/ mation available about their approaches, and which are replaced by /d/ or /z/ and /t/ or /s/ whatever resources they use are not publicly respectively, thus giving those two original available for research purposes. Furthermore, as consonants a heavier load. Examples in- Al-Badrashiny et al. (2014) point out, Maren, -dabaħ/ Ta3reeb and Yamli are primarily intended as in/ ذبح ,”zakar/ “to mention/ ذكر clude .taman/ put method support, not full text transliteration/ ثمن ,”talg/ “ice/ ثلج ,”to slaughter“ sibit/ “to stay in place, As a result, their users’ goal is to produce Arabic/ ثبت price”, and“ become immobile”. script text not Arabizi text, which affects the (b) The exclusion of /q/ and /ǰ/ from the conso- form of the romanization they utilize as an in- nantal system, being replaced by the /ʔ/ and termediate step. The differences between such -functional romanization” and real Arabizi in“ جمل ʔuṭn/ “cotton”, and/ قطن ,.g/, e.g/ /gamal/ “camel”. clude that the users of these systems will use less At the level of morphology and syntax, the or no code switching to English, and may em- structures of Egyptian Arabic closely resemble ploy character sequences that help them arrive at the overall structures of MSA with relatively mi- the target Arabic script form faster, which other- nor differences to speak of. Finally, the Egyptian wise they would not write if they were targeting Arabic lexicon shows some significant elements Arabizi (Al-Badrashiny et al., 2014). of semantic differentiation. Name Transliteration There has been some The most important morphological difference work on machine transliteration by Knight and between Egyptian Arabic and MSA is in the use Graehl (1997). Al-Onaizan and Knight (2002) of some Egyptian clitics and affixes that do not introduced an approach for machine translitera- exist in MSA. For instance, Egyptian Arabic has tion of Arabic names. Freeman et al. (2006) also the future proclitics h+ and ħ+ as opposed to the introduced a system for name matching between standard equivalent s+. English and Arabic. Although the general goal Lexically, there are lexical differences be- of transliterating from one script to another is tween Egyptian Arabic and MSA where no ety- shared between these efforts and ours, we are mological connection or no cognate spelling is considering a more general form of the problem .in that we do not restrict ourselves to names بص available. For example, the Egyptian Arabic unZur/ in MSA. Code Switching There is some work on code’/ أنظر buṣṣ/ “look” is/ switching between Modern Standard Arabic 3 Related Work (MSA) and dialectal Arabic (DA). Zaidan and Callison-Burch (2011) were interested in this Arabizi-Arabic Script Transliteration Previ- problem at the inter-sentence level. They ous efforts on automatic from crawled a large dataset of MSA-DA news com- Arabizi to Arabic script include work by Chalabi mentaries, and used Amazon Mechanical Turk to and Gerges (2012), Darwish (2013) and Al- annotate the dataset at the sentence level. Badrashiny et al. (2014). All of these approaches Elfardy et al. (2013) presented a system, AIDA, rely on a model for character-to-character map- that tags each word in a sentence as either DA or ping that is used to generate a lattice of multiple MSA based on the context. Lui et al. (2014) alternative words which are then selected among proposed a system for language identification in using a language model. The training data used by Darwish (2013) is publicly available but it is quite limited (2,200 word pairs). The work we 2 http://www.getmaren.com are describing here can help substantially im- 3 http://www.google.com/ta3reeb prove the quality of such system. We use the 4 http://www.basistech.com/arabic-chat-translator- system of Al-Badrashiny et al. (2014) in this pa- transforms-social-media-analysis/ 5 http://www.yamli.com/

95 multilingual documents using a generative mix- Egyptian Arabic has the advantage over all other ture model that is based on supervised topic dialects of Arabic of being the language of the modeling algorithms. Darwish (2013) and Voss largest linguistic community in the Arab region, et al. (2014) deal with exactly the problem of and also of having a rich level of internet com- classifying tokens in Arabizi as Arabic or not. munication. More specifically, Voss et al. (2014) deal with , and with both French and 4.1 SMS and Chat Collection English, meaning they do a three-way classifica- In BOLT Phase 2, LDC collected large volumes tion. Darwish (2013)'s data is more focused on of naturally occurring informal text (SMS) and Egyptian and and code switch- chat messages from individual users in English, ing with English. Chinese and Egyptian Arabic (Song et al., 2014). Processing Social Media Text Finally, while Altogether we recruited 46 Egyptian Arabic par- English NLP for social media has attracted con- ticipants, and of those 26 contributed data. To siderable attention recently (Clark and Araki, protect privacy, participation was completely 2011; Gimpel et al., 2011; Gouws et al., 2011; anonymous, and demographic information was Ritter et al., 2011; Derczynski et al., 2013), there not collected. Participants completed a brief lan- has not been much work on Arabic yet. Darwish guage test to verify that they were native Egyp- et al. (2012) discuss NLP problems in retrieving tian Arabic speakers. On average, each partici- Arabic microblogs (tweets). They discuss many pant contributed 48K words. The Egyptian Ara- of the same issues we do, notably the problems bic SMS and Chat collection consisted of 2,140 arising from the use of dialectal Arabic such as conversations in a total of 475K words after the lack of a standard orthography. Eskander et manual auditing by native speakers of Egyptian al. (2013) described a method for normalizing Arabic to exclude inappropriate messages and spontaneous orthography into CODA. messages that were not Egyptian Arabic. 96% of the collection came from the personal SMS or 4 Corpus Creation Chat archives of participants, while 4% was col- This work was prepared as part of the DARPA lected through LDC’s platform, which paired Broad Operational Language Translation participants and captured their live text messag- (BOLT) program which aims at developing tech- ing (Song et al., 2014). A subset of the collec- nology that enables English speakers to retrieve tion was then partitioned into training and eval and understand information from informal for- datasets. eign language sources including chat, text mes- Table 1 shows the distribution of Arabic script saging and spoken conversations. LDC collects vs. Arabizi in the training dataset. The conversa- and annotates informal linguistic data of English, tions that contain Arabizi were then further anno- Chinese and Arabic, with Egyptian Arabic being tated and transliterated to create the Arabizi- the representative of the Arabic . Arabic script parallel corpus, which consists of

Total Arabic Arabizi Mix of Arabizi and Arabic script script only only Arabizi Arabic script Conversations 1,503 233 987 283 Messages 101,292 18,757 74,820 3,237 4,478 Sentence units 94,010 17,448 69,639 3,017 3,906 Words 408,485 80,785 293,900 10,244 23,556 Table 1. Arabic SMS and Chat Training Dataset

1270 conversations. 6 All conversations in the Not surprisingly, most Egyptian conversations training dataset were also translated into English in our collection contain at least some Arabizi; to provide Arabic-English parallel training data.

word) due to SMS messaging character limits were rejoined, 6 In order to form single, coherent units (Sentence units) of and very long messages (especially common in chat) were an appropriate size for downstream annotation tasks using split into two or more units, usually no longer than 3-4 sen- this data, messages that were split mid-sentence (often mid- tences.

96 only 15% of conversations are entirely written in Arabic, as they are in other languages. English Arabic script, while 66% are entirely Arabizi. words use Arabic morphology or determiners, as The remaining 19% contain a mixture of the two in el anniversary “the anniversary”. Sometimes at the conversation level. Most of the mixed English words are spelled in a way that is closer conversations were mixed in the sense that one phonetically to the way an Egyptian speaker side of the conversation was in Arabizi and the would pronounce them, for example lozar for other side was in Arabic script, or in the sense “loser”, or beace for “peace”. that at least one of the sides switched between The adoption of Arabizi for SMS and online the two forms in mid-conversation. Only rarely chat may also go some way to explaining the are individual messages in mixed scripts. The high frequency of code mixing in the Egyptian annotation for this project was performed on the Arabic collection. While the auditing process Arabizi tokens only. Arabic script tokens were eliminated messages that were entirely in a non- not touched and were kept in their original target language, many of the acceptable messag- forms. es contain a mixture of Egyptian Arabic and The use of Arabizi is predominant in the SMS English. and Chat Egyptian collection, in addition to the presence of other typical cross-linguistic text ef- 4.2 Annotation Methodology fects in social media data. For example, the use All of the Arabizi conversations, including the of emoticons and emoji is frequent. We also ob- conversations containing mixtures of Arabizi and served the frequent use of written out representa- Arabic script were then annotated and translit- tions of speech effects, including representations erated: of laughter (e.g., hahaha), filled pauses (e.g., 1. Annotation on the Arabizi source text to um), and other sounds (e.g., hmmm). When these flag certain features representations are written in Arabizi, many of 2. Correction and normalization of the trans- them are indistinguishable from the same repre- literation according to CODA conventions sentations in English SMS data. Neologisms are also frequently part of SMS/Chat in Egyptian

Figure 1. Arabizi Annotation and Transliteration Tool

The annotators were presented with the source tors perform both annotation and transliteration conversations in their original Arabizi form as token by token, sentence by sentence and review well as the transliteration output from an auto- the corrected transliteration in full context. The matic Arabization system, and used a web-based GUI shows the full conversation in both the orig- tool developed by LDC (see Figure 1) to perform inal Arabizi and the resulting Arabic script trans- the two annotation tasks, which allowed annota- literation for each sentence. Annotators must

97 annotate each sentence in order, and the annota-  Foreign language words and numbers. All tion is displayed in three columns. The first col- cases of code switching and all cases of bor- umn shows the annotation of flag features on the rowings which are rendered in Arabizi us- source tokens, the second column is the working ing standard English orthography are panel where annotators correct the automatic marked as “Foreign”. transliteration and retokenize, and the third col- o ana kont mt25er fe t2demm l pro- umn displays the final corrected and retokenized jects//Foreign result. o oltilik okay//Foreign ya Babyy//Foreign Annotation was performed according to anno- balashhabal!!!! tation guidelines developed at the Linguistic Da- o zakrty ll sat//Foreign ta Consortium specifically for this task (LDC, o Bat3at el whatsapp//Foreign 2014). o La la la merci//Foreign gedan bs la2 o We 9//Foreign galaeeb dandash lel ban- 4.3 Automatic Transliteration at To speed up the annotation process, we utilized an automatic Arabizi-to-Arabic script translitera-  Names, mainly person names tion system (Al-Badrashiny et al., 2014) which o Youmna//Name 7atigi?? was developed using a small vocabulary of 2,200 words from Darwish (2013) and an additional 6,300 Arabic-English proper name pairs (Buck- 4.5 Correction and Normalization of the walter, 2004). The system has an accuracy of Transliteration According to CODA 69.4%. We estimate that using this still allowed Conventions us to cut down the amount of time needed to type The goal of this task was to correct all spelling in in the Arabic script version of the Arabizi by the Arabic script transliteration to CODA stand- two-thirds. This system did not identify Foreign ards (Habash et al., 2012a, 2012b). This meant words or Names and transliterated all of the that annotators were required to confirm both (1) words. In one quarter of the errors, the provided that the word was transliterated into Arabic script answer was plausible but not CODA-compliant correctly and also (2) that the transliterated word (Al-Badrashiny et al., 2014). conformed to CODA standards. The automatic transliteration was provided to the annotators, 4.4 Annotation on Arabizi Source Text to and manually corrected by annotators as needed. Flag Features Correcting spelling to a single standard (CO- This annotation was performed only on sentences DA), however, necessarily included some degree containing Arabizi words, with the goal of tag- of normalization of the orthography, as the anno- ging any words in the source Arabizi sentences tators had to correct from a variety of dialect that would be kept the same in the output of an spellings to a single CODA-compliant spelling English translation with the following flags: for each word. Because the goal was to reach a consistent representation of each word, ortho-  Punctuation (not including emoticons) graphic normalization was almost the inevitable o Eh ?!//Punct effect of correcting the automatic transliteration. o Ma32ula ?!//Punct This consistent representation will allow down- o Ebsty ?//Punct stream annotation tasks to take better advantage of the SMS/Chat data. For example, more con-  Sound effects, such as laughs (‘haha’ or sistent spelling of Egyptian Arabic words will variations), filled pauses, and other sounds lead to better coverage from the CALIMA mor- (‘mmmm’ or ‘shh’ or ‘um’ etc.) phological analyzer and therefore improve the o hahhhahhah//Sound akeed 3arfa :p da manual annotation task for morphological anno- enty t3rafy ablia :pp tation, as in Maamouri et al. (2014). o Hahahahaahha//Sound Tb ana ta7t fel ahwaa Modern Standard Arabic (MSA) cognates and o Wala Ana haha//Sound Egyptian Arabic sound changes o Mmmm//Sound okay Annotators were instructed to use MSA or- thography if the word was a cognate of an MSA

98 root, including for those consonants that have singular pronoun and the third person plural ver- undergone sound changes in Egyptian Arabic.7 bal suffix can be ambiguous in informal texts. :for For example مأفول and not ma>fwl مقفول  use mqfwl and not byHbh بيحبوا بعض locked”  use byHbwA bED“ -for “(They) loved each oth بيحبه بعض for the bED حافز and not HAfz حافظ  use HAfZ name (a proper noun) er” بيعمله and not byEmlh بيعملوا  use byEmlwA Long for “(They) did” or “(They) worked” Annotators were instructed to reinstate miss- In addition, because final -h is sometimes re- ing long vowels, even when they were written as placed in speech by final /-uw/, it was occasion- short vowels in the Arabizi source, and to correct ally necessary to correct cases of overuse of the long vowels if they were included incorrectly. third person plural verbal suffix (-wA) to the .for pronoun -h as well سعة and not saEap ساعة  use sAEap “hour” -for “(she) Merging and splitting tokens written with in قلت and not qlt قالت  use qAlt said” correct word boundaries Annotators were instructed to correct any Consonantal ambiguities word that was incorrectly segmented. The anno- Many consonants are ambiguous when written tation tool allowed both the merging and splitting in Arabizi, and many of the same consonants are of tokens. also difficult for the automatic transliteration Clitics were corrected to be attached when script. Annotators were instructed to correct any necessary according to (MSA) standard writing errors of this type. conventions. These include single letter proclit- ics (both verbal and nominal) and the negation س .vs ص / S vs. s for suffix -$, as well as pronominal clitics such as سايغ and not sAyg صايغ o use SAyg “jeweler” possessive pronouns and direct object pronouns. ,For example ظ .vs ض / D vs. Z and not فالبيت for  use fAlbyt ظابط and not ZAbT ضابط o use DAbT for “in the فلبيت or flbyt فال بيت officer” fAl byt“ ”for house ضلمة and not Dlmp ظلمة o use Zlmp and not عالسطح darkness”  use EAlsTH“ for “on the علسطح or ElsTH عال سطح Alt- EAl sTH .ى .vs ي / Dotted ya vs. Alif Maqsura ”roof ى /and Alif Maqsura ي /hough the dotted ya is always attached to و- / -are often used interchangeably in Egyptian The conjunction w Arabic writing conventions, it was neces- its following word. for و كان and not w kAn وكان sary to make the distinction between the  use wkAn two for this task. “and was” و راحت and not w rAHt وراحت for “Ali”  use wrAHt على and not ElY علي o use Ely (the proper name) for “and (she) left”  Taa marbouta. In Arabizi and so also in the Words that were incorrectly segmented in the Arabic script transliteration, the taa mar- Arabizi source were also merged. For example, and not مسحورة may be written for both nominal fi-  use msHwrp ة /bouta for “bewitched مس حورة but for dif- ms Hwrp ,ت /and verbal final -t ه /nal -h ferent reasons. (fem.sing.)” for شعر ها and not $Er hA شعرها Ali’s school”  use $ErhA“ مدرسة علي o mdrsp Ely ”his school” “her hair“ مدرسته o mdrsth Particles that are not attached in standard Morphological ambiguities MSA written forms were corrected as necessary Spelling variation and informal usage can by the splitting function of the tool. For exam- combine to create morphological ambiguities as ple, and not yAEmry يا عمري well. For example, the third person masculine  use yA Emry ”!for “Hey, dear ياعمري التروح and not lAtrwH ال تروح Both Arabic script and the Buckwalter transliteration  use lA trwH 7 (http://www.qamus.org/transliteration.htm) are shown for for “Do not go” the transliterated examples in this paper.

99 5 Discussion Abbreviations in Arabizi Three abbreviations in Arabizi received spe- Annotation and transliteration were performed cial treatment: msa, isa, 7ma. These three abbre- on all sentence units that contain Arabizi. Sen- viations only were expanded out to their full tence units that contain only Arabic script were form using Arabic words in the corrected Arabic ignored and untouched during annotation. In script transliteration. total, we reviewed 1270 conversations, among for “As which over 42.6K sentence units (more than ما شاء هللا  msa: use mA $A' All~h God wills” 300K words) were deemed to be containing .for “God Arabizi and hence annotated and transliterated إن شاء هللا  isa: use

As we noted earlier, code switching is fre- Correcting Arabic typos quent in the SMS and Chat Arabizi data. There Annotators were instructed to correct typos in were about 23K words flagged as foreign words. the transliterated Arabic words, including typos Written out speech effects in this type of data are in proper names. However, typos and non- also prevalent, and 6610 tokens were flagged as standard spellings in the transliteration of a for- Sounds (laughter, filled pause, etc.). Annotators eign words were kept as is and not corrected. most often agreed with each other in the detec- ,should be corrected to tion and flagging of tokens as Foreign, Name رمفان  Ramafan for “” Sound or Punctuation, with over 98% agreement رمضان rmDAn .since it is the English word “ba- for all flags ببيي  babyy by” it should not be corrected The transliteration annotation was more diffi- cult than the flagging annotation, because apply- Flagged tokens in the correction task ing CODA requires linguistic knowledge of Ara- Tokens flagged during task 1 as Sound and bic. Annotators went through several rounds of Foreign were transliterated into Arabic script but training and practice and only those who passed were not corrected during task 2. Note that even a test were allowed to work on the task. In an when a whole phrase or sentence appeared in analysis of inter-annotator agreement in the dual- English, the transliteration was not corrected. ly annotated files, the overall agreement between for “kiss” the two annotators was 86.4%. We analyzed all كس  ks for “did you the disagreements and classified them in four ضد يا هاف فان  Dd yA hAf fAn have fun” high level categories: The transliteration of proper names was cor-  CODA 60% of the disagreements were related rected in the same way as all other words. to CODA decisions that did not carefully follow Emoticons and emoji were replaced in the the guidelines. Two-fifths of these cases were transliteration with #. Emoticons refer to a set of related to Alif/Ya spelling (mostly Alif - numbers or letters or punctuation marks used to tion, rules of hamza support) and about one-fifth express feelings or mood. Emoji refers to a spe- involved the spelling of common dialectal words. cial set of images used in messages. Both Emot- An additional one-third were due to non-CODA icons and Emoji are frequent in SMS/Chat data. root, pattern or affix spelling. Only one-tenth of the cases were because of split or merge deci- sions. These issues suggest that additional train- ing may be needed. Additionally, since some of

100 the CODA errors may be easy to detect and cor- Acknowledgements rect using available tools for morphological This material is based upon work supported by analysis of Egyptian Arabic (such as the CALI- the Defense Advanced Research Projects Agency MA-ARZ analyzer), we will consider integrating such support in the annotation interface in the (DARPA) under Contract No. HR0011-11-C- 0145. The content does not necessarily reflect the future. position or the policy of the Government, and no  Task In 23% of the overall disagreements, the official endorsement should be inferred. annotators did not follow the task guidelines for Nizar Habash performed most of his contribu- handling punctuation, sounds, emoticons, names tion to this paper while he was at the Center for or foreign words. Examples include disagree- Computational Learning Systems at Columbia ment on whether a question mark should be split University. or kept attached, or whether a non-Arabic word should be corrected or not. Many of these cases References can also be caught as part of the interface; we will consider the necessary extensions in the fu- Mohamed Al-Badrashiny, Ramy Eskander, Nizar Ha- ture. bash, and Owen Rambow. 2014. Automatic Trans-  Ambiguity In 12% of the cases, the annota- literation of Romanized Dialectal Arabic. In Pro- tors’ disagreement reflected a different reading ceedings of the Conference on Computational Nat- ural Language Learning (CONLL), Baltimore, of the Arabizi resulting in a different lemma or Maryland, 2014. inflectional feature. These differences are una- voidable and reflect the natural ambiguity in the Tim Buckwalter. 2004. Buckwalter Arabic Morpho- task. logical Analyzer Version 2.0. LDC catalog number  Typos Finally, in less than 5% of the cases, LDC2004L02, ISBN 1-58563-324-0. the disagreement was a result of a typographical Achraf Chalabi and Hany Gerges. 2012. Romanized error unrelated to any of the above issues. Arabic Transliteration. In Proceedings of the Sec- Among the cases that were easy to adjudicate, ond Workshop on Advances in Text Input Methods one of the two annotators was correct 60% more (WTIM 2012). than the other. This is consistent with the obser- Eleanor Clark and Kenji Araki. 2011. Text normaliza- vation that more training may be needed to fill in tion in social media: Progress, problems and ap- some of the knowledge gaps or increase the an- plications for a pre-processing system of casual notator’s attention to detail. English. Procedia - Social and Behavioral Scienc- es, 27(0):2 – 11. 6 Conclusion Kareem Darwish, Walid Magdy, and Ahmed Mourad. 2012. Language processing for arabic microblog This is the first Arabizi-Arabic script parallel re- trieval. In Proceedings of the 21st ACM Inter- corpus that supports research on transliteration national Conference on Information and from Arabizi to Arabic script. We expect to Knowledge Management, CIKM ’12, pages 2427– make this corpus available through the Linguistic 2430, New York, NY, USA. ACM. Data Consortium in the near future. Kareem Darwish. 2013. Arabizi Detection and Con- This work focuses on the novel challenges of version to Arabic. CoRR, arXiv:1306.6755 [cs.CL]. developing a corpus like this, and points out the close interaction between the orthographic form Leon Derczynski, Alan Ritter, Sam Clark, and Kalina of written informal genres of Arabic and the spe- Bontcheva. 2013. Twitter part-of-speech tagging cific features of individual Arabic dialects. The for all: Overcoming sparse and noisy data. In Pro- ceedings of the International Conference Recent use of Arabizi and the use of Egyptian Arabic in Advances in Natural Language Processing RANLP this corpus come together to present a host of 2013, pages 198–206, Hissar, Bulgaria, September. spelling ambiguities and multiplied forms that INCOMA Ltd. Shoumen, Bulgaria. were resolved in this corpus by the use of CODA for Egyptian Arabic. Developing a similar cor- Heba Elfardy, Mohamed Al-Badrashiny, and Mona Diab. 2013. Code Switch Point Detection in Ara- pus and transliteration for other Arabic dialects bic. In Proceedings of the 18th International Con- would be a rich area for future work. ference on Application of Natural Language to In- We believe this corpus will be essential for formation Systems (NLDB2013), MediaCity, UK, NLP work on Arabic dialects and informal gen- June. res. In fact, this corpus has recently been used in development by Eskander et al. (2014). Ramy Eskander, Mohamed Al-Badrashiny, Nizar Ha- bash and Owen Rambow. 2014. Foreign Words

101 and the Automatic Processing of Arabic Social the Language Resources and Evaluation Confer- Media Text Written in Roman Script. In Arabic ence (LREC), Reykjavik, Iceland. Natural Language Processing Workshop, EMNLP, Mohamed Maamouri, Ann Bies, Seth Kulick, Doha, Qatar. Ciul, Nizar Habash and Ramy Eskander. 2014. De- Ramy Eskander, Nizar Habash, Owen Rambow, and veloping a dialectal Egyptian Arabic Treebank: Nadi Tomeh. 2013. Processing Spontaneous Or- Impact of Morphology and Syntax on Annotation thography. In Proceedings of the 2013 Conference and Tool Development. In Proceedings of the Lan- of the North American Chapter of the Association guage Resources and Evaluation Conference for Computational Linguistics: Human Language (LREC), Reykjavik, Iceland. Technologies (NAACL-HLT), Atlanta, GA. Yaser Al-Onaizan and Kevin Knight. 2002. Machine Andrew T. Freeman, Sherri L. Condon and Christo- Transliteration of Names in Arabic Text. In Pro- pher M. Ackerman. 2006. Cross Linguistic Name ceedings of ACL Workshop on Computational Ap- Matching in English and Arabic: A “One to Many proaches to . Mapping” Extension of the Levenshtein Edit Dis- Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, tance Algorithm. In Proceedings of HLT-NAACL, Ahmed El Kholy, Ramy Eskander, Nizar Habash, New York, NY. Manoj Pooleery, Owen Rambow, and Ryan M. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Roth. 2014. MADAMIRA: A Fast, Comprehensive Dipanjan Das, Daniel Mills, Jacob Eisenstein, Mi- Tool for Morphological Analysis and Disambigua- chael Heilman, Dani Yogatama, Jeffrey Flanigan, tion of Arabic. In Proceedings of the Language Re- and Noah A. Smith. 2011. Part-of-speech tagging sources and Evaluation Conference (LREC), Rey- for twitter: Annotation, features, and experiments. kjavik, Iceland. In Proceedings of ACL-HLT ’11. Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. Stephan Gouws, Donald Metzler, Congxing Cai, and 2011. Named entity recognition in tweets: An ex- Eduard Hovy. 2011. Contextual bearing on linguis- perimental study. In Proceedings of the Conference tic variation in social media. In Proceedings of the on Empirical Methods in Natural Language Pro- Workshop on Languages in Social Media, LSM cessing, EMNLP ’11. ’11, pages 20–29, Stroudsburg, PA, USA. Associa- Wael Salloum and Nizar Habash. 2011. Dialectal to tion for Computational Linguistics. Standard Arabic Paraphrasing to Improve Arabic- Nizar Habash, Mona Diab, and Owen Rambow English Statistical Machine Translation. In Pro- (2012a).Conventional Orthography for Dialectal ceedings of the First Workshop on Algorithms and Arabic: Principles and Guidelines – Egyptian Ara- Resources for Modelling of Dialects and Language bic. Technical Report CCLS-12-02, Columbia Varieties, pages 10–21, Edinburgh, Scotland. University Center for Computational Learning Sys- Zhiyi Song, Stephanie Strassel, Haejoong Lee, Kevin tems. Walker, Jonathan Wright, Jennifer Garland, Dana Nizar Habash, Mona Diab, and Owen Rabmow. Fore, Brian Gainor, Preston Cabe, Thomas Thom- 2012b. Conventional Orthography for Dialectal as, Brendan Callahan, Ann Sawyer. Collecting Arabic. In Proceedings of the Language Resources Natural SMS and Chat Conversations in Multiple and Evaluation Conference (LREC), Istanbul. Languages: The BOLT Phase 2 Corpus. In Pro- ceedings of the Language Resources and Evalua- Nizar Habash, Ramy Eskander, and Abdelati Haw- tion Conference (LREC) 2014, Reykjavik, Iceland. wari. 2012c. A Morphological Analyzer for Egyp- tian Arabic. In Proceedings of the Twelfth Meeting Clare Voss, Stephen Tratz, Jamal Laoudi, and Dou- of the Special Interest Group on Computational glas Briesch. 2014. Finding romanized Arabic dia- Morphology and Phonology, pages 1–9, Montréal, lect in code-mixed tweets. In Proceedings of the Canada. Ninth International Conference on Language Re- sources and Evaluation (LREC’14), Reykjavik, Kevin Knight and Jonathan Graehl. 1997. Machine Iceland. Transliteration. In Proceedings of the Conference of the Association for Computational Linguistics Omar F Zaidan and Chris Callison-Burch. 2011. The (ACL). arabic online commentary dataset: an annotated da- taset of informal arabic with high dialectal content. Linguistic Data Consortium. 2014. BOLT Program: In Proceedings of ACL, pages 37–41. Romanized Arabic (Arabizi) to Arabic Translitera- tion and Normalization Guidelines, Version 3.1. Linguistic Data Consortium, April 21, 2014. Marco Lui, Jey Han Lau, and Timothy Baldwin. 2014. Automatic detection and language identifica- tion of multilingual documents. In Proceedings of

102 Appendix A: File Format Examples

Example 1:

marwan ? ana walahi knt gaya today :/ marwan ? ana walahi knt gaya today :/ مروان ؟ انا وهللا كنت جاية تودي /: مروان ؟ انا وهللا كنت جاية تودي # مروان ؟ انا وهللا كنت جاية تودي # Marwan? I swear I was coming today :/ marwan ? ana walahi knt gaya today :/

Example 2:

W sha3rak ma2sersh:D haha W sha3rak ma2sersh:D haha و[+] شعرك مقصرش[-] # هه و[+] شعرك ما[-]قصرش[-]# هه وشعرك ما قصرش # هه And your hair did not become short? :D Haha W sha3rak ma2sersh:D haha

103