Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus Ann Bies, Zhiyi Song, Mohamed Maamouri, Stephen Grimes, Haejoong Lee, Jonathan Wright, Stephanie Strassel, Nizar Habash†, Ramy Eskander‡, Owen Rambow‡ Linguistic Data Consortium, University of Pennsylvania {bies,zhiyi,maamouri,sgrimes,haejoong, jdwright,strassel}@ldc.upenn.edu †Computer Science Department, New York University Abu Dhabi †
[email protected] ‡Center for Computational Learning Systems, Columbia University ‡{reskander,rambow}@ccls.columbia.edu letters for emphasis; typos and non-standard ab- Abstract breviations are common; and non-linguistic con- tent is written out, such as laughter, sound repre- This paper describes the process of creating a sentations, and emoticons. novel resource, a parallel Arabizi-Arabic This situation is exacerbated in the case of Ar- script corpus of SMS/Chat data. The lan- abic social media for two reasons. First, Arabic guage used in social media expresses many dialects, commonly used in social media, are differences from other written genres: its vo- quite different from Modern Standard Arabic cabulary is informal with intentional devia- tions from standard orthography such as re- (MSA) phonologically, morphologically and lex- peated letters for emphasis; typos and non- ically, and most importantly, they lack standard standard abbreviations are common; and non- orthographies (Maamouri et.al. 2014). Second, linguistic content is written out, such as Arabic speakers in social media as well as dis- laughter, sound representations, and emoti- cussion forums, Short Messaging System (SMS) cons. This situation is exacerbated in the text messaging and online chat often use a non- case of Arabic social media for two reasons.