Transliteration of Arabizi Into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus
Total Page:16
File Type:pdf, Size:1020Kb
Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus Ann Bies, Zhiyi Song, Mohamed Maamouri, Stephen Grimes, Haejoong Lee, Jonathan Wright, Stephanie Strassel, Nizar Habash†, Ramy Eskander‡, Owen Rambow‡ Linguistic Data Consortium, University of Pennsylvania {bies,zhiyi,maamouri,sgrimes,haejoong, jdwright,strassel}@ldc.upenn.edu †Computer Science Department, New York University Abu Dhabi †[email protected] ‡Center for Computational Learning Systems, Columbia University ‡{reskander,rambow}@ccls.columbia.edu letters for emphasis; typos and non-standard ab- Abstract breviations are common; and non-linguistic con- tent is written out, such as laughter, sound repre- This paper describes the process of creating a sentations, and emoticons. novel resource, a parallel Arabizi-Arabic This situation is exacerbated in the case of Ar- script corpus of SMS/Chat data. The lan- abic social media for two reasons. First, Arabic guage used in social media expresses many dialects, commonly used in social media, are differences from other written genres: its vo- quite different from Modern Standard Arabic cabulary is informal with intentional devia- tions from standard orthography such as re- (MSA) phonologically, morphologically and lex- peated letters for emphasis; typos and non- ically, and most importantly, they lack standard standard abbreviations are common; and non- orthographies (Maamouri et.al. 2014). Second, linguistic content is written out, such as Arabic speakers in social media as well as dis- laughter, sound representations, and emoti- cussion forums, Short Messaging System (SMS) cons. This situation is exacerbated in the text messaging and online chat often use a non- case of Arabic social media for two reasons. standard romanization called “Arabizi” (Dar- First, Arabic dialects, commonly used in so- wish, 2013). Social media communication in cial media, are quite different from Modern Arabic takes place using a variety of orthogra- Standard Arabic phonologically, morphologi- phies and writing systems, including Arabic cally and lexically, and most importantly, they lack standard orthographies. Second, script, Arabizi, and a mixture of the two. Alt- Arabic speakers in social media as well as hough not all social media communication uses discussion forums, SMS messaging and Arabizi, the use of Arabizi is prevalent enough to online chat often use a non-standard romani- pose a challenge for Arabic NLP research. zation called Arabizi. In the context of natu- In the context of natural language processing ral language processing of social media Ara- of social media Arabic, transliterating from bic, transliterating from Arabizi of various Arabizi of various dialects to Arabic script is a dialects to Arabic script is a necessary step, necessary step, since many of the existing state- since many of the existing state-of-the-art re- of-the-art resources for Arabic dialect processing sources for Arabic dialect processing expect and annotation expect Arabic script input (e.g., Arabic script input. The corpus described in this paper is expected to support Arabic NLP Salloum and Habash, 2011; Habash et al. 2012c; by providing this resource. Pasha et al., 2014). To our knowledge, there are no naturally oc- 1 Introduction curring parallel texts of Arabizi and Arabic script. In this paper, we describe the process of The language used in social media expresses creating such a novel resource at the Linguistic many differences from other written genres: its Data Consortium (LDC). We believe this corpus vocabulary is informal with intentional devia- will be essential for developing robust tools for tions from standard orthography such as repeated converting Arabizi into Arabic script. 93 Proceedings of the EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP), pages 93–103, October 25, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics The rest of this paper describes the collection approximates the Arabic letter that one wants to because ع of Egyptian SMS and Chat data and the creation express (e.g., the numeral 3 represents of a parallel text corpus of Arabizi and Arabic it looks like a mirror reflection of the letter). script for the DARPA BOLT program. 1 After Due to the use of Latin characters and also reviewing the history and features in Arabizi frequent code switching in social media Arabizi, (Section 2) and related work on Arabizi (Section it can be difficult to distinguish between Arabic 3), in Section 4, we describe our approach to col- words written in Arabizi and entirely unrelated lecting the Egyptian SMS and Chat data and the foreign language words (Darwish 2013). For annotation and transliteration methodology of the example, mesh can be the English word, or not”. However, in context these“ مش Arabizi SMS and Chat into Arabic script, while Arabizi for in Section 5, we discuss the annotation results, cases can be clearly labeled as either Arabic or a along with issues and challenges we encountered foreign word. An additional complication is that in annotation. many words of foreign origin have become Ara- bic words (“borrowings”). Examples include موبايل tomato” and mobile“ بندورة Arabizi and Egyptian Arabic Dialect banadoora 2 “mobile phone”. It is a well-known practical and 2.1 What is Arabizi? theoretical problem to distinguish borrowings Arabizi is a non-standard romanization of Arabic (foreign words that have become part of a lan- script that is widely adopted for communication guage and are incorporated fully into the mor- over the Internet (World Wide Web, email) or phological and syntactic system of the host lan- for sending messages (instant messaging and guage) from actual code switching (a bilingual mobile phone text messaging) when the actual writer switches entirely to a different language, Arabic script alphabet is either unavailable for even if for only a single word). Code switching technical reasons or otherwise more difficult to is easy to identify if we find an extended passage use. The use of Arabizi is attributed to different in the foreign language which respects that lan- reasons, from lack of good input methods on guage’s syntax and morphology, such as Bas eh some mobile devices to writers’ unfamiliarity ra2yak I have the mask. The problem arises with Arabic keyboard. In some cases, writing in when single foreign words appear without Arabic Arabizi makes it easier to code switch to English morphological marking: it is unclear if the writer or French, which is something educated Arabic switched to the foreign language for one word or speakers often do. Arabizi is used by speakers of whether he or she simply is using an Arabic a variety of Arabic dialects. word of foreign origin. In the case of banadoora tomato”, there is little doubt that this has“ بندورة ,Because of the informal nature of this system there is no single “correct” encoding, so some become a fully Arabic word and the writer is not character usage overlaps. Most of the encoding code switching into Italian; this is also signaled in the system makes use of the Latin character by the fact that a likely Arabizi spelling (such as (as used in English and French) that best approx- banadoora) is not in fact the Italian orthography imates phonetically the Arabic letter that one (pomodoro). However, the case is less clear cut mobile phone”: even if it is a“ موبايل wants to express (for example, either b or p cor- with mobile -This may sometimes vary due to borrowing (clearly much more recent than bana .(ب responds to tomato”), a writer will likely spell“ بندورة regional variations in the pronunciation of the doora in the the word with the English orthography as mobile ﺝ Arabic letter (e.g., j is used to represent Levantine dialect, while in Egyptian dialect g is rather than write, say, mubail. More research is used) or due to differences in the most common needed on this issue. However, because of the non-Arabic second language (e.g., sh corre- difficulty of establishing the difference between in the previously English dominated code switching and borrowing, we do not attempt ش sponds to Middle East Arab countries, while ch shows a to make this distinction in this annotation predominantly French influence as found in scheme. North Africa and Lebanon). Those letters that do not have a close phonetic approximate in the Lat- 2.2 Egyptian Arabic Dialect in script are often expressed using numerals or Arabizi is used to write in multiple dialects of other characters, so that the numeral graphically Arabic, and differences between the dialects themselves have an effect on the spellings cho- sen by individual writers using Arabizi. Because 1 http://www.darpa.mil/Our_Work/I2O/Programs/Broad_Op erational_Language_Translation_%28BOLT%29.aspx Egyptian Arabic is the dialect of the corpus cre- 94 ated for this project, we will briefly discuss some per as part of the automatic transliteration step of the most relevant features of Egyptian Arabic because they target the same conventional or- with respect to Arabizi transliteration. For a thography of dialectal Arabic (CODA) (Habash more extended discussion of the differences be- et al., 2012a, 2012b), which we also target. tween MSA and Egyptian Arabic, see Habash et There are several commercial products that con- al. (2012a) and Maamouri et al. (2014). vert Arabizi to Arabic script, namely: Microsoft Phonologically, Egyptian Arabic is character- Maren, 2 Google Ta3reeb, 3 Basis Arabic chat ized by the following features, compared with translator4 and Yamli.5 Since these products are MSA: for commercial purposes, there is little infor- (a) The loss of the interdentals /ð/ and /θ/ mation available about their approaches, and which are replaced by /d/ or /z/ and /t/ or /s/ whatever resources they use are not publicly respectively, thus giving those two original available for research purposes. Furthermore, as consonants a heavier load. Examples in- Al-Badrashiny et al.