Foreign Words and the Automatic Processing of Arabic Social Media Text Written in Roman Script
Total Page:16
File Type:pdf, Size:1020Kb
Foreign Words and the Automatic Processing of Arabic Social Media Text Written in Roman Script Ramy Eskander, Mohamed Al-Badrashiny†, Nizar Habash‡ and Owen Rambow Center for Computational Learning Systems, Columbia University {reskander,rambow}@ccls.columbia.edu †Department of Computer Science, The George Washington University †[email protected] ‡Computer Science Department, New York University Abu Dhabi ‡[email protected] Abstract This situation is exacerbated in the case of Ara- bic social media. There are three principal rea- Arabic on social media has all the prop- sons. First, the Arabic language is a collection of erties of any language on social media varieties: Modern Standard Arabic (MSA), which that make it tough for natural language is used in formal settings, and different forms of processing, plus some specific problems. Dialectal Arabic (DA), which are commonly used These include diglossia, the use of an informally. This situation is referred to as “diglos- alternative alphabet (Roman), and code sia”. MSA has a standard orthography, while switching with foreign languages. In this the dialects do not. What is used in Arabic so- paper, we present a system which can cial media is typically DA. This means that there process Arabic written in Roman alpha- is no standard orthography to begin with, result- bet (“Arabizi”). It identifies whether each ing in an even broader variation in orthographic word is a foreign word or one of an- forms found. Diglossia is seen in other linguistic other four categories (Arabic, name, punc- communities as well, including German-speaking tuation, sound), and transliterates Arabic Switzerland, in the Czech Republic, or to a some- words and names into the Arabic alphabet. what lesser extent among French speakers. Sec- We obtain an overall system performance ond, while both MSA and DA are commonly writ- of 83.8% on an unseen test set. ten in the Arabic script, DA is sometimes writ- 1 Introduction ten in the Roman script. Arabic written in Roman is often called “Arabizi”. It is common in other Written language used in social media shows dif- linguistic communities as well to write informal ferences from that in other written genres: the communication in the Roman alphabet rather than vocabulary is informal (and sometimes the syn- in the native writing system, for example, among tax is as well); there are intentional deviations South Asians. And third, educated speakers of from standard orthography (such as repeated let- Arabic are often bilingual or near-bilingual speak- ters for emphasis); there are typos; writers use ers of another language as well (such as English non-standard abbreviations; non-linguistic sounds or French), and will code switch between DA and are written (haha); punctuation is used creatively; the foreign language in the same utterance (and non-linguistic signs such as emoticons often com- sometimes MSA as well). As is well known, code pensate for the absence of a broader communica- switching is common in many linguistic commu- tion channel in written communication (which ex- nities, for example among South Asians. cludes, for example, prosody or visual feedback); and, most importantly for this paper, there fre- In this paper, we investigate the issue of pro- quently is code switching. These facts pose a well- cessing Arabizi input with code switching. There known problem for natural language processing of are two tasks: identification of tokens that are social media texts, which has become an area of not DA or MSA (and should not be transliterated interest as applications such as sentiment analy- into Arabic script for downstream processing), and sis, information extraction, and machine transla- then the transliteration into Arabic script of the tion turn to this genre. parts identified as DA or MSA. In this paper, we 1 Proceedings of The First Workshop on Computational Approaches to Code Switching, pages 1–12, October 25, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics use as a black box an existing component that we work on Arabic yet. We give a brief summary of developed to transliterate from Arabizi to Arabic relevant work on Arabic. script (Al-Badrashiny et al., 2014). This paper Darwish et al. (2012) discuss NLP problems in concentrates on the task of identifying which to- retrieving Arabic microblogs (tweets). They dis- kens should be transliterated. A recent release cuss many of the same issues we do, notably the of annotated data by the Linguistic Data Consor- problems arising from the use of DA such as the tium (LDC, 2014c; Bies et al., 2014) has enabled lack of a standard orthography. However, they do novel research on this topic. The corpus pro- not deal with DA written in the Roman alphabet vides each token with a tag, as well as a translit- (though they do discuss non-Arabic characters). eration if appropriate. The tags identify foreign There is some work on code switching be- words, as well as Arabic words, names, punctua- tween Modern Standard Arabic (MSA) and di- tion, and sounds. Only Arabic words and names alectal Arabic (DA). Zaidan and Callison-Burch are transliterated. (Note that code switching is not (2011) are interested in this problem at the inter- distinguished from borrowing.) Emoticons, which sentence level. They crawl a large dataset of may be isolated or part of an input token, are also MSA-DA news commentaries. They use Ama- identified, and converted into a conventional sym- zon Mechanical Turk to annotate the dataset at the bol (#). This paper presents taggers for the tags, sentence level. Then they use a language model- and an end-to-end system which takes Arabizi in- ing approach to predict the class (MSA or DA) put and produces a complex output which consists for an unseen sentence. There is other work on of a tag for each input token and a transliteration dialect identification, such as AIDA (Elfardy et of Arabic words and names into the Arabic script. al., 2013; Elfardy et al., 2014). In AIDA, some To our knowledge, this is the first system that han- statistical and morphological analyses are applied dles the complete task as defined by the LDC data. to capture code switching between MSA and DA This paper focuses on the task of identifying for- within the same sentence. Each word in the sen- eign words (as well as the other tags), on creating tence is tagged to be either DA or MSA based a single system, and on evaluating the system as a on the context. The tagging process mainly de- whole. pends on the language modeling (LM) approach, This paper makes three main contributions. but if a word is unknown in the LM, then its tag is First, we clearly define the computational prob- assigned through MADAMIRA, a morphological lem of dealing with social media Arabizi, and pro- disambiguator Pasha et al. (2014). pose a new formulation of the evaluation metric Lui et al. (2014) proposed a system that does for the LDC corpus. Second, we present novel language identification in multilingual documents, modules for the detection of foreign words as well using a generative mixture model that is based as of emoticons, sounds, punctuation marks, and on supervised topic modeling algorithms. This is names in Arabizi. Third, we compose a single sys- similar to our work in terms of identifying code tem from the various components, and evaluate the switching. However, our system deals with Ara- complete system. bizi, a non-standard orthography with high vari- This paper is structured as follows. We start by ability, making the identification task much harder. presenting related work (Section 2), and then we Concerning specifically NLP for Arabizi, Dar- present relevant linguistic facts and explain how wish (2013) (published in an updated version as the data is annotated (Section 3). After summariz- (Darwish, 2014)) is similar to our work in that ing our system architecture (Section 4) and exper- he identifies English in Arabizi text and he also imental setup (Section 5), we present our systems transliterates Arabic text from Arabizi to Arabic for tagging in Sections 6, 7 and 8. The evaluation script. We compare our transliteration method to results are presented in Section 9. his in Al-Badrashiny et al. (2014). For identifi- cation of non-Arabic words in Arabizi, Darwish 2 Related Work (2013) uses word and sequence-level features with While natural language processing for English in CRF modeling; while we use SVMs and decision social media has attracted considerable attention trees. Darwish (2013) identifies three tags: Ara- recently (Clark and Araki, 2011; Gimpel et al., bic, foreign and others (such as email addresses 2011; Gouws et al., 2011; Ritter et al., 2011; Der- and URLs). In contrast, we identify a bigger czynski et al., 2013), there has not been much set: Arabic, foreign, names, sounds, punctuation 2 and emoticons. Furthermore, Darwish (2013) uses T (which itself can be also be represented using around 5K words for training his taggers and 3.5K the digit “6”). words for testing; this is considerably smaller than Text written in Arabizi also tends to have a large our training and test sets of 113K and 32K words, number of foreign words, that are either borrow- respectively. ings such as telephone, or code switching, such Chalabi and Gerges (2012) presented a hybrid as love you!. Note that Arabizi often uses the approach for Arabizi transliteration. Their work source language orthography for borrowings (es- does not address the detection of English words, pecially recent borrowings), even if the Arabic punctuation, emoticons, and so on. They also do pronunciation is somewhat modified. As a re- not handle English when mixed with Arabizi. sult, distinguishing borrowings from code switch- Voss et al.