COLABA: Arabic Dialect Annotation and Processing
Total Page:16
File Type:pdf, Size:1020Kb
COLABA: Arabic Dialect Annotation and Processing Mona Diab, Nizar Habash, Owen Rambow, Mohamed Altantawy, Yassine Benajiba Center for Computational Learning Systems 475 Riverside Drive, Suite 850 New York, NY 10115 Columbia University {mdiab,habash,rambow,mtantawy,ybenajiba}@ccls.columbia.edu Abstract In this paper, we describe COLABA, a large effort to create resources and processing tools for Dialectal Arabic Blogs. We describe the objectives of the project, the process flow and the interaction between the different components. We briefly describe the manual annotation effort and the resources created. Finally, we sketch how these resources and tools are put together to create DIRA, a term- expansion tool for information retrieval over dialectal Arabic collections using Modern Standard Arabic queries. 1. Introduction these genres. In fact, applying NLP tools designed for MSA directly to DA yields significantly lower performance, mak- The Arabic language is a collection of historically related ing it imperative to direct the research to building resources variants. Arabic dialects, collectively henceforth Dialectal and dedicated tools for DA processing. Arabic (DA), are the day to day vernaculars spoken in the DA lacks large amounts of consistent data due to two fac- Arab world. They live side by side with Modern Standard tors: a lack of orthographic standards for the dialects, and Arabic (MSA). As spoken varieties of Arabic, they differ a lack of overall Arabic content on the web, let alone DA from MSA on all levels of linguistic representation, from content. These lead to a severe deficiency in the availabil- phonology, morphology and lexicon to syntax, semantics, ity of computational annotations for DA data. The project and pragmatic language use. The most extreme differences presented here – Cross Lingual Arabic Blog Alerts (CO- are on phonological and morphological levels. LABA) – aims at addressing some of these gaps by building The language of education in the Arab world is MSA. DA is large-scale annotated DA resources as well as DA process- perceived as a lower form of expression in the Arab world; ing tools.1 and therefore, not granted the status of MSA, which has This paper is organized as follows. Section 2. gives a high implications on the way DA is used in daily written venues. level description of the COLABA project and reviews the On the other hand, being the spoken language, the native project objectives. Section 3. discusses the annotated re- tongue of millions, DA has earned the status of living lan- sources being created. Section 4. reviews the tools created guages in linguistic studies, thus we see the emergence of for the annotation process as well as for the processing of serious efforts to study the patterns and regularities in these the content of the DA data. Finally, Section 5. showcases linguistic varieties of Arabic (Brustad, 2000; Holes, 2004; how we are synthesizing the resources and tools created for Bateson, 1967; Erwin, 1963; Cowell, 1964; Rice and Sa’id, DA for one targeted application. 1979; Abdel-Massih et al., 1979). To date most of these studies have been field studies or theoretical in nature with 2. The COLABA Project limited annotated data. In current statistical Natural Lan- COLABA is a multi-site partnership project. This paper, guage Processing (NLP) there is an inherent need for large- however, focuses only on the Columbia University contri- scale annotated resources for a language. For DA, there butions to the overall project. has been some limited focused efforts (Kilany et al., 2002; COLABA is an initiative to process Arabic social media Maamouri et al., 2004; Maamouri et al., 2006); however, data such as blogs, discussion forums, chats, etc. Given overall, the absence of large annotated resources continues that the language of such social media is typically DA, one to create a pronounced bottleneck for processing and build- of the main objective of COLABA is to illustrate the signif- ing robust tools and applications. icant impact of the use of dedicated resources for the pro- DA is a pervasive form of the Arabic language, especially cessing of DA on NLP applications. Accordingly, together given the ubiquity of the web. DA is emerging as the lan- with our partners on COLABA, we chose Information Re- guage of informal communication online, in emails, blogs, trieval (IR) as the main testbed application for our ability to discussion forums, chats, SMS, etc, as they are media that process DA. are closer to the spoken form of language. These genres Given a query in MSA, using the resources and processes pose significant challenges to NLP in general for any lan- created under the COLABA project, the IR system is able guage including English. The challenge arises from the to retrieve relevant DA blog data in addition to MSA fact that the language is less controlled and more speech data/blogs, thus allowing the user access to as much Arabic like while many of the textually oriented NLP techniques are tailored to processing edited text. The problem is com- 1We do not address the issue of augmenting Arabic web con- pounded for Arabic precisely because of the use of DA in tent in this work. 66/119 content (in the inclusive sense of MSA and DA) as possi- tuition is that the more words in the blogs that are not ble. The IR system may be viewed as a cross lingual/cross analyzed or recognized by a MSA morphological an- dialectal IR system due to the significant linguistic differ- alyzer, the more dialectal the blog. It is worth noting ences between the dialects and MSA. We do not describe that at this point we only identify that words are not the details of the IR system or evaluate it here; although we MSA and we make the simplifying assumption that allude to it throughout the paper. they are DA. This process results in an initial ranking There are several crucial components needed in order for of the blog data in terms of dialectness. this objective to be realized. The COLABA IR sys- tem should be able to take an MSA query and convert 3. Content Clean-Up. The content of the highly ranked it/translate it, or its component words to DA or alterna- dialectal blogs is sent for an initial round of manual tively convert all DA documents in the search collection clean up handling speech effects and typographical er- to MSA before searching on them with the MSA query. In rors (typos) (see Section3.2.). Additionally, one of the COLABA, we resort to the first solution. Namely, given challenging aspects of processing blog data is the se- MSA query terms, we process them and convert them to vere lack of punctuation. Hence, we add a step for DA. This is performed using our DIRA system described sentence boundary insertion as part of the cleaning up in Section 5.. DIRA takes in an MSA query term(s) and process (see Section 3.3.). The full guidelines will be translates it/(them) to their corresponding equivalent DA presented in a future publication. terms. In order for DIRA to perform such an operation it requires two resources: a lexicon of MSA-DA term corre- 4. Second Ranking of Blogs and Dialectalness De- spondences, and a robust morphological analyzer/generator tection. The resulting cleaned up blogs are passed that can handle the different varieties of Arabic. The pro- through the DI pipeline again. However, this time, cess of creating the needed lexicon of term correspondences we need to identify the actual lexical items and add is described in detail in Section 3.. The morphological an- them to our lexical resources with their relevant infor- alyzer/generator, MAGEAD, is described in detail in Sec- mation. In this stage, in addition to identifying the tion 4.3.. dialectal unigrams using the DI pipeline as described For evaluation, we need to harvest large amounts of data in step 2, we identify out of vocabulary bigrams and from the web. We create sets of queries in domains of in- trigrams allowing us to add entries to our created re- terest and dialects of interest to COLABA. The URLs gen- sources for words that look like MSA words (i.e. cog- erally serve as good indicators of the dialect of a website; nates and faux amis that already exist in our lexica, however, given the fluidity of the content and variety in di- yet are specified only as MSA). This process renders alectal usage in different social media, we decided to per- a second ranking for the blog documents and allows form dialect identification on the lexical level. us to hone in on the most dialectal words in an ef- Moreover, knowing the dialect of the lexical items in a doc- ficient manner. This process is further elaborated in ument helps narrow down the search space in the under- Section 4.2.. lying lexica for the morphological analyzer/generator. Ac- 5. Content Annotation. The content of the blogs that cordingly, we will also describe the process of dialect an- are most dialectal are sent for further content annota- notation for the data. tion. The highest ranking blogs undergo full word-by- The current focus of the project is on blogs spanning four word dialect annotation as described in Section 3.5.. different dialects: Egyptian (EGY), Iraqi (IRQ), Levantine Based on step 4, the most frequent surface words that (LEV), and (a much smaller effort on) Moroccan (MOR). are deemed dialectal are added to our underlying lex- Our focus has been on harvesting blogs covering 3 do- ical resources. Adding an entry to our resources en- mains: social issues, religion and politics. tails rendering it in its lemma form since our lexical Once the web blog data is harvested as described in Sec- database uses lemmas as its entry forms.