Building a Corpus for Palestinian Arabic: a Preliminary Study Mustafa Jarrar, *Nizar Habash, Diyam Akra, Nasser Zalmout Birzeit University, West Bank, Palestine {mjarrar,nzalmout}@birzeit.edu,
[email protected] *New York University Abu Dhabi, United Arab Emirates
[email protected] the Arab World. Using such tools to understand Abstract and process Arabic dialects (DAs) is a challenging task because of the phonological and This paper presents preliminary results in morphological differences between DAs and building an annotated corpus of the MSA. In addition, there is no standard Palestinian Arabic dialect. The corpus orthography for DAs. Moreover, DAs have consists of about 43K words, stemming limited standardized written resources, since from diverse resources. The paper most of the written dialectal content is the result discusses some linguistic facts about the of ad hoc and unstructured social conversations Palestinian dialect, compared with the or commentary, in comparison to MSA’s vast Modern Standard Arabic, especially in body of literary works. terms of morphological, orthographic, and lexical variations, and suggests some The rest of this paper is structured as follows: directions to resolve the challenges these We present important linguistic background in differences pose to the annotation goal. Section 2, followed by a survey of related work Furthermore, we present two pilot in Section 3. We then present the process of studies that investigate whether existing collecting the Curras Corpus (Section 4) and the tools for processing Modern Standard challenges of annotating it (Section 5). Arabic and Egyptian Arabic can be used to speed up the annotation process of our 2.