A Large Scale Corpus of Gulf Arabic Salam Khalifa, Nizar Habash, Dana Abdulrahim, Sara Hassan To cite this version: Salam Khalifa, Nizar Habash, Dana Abdulrahim, Sara Hassan. A Large Scale Corpus of Gulf Arabic. Language Resources and Evaluation Conference, 2016, Portoroz, Slovenia. hal-01349204 HAL Id: hal-01349204 https://hal.archives-ouvertes.fr/hal-01349204 Submitted on 3 Aug 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Large Scale Corpus of Gulf Arabic Salam Khalifa, Nizar Habash, Dana Abdulrahimy, Sara Hassan Computational Approaches to Modeling Language Lab, New York University Abu Dhabi, UAE yUniversity of Bahrain, Bahrain {salamkhalifa,nizar.habash,sah650}@nyu.edu,
[email protected] Abstract Most Arabic natural language processing tools and resources are developed to serve Modern Standard Arabic (MSA), which is the official written language in the Arab World. Some Dialectal Arabic varieties, notably Egyptian Arabic, have received some attention lately and have a growing collection of resources that include annotated corpora and morphological analyzers and taggers. Gulf Arabic, however, lags behind in that respect. In this paper, we present the Gumar Corpus, a large-scale corpus of Gulf Arabic consisting of 110 million words from 1,200 forum novels.