Guidelines and Framework for a Large Scale Arabic Diacritized Corpus
Total Page:16
File Type:pdf, Size:1020Kb
Guidelines and Framework for a Large Scale Arabic Diacritized Corpus Wajdi Zaghouani1, Houda Bouamor1, Abdelati Hawwari2, Mona Diab 2, Ossama Obeid1, Mahmoud Ghoneim2, Sawsan Alqahtani2 and Kemal Oflazer1 1Carnegie Mellon University in Qatar fwajdiz,hbouamor,[email protected], [email protected] 2George Washington University fmdiab,abhawwari,ghoneim,[email protected] Abstract This paper presents the annotation guidelines developed as part of an effort to create a large scale manually diacritized corpus for various Arabic text genres. The target size of the annotated corpus is 2 million words. We summarize the guidelines and describe issues encountered during the training of the annotators. We also discuss the challenges posed by the complexity of the Arabic language and how they are addressed. Finally, we present the diacritization annotation procedure and detail the quality of the resulting annotations. Keywords: Arabic Diacritization, Guidelines, Annotation 1. Introduction Undiacritized Diacritized Buckwalter English Written Modern Standard Arabic (MSA) poses many chal- Y«ð Y«ð /waEad/ he promised lenges for natural language processing (NLP). Most written Y«ð Y« ð /waEodN/ it/a promise Arabic text lacks short vowels and diacritics rendering a Y«ð Y« ð /wuEida/ he was promised mostly consonantal orthography (Schulz, 2004). Arabic di- Y«ð Y« ð /waEad ˜a/ and he counted acritization is an orthographic way to describe Arabic word /waEud ˜a/ and he was counted pronunciation, and to avoid word reading ambiguity. Ara- Y«ð Y«ð bic writing system has two classes of symbols: letters and diacritics. diacritics are the marks that reflect the phonolog- Table 1: Possible pronunciations and meanings of the undi- ical, morphological and grammatical rules. Diacritization acritized Arabic word /wEd/ may be classified according to the linguistic rules they are Y«ð representing, into two types: a) word form diacritization, which shows how a word is pronounced, except the last This paper presents the work carried out within the opti- letter diacritization, and b) case and mood diacritization, mal diacritization scheme for Arabic orthographic repre- which exists above or below the last letter in each word, sentation (OptDiac) project. The focus of this work is to indicating its grammatical function in the sentence. There manually create a large-scale annotated corpus with the di- are three types of diacritics: vowel, nunation, and shadda acritics for a variety of Arabic texts covering more than 10 (gemination). genres. The target size of the annotated corpus is 2 mil- The lack of diacritics leads usually to considerable lexical lion words. The present work was mainly motivated by the and morphological ambiguity as shown in the example in lack of equivalent multi-genres large scale annotated cor- Table 1.1 Full diacritization has been shown to improve pus. The creation of manually annotated corpus usually state-of-the-art Arabic automatic systems such as speech presents many challenges and issues. In order to address recognition (ASR) systems (Kirchhoff and Vergyri, 2005) those challenges, we created comprehensive and simplified and statistical machine translation (SMT) (Diab et al., annotation guidelines that were used by a team consisting 2007). Hence, diacritization has been receiving increased of five annotators and an annotation manager. The guide- attention in several Arabic NLP applications (Zitouni et al., lines were defined after an initial pilot annotation experi- 2006; Shahrour et al., 2015; Abandah et al., 2015; Belinkov ment described in Bouamor et al. (2015). In order to ensure and Glass, 2015). a high annotation agreement between the annotators, multi- Building models to assign diacritics to each letter in a word ple training sessions were held and a regular inter-annotator requires a large amount of annotated training corpora cover- agreement (IAA) measures were performed to check anno- ing different topics and domains to overcome the sparseness tation quality. To the best of our knowledge, this is the first problem. The currently available MSA diacritized corpora Arabic diacritized multi-genres corpus. are generally limited to newswire stories (those distributed by the LDC), religious texts such as the Holy Quran, or ed- The remainder of this paper is organized as follows. We re- ucational texts. view related work in section 2. Afterwards, we discuss the challenges posed by the complexity of the Arabic diacriti- 1We use the Buckwalter transliteration encoding system to rep- zation process in section 3. Then, we describe our corpus resent Arabic in Romanized script (Buckwalter, 2002) and the development of the guidelines in sections 4 and 5. 3637 We present the diacritization annotation procedure in sec- sence of any vowel. The following are the three vowel dia- tion 6 and analyze the quality of the resulting annotations critics exemplified in conjunction with the letter Ð/m: Ð /ma in 7. Finally we conclude and present our future work in (fatha), /mu (damma), /mi (kasra), and /mo (no vowel section 8. Ð Ð Ð aka sukuun). Nunation diacritics can only occur word fi- 2. Related Work nally in nominals (nouns, adjectives) and adverbs. They indicate a short vowel followed by an unwritten n sound: Since, our paper is mainly about the creation and evaluation /mAF,4 /mN and /mK. Nunation is an indicator of nom- of a large annotated corpus, we will focus mostly on this AÓ Ð Ð aspect in the previous works. There have been numerous inal indefiniteness. The shadda is a consonant doubling di- approaches to build an automatic diacritization system for acritic: Ð/m˜(/mm/). The shadda can combine with vowel Arabic using rule-based, statistical and hybrid methods. We or nunation diacritics: /m˜u or /m˜uN. refer to the recent literature review in Abandah et al. (2015) Ð Ð for a general overview of these methods and tools. Functionally, diacritics can be split into two different kinds: The most relevant resource to our work is the Penn Ara- lexical diacritics and inflectional diacritics (Diab et al., bic Treebank (PATB), a large corpus annotated by the Lin- 2007) . guistic Data Consortium (Maamouri et al., 2010). Most of Lexical diacritics : distinguish between two lexemes.5 the LDC Treebank corpora are also manually diacritized, We refer to a lexeme with its citation form as the lemma. but they cover mainly news and weblog text genres. The Arabic lemma forms are third masculine singular perfec- PATB served later to build the first Arabic Proposition Bank tive for verbs and masculine singular (or feminine singular (APB) using the fully specified diacritized lemmas (Diab et if no masculine is possible) for nouns and adjectives. For al., 2008; Zaghouani et al., 2010). example, the diacritization difference between the lemmas The Tashkeela classical Arabic vocalized corpus (Zerrouki, /kAtib/ ’writer’ and /kAtab/ ’to correspond’ dis- I. K A¿ I. KA¿ 2011) is another notable dataset covering six million words. tinguishes between the meanings of the word (lexical dis- Tashkeela was compiled from various web sources covering ambiguation) rather than their inflections. Any of diacritics Islamic religious heritage (mainly classical Arabic books). may be used to mark lexical variation. A common exam- Moreover, Dukes and Habash (2010), created the Quranic ple with the shadda (gemination) diacritic is the distinction Arabic Corpus, a fully diacritized annotated linguistic re- between Form I and Form II of Arabic verb derivations. source which we used later on to build the first Quranic Form II, indicates, in most cases, added causativity to the Arabic Proposition Bank Zaghouani et al. (2012). Form I meaning. Form II is marked by doubling the second The Qatar Arabic Language Bank (Zaghouani et al., 2014; Zaghouani et al., 2015; Zaghouani et al., 2016) is another radical of the root used in Form I: É¿@ /Akal/’ate’ vs. É¿@ relevant work that aims to build a large corpus of manu- /Ak˜al/ ’fed’. Generally speaking, however, deriving word ally corrected Arabic text for building automatic correction meaning through lexical diacritic placement is largely un- tools for three Arabic text genres: native, non-native and predictable and they are not specifically associated with any machine translation post-edited text. particular part of speech. Recently, in Bouamor et al. (2015), we conducted vari- Inflectional diacritics : distinguish different inflected ous annotation experiments to find the most suitable and forms of the same lexeme. For instance, the final diacrit- efficient annotation procedure in creating a large scale dia- ics in /kitAbu/ ’book [nominative]’ and /kitAba/ H . AJ» H . AJ» critized corpus. ’book [accusative]’ distinguish the syntactic case of ’book’ 3. Arabic Diacritics (e.g., whether the word is subject or object of a verb). Addi- tional inflectional features marked through diacritic change, Arabic script consists of two classes of symbols: letters and in addition to syntactic case, include voice, mood, and def- diacritics. Letters comprise long vowels such as A, y, w initeness. Inflectional diacritics are predictable in their po- as well as consonants. Diacritics on the other hand com- sitional placement in a word. Moreover, they are associated prise short vowels, gemination markers, nunation markers, with certain parts of speech. as well as other markers (such as hamza, the glottal stop which appears in conjunction with a small number of let- 4. Corpus Description ters, e.g., , , , etc., dots on letters, elongation and emphatic @ @ @ We use the corpus of contemporary Arabic