Arabic Diacritization in the Context of Statistical Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
Arabic Diacritization in the Context of Statistical Machine Translation Mona Diab, Mahmoud Ghoneim, Nizar Habash Center for Computational Learning Systems Columbia University 475 Riverside Drive, New York, NY 10115 mdiab,mghoneim,[email protected] Abstract Diacritics in Arabic are optional orthographic symbols typically representing short vowels. Most Arabic text is underspecified for diacritics. However, we do observe partial diacritization depending on genre and domain. In this paper, we investigate the impact of Arabic diacritization on statistical machine translation (SMT). We define several diacritization schemes ranging from full to partial diacritization. We explore the impact of the defined schemes on SMT in two different modes which tease apart the effect of diacritization on the alignment and its consequences on decoding. Our results show that none of the partial diacritization schemes significantly varies in performance from the no-diacritization baseline despite the increase in the number of types in the data. However, a full diacritization scheme performs significantly worse than no diacritization. Crucially, our research suggests that the SMT performance is positively correlated with the increase in the number of tokens correctly affected by a diacritization scheme and the high F-score of the automatic assignment of the particular diacritic. 1 Introduction ent modes that tease apart the effect diacritization has on alignment and its consequences on decoding. This inves- Modern standard Arabic (MSA) is written with an tigation falls within approaches to preprocessing source orthography that includes optional diacritical marks language text, which typically attempt to reduce word (henceforth, diacritics). Diacritics are extremely use- sparsity through morphological preprocessing and ortho- ful for readability and understanding. Their absence in graphic normalization as a means of improving transla- Arabic text adds another layer of lexical and morpho- tion quality. However, our work differs from previous ap- logical ambiguity. Naturally occurring Arabic text has proaches in that it maintains the same preprocessing tok- some percentage of these diacritics present depending enization throughout all the experiments and only varies on genre and domain. They are there to aid the reader some of the word forms, effectively increasing the word disambiguate the text or simply to articulate it correctly. types in our vocabulary adding to the complexity of our For instance, religious text such as the Quran is fully text rather than reducing it. Our results show that none of diacritized to minimize the chances of reciting it incor- the partial diacritization schemes significantly varies in rectly. So are childrens’ educational texts. Classical performance from the no-diacritization baseline despite poetry tends to be diacritized as well. However, news the increase in the number of types in the data. How- text and other genre are sparsely diacritized (e.g., around ever, a full diacritization scheme performs significantly 1.5% of tokens in the United Nations Arabic cor- worse than no diacritization. Crucially, our research sug- pus bear at least one diacritic). gests that the SMT performance is positively correlated In speech technology, full diacritization has im- with the number of tokens affected by a diacritization proved state-of-the-art Arabic automatic speech recogni- scheme coupled with the high F-score of the specific au- tion (ASR) systems especially in cross dialectal speech tomatic diacritic assignment. Hence, the larger the size modeling (Kirchhoff and Vergyri, 2005). However, no of the accurately affected tokens by the application of a studies have investigated the optimal level of diacritiza- diacritization scheme, the higher the yielded SMT score. tion sufficient to yield the best ASR results. To date, This paper is organized as follows: Section 2 reviews no systematic study of the impact of diacritization on related work; Section 3 presents Arabic diacritic linguis- other NLP applications has been reported. Guided by tic facts; Section 4 describes the experimental setup; Sec- the utility of diacritization for readability and its pres- tion 5 presents the experimental results and discussion. ence in naturally occurring text, we introduce the notion Finally, we conclude with Section 6 with a possible fu- of partial diacritization for natural language processing ture directions. (NLP). We define several diacritization schemes rang- ing from full diacritization to partial diacritization. The 2 Related Work schemes vary in representation from the inflectional to the lexical. We also investigate the impact of both par- Arabic Diacritization Much work has been done on tial and full diacritization on statistical machine transla- Arabic diacritization (aka vowelization, diacritic/vowel tion (SMT). In the SMT pipeline, we consider two differ- restoration). We refer to the literature review in (Zi- touni et al., 2006) for an excellent general review. Zi- diacritics are optional: written Arabic may be fully di- touni et al. (2006) use a maximum entropy classifier acritized, may have some diacritics, or may be entirely to assign diacritics to the letters of each word. Vergyri undiacritized. In this section, Arabic diacritics are de- and Kirchhoff (2004) and Ananthakrishnan et al. (2005) scribed in terms of their form and function before dis- present work on diacritization targeted toward improv- cussing how they are actually used in practice. ing ASR. Both systems exploit the Buckwalter Arabic Morphological Analysis (BAMA) system. Vergyri and 3.1 Arabic Diacritic Forms Kirchhoff (2004) use a single tagger to select amongst There are three types of diacritics: vowel, nunation, and the diacritized analyses; whereas Ananthakrishnan et al. shadda (gemination). Vowel diacritics represent Ara- (2005) use a language-modeling approach. Habash and bic’s three short vowels and a diacritic indicating the ab- Rambow (2007) introduce a system MADA that also uses sence of any vowel. The following are the four vowel- ¡ BAMA, but they use 14 taggers and a lexeme-based lan- diacritics exemplified in conjunction with the letter £ ¤ ¢ ¡ ¡ ¡ ¡ guage model. MADA is the best performing system to b2: ba (fatha), bu (damma), bi (kasra), and date achieving a word error rate of 14.9% and a diacritic ¢ bo (no vowel aka sukuun). Nunation diacritics can only error rate of 4.8%. We use MADA in implementing our occur in word final positions in nominals (nouns, adjec- diacritization schemes.1 tives and adverbs). They indicate a short vowel followed ¥ ¦ § Diacritization as Preprocessing Relevant research on ¡ ©¡ 3 4 ©¡ by an unwritten n sound: bAF , ¨ bN and bK. the effect of morphological preprocessing on SMT qual- ¥ ity focuses on morphologically rich languages such as Nunation is an indicator of nominal indefiniteness. The ©¡ ∼ German (Nießen and Ney, 2004); Spanish, Catalan, and shadda is a consonant doubling diacritic: b (/bb/). The shadda can combine with vowel or nunation diacrit- Serbian (Popovic´ and Ney, 2004); and Czech (Gold- £ ¡ ¡ ∼ ¨ ∼ water and McClosky, 2005). They all studied the ef- ics: b u or b N. fects of various kinds of tokenization, lemmatization and Additional diacritical marks in Arabic include the POS tagging and show a positive impact on SMT qual- hamza, which appears in conjunction with a small num- ity. Specifically for Arabic, Lee (2004) investigated the ber of letters (e.g., , , , , ). Since most Arabic use of automatic alignment of POS tagged English and encodings do not count the hamza a diacritic, but rather affix-stem segmented Arabic to determine appropriate a part of the letter (like the dot on the lower-case Roman tokenizations. Habash and Sadat (2006) investigated a i or under the Arabic b: ¡ ), we do not consider it here wide set of possible tokenization schemes for Arabic. as part of the diacritic set. Results from both research investigations show that mor- phological preprocessing helps, but only for the smaller 3.2 Arabic Diacritic Functions corpora. As token size increases, the benefits dimin- ish. Habash and Sadat (2006) showed that a more lin- Functionally, diacritics can be split into two different guistically informed approach to tokenization yields bet- kinds: lexical diacritics and inflectional diacritics. ter results than simple heuristics. The work presented 3.2.1 Lexical Diacritics here differs from previous approaches to preprocessing Lexical diacritics distinguish between two lexemes.5 We in that we maintain the same preprocessing tokenization refer to a lexeme with its citation form; Arabic lexeme throughout all our experiments and only vary some of the citation forms are third masculine singular perfective for word forms, effectively increasing the word types in our verbs and masculine singular (or feminine singular if no vocabulary. masculine is possible) for nouns and adjectives. For ex- ample, the diacritization difference between the lexemes ¦ §¦ Diacritization in Natural Language Preprocessing § ¢ ¢ ¢ ¡ ¡ Regardless of the level of diacritization, to date, there ¢ kAtib ‘writer’ and kAtab ‘to correspond’ dis- have not been any systematic investigations of the im- tinguishes between the meanings of the word (lexical dis- pact of different types of Arabic diacritization on SMT ambiguation) rather than their inflections. Any of the di- (or any other NLP application, for that matter). One ex- acritics may be used to mark lexical variation. A com- ception is the work of Kirchhoff and Vergyri (2005)