Improving the Arabic Pronunciation Dictionary for Phone and Word Recognition with Linguistically-Based Pronunciation Rules
Total Page:16
File Type:pdf, Size:1020Kb
Improving the Arabic Pronunciation Dictionary for Phone and Word Recognition with Linguistically-Based Pronunciation Rules Fadi Biadsy∗ and Nizar Habash† and Julia Hirschberg∗ ∗Department of Computer Science, Columbia University, New York, USA {fadi,julia}@cs.columbia.edu †Center for Computational Learning Systems, Columbia University, New York, USA [email protected] Abstract dictionaries are typically written by hand. However, for morphologically rich languages, such as MSA,1 In this paper, we show that linguistically mo- pronunciation dictionaries are difficult to create by tivated pronunciation rules can improve phone hand, because of the large number of word forms, and word recognition results for Modern Stan- each of which has a large number of possible pro- dard Arabic (MSA). Using these rules and the MADA morphological analysis and dis- nunciations. Fortunately, the relationship between ambiguation tool, multiple pronunciations per orthography and pronunciation is relatively regu- word are automatically generated to build two lar and well understood for MSA. Moreover, re- pronunciation dictionaries; one for training cent automatic techniques for morphological anal- and another for decoding. We demonstrate ysis and disambiguation (MADA) can also be useful that the use of these rules can significantly in automating part of the dictionary creation process improve both MSA phone recognition and (Habash and Rambow, 2005; Habash and Rambow, MSA word recognition accuracies over a base- 2007) Nonetheless, most documented Arabic ASR line system using pronunciation rules typi- cally employed in previous work on MSA Au- systems appear to handle only a subset of Arabic tomatic Speech Recognition (ASR). We ob- phonetic phenomena; very few use morphological tain a significant improvement in absolute ac- disambiguation tools. curacy in phone recognition of 3.77%–7.29% In Section 2, we briefly describe related work, in- and a significant improvement of 4.1% in ab- cluding the baseline system we use. In Section 3, we solute accuracy in ASR. outline the linguistic phenomena we believe are crit- ical to improving MSA pronunciation dictionaries. 1 Introduction In Section 4, we describe the pronunciation rules we have developed based upon these linguistic phenom- The correspondence between orthography and pro- ena. In Section 5, we describe how these rules are nunciation in Modern Standard Arabic (MSA) falls used, together with MADA, to build our pronuncia- somewhere between that of languages such as Span- tion dictionaries for training and decoding automat- ish and Finnish, which have an almost one-to-one ically. In Section 6, we present results of our eval- mapping between letters and sounds, and languages uations of our phone- and word-recognition systems such as English and French, which exhibit a more (XPR and XWR) on MSA comparing these systems complex letter-to-sound mapping (El-Imam, 2004). to two baseline systems, BASEPR and BASEWR. The more complex this mapping is, the more diffi- cult the language is for Automatic Speech Recogni- 1MSA words have fourteen features: part-of-speech, person, tion (ASR). number, gender, voice, aspect, determiner proclitic, conjunctive An essential component of an ASR system is its proclitic, particle proclitic, pronominal enclitic, nominal case, pronunciation dictionary (lexicon), which maps the nunation, idafa (possessed), and mood. MSA features are real- ized using both concatenative (affixes and stems) and templatic orthographic representation of words to their pho- (root and patterns) morphology with a variety of morphological netic or phonemic pronunciation variants. For lan- and phonological adjustments that appear in word orthography guages with complex letter-to-sound mappings, such and interact with orthographic variations. We conclude in Section 7 and identify directions for small number of words; (c) three nunation diacrit- future research. ics (F /an/, N /un/, K /in/) representing a combina- tion of a short vowel and the nominal indefiniteness 2 Related Work marker /n/ in MSA; (d) one consonant lengthening diacritic (called Shadda ∼) which repeats/elongates Most recent work on ASR for MSA uses a sin- the previous consonant (e.g., kat∼ab is pronounced gle pronunciation dictionary constructed by map- /kattab/); and (e) one diacritic for marking when ping every undiacritized word in the training cor- there is no diacritic (called Sukun o). pus to all of the diacritized Buckwalter analyses and Arabic diacritics can only appear after a let- the diacritized versions of this word in the Arabic ter. Word-initial diacritics (in practice, only short Treebank (Maamouri et al., 2003; Afify et al., 2005; vowels) are handled by adding an extra Alif A Messaoudi et al., 2006; Soltau et al., 2007). In these (also called Hamzat-Wasl) at the beginning of the papers, each diacritized word is converted to a sin- word. Sentence/utterance initial Hamzat-Wasl is gle pronunciation with a one-to-one mapping using pronounced like a glottal stop preceding the short “very few” unspecified rules. None of these systems vowel; however, the sentence medial Hamzat-Wasl use morphological disambiguation to determine the is silent except for the short vowel. For exam- most likely pronunciation of the word given its con- ple, Ainkataba kitAbN is /Ginkataba kitAbun/ but text. Vergyri et al. (2008) do use morphological in- kitAbN Ainkataba is /kitAbun inkataba/. A ‘real’ formation to predict word pronunciation. They se- Hamza (glottal stop) is always pronounced as a glot- lect the top choice from the MADA (Morphological tal stop. The Hamzat-Wasl appears most commonly Analysis and Disambiguation for Arabic) system for as the Alif of the definite article Al. It also appears each word to train their acoustic models. For the test in specific words and word classes such as relative lexicon they used the undiacritized orthography, as pronouns (e.g., Aly ‘who’ and verbs in pattern VII well as all diacritizations found for each word in the (Ain1a2a3). training data as possible pronunciation variants. We Arabic short vowel diacritics are used together use this system as our baseline for comparison. with the glide consonant letters w and y to denote the long vowels /U/ (as uw) and /I/ (iy). This makes 3 Arabic Orthography and Pronunciation these two letters ambiguous. MSA is written in a morpho-phonemic orthographic Diacritics are largely restricted to religious texts representation using the Arabic script, an alphabet and Arabic language school textbooks. In other accented with optional diacritical marks.2 MSA has texts, fewer than 1.5% of words contain a diacritic. 34 phonemes (28 consonants, 3 long vowels and 3 Some diacritics are lexical (where word meaning short vowels). The Arabic script has 36 basic let- varies) and others are inflectional (where nominal ters (ignoring ligatures) and 9 diacritics. Most Ara- case or verbal mood varies). Inflectional diacritics bic letters have a one-to-one mapping to an MSA are typically word final. Since nominal case, verbal phoneme; however, there are a small number of mood and nunation have all disappeared in spoken common exceptions (Habash et al., 2007; El-Imam, dialectal Arabic, Arabic speakers do not always pro- 2004) which we summarize next. duce these inflections correctly or at all. Much work has been done on automatic Arabic 3.1 Optional Diacritics diacritization (Vergyri and Kirchhoff, 2004; Anan- Arabic script commonly uses nine optional diacrit- thakrishnan et al., 2005; Zitouni et al., 2006; Habash ics: (a) three short-vowel diacritics representing the and Rambow, 2007). In this paper, we use the vowels /a/, /u/ and /i/; (b) one long-vowel diacritic MADA (Morphological Analysis and Disambigua- (Dagger Alif ‘) representing the long vowel /A/ in a tion for Arabic) system to diacritize Arabic (Habash and Rambow, 2005; Habash and Rambow, 2007). 2We provide Arabic script orthographic transliteration in MADA, which uses the Buckwalter Arabic mor- the Buckwalter transliteration scheme (Buckwalter, 2004). For phological Analyzer databases (Buckwalter, 2004), Modern Standard Arabic phonological transcription, we use a provides the necessary information to determine variant of the Buckwalter transliteration with the following ex- ceptions: glottal stops are represented as /G/ and long vowels as Hamzat-Wasl through morphologically tagging the /A/, /U/ and /I/. All Arabic script diacritics are phonologically definite article; in most other cases it outputs the spe- spelled out. cial symbol “{” for Hamzat-Wasl. 3.2 Hamza Spelling appears after some nunated nouns, e.g., ki- The consonant Hamza (glottal stop /G/) has multi- taAbAF /kitAban/. In some poetic readings, ′ this Alif can be produced as the long vowel ple forms in Arabic script: , >, <, &, /A/: /kitAbA/. Finally, a common odd spelling }, |. There are complex rules for Hamza spelling is that of the proper name Eamrw /Eamr/ that primarily depend on its vocalic context. For ex- ‘Amr’where the final w is silent. ample, } is used word medially and finally when preceded or followed by an /i/ vowel. Similarly, the 4 Pronunciation Rules Hamza form | is used when the Hamza is followed by the long vowel /A/. As noted in Section 3, diacritization alone does not Hamza spelling is further complicated by the fact predict actual pronunciation in MSA. In this section that Arabic writers often replace hamzated letters we describe a set of rules based on MSA phonol- ogy which will extend a diacritized word to a set with the un-hamzated form ( > → A)