A Transcription Scheme for Languages Employing the Arabic Script Motivated by Speech Processing Application

A Transcription Scheme for Languages Employing the Arabic Script Motivated by Speech Processing Application Shadi GANJAVI Panayiotis G. GEORGIOU, *Department of Linguistics Shrikanth NARAYANAN* University of Southern California Department of Electrical Engineering [email protected] Speech Analysis & Interpretation Laboratory (sail.usc.edu) [georgiou, shri]@sipi.usc.edu Abstract 1 Introduction This paper offers a transcription system for Speech-to-speech (S2S) translation systems Persian, the target language in the Transonics present many challenges, not only due to the project, a speech-to-speech translation system complex nature of the individual technologies developed as a part of the DARPA Babylon involved, but also due to the intricate interaction program (The DARPA Babylon Program; that these technologies have to achieve. A great Narayanan, 2003). In this paper, we discuss challenge for the specific S2S translation system transcription systems needed for automated involving Persian and English would arise from spoken language processing applications in not only the linguistics differences between the Persian that uses the Arabic script for writing. two languages but also from the limited amount of This system can easily be modified for Arabic, data available for Persian. The other major hurdle Dari, Urdu and any other language that uses in achieving a S2S system involving these the Arabic script. The proposed system has languages is the Persian writing system, which is two components. One is a phonemic based based on the Arabic script, and hence lacks the transcription of sounds for acoustic modelling explicit inclusion of vowel sounds, resulting in a in Automatic Speech Recognizers and for Text very large amount of one-to-many mappings from to Speech synthesizer, using ASCII based transcription to acoustic and semantic symbols, rather than International Phonetic representations. Alphabet symbols. The other is a hybrid In order to achieve our goal, the system that was system that provides a minimally-ambiguous designed comprised of the following components: lexical representation that explicitly includes vocalic information; such a representation is needed for language modelling, text to speech synthesis and machine translation. Fig 1. Block diagram of the system. Note that the communication server allows interaction between all subsystems and the broadcast of messages. Our vision is that only the doctor will have access to the GUI and the patient will only be given a phone handset. (1) a visual and control Graphical User Interface Examples of the homophones and homographs (GUI); (2) an Automatic Speech Recognition are represented in Table 1. The words “six” and (ASR) subsystem, which works both using Fixed “lung” are examples of homographs, where the State Grammars (FSG) and Language Models identical (transliterated Arabic) orthographic (LM), producing n-best lists/lattices along with the representations (Column 3) correspond to different decoding confidence scores; (3) a Dialog Manager pronunciations [SeS] and [SoS] respectively (DM), which receives the output of the speech (Column 4). The words “hundred” and “dam” are recognition and machine translation units and examples of homophones, where the two words subsequently “re-scores'' the data according to the have similar pronunciation [sad] (Column 4), history of the conversation; (4) a Machine despite their different spellings (Column 3). Translation (MT) unit, which works in two modes: Classifier based MT and a fully Stochastic MT; Persian USCPers USCPron USCPers+ and finally (5) a unit selection based Text To ¢¡ ‘six’ SS SeS SeS Speech synthesizer (TTS), which provides the spoken output. A functional block diagram is ‘lung’ ¢¡ SS SoS SoS shown in Figure 1. ‘100’ £¥¤ $d sad $ad 1.1 The Language Under Investigation: ‘dam’ £¥¦ sd sad sad Persian Table 1 Examples of the transcription methods Persian is an Indo-European language with a and their limitation. Purely orthographic writing system based on the Arabic script. transcription schemes (such as USCPers) fail to Languages that use this script have posed a distinctly represent homographs while purely problem for automated language processing such phonetic ones (such as USCPron) fail to distinctly as speech recognition and translation systems. For represent the homophones. instance, the CSLU Labeling Guide (Lander, http://cslu.cse.ogi.edu/corpora/corpPublications.ht The former is the sample of the cases in which ml) offers orthographic and phonetic transcription there is a many-to-one mapping between systems for a wide variety of languages, from orthography and pronunciation, a direct result of German to Spanish with a Latin-based writing the basic characteristic of the Arabic script, viz., system to languages like Mandarin and Cantonese, little to no representation of the vowels. which use Chinese characters for writing. As is evident by the data presented in this table, However, there seems to be no standard there are two major sources of problems for any transcription system for languages like Arabic, speech-to-speech machine translation. In other Persian, Dari, Urdu and many others, which use words, to employ a system with a direct 1-1 the Arabic script (ibid; Kaye, 1876; Kachru, 1987, mapping between Arabic orthography and a Latin among others). based transcription system (what we refer to as Because Persian and Arabic are different, USCPers in our paper) would be highly ambiguous Persian has modified the writing system and and insufficient to capture distinct words as augmented it in order to accommodate the required by our speech-to-speech translation differences. For instance, four letters were added system, thus resulting in ambiguity at the text-to- to the original system in order to capture the speech output level, and internal confusion in the sounds available in Persian that Arabic does not language modelling and machine translation units. have. Also, there are a number of homophonic The latter, on the other hand, is a representative of letters in the Persian writing system, i.e., the same the cases in which the same sequence of sounds sound corresponding to different orthographic would correspond to more than one orthographic representations. This problem is unique to Persian, representation. Therefore, using a pure phonetic since in Arabic different orthographic transcription, e.g., USCPron, would be acceptable representations represent different sounds. The for the Automatic Speech Recognizer (ASR), but other problem that is common in all languages not for the Dialog Manager (DM) or the Machine using the Arabic script is the existance of a large Translator (MT). The goal of this paper is twofold number of homographic words, i.e., orthographic (i) to provide an ASCII based phonemic representations that have a similar form but transcription system similar to the one used in the different pronunciation. This problem arises due International Phonetic Alphabet (IPA), in line of to limited vowel presentation in this writing Worldbet (Hieronymus, system. http://cslu.cse.ogi.edu/corpora/corpPublications.ht ml) and (ii) to argue for an ASCII based hybrid transcription scheme, which provides an easy way 2.2 Consonants to transcribe data in languages that use the Arabic In addition to the six vowels, there are 23 script. distinct consonantal sounds in Persian. Voicing is We will proceed in Section 2 to provide the phonemic in Persian, giving rise to a quite USCPron ASCII based phonemic transcription symmetric system. These consonants are system that is similar to the one used by the represented in Table 3 based on the place (bilabial International Phonetic Alphabet (IPA), in line of (BL), lab-dental (LD), dental (DE), alveopalatal Worldbet (ibid). In Section 3, we will present the (AP), velar (VL), uvular (UV) and glottal (GT)) USCPers orthographic scheme, which has a one- and manner of articulation (stops (ST), fricatives to-one mapping to the Arabic script. In Section 4 (FR), affricates (AF), liquids (LQ), nasals (NS) we will present and analyze USCPers+, a hybrid and glides (GL)) and their voicing ([-v(oice)] and system that keeps the orthographic information, [+v(oice)]. while providing the vowels. Section 5 discusses some further issues regarding the lack of data. BL LD DE AP VL UV GT 2 Phonetic Labels (USCPron) ST [-v] p t k ? One of the requirements of an ASR system is a [+v] b d g q phonetic transcription scheme to represent the FR [-v] f s S x h pronunciation patterns for the acoustic models. Persian has a total of 29 sounds in its inventory, six [+v] v z Z vowels (Section 2.1) and 23 consonants (Section AF [-v] C 2.2). The system that we created to capture these sounds is a modified version of the International [+v] J Phonetic Alphabet (IPA), called LQ l, r USCPron(unciation). In USCPron, just like the NS m n IPA, there is a one-to-one correspondence between the sounds and the symbols representing them. GL y However, this system, unlike IPA does not require Table 3: Consonants special fonts and makes use of ASCII characters. The advantage that our system has over other Many of these sounds are similar to English systems that use two characters to represent a sounds. For instance, the stops, [p, b, t, d, k, g] are single sound is that following IPA, our system similar to the italicized letters in the following avoids all ambiguities. English words: ‘potato’, ‘ball’, ‘tree’, ‘doll’, ‘key’ 2.1 Vowels and ‘dog’ respectively. The glottal stop [?] can be found in some pronunciations of ‘button’, and the Persian has a six-vowel system, high to low and sound in between the two syllables of ‘uh oh’. The front and back. These vowels are: [i, e, a, u, o, A], uvular stop [q] does not have a correspondent in as are exemplified by the italicized vowels in the English. Nor does the velar fricative [x]. But the following English examples: ‘beat’, ‘bet’, ‘bat’, rest of the fricatives [f, v, s, z, S, Z, h] have a ‘pull’, ‘poll’ and ‘pot’.

Load more