<<

INTERSPEECH 2004 -- ICSLP 8th International Conference on Spoken ISCA Archive Processing http://www.isca-speech.org/archive ICC Jeju, Jeju Island, Korea October 4-8, 2004

A first experience on multilingual acoustic modeling of the spoken in Morocco

José . Mariño, A. Moreno, A. Nogueiras

TALP Research Center on “Technologies and Applications of Language and ” Technical University of Catalonia, Spain {canton,asuncion,albino}@talp.upc.es

inventory of for both languages is designed and Abstract evaluated against the monolingual counterparts. The goal of this paper is to explore and describe the potential The paper is organized as follows. The three next sections of multilingual acoustic models for automatic speech describe the experimental framework including the available recognition of the languages spoken in Morocco. The basic speech databases used to train and test the system, the experimental framework comes from the OrienTel project, inventory of sounds and the main features of the recognition mainly the sound inventory of the languages and the system used for this experimental work. Section five provides speech databases. Monolingual and multilingual automatic the description of the experimental work carried out to speech recognition systems for Modern Colloquial and validate and evaluate the multilingual system. The paper ends Standard Arabic (MCA and MSA, respectively) and French with a discussion section. languages are developed and evaluated, in order to envisage the phonetic exchange and similarity among the three 2. Speech databases languages. As a main result, it can stated that a combined In the OrienTel project three databases [2] have been modeling of MSA and MCA or, even a trilingual design, does produced in Morocco: for MCA, MSA and French. Calls not harm the performance of the recognition system. were recorded from fixed and mobile phones. The utterances were recorded through an ISDN access to the fixed public 1. Introduction telephone network, sampled at 8 kHz and quantified by the A- The aim of the IST project “Multilingual Access to Interactive law at 8 bits per sample. These databases have been used for Communication Services for the Mediterranean and the training the ASR system and testing. Middle East” (OrienTel) is to enable the project's participants to design and develop multilingual interactive communication 2.1. MCA database services for the Mediterranean and the Middle East, ranging The Modern Colloquial Arabic (MCA) database contains from Morocco in the West to the Gulf States in the East, utterances collected from 772 speakers: 600 of them supply including Turkey, Israel and Cyprus. To achieve this aim, the the training material and the remaining 172 speakers build up consortium has been compiling a set of 23 linguistic databases the testing set. and conduct research into ASR-related problems of the As training material the total number of utterances is 44 10.21437/Interspeech.2004-308 OrienTel region. 600= 26400 utterances (spellings and yes/no questions are not This paper is addressed to explore and describe the used) including more than 605850 phones. As test we chose potential of multilingual acoustic models and lexica of the three different tasks extensively described in [2]: languages spoken in Morocco. Morocco belongs to the • Digit strings: prompt sheet number, telephone number, Magreb area. Three languages are spoken in Morocco: spontaneous telephone number, credit car (14-16 digits), Modern Colloquial Arabic (MCA), PIN. (MSA) and French. As far as MSA and MCA are spoken • Applications words. across the country, Morocco is a fully bilingual country. • Dates: relative and general expressions. French is mainly used for commercial transactions. Both MSA and MCA languages have important 2.2. MSA database similarities while maintaining specific phonetic traits and lexica. For instance, even though they share the same The Modern Standard Arabic (MSA) database contains phonetic inventory, pronunciation issues differ slightly utterances collected from 530 speakers: 400 of them supply between both languages. the training material and the remaining 130 speakers build up On the other hand, French shows a complete different the testing set. phonetic inventory and come from a very different language As training material the total number of utterances is 46 x root, that is, . In this work we shall try to take advantage 400= 18400 utterances (spelling and yes/no questions are not of the fact that, for Moroccan people, French is a very used) including more than 548395 phones. As test we chose commonly used third language, and their pronunciation is three different tasks: strongly influenced by Arab phonemes. • Digit strings: prompt sheet number, strings of 4 digits. Thus, an alternative to use a specific phonetic description • Applications words. for MSA, MCA and French can be devised. In this paper, and • Dates: relative and general expressions. following previous work (for instance, see [1]), a common 2.3. French database SAMPA Definition The French database is formed by utterances collected from 530 speakers: 400 of them supply the training material and a open front unrounded the remaining 130 speakers build up the testing set. i close front unrounded vowel As training material the total number of utterances is 43 x 400= 17200 utterances (spellings and yes/no questions are not u used) including more than 344245 phones. As test we chose a: long open front unrounded vowel three different tasks: i: long close front unrounded vowel • Digit strings: prompt sheet number, telephone number, u: long close back rounded vowel spontaneous telephone number, credit car (14-16 digits), PIN. • Applications words. j voiced palatal • Dates: prompted date phrases, relative and general voiced labial-velar approximant expressions After discarding the utterances with mispronounced or incomplete words, the final number of utterances for every ?`(?\) voiced pharyngeal language is described in Table 1. Furthermore, to be more D voiced specific, Table 2 shows, for each test set, the size of the D` voiced dental emphatic fricative vocabulary, number of sentences and number of words for the f voiceless labiodental fricative final set.

h voiceless glottal fricative Training Test Language s voiceless alveolar fricative Utterances Digits A. words Dates S voiceless postalveolar fricative MCA 26328 356 816 156 s` voiceless alveolar emphatic fricative MSA 18322 911 698 114 voiceless dental fricative French 17039 267 727 202 voiced labiodental fricative (MCA, MSA rare) Table 1: Training and test material. Number of utterances. x voiceless velar fricative X\ voiceless pharyngeal fricative Digits A. words Dates Language voiced alveolar fricative Size Words Size Words Size Words Z voiced postalveolar fricative MCA 10 2195 23 848 38 351 Lateral MSA 10 3999 25 786 32 247 l voiced dental/alveolar lateral approximant French 10 1728 33 727 151 894 l` voiced dental/alveolar lateral approximant emphatic (MCA, MSA rare) Table 2: Vocabulary size and number of words in the test set. Trill r voiced dental or alveolar trill 3. Inventory of sounds Nasals The standard SAMPA[3] phoneme set for French and Arabic were used for French, and MCA and MSA as spoken in n voiced dental or alveolar nasal Morocco. Table 3 summarizes the MSA and MCA inventories of allophones as they were defined to design the OrienTel databases. The same table includes for every the ? stød attributes considered further in the clustering algorithms. It b voiced bilabial can be observed that MSA and MCA share the same d voiced dental/alveolar plosive inventory, where rare sounds coming from foreign languages are included. d` voiced dento-alveolar emphatic plosive French only shares a small part of phonemes, which are g marked with a bold character in Table 3. French shows a greater variability on the vowels set and MCA and MSA p voiceless bilabial plosive (MCA, MSA rare) show a higher variability in the fricatives set. The specific voiceless uvular plosive French sounds used in our experimentation can be found in Table 4, where “indeterminacy symbol” means that it replaces t voiceless dental/alveolar plosive the corresponding symbols in the list in case of indeterminacy t` voiceless dento-alveolar emphatic plosive between both symbols. Table 3: Inventory of sounds for MCA and MSA. SAMPA Definition Some standard pronunciation issues in Magreb dialects Vowels have not been taken into account when generating phonetic transcriptions because of their dependence on the speaker and 2 close-mid,front,rounded vowel their non-systematic nature: 9 open-mid a) Substitution of /T/ by /t/; and of /q/ by /g/ or /?/. @ mid central unrounded vowel b) of voiced dental fricatives and plosives (/D/, A open back unrounded vowel /D`/, /d/, /d`/), which usually merge into just /d/ and /d`/. ) Relaxation of shedda (gemination) and emphasation. e close-mid front unrounded vowel d) Deletion of hamza (/?/). E open-mid front unrounded vowel Furthermore, the distribution of these peculiarities is dialect close- dependent, being remarkably more important in MCA. O open-mid back rounded vowel The recognition search is sped up by using beam-search Y close front rounded vowel and phonetic look-ahead. &/ 2, 9 (indeterminacy symbol) 5. Evaluation A/ a, A (indeterminacy symbol) E/ e, E (indeterminacy symbol) The following recognition systems were trained and evaluated: a) Three monolingual systems, one for each language, with O/ o, O (indeterminacy symbol) 750 models each. U~/ e~, 9~ (indeterminacy symbol) b) Two bilingual systems for modeling MCA and MSA. 9~ open-mid front rounded nasal Both systems use 900 models. The presence or not of a~ open front unrounded nasal language dedicated models is the difference between e~ close-mid front unrounded nasal them. c) Two multilingual systems trained with FR, MSA and o~ open-mid back rounded nasal MCA either allowing language dedicated clusters or not, and 1200 models each. H labial-palatal approximant Tables 5, 6 and 7 show the performance (in terms of word accuracy) of these five systems when applied to the Trill previously described test tasks. Table 5 points out the results R uvular trill/fricative obtained with the French material, Table 6 exhibits the Nasals performance reached in the MSA test, and Table 7 shows the

J palatal nasal App. System Digits Dates N velar nasal words French 89.4 92.2 92.0 Table 4: Inventory of specific sounds for French. MSA+MCA+FR (m) 88.6 93.1 91.1 MSA+MCA+FR (ld) 88.9 92.7 90.8 4. The speech recognition system Table 5: Word accuracy reached with the French tests. The experimental work was carried out with the speech recognition system developed at UPC. The speech is App. parameterized with mel-cepstrum coefficients. First and System Digits Dates words second order differential parameters plus the differential energy are employed. The phonetic unit used is the demiphone MSA 93.8 95.3 92.1 [4]. Clustering of models based on decision trees is used to MSA+MCA (m) 93.2 94.4 93.9 smooth parameters [5]. Sounds are classified according to MSA+MCA (ld) 93.3 94.1 92.1 their type, point, manner and variant of articulation. The MSA+MCA+FR (m) 93.2 94.3 95.6 recognition system models the phonetic units by Gaussian 93.5 94.3 92.1 SCHMM with quantization to the 6 (2 for the energy) closest MSA+MCA+FR (ld) codewords. The size of the codebooks is 512 (64 for the Table 6: Word accuracy reached with the MSA tests. differential energy). Every item of the vocabulary was represented by a string App. of demiphones. A word was provided with multiple System Digits Dates transcriptions drawn out of the lexicon [6]. An important words issue of of Arabic languages concerns MCA 91.3 93.0 91.7 the transcription of vowels. Usually, vowels are not written MSA+MCA (m) 90.1 94.5 94.2 and a high variability is expected in uttered vowels from a MSA+MCA (ld) 90.6 93.3 94.9 given prompt text. .In our experimentation the criterion to build up the phonetic transcriptions of every word in the MSA+MCA+FR (m) 88.8 93.9 94.9 lexicon was to take into account just those vowels MSA+MCA+FR (ld) 90.1 93.2 94.2 pronounced for that word by the speakers included in the training speech material. Table 7: Word accuracy reached with the MCA tests. results for the MCA test. Multilingual systems are denoted by train material sharing and language dedicated modeling must m when the phone with the same SAMPA symbol are merged be considered carefully. However, we can draw the into one phone; the mark ld is used when the generalized provisional conclusion that a combined modeling of MSA demiphone trees include language dependent nodes. Due to and MCA does not harm the performance of the recognition the reduced amount of data material, the figures given for this system, while being a practical design. test should be considered cautiously. As the trilingual system is concerned, from Tables 5-7 we can be also conclude that MSA, MCA and French can be 6. Discussion modeled together without a significant loss in performance.

The experimental results summarized in Tables 5-7 allow to Digits A. words Dates draw the following provisional conclusions: System a) When MCA and MSA material are shared to train the MCA MSA MCA MSA MCA MSA recognition system, a contradictory result arises. On one MSA 81.3 93.8 84.6 95.3 85.9 92.1 hand, the digit string recognition performance decreases. MCA 91.3 89.4 93.0 90.7 91.7 90.4 On the other hand, the scores of application words and MSA+MCA 90.1 93.2 94.5 94.4 94.2 93.9 dates recognition improve. Furthermore, the former behavior is confirmed by a better performance when the Table 8: Word accuracy achieved by the monolingual phonetic unit clusters are split according to the language baseline systems in cross-language tests. The performance of (MCA or MSA). These facts need a further consideration the bilingual system is included as a reference. (see below) in order to gain more insight on the advantages of a shared training. b) Both application word and date tasks are carried out with voiceless? a similar performance for MCA and MSA languages. On the contrary, digit string recognition gets a lower scoring for the colloquial Arabic than for the standard language, open, fricative, thrill? (post, dental)- alveolar? in contradiction to the use of a greater training corpus. A plausible explanation can be given by the lack of written front, palatal? MSA? alveolar? MSA? version of MCA that required a prompt sheets where digits were expressed by figures, while MSA prompts provided digits in alphabetical form. Thus, both a greater MSA? MSA? MSA? MSA? phonetic variability and a higher articulation rate were exhibited by the MCA digits than the MSA read digits. c) MCA seems to be more beneficial by the multilingual Figure 1: Partial classification tree for right demiphones of systems than the MSA. sound /s/. In addition to the experiments reported above, and in order to assess the phonetic similarity between the MCA and 7. Acknowledgements MSA languages, two new experiments were carried out. Firstly, a classification tree was learned to cluster the This work has been partially sponsored by the European incontextual phonetic units of both MCA and MSA together. Union under grant IST-2001-28373 (OrienTel project, The phonetic traits and the corresponding language of every http://www.orientel.org) and the Spanish Government under sound were used as attributes to learn the classification tree. grant TIC2002-04447-C02 (ALIADO project, http://gps- The result was unambiguous: the sounds with the same tsc.upc.es/veu/aliado/main.html). SAMPA symbol joined the same node of the tree, being the language the least significant attribute. Thus, we can draw the 8. References conclusion that MCA and MSA share a great deal of phonetic [1] T. Schultz and A. Waibel, “Experiments on cross- characteristics. language acoustic modelling”, Proc. EUROSPEECH’01, As a second experiment, the MCA test was recognized by pp. 2721-2724, Aalborg, Denmark. the monolingual MSA system, and viceversa. In Table 8 the [2] Oren Gedge et al., “Speech Database Design”, cross-language performance is shown. It is obvious that the Deliverable D2.1 of the OrienTel Project, July 2002. monolingual systems are not interchangeable, providing [3] http://www.phon.ucl.ac.uk/home/sampa/ evidence of phonetic diversity between both languages. The [4] J.B. Mariño, A. Nogueiras, P. Pachès and A. Bonafonte, phonetic difference between MSA and MCA languages can “The demiphone: an efficient contextual subword unit for be also assessed by the classification trees estimated to cluster continuous speech recognition”, Speech Communication, the demiphones of a phone according to their context. As it Vol. 32, No. 3, pp. 187-197. can be seen in Figure 1 as an example, the tree corresponding [5] J.B. Mariño and A. Nogueiras, “Top-down bottom-up to the right demiphones of the phone /s/ provides language hybrid clustering algorithm for acoustic-phonetic dedicated clusters from the early stages. This result is typical modeling of speech”, Proc. EUROSPEECH’99, pp. for the bilingual system with language dependent phonetic 1343-1346, Budapest, Hungary. units. However, bilingual designs either recover most of the [6] A. Moreno, J.B. Mariño, A. Nogueiras, “Multilingual monolingual performance or even outperform the acoustic models and lexica, focus on previous monolingual systems. Therefore, no simple conclusion can be experiences and Morocco”, Deliverable D4.1 of the drawn. On the contrary, the phonetic relation of both OrienTel Project, February 2004. languages seems a rather complex one. The balance between