<<

ICPhS XVII Regular Session , 17-21 August 2011

STUDYING VIA MULTILINGUAL FORCED ALIGNMENTS

M. Adda-Deckera,, . Lamelb & .. Snoerenb

aLPP CNRS / Univ. 3, -75005 Paris, ; bLIMSI CNRS, F-91403 Orsay, France [email protected]; [email protected]; [email protected]; [email protected]

ABSTRACT , due to fact the written production is low, and linguistic knowledge and resources, such as Luxembourgish, a Germanic- language, is lexica and , are sparse. embedded in a multilingual context on the divide Written Luxembourgish is not systematically taught to between Romance and Germanic cultures and children in primary school: German is usually the first remains of under-described . learned, followed by French. This paper investigates the similarity between This paper aims to gain more insight into the Luxembourgish segments with German, acoustic properties that define the Luxembourgish French and English via forced speech alignment language in the light of its Germanic and Romance techniques. Making use of monolingual acoustic seed influences. In particular the goal is to determine the models from these three languages, as well as influence of the German, French and the less practiced “multilingual” models trained on pooled speech data on the acoustic realization of we investigated whether Luxembourgish was Luxembourgish . The is on acoustic globally better represented by one of the individual modeling whilst making use of phonemically aligned languages or by the multilingual model. Although audio data. The following are addressed. First, French are often interspersed in spoken if multiple monolingual acoustic models are available Luxembourgish, forced alignments show a clear for alignment of Luxembourgish audio data, is there a preference for Germanic acoustic models, with only a clear preference for one of the languages? Second, is a limited usage of the French ones. While globally, the language- specific preference observed for specific German models provide the best match, a phone- phonemes or for classes? If so, how do they based analysis, shows language-specific preferences: correspond to IPA symbol matches between the French is preferred for rounded front , nasal preferred language and Luxembourgish phonemes? vowels and / / whereas English is more frequently  Third, how do language-specific preferences, if any, used for diph- thongs. The proposed method enables fare when compared with multilingual phone models? the acoustic match between phonemes in different These issues have important implications for acoustic languages to be quantified and opens new model for automatic speech recognition and handling perspectives in language processing studies for low pronunciation variants. The next section introduces the -resourced languages and for L2 learning. phonemic inventory of Luxembourgish and its Keywords: forced alignment, acoustic modeling, correspondance with the three source languages multilingual models, Luxembourgish, Germanic (German, French, and English). Results are then languages, presented with both monolingual and multilingual seed models. Finally, section 6 summarizes the results and 1. INTRODUCTION discusses challenges for speech technology and , a small of less than 500,000 linguistic studies of Luxembourgish. inhabitants in the center of , is com- 2. PHONEMIC INVENTORY OF posed of about 65% of native inhabitants and 35% of LUXEMBOURGISH immigrants. The , Luxembourgish ("Lëtzebuergesch"), has only been considered as an The adopted Luxembourgish phonemic inventory since 1984 and is spoken by includes a total of 60 phonemic symbols including 3 natives.[4]. The immigrant population generally extra-phonemic symbols (for silence, breath and speaks one of Luxembourg’s other official languages: hesitations). Table 1 presents a selection of the French or German. Recently, English has joined the phonemic inventory together with illustrating set of languages of communication, mainly in examples (see [4] for more information. on the professional environments. phonemic inventory of Luxembourgish). As pointed out by [2] and [3], Luxembourgish Luxembourgish is characterized by a particularly should be considered as a partially under-resourced high number of .

196

ICPhS XVII Regular Session Hong Kong, 17-21 August 2011

Table 1: Sample cross-lingual phoneme mappings: Lux. targets mapped to same or similar phonemes in 3 source lang. (Fre, Ger, Eng).

Table 2: Phoneme and training data information (in hours) for native and pseudo-Lux. acoustic models from English, French, German and for the multilingual superset (E, F, ).

To minimize the phonemic inventory size, we could have chosen to code diphthongs using two consecutive symbols, one for the nucleus and one for the offglide (e.g. the sequence /a/ and // for ). We prefered, however, the option of Table 1 shows a sample of the adopted cross- coding diphthongs and using specific lingual mappings that were used to initialize seed unique symbols. Given the importance of French models for Luxembourgish. Some symbols are used imports, nasal vowels were included in the inventory, several times for different Luxembourgish phonemes. although they are not required for typical For the diphthongs that are missing in French, Luxembourgish words. Furthermore, native phonemes corresponding to the nucleus were Luxembourgish makes use of a rather complex set of chosen. A fourth model set was then formed by voiced/unvoiced . The final Luxembourgish concatenating the first three model sets, allowing the inventory includes 60 symbols. decoder to choose among the three models (see Table 2). Finally a set of multilingual acoustic models were 3. ACOUSTIC SEED MODELS trained (see Fig. 1, lower part) using the pooled The need to develop acoustic seed models for un- E,F,G audio data that were labeled using their derresourced languages has already been addressed in respective IPA(i,n) correspondances. previous research [5]. In the current study, three sets 4. LUXEMBOURGISH SPEECH DATA of context-independent acoustic models were built, one for each well-resources seed language (i.e., Forced alignments were carried using 80 minutes of English, French, German). The models were trained manually transcribed speech from the House of on manually transcribed audio data (between 40 and (Chamber debates (70’) and from 150 hours) from a of sources, using language- (10’) broadcast by RTL, the Luxembourgish radio specific phone sets. The amount of data used to train and TV broadcast company [2]. The detailed manual the native acoustic models and the number of transcripts include all audible speech events, includ- phonemes per language are given in Table 2 (). ing disfluencies and speech errors. These verbatim Each phone model is a tied-state left-to-right, 3-state transcripts were checked against the resulting CDHMM with Gaussian mixture observation lists for errors and orthographic inconsistencies. densities (typically containing 64 components). The average Luxembourgish phone segment du- Figure 1 (top) illustrates the development of three ration remains relatively stable with respect to the sets of pseudo-Luxembourgish acoustic models, each different seed alignment (70-80ms). The corpus in- including 60 phones, starting from the English, cludes a total of 56,000 phone segments. French and German seed models and mapping the Luxembourgish phonemes to a close equivalent in 5. RESULTS each of the source model sets (IPA(i,n) in Fig. 1). In our earlier investigations [1], a in acoustic models was permitted only on word boundaries. In that case, a clear preference for the”

197

ICPhS XVII Regular Session Hong Kong, 17-21 August 2011

German” acoustic models could be noticed with more phoneme repertoires, there were many shared IPA than 50% of phone segments aligned with the symbols (e.g. ). For some of the symbols (e.g. acoustic models arising from the German audio data. diphthongs) however, a correspondance was fixed In the present study, the language identity of the approximatively (i.e., if the diphthong does not exist acoustic models may change at any phone segment in the language of the audio training data, the boundary. As a result, a large number of language corresponding nucleus vowel was assigned to the switches between model sets are observed. The Luxembourgish diphthong). In the following, we German models are less used. However, in all investigate the alignment results as a function of the explored alignment conditions (three sets of models closeness of the IPA match. trained with monolingual English, French and For each phoneme symbol, all observed segments German data, or adding also a fourth acoustic model were aligned with one of the provided acoustic model set trained with the pooled audio data), the German sets (En, Fr, Ge, Pooled). Rates were computed per models are always at rank one. In the alignment language origin of the acoustic model set. When not condition including the pooled model set, German considering the pooled models, the English, French models are aligned with 36% of segments, followed and German rates sum up to 100% for each symbol by the English set (27%) and the pooled model set (results on the top in Figure 2). When the pooled (26.5%). The French models are used only for 10.5% models are added (results on the bottom), the 4 model of the segments on average. In all conditions, a set rates sum up to 100%. Here, the rate of the pooled preference for and in particular model and the related decrease in the monolingual for the can be observed for Luxem- model rates give an indication of the matching bourgish speech. This result is in with the accuracy between target speech and the monolingual postulated influence of typological distance. source models. When defining the IPA correspondances between the Luxembourgish and English, French and German

(i) symbols shared among languages: plosives aligned with German or English models rather than The plosives /p/, // and // with their voiced counter- with French ones. Detailed results are shown in the parts exist in the four languages. As Lëtzebuergesch is left panel of Figure 2. As expected, segments are considered a Germanic language, we may hypothesize distributed almost evenly among German and English that plosives be realized similar to German and models and only 10-20% of the segments used the English. We may thus expect a stronger burst than in French models. The Luxembourgish plosives /k/ and French plosives and positive VOTs. If major acoustic /b/ tend to prefer English models. When adding the correlates are shared with German and English rather pooled models (bottom, left), about 30% of the data than with French, then segments should be switch to these models during alignment.

198

ICPhS XVII Regular Session Hong Kong, 17-21 August 2011

(ii) approximate symbol association: diphthongs (German, French and English), using the IPA symbol The Luxembourgish phonemic repertoire includes a set. For each of these languages, acoustic seed large number of diphthongs. Diphthongs are broadly models were composed using either monolingual represented in the various regional variants, but their German, French or English acoustic model sets. In use in “standard” Luxembourgish (as broadcasted Luxembourgish speech alignments, a superset of through radio and TV) needs to be further studied. multilingual acoustic seeds was used putting together Figure 2 (mid) shows a less homogeneous picture the three language-dependent sets. The language- than for the plosives. There is a tendency to use the identity of the aligned acoustic models provides English models, Luxembourgish // segments were information about the overall acoustic adequacy of 90% aligned with English models, whereas the // both the cross-language phonemic correspondances segments show a clear preference for German (all and the acoustic models. Furthermore, some occurrences stem from euro words and compounds). information can be gleaned on inter-language When adding the pooled models, // segments tend distances. was shown that the Germanic acoustic to move to the corresponding pooled model, whereas seed models provided the best match with 36% of the the rate of German models remains almost segments aligned using German seeds and 27% using unchanged for the // segments. For these, the the English ones. Only 10.5% used the French provides a good match, whereas the acoustic models. Since Luxembourgish is considered English model of // is superseded by the pooled a Western Germanic language close to German, this model. result is in line with its linguistic typology. The remaining 26.5% of segments were used by the (iii) approximate symbol association: nasal vowels models estimated from the pooled multilingual audio Nasal vowels may be used in Luxembourgish in data. The pooled model rates give an indicative French import words. Figure 2 (right) shows results measure of match/mismatch between the acoustic for nasal and corresponding oral vowels. The rate of realisations of a given pair of phonemes between French model usage is very high for the and   source and target languages. In particular, phoneme- segments. When introducing pooled models, this rate based results revealed a particularly high match tends to drop, indicating that there is a relatively between Luxembourgish and English //, weak match between Luxembourgish and French Luxembourgish and French /ø/. As a perspective, we nasal vowels. propose to enhance this measure to provide help for It is interesting to note the predominance of French learning, elaborating lists of poten- models for the rounded front vowels (// and /ø/) (not tially difficult phonemes given an L1/L2 pair. shown in Figure 2 due to lack of place). Similarly, Computational ASR investigations and corpus- German model rates are generally for tense based analyses will not only enhance the vowels, whereas English models achieve higher rates development of a more full-fledged ASR system for for the lax vowels. Luxembourgish, but can also be used to generate It can be seen that the pooled model (Figure 2 more specific predictions about lexical processing bottom) is used to align a large proportion of the observed phone segments (e.g. //) for some and phonetic learning in listeners. phonemes, whereas for other phonemes, the pooled 7. REFERENCES model will only account for a relatively small percentage of data. These results thus give an [1] Adda-Decker, ., Lamel, L., Snoeren, N.D. 2010. indicative measure of match/mismatch between the Comparing mono- & multilingual acoustic seed models acoustic realizations of a given pair of phonemes for a low e-resourced language: a case-study of Luxembourgish. Interspeech 2010. between source and target languages. [2] Adda-Decker, M., Pellegrini, T., Bilinski, E., Adda, G. In future work, we propose to use this measure to 2008. Developments of Lëtzebuergesch resources for provide phonetic instructions for foreign language automatic speech processing and linguistic studies. LREC. learning such as defining lists of the difficult [3] Krummes, . 2006. Sinn si or si si? mobile-n deletion in phonemes to learn, given and L1/L2 pair. luxembourgish. Papers in Linguistics from the University of : Proceedings of the 15th Postgraduate Conference in Linguistics Manchester. 6. SUMMARY AND PROSPECTS [4] Schanen, F. 2004. Parlons Luxembourgeois. The present work focused on the issue of producing L’Harmattan. acoustic seed models for Luxembourgish, a language [5] Schultz, T., Waibel, A. 2001. Experiments on cross- language acoustic modeling. Proceedings of Eurospeech, with strong Germanic and Romance influences and Aalborg. exploring their use in phonemic alignments. A phonemic inventory was defined and linked to inventories from major neighboring languages

199