
Proc of 4th Language & Technology Conference, November 6-8, 2009, Poznan, Poland Pronunciation and Writing Variants in Luxembourgish: The Case of Mobile N-Deletion in Large Corpora Natalie D. Snoeren, Martine Adda-Decker LIMSI/CNRS (UPR351), BP 133 F-91403 Orsay cedex France {madda,nemoto}@limsi.fr Résumé The national language of the Grand-Duchy of Luxembourg, Luxembourgish, has often been characterized as one of Europe's underdescribed and under-resourced languages. Because of a limited written production of Luxembourgish, poorly observed writing standardization (as compared to other languages such as English and French) and a large diversity of spoken varieties, the study of Luxembourgish poses many interesting challenges to automatic speech processing studies as well as to linguistic enquiries. In the present paper, we make use of large corpora to focus on typical pronunciation and writing variants in Luxembourgish, elicited by mobile -n deletion (hereafter shortened to MND). Using over 10 millions of words from transcribed Parliament debates and 10k words from news reports we examine the reality of MND variants in written transcripts of speech. The goal of this study is 3-fold : quantify the potential of variation due to MND in written Luxembourgish, check the mandatory status of the MND rule and discuss the arising problems for automatic spoken Luxembourgish processing. Pronunciation and Writing Variants in Luxembourgish: The Case of Mobile N-Deletion in Large Corpora Natalie D. Snoeren & Martine Adda-Decker LIMSI-CNRS, BP 133 91403 Orsay cedex, FRANCE {natalie.snoeren, madda}@limsi.fr Abstract The national language of the Grand-Duchy of Luxembourg, Luxembourgish, has often been characterized as one of Europe's under- described and under-resourced languages. Because of a limited written production of Luxembourgish, poorly observed writing standard- ization (as compared to other languages such as English and French) and a large diversity of spoken varieties, the study of Luxembourgish poses many interesting challenges to automatic speech processing studies as well as to linguistic enquiries. In the present paper, we make use of large corpora to focus on typical pronunciation and writing variants in Luxembourgish, elicited by mobile -n deletion (hereafter shortened to MND). Using over 10 millions of words from transcribed Parliament debates and 10k words from news reports we examine the reality of MND variants in written transcripts of speech. The goal of this study is 3-fold: quantify the potential of variation due to MND in written Luxembourgish, check the mandatory status of the MND rule and discuss the arising problems for automatic spoken Luxembourgish processing. 1. Introduction ally emerged from the work of a number of specialists charged with the task of creating a dictionary that was published be- Luxembourg is a small, landlocked country in Western Europe, tween 1950 and 1977 (Linden, 1950). This dictionary under- bordered by Belgium, France and Germany. The official lan- went some modifications and has officially been adopted in the guage Luxembourgish ("Lëtzebuergesch") is the language spo- spelling reform of 1999. Nonetheless, up until today, Ger- ken by native Luxembourgers. From a linguistic typological man and French are the most practiced languages for written point of view, Luxembourgish belongs to the West central di- administrative purposes and communication in Luxembourg, alects of High German and is therefore part of the Germanic guaranteeing a larger dissemination, whereas Lëtzebuergesch Franconian languages. Just like the English language, Luxem- is mainly being used for oral communication. The strong in- bourgish can be considered as a mixed language with strong fluence of both German and French, among other factors, can Romance and Germanic influences. It is estimated that about explain the fact that Luxembourgish exhibits a large amount of 300,000 people worldwide speak Luxembourgish. Although both pronunciation and derived potential writing variants. It is Luxembourgish is the national language of Luxembourg since common to have pronunciations changing from one place to an- 1984, French and German remain the other administrative lan- other within an area of only several kilometers especially for guages. Because of the fact that Luxembourgish is embedded in function words (e.g. the English personal determiner "our" can this multilingual context, it may entail frequent code-switching. be written and pronounced as eis [ajs], ons [ ns], is [i:s] (even Indeed, there are virtually no pure monolinguals in Luxembourg though the standard form is considered to be ei). These pro- and people switch from one language to another fairly easily. nunciation variants may give rise to resulting variations in writ- Therefore, the linguistic situation in Luxembourg poses a real ten Luxembourgish, as Luxembourgish orthography strives for challenge for researchers concerned with both automatic and phonetic accuracy (Schanen, 2004). The question then arises, in human language processing. As was previously pointed out particular for oral transcripts, whether the written form reflects (Adda-Decker et al., 2008; Krummes, 2006), Luxembourgish the perceived pronunciation form or whether some sort of nor- should be considered as a partially under-resourced language, malization process is at work that eliminates part of the varia- mainly because of the fact that written production remains rel- tion. Thus, Luxembourgish is predominantly a spoken language atively low. Rather surprisingly, written Luxembourgish is not that tends to reproduce the observed variations when written. systematically taught to children in primary school: German is With respect to automatic speech recognition, text normaliza- usually the first written language learned, followed by French tion is an important issue in order to achieve reliable estimates (Berg and Weis, 2005). As was reported by Adda-Decker et al. for n-gram based language models. In the current paper we will (2008), a relatively important production of Lëtzebuergesch lit- address some important issues related to writing and pronunci- erature was being observed throughout the 19th century, and a ation variants, and particular those that are elicited by the rule number of proposals for standardizing the orthography of Lux- of mobile n-deletion. Before turning to this variant, we will first embourgish can be traced back to the middle of the 19th cen- discuss the relevance of pronunciation and writing variants for tury. Since that time, the question of appropriate spelling rules automatic speech processing. arose. There was no officially recognized spelling system un- til the adoption of the "OLO" (ofizjel lezebuurjer ortografi) in 1.1. ASR and the study of pronunciation and writing 1946, which aimed at producing written forms that clearly di- variants verge from German orthography. The success of these rules re- Speech is known to be highly variable, and major factors that mained very limited. A more successful standardization eventu- contribute to the variation concern phonemic context, speaker identity (gender, age, health, emotion), speaking style and rate, et al., 2008). First, the percentage of Luxembourgish words communication contexts as well as environmental and record- shared with French was particularly high. This is hardly sur- ing conditions. Variation must be addressed globally for Au- prising, since French is largely being used in administrative and tomatic Speech Recognition (ASR) systems to produce faith- official speech in Luxembourg. Second, the curves for Luxem- ful orthographic transcriptions. ASR has made a tremendous bourgish showed that the contribution of shared frequent words progress in the study of pronunciation variants over the last is higher, whereas the part corresponding to proper names re- decade, with a significant decrease in error rates for word recog- mains relatively small in these data. Similar comparisons be- nition. Present challenges concern improvement of language tween French and English from broadcast news data showed and pronunciation modeling. The problem of modeling pronun- an important part of shared proper name forms but almost no ciation variants appear to be crucial, especially for spontaneous function word forms. Furthermore, the authors showed that a speech and many efforts have been spent over the last years on significant amount of French and German imports are a major pronunciation variants by the ASR community (Strik and Cuc- characteristic of the Luxembourgish language. These imports chiarini, 1999). Reductions typically produce pronunciation may give rise to variation in written and spoken forms. French variants, producing either different (centralized) phonemes (van imports may be pronounced according to their native system Son and Pols, 2003), fewer phonemes, or even fewer sylla- or adapted to Luxembourgish (e.g., the au sequence giving rise bles (Adda-Decker et al., 2005). In terms of lexical effects, to two pronunciations [aw] or [o]). Similar cases can be ob- reductions seem to affect the speech that contain the least com- served for German imports that are being adapted to Luxem- municative value such as function words that are highly pre- bourgish (e.g., the German suffix -ung may be pronunced and dictable from the surrounding context, such as idioms (e.g., written either with u or with o (Stëmmung or Stëmmong). Mul- c’est-à-dire, "that is"), morphological items in particular end- tilingual entries (e.g. ville meaning "city" [vil]
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-