Bilingual Mixed Language Accented Speech Database
Total Page:16
File Type:pdf, Size:1020Kb
MEDIAPARL: BILINGUAL MIXED LANGUAGE ACCENTED SPEECH DATABASE David Imseng1;2, Herve´ Bourlard1;2, Holger Caesar1;2, Philip N. Garner1, Gwenol´ e´ Lecorve´1, Alexandre Nanchen1 1Idiap Research Institute, Martigny, Switzerland 2Ecole Polytechnique Fed´ erale,´ Lausanne (EPFL), Switzerland fdimseng,bourlard,hcaesar,pgarner,glecorve,[email protected] ABSTRACT MediaParl was recorded at the cantonal parliament of Valais. About two thirds of the population speak French, MediaParl is a Swiss accented bilingual database containing and one third speaks German. However, the German that is recordings in both French and German as they are spoken in spoken in Valais is a group of dialects (also known as Wal- Switzerland. The data were recorded at the Valais Parliament. liser Deutsch) without written form. The dialects differ a lot Valais is a bilingual Swiss canton with many local accents from the standard (high) German (Hochdeutsch, spoken in and dialects. Therefore, the database contains data with high Germany) and are sometimes even difficult to understand for variability and is suitable to study multilingual, accented and other Swiss Germans. Close to the language border (Italy and non-native speech recognition as well as language identifica- French speaking Valais) people also use foreign words (loan tion and language switch detection. words) in their dialect. In the parliament (and other formal We also define monolingual and mixed language auto- situations), people speak in accented standard German. In matic speech recognition and language identifictaion tasks the remainder of the paper, we refer simply to German and and evaluate baseline systems. French but take this to mean Swiss German and Swiss French. The database is publicly available for download. The political debates at the parliament are recorded Index Terms— Multilingual corpora, Non-native speech, and broadcasted. The recordings mostly contain prepared Mixed language speech recognition, Language identification speeches in both languages. Some of the speakers even switch between the two languages during the speech. Therefore, the database may also be used to a certain extent to study code- 1. INTRODUCTION switched ASR. However, in contrast to for example [1], the code switches always occur on sentence boundaries. While In this paper, we present a database that addresses multilin- some similar databases only contain one hour of speech per gual, accented and non-native speech, which are still chal- language [2], MediaParl contains 20 hours of German and 20 lenging tasks for current ASR systems. At least two bilingual hours of French data. databases already exist [1, 2]. The MediaParl speech cor- In the remainder of the paper, we will give more details pus was recorded in Valais, a bilingual canton of Switzerland. about the recording (Section 2) and transcription (Section 3) Valais is surrounded by mountains and is better known inter- process. We will also present the dictionary creation process nationally for its ski resorts like Verbier and Zermatt with the in Section 4 and define tasks and training, development and Matterhorn. Valais is an ideal place to record bilingual data, test sets in Section 5. Finally we present and evaluate baseline because there are two different official languages (French and systems on most of the tasks in Section 6. German). Furthermore, even within Valais, there are many local accents and dialects (especially in the German speak- ing part). This language mix leads to obvious difficulties, 2. RECORDINGS with many people working and even living in a non-native language, and leads to high variability in the speech record- The MediaParl speech corpus was recorded at the cantonal ings. On the other hand, it leads to valuable data that allow parliament of Valais, Switzerland. We used the recordings study of multilingual, accented and non-native speech as well of Swiss Valaisan parliament debates of the years 2006 and as language identification and language switch detection. 2009. The parliament debates always take place in the same closed room. Each speaker intervention can last from about This research was supported by the Swiss NSF through the project In- 10 seconds up to 15 minutes. Speakers are sitting or stand- teractive Cognitive Systems (ICS) under contract number 200021 132619/1 and the National Centre of Competence in Research (NCCR) in Interactive ing when talking and their voice is recorded through a distant Multimodal Information Management (IM2) http://www.im2.ch microphone. The recordings from 2009 that were processed at Idiap Research Institute are also available as video streams that are specific to the domain (politics) and region (Switzer- online1. land). Hence, the dictionaries need to be completed. In the The audio recordings of the year 2006 were formatted as reminder of this section, we first generally describe how we “mp3”, more specifically MPEG ADTS, layer III, v1, 128 complete the dictionaries and then give more details about kbps, 44.1 kHz, Monaural with16 bits per sample. The video German and French in Sections 4.2 and 4.3 respectively. recordings of the year 2009 were formatted as “avi” with un- compressed PCM (stereo, 48000 Hz, 16 bits per sample) au- 4.1. Phonetisaurus dio data. All the audio data (2006 and 2009) was converted to WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz prior We used Phonetisaurus [3], a grapheme-to-phoneme (g2p) to any processing. tool that uses existing dictionaries to derive a finite state trans- ducer based mapping of sequences of letters (graphemes) to their acoustic representation (phonemes). The transducer was 3. TRANSCRIPTIONS then applied to unseen words. For languages with highly transparent orthographies such Each recorded political debate (session) lasts about 3 hours as Spanish or German [4], g2p approaches typically work and human-generated transcriptions are available. However, quite well [5]. However, for languages with less transparent manual work was required to obtain annotated speech data of orthographies, such as English or French [4], it is relatively reasonable quality: difficult to derive simple mappings from the grapheme repre- • Speaker diarization was performed manually. Each sentation of a syllable to its phoneme representation. There- speaker cluster is referred to as an intervention; an in- fore, g2p approaches tend to work less well [5]. tervention consists of multiple sentences consecutively Furthermore, due to the prevalence of English in many spoken by the same speaker. fields, domain-specific words, such as “highspeed”, “inter- view” or “controlling” are often borrowed from English. • Each intervention is then associated with the corre- Since MediaParl was recorded in a bilingual region, this sponding transcription and manually split into individ- effect becomes even more pronounced than in other more ho- ual sentences. mogenous speaker populations. As a result, the dictionaries contain relatively large numbers of foreign words. However, • The transcription of each sentence is then manually ver- the g2p mappings of one language do not necessarily general- ified by two annotators and noisy utterances are dis- ize to a foreign language. French word suffixes for example, carded. are often not pronounced if they form an extension to the word stem, such as plurals and conjugations. On the other Finally, all the transcriptions were tokenized and normal- hand, German word suffixes are usually pronounced, except ized using in-house scripts. The transcription process de- for some cases where terminal devoicing (voiced consonants scribed above resulted in a corpus of 7,042 annotated sen- become unvoiced before vowels or breaks) applies. tences (about 20 hours of speech) for the French language Owing to the above problems with g2p, all entries gen- and 8,526 sentences (also about 20 hours of speech) for the erated by Phonetisaurus were manually verified by native German language. speakers according to the SAMPA rules for the respective language. Table 2 shows the number of unique words in each 4. DICTIONARIES dictionary. The phonemes in the dictionaries are represented using the 4.2. German Dictionary Speech Assessment Methods Phonetic Alphabet (SAMPA)2. SAMPA is based on the International Phonetic Alphabet To bootstrap the German dictionary, we used Phonolex3. (IPA), but features only ASCII characters. It supports multi- Phonolex was developed by a cooperation between DFKI ple languages including German and French. Saarbrucken,¨ the Computational Linguistics Lab, the Uni- Manual creation of a dictionary can be quite time consum- versitat¨ Leipzig (UL) and the Bavarian Archive for Speech ing because it requires a language expert to expand each word Signals (BAS) in Munich. into its pronunciation. Therefore we bootstrap our dictionar- 82% of the German MediaParl words were found in ies with publicly available sources that are designed for a gen- Phonolex. Phonetisaurus was then trained on Phonolex to eral domain of speech, such as conversations. However, the generate the missing pronunciations. All g2p-based dic- speech corpus that we use includes large numbers of words, tionary entries were manually verified in accordance to the German SAMPA rules [6]. 1http://www.canal9.ch/television-valaisanne/ emissions/grand-conseil.html 3http://www.phonetik.uni-muenchen.de/Bas/ 2http://www.phon.ucl.ac.uk/home/sampa/index.html BasPHONOLEXeng.html Since Phonolex is a standard German dictionary and we As already described in Section 3, an intervention con- only use one pronunciation for each word, the actual Swiss tains multiple sentences of the same speaker. The bilin- German pronunciation of some words may significantly dif- gual speakers change language within one intervention, fer. Analyzing, for instance, various samples of the German hence the database can also be used to study the de- word “achtzig” reveals that speakers in MediaParl pronounce tection of language switches. Note that the language it in three different ways: switches always occur at sentence boundaries. 1. /Q a x t s I C/ Mixed language ASR Mixed language ASR is defined as 2. /Q a x t s I k/ ASR without knowing the language of a sentence a priori.