Open Source German Distant Speech Recognition: Corpus and Acoustic Model
Total Page:16
File Type:pdf, Size:1020Kb
Open Source German Distant Speech Recognition: Corpus and Acoustic Model Stephan Radeck-Arneth1;2, Benjamin Milde1, Arvid Lange1;2, Evandro Gouvea,ˆ Stefan Radomski1, Max Muhlh¨ auser¨ 1, and Chris Biemann1 1Language Technology Group / 2Telecooperation Group Computer Science Departement Technische Universitat¨ Darmstadt {stephan.radeck-arneth,milde,biem}@cs.tu-darmstadt.de {radomski,max}@tk.informatik.tu-darmstadt.de {egouvea,arvidjl}@gmail.com Abstract. We present a new freely available corpus for German distant speech recognition and report speaker-independent word error rate (WER) results for two open source speech recognizers trained on this corpus. The corpus has been recorded in a controlled environment with three different microphones at a dis- tance of one meter. It comprises 180 different speakers with a total of 36 hours of audio recordings. We show recognition results with the open source toolkit Kaldi (20.5% WER) and PocketSphinx (39.6% WER) and make a complete open source solution for German distant speech recognition possible. Keywords: German speech recognition, open source, speech corpus, distant speech recognition, speaker-independent 1 Introduction In this paper, we present a new open source corpus for distant microphone record- ings of broadcast-like speech with sentence-level transcriptions. We evaluate the corpus with standard word error rate (WER) for different acoustic models, trained with both Kaldi[1] and PocketSphinx[2]. While similar corpora already exist for the German lan- guage (see Table 1), we placed a particular focus on open access by using a permissive CC-BY license and ensured a high quality of (1) the audio recordings, by conducting the recordings in a controlled environment with different types of microphones; and (2) the hand verified accompanying transcriptions. Each utterance in our corpus was simultaneously recorded over different micro- phones. We recorded audio from a sizable number of speakers, targeting speaker inde- pendent acoustic modeling. With a dictionary size of 44.8k words and the best Kaldi model, we are able to achieve a word error rate (WER) of 20.5%. 1.1 Related Work Table 1 shows an overview of current major German speech corpora. Between 1993 and 2000 the Verbmobil [4] project collected around 180 hours of speech for speech transla- tion. The PhonDat [5] corpus was generated during the Verbmobil project and includes 2 Stephan Radeck-Arneth et al. Table 1. Major available corpora containing spoken utterances for the German language. Name Type Size Recorded SmartKom [3] spontaneous, orthographic & 12h synchronized directional & beamformed prosodic Transcription Verbmobil [4] spontaneous, orthographic tran- appr. 180h synchronized close-range & far-field & scription telephone PhonDat [5] spontaneous, orthographic tran- 9.5h close-range scription German Today [6] read and spontaneous, orthographic 1000h single microphone, headset transcription FAU IISAH [7] spontaneous 3.5h synchronized close-range to far-field Voxforge read, orthographic transcription appr. 55h varying microphones, usually headsets Alcohol Language Cor- read, spontaneous, command, or- 300k words close-range, headset pus [8] thographic transcription & control GER-TV1000h [9] read, orthographic transcription appr. 1000h various microphones, broadcast data Our speech corpus read and semi-spontaneous, ortho- appr. 36h x 3 synchronized multiple microphones far- graphic transcription field & beamformed Open Source German Distant Speech Recognition 3 described in [15]. We employed the KisRecord2 software, which supports concurrent recording with multiple microphones. Speakers were presented with text on a screen, one sentence at a time, and were asked to read the text aloud. While the setup is some- what artificial in that reading differs from speaking freely, we avoid the need of tran- scribing the audio and thus follow a more cost-effective approach. For the training part of the corpus, sentences were drawn randomly from three text sources: a set of 175 sentences from the German Wikipedia, a set of 567 utterances from the German section of the European Parliament transcriptions [16] and 177 short com- mands for command-and-control settings. Text sources were chosen because of their free licenses, allowing redistribution of derivatives without restrictions. Test and devel- opment sets were recorded at a later date, with new sentences and new speakers. These have 1028 and 1085 unique sentences from Wikipedia, European Parliament transcrip- tions and crawled German sentences from the Internet, distributed equally per speaker. The crawled German sentences were collected randomly with a focused crawler [17], and were only selected from sentences encountered between quotation marks, which exhibit textual content more typical of direct speech. Unlike in the training set, where multiple speakers read the same sentences, every sentence recorded in the test and de- velopment set is unique, for evaluation purposes. The distance between speaker and microphones was chosen to be one meter, which seems realistic for a business meeting setup where microphones could e.g. be located on a meeting table. There are many more use cases, where a distance of one meter is a sen- sible choice, e.g. in-car speech recognition systems, smart living rooms or plenary ses- sions. The entire corpus is available for the following microphones: Microsoft Kinect, Yamaha PSG-01S, and Samson C01U. The most promising and interesting microphone is the Microsoft Kinect, which is in fact a small microphone array and supports speaker direction detection and beamforming, cf. [15]. We recorded the beamformed and the raw signal simultaneously, however due to driver restrictions both are single channel recordings (i.e. it is not possible to apply our own beamforming on the raw signals). We split the overall data set into training, development and test partitions in such a way that speakers or sentences do not overlap across the different sets. The audio data are available in MS Wave format, one file per sentence per microphone. In addition, for each recorded sentence, an XML file is provided that contains the sentence in original and normalized form (cf. Section 3.2), and the speaker metadata, such as gender, age and region of birth. Table 2 lists the gender and age distribution. Female speakers make up about 30% in all sets and most speakers are aged between 18 and 30 years. For analysis of dialect influence, the sentence metadata also includes the federal state (Bundesland) where the speakers spent the majority of their lives. Despite our corpus being too small for training and testing dialect-specific models, this metadata is collected to support a future training of regional acoustic models. Most speakers are students that grew up and live in Hesse. The statistics related to the number of sentences per speaker for each of the three text sources in the training corpus is given in Table 3. Most sentences were drawn from Wikipedia. On average, utterances from Europarl are longer than sentences from Wikipedia, and speakers encountered more difficulties in reading Europarl utterances 2 http://kisrecord.sourceforge.net 4 Stephan Radeck-Arneth et al. Table 3. Mean and standard deviation of training set Table 2. Speaker gender and age dis- sentences read per person per text source and total tribution audio recording times for each microphone Gender Train Dev Test male 105 12 13 Corpus Mean ± StD female 42 4 4 Wikipedia 35 ± 12 Europarl 8 ± 3 Age Train Dev Test Command 18 ± 23 41–50 2 0 1 31–40 17 17 2 Microphone Dur. Train/Dev/Test (h) 21–30 108 13 10 Microsoft Kinect 31 / 2 / 2 18–20 20 1 4 Yamaha PSG-01S 33 / 2 / 2 Samson C01U 29 / 2 / 2 because of their length and domain specificity. Recordings were done in sessions of 20 minutes. Most speakers participated in only one session, some speakers (in the training set) took part in two sessions. Table 3 also compares the audio recording lengths of the three main microphones for each dataset. Occasional outages of audio streams oc- curred during the recording procedure, causing deviations in recording length for each microphone. By ensuring that the training and the dev/test portions have disjoint sets of speakers and textual material, this corpus is perfectly suited for examining approaches to speaker-independent distant speech recognition for German. 3 Experiments In this section, we show how our corpus can be used to generate a speaker-independent model in PocketSphinx and Kaldi. We then compare both speech recognition toolkits using the same language model and pronunciation dictionary in terms of word error rate (WER) on our heldout data from unseen speakers (development and test utter- ances). Our Kaldi recipe has been released on Github3 as a package of scripts, which support fully automatic generation of all acoustic models with automatic downloading and preparation of all needed resources, including phoneme dictionary and language model. This makes the Kaldi results easily reproducible. 3.1 Phoneme dictionary As our goal is to train a German speech recognizer using only freely available sources, we have not relied on e.g. PHONOLEX4, which is a very large German pronunci- ation dictionary with strict licensing. We compiled our own German pronunciation dictionary using all publicly available phoneme lexicons at the Bavarian Archive for speech signals (BAS)5 and using the MaryTTS [18] LGPL-licensed pronunciation dic- tionary, which has 26k entries. Some of the publicly available BAS dictionaries, like the one for the Verbmobil [4] corpus, also contain pronunciation variants which were 3 https://github.com/tudarmstadt-lt/kaldi-tuda-de 4 https://www.phonetik.uni-muenchen.de/Bas/BasPHONOLEXeng.html 5 ftp://ftp.bas.uni-muenchen.de/pub/BAS/ Open Source German Distant Speech Recognition 5 included. The final dictionary covers 44.8k unique German words with 70k total en- tries, with alternate pronunciations for some of the more common words. Stress mark- ers in the phoneme set were grouped with their unstressed equivalents in Kaldi using the extra_questions.txt file and were entirely removed for training with CMU Sphinx.