Automatic Speech Recognition Using Probabilistic Transcriptions in Swahili, Amharic, and Dinka
Total Page:16
File Type:pdf, Size:1020Kb
Automatic speech recognition using probabilistic transcriptions in Swahili, Amharic, and Dinka Amit Das⋆ 1,2, Preethi Jyothi⋆2, Mark Hasegawa-Johnson1,2 1Department of ECE, University of Illinois at Urbana-Champaign, USA 2Beckman Institute, University of Illinois at Urbana-Champaign, USA {amitdas, pjyothi, jhasegaw}@illinois.edu Abstract value. Here, each symbol occurs with probability 1.0. On the other hand, the term probabilistic transcripts (PT) mean that In this study, we develop automatic speech recognition sys- the transcripts are probabilistic or ambiguous in nature. Such tems for three sub-Saharan African languages using probabilis- transcripts frequently occur, for example, when collected from tic transcriptions collected from crowd workers who neither crowd workers who do not speak the language they are tran- speak nor have any familiarity with the African languages. The scribing [1]. Usually a training audio clip (in some target lan- three African languages in consideration are Swahili, Amharic, guage L) is presented to a set of crowd workers who neither and Dinka. There is a language mismatch in this scenario. More speak L nor have any familiarity with it. Due to their lack of specifically, utterances spoken in African languages were tran- knowledge about L, the labels provided by such workers are scribed by crowd workers who were mostly native speakers of inconsistent, i.e., a given segment of speech can be transcribed English. Due to this, such transcriptions are highly prone to using a variety of labels. This inconsistency can be modeled as label inaccuracies. First, we use a recently introduced tech- a probability mass function (PMF) over the set of labels tran- nique called mismatched crowdsourcing which processes the scribed by crowd workers. Such a PMF can be graphically rep- raw crowd transcriptions to confusion networks. Next, we resented by a confusion network as shown in Fig. 2. Unlike the adapt both multilingual hidden Markov models (HMM) and DT in Fig. 1 which has a single sequence of symbols, the PT deep neural network (DNN) models using the probabilistic tran- has 3×4×3×4 = 144 possible sequences, one of which could scriptions of the African languages. Finally, we report the re- be the right sequence. In this case, it is “kæ ∅ t”. sults using both deterministic and probabilistic phone error rates Collecting and processing PTs for audio data in the target (PER). Automatic speech recognition systems developed using language L from crowd workers who do not understand L is this recipe are particularly useful for low resource languages called mismatched crowdsourcing [1]. The objective of this where there is limited access to linguistic resources and/or tran- study is to present a complete ASR training procedure to rec- scribers in the native language. ognize African languages for which we have PTs but no DTs.3 Index Terms: mismatched crowdsourcing, cross-lingual The following low resource conditions outline the nature of the speech recognition, deep neural networks, African languages data used in this study: • PTs in Target Language: PTs in the target language L are 1. Introduction collected from crowd workers who do not speak L. This work is focussed on knowledge transfer from multilingual • PTs are limited: The amount of PTs available from the crowd data collected from a set of source (train) languages to a tar- workers is limited to only 40 minutes of audio. get (test) language that is mutually exclusive to the source set. • Zero DT in Target Language: There are no DTs in L. More specifically, we assume that we have easy access to na- • DTs only in Source Languages: There are DTs from other tive transcripts in the source languages but that we do not have source languages (6= L). native transcripts in the target language. However, mismatched • DTs are limited: The are about 40 minutes of audio per lan- transcripts for the target language (i.e. transcriptions from non- guage accompanied by their DTs. Hence, the total amount of native speakers) can be easily obtained from crowd workers on multilingual DTs available for training is 3.3 hours. (40 platforms such as Amazon’s Mechanical Turk1 and Upwork.2 ≈ minutes/language # languages) An automatic speech recognition (ASR) system trained using × transcripts collected from non-native speakers can be particu- larly useful for low-resource African languages as it circum- 2. Sub-Saharan African Languages vents the difficult task of finding native speakers. 2.1. Swahili We explain some terminology used in this paper. Deter- Swahili is a widely spoken language in Southeast Africa with ministic transcripts (DT) refer to transcripts collected from na- over 15 million speakers. Swahili’s written system uses a vari- tive speakers of a language. We assume no ambiguity in these ant of the Latin alphabet; it consists of digraphs (other than the ground truth labels, and hence they are deterministic in nature. standard ones like ch, sh, etc.) corresponding to prenasalized As an example, the DT for the word “cat”, after converting the consonants that appear in many African languages. Swahili has labels to IPA phone symbols, can be represented as shown in only five vowel sounds with no diphthongs. More details about Fig. 1 with each arc representing a symbol and a probability Swahili is in [2]. ⋆ first authors 3In this work, we report phone error rates. Our methodology could 1http://www.mturk.com be extended to the word level by using our proposed G2P mappings to 2http://www.upwork.com build a lexicon, in conjunction with word-based language models. [k]/1.0 [æ]/1.0 [t]/1.0 Cantonese, Mandarin, Arabic, Urdu . Out of these, the sub- Saharan languages - Swahili, Amharic, Dinka - were considered Figure 1: A deterministic transcription (DT) for the word cat. as the target languages as the focus of this study is on African [a]/0.45 [p]/0.3 languages. The remaining languages represent the set of source languages. DTs for the target languages were used only during [k]/0.5 [p]/0.3 the evaluation stage. They were never used in the training stage. 0 35 0 3 [5]/ . [k]/ . The SBS podcasts were not entirely homogeneous in the [g]/0.4 [a]/0.2 target language and contained utterances interspersed with seg- [æ]/0.10 [t]/0.2 ments of music and English. An HMM-based language iden- ∅/0.1 ∅/0.5 tification system was used to isolate regions that correspond mostly to the target language. These long segments were then 0 10 0 2 [E]/ . [k]/ . split into smaller 5-second chunks. The short segments make Figure 2: A probabilistic transcription (PT) for the word cat. it easier for crowd workers to annotate since they are unfa- miliar with the utterance language. More than 2500 Turkers participated in these tasks, with roughly 30% of them claim- 2.2. Amharic ing to know only English. The remaining Turkers claimed Amharic is the primary language spoken in Ethiopia with to know other languages such as Spanish, French, German, over 22 million speakers. The Amharic script has more than Japanese, and Chinese. Since English was the most common 280 distinct characters (or fidels) representing various conso- language among crowd workers, they were asked to annotate nant+vowel sounds. Ejective consonants and labialized sounds the sounds in the 5-second utterances using English letters that are special characteristics of Amharic’s phonology. There are most closely matched the audio. The sequence of letters were seven vowels, thirty one consonant sounds in Amharic and no not meant to correspond to meaningful English words or sen- diphthongs. More details of Amharic phonology are in [3]. tences as this was found to be detrimental to the final perfor- mance [8]. PTs and DTs, worth about 1 hour of audio, were 2.3. Dinka collected from crowd workers and native transcribers respec- tively. Thus, the training set consists of a) about 40 minutes We elaborate more about Dinka since it has been rarely cov- of PTs in the target language and, b) about 40 minutes of DTs ered in the ASR literature. Dinka is a Western Nilotic language in other source languages excluding the target language. The which is a member of the family of Nilo-Saharan languages. It development and test sets were worth roughly 10 minutes each. is spoken by over 2 million people living in South Sudan. The To accumulate the PTs, each utterance was transcribed by four major dialects are Padang, Rek, Agar, and Bor of which the 10 distinct Turkers. First the letters in the transcripts were Rek dialect is considered the standard dialect of Dinka. This mapped to IPA symbols using a misperception G2P model study is based on the Rek dialect. The orthography of Dinka learned from the source languages. More specifically, the mis- closely follows its pronunciation. There are 33 alphabet sym- perceptions of the crowd workers were approximated by letter- bols in the Dinka orthography which are borrowed from a mix- to-phone mappings learned from mismatched transcripts and ture of Latin and IPA alphabets [4]. Furthermore, 4 out of the their corresponding DTs in the source languages. No target 33 symbols are digraphs. The Dinka phonology [5] consists of language data are used while estimating the misperception G2P 7 vowels and 20 consonants, described in more detail below. model since we assume there are no DTs in the target language. The set of vowels comprises {/a/, /e/, /E/, /i/, /o/, /O/, /u/}. Multiple mismatched transcripts, collected for the same utter- The vowels often have a creaky quality. With the exception ance, were then merged into a compact structure by aligning of /u/, these vowels could also have a breathy quality. For ex- the sequences (after defining equivalence classes corresponding ample, the breathy version of /a/is/a/, orthographically repre- to similar sounds; e.g.