Arxiv:2002.11800V1 [Cs.CL] 26 Feb 2020
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSAL PHONE RECOGNITION WITH A MULTILINGUAL ALLOPHONE SYSTEM yXinjian Li ySiddharth Dalmia yJuncheng Li ◦Matthew Lee 4Patrick Littell Jiali Yao yAntonios Anastasopoulos yDavid R. Mortensen yGraham Neubig yAlan W Black yFlorian Metze yCarnegie Mellon University; ◦SIL International; 4National Research Council Canada; ByteDance AI Lab [email protected] ABSTRACT ENGLISH MANDARIN CHINESE Multilingual models can improve language processing, particularly peak speak ping bing for low resource situations, by sharing parameters across languages. ‘level’ ‘ice’ Multilingual acoustic models, however, generally ignore the differ- /pik/ /spik/ /phiN/ /piN/ ence between phonemes (sounds that can support lexical contrasts in a particular language) and their corresponding phones (the sounds that are actually spoken, which are language independent). This h [p ik] [spik] [phiN] [piN] can lead to performance degradation when combining a variety of training languages, as identically annotated phonemes can actually correspond to several different underlying phonetic realizations. In Fig. 1: Words, phonemes (slashes), and phones (square brackets). this work, we propose a joint model of both language-independent phone and language-dependent phoneme distributions. In mul- tilingual ASR experiments over 11 languages, we find that this model improves testing performance by 2% phoneme error rate the sets of phones that correspond to a particular phoneme, are lan- absolute in low-resource conditions. Additionally, because we are guage specific; distinctions that are important in some languages are explicitly modeling language-independent phones, we can build a not important in others. (nearly-)universal phone recognizer that, when combined with the Most multilingual acoustic models simply use existing phoneme PHOIBLE [1] large, manually curated database of phone invento- transcriptions as-is, taking the union of the phoneme sets to be shared ries, can be customized into 2,000 language dependent recognizers. by all training languages [3, 4, 5, 6, 7]. The assumption is reasonable Experiments on two low-resourced indigenous languages, Inukti- under some circumstances as phoneme names are typically associ- tut and Tusom, show that our recognizer achieves phone accuracy ated with their most common or least marked allophone. However, improvements of more than 17%, moving a step closer to speech this is obviously an over-simplistic view: in Figure 1, for example, recognition for all languages in the world.1 this would mean that all training in English would assign the phones [p] and [ph] to phoneme /p/. This is detrimental if we want to rec- Index Terms— multilingual speech recognition, universal ognize Mandarin Chinese, for instance, where the two phones are phone recognition, phonology corresponding to two distinctive phonemes /p/ and /ph/. In this paper, we propose a novel method for multilingual 1. INTRODUCTION recognition based on phonetic annotation to tackle this problem: Al- arXiv:2002.11800v1 [cs.CL] 26 Feb 2020 losaurus (allophone system of automatic recognition for universal There is an increasing interest in building speech tools benefiting speech). Our method incorporates knowledge of phonology into the low-resource languages, specifically multilingual models that can multilingual model through an allophone layer, which associates a improve low-resource recognition using rich resources available universal narrow phone set with the phonemes that appear in the in other languages like English and Mandarin. One standard tool transcription of each language. Our model first computes the phone for recognition in low resource languages is multilingual acoustic distribution using a standard ASR encoder, then the allophone layer modeling [2]. Acoustic models are generally trained on parallel maps the phone distribution into the phoneme distribution for each data of speech waveforms and phoneme transcriptions. Importantly, language. This model can be trained end-to-end using only standard phonemes are perceptual units of sound that closely correlate with, phonemic transcriptions and an allophone list created by phoneti- but do not exactly correspond to the actual sounds that are spoken, cians. The allophone layer is first initialized with the allophone list, phones. An example of this is shown in Figure 1, which demon- then is further optimized during the training process. We demon- strates two English words that share the same phoneme /p/, but h strate that accounting for the phoneme-phone mismatch in this way different in the actual phonetic realizations [p] and [p ]. Allophones, improves the accuracy of multilingual acoustic modeling by 2.0% phoneme error rate in low-resource conditions. 1A web demo is available at https://www.dictate.app, the pre- trained model will be released at https://github.com/xinjli/ Furthermore, the architecture simultaneously makes it possible allosaurus to perform universal phone recognition. Previous approaches can- language- specific language 1 3. APPROACH phonemes languageloss ... language 1 language 1 languageloss L 3.1. Phone-Phoneme Annotation language ... language ... loss loss loss language L language L Suppose there are jLj training languages, and each language Li has loss loss allophone layer its own phoneme inventory Qi which can be easily obtained by loss loss allophone layer enumerating the phonemes appearing in its annotated training data. allophone layer Most traditional multilingual approaches handle inventories at the phoneme level, and create a shared phoneme inventory Qsha by tak- language- shared universal phonemes specific ing union of the phoneme sets: encoder encoder phonemes encoder phones [ Shared Phoneme Model Private Phoneme Model Allosaurus Qsha = Qi: (1) 1≤i≤|Lj Fig. 2: Traditional approaches predict phonemes directly, either for In contrast, our method distinguishes phonemes from their all languages (left) or separately for each language (middle). On the phone realizations. We have linguists annotate each phoneme contrary, our approach (right) predicts over a shared phone inventory, i q 2 Qi with its corresponding allophone set Pq , where each phone then maps into language-specific phonemes with an allophone layer. i p 2 Pq is a realization of q in language Li. Merging these sets for all languages, we obtain the universal phone inventory Puni. [ [ i not perform phone recognition in a universal fashion as they de- Puni = Pq (2) pend on language-specific phonemes, as illustrated with the previ- 1≤i≤|Lj q2Qi ous example of English not distinguishing /p/ and /p h/ as required i jQi|×|Punij in Mandarin. In contrast, because our approach allows recognition Additionally, we obtain a signature matrix S = f0; 1g of phones directly, it already has learned to make these fine-grained describing the association of phone and phonemes in each language distinctions. Taking advantage of this fact, we incorporate a large Li: Suppose the phoneme q 2 Qi has the row index j where phone inventory database collected by linguists, PHOIBLE [1], and 1 ≤ j ≤ jQij , phone p 2 Puni has the column index k where 1 ≤ k ≤ jPunij, if the p is a realization of q, then (j; k) cell of the demonstrate that our phone recognizer can be customized to recog- i nize over 2000 languages without any training data in the languages S has a value of 1, otherwise it is assigned 0. themselves. By evaluating the recognizer with completely unseen testing languages, we found that our recognizer achieves 17% better 3.2. Allophone Layer performance absolute compared with the traditional approach. As mentioned in Section 2, traditional multilingual models can be di- vided into two groups. The first group, shared phoneme models (Fig- ure 2 left), predicts phoneme distributions over the shared phoneme inventory Qsha. The second group, private phoneme models (Figure 2 middle), on the other hand, shares a common encoder but computes 2. RELATED WORK distribution over private phoneme inventory Qi for each language Li. These approaches handle phonemes directly with no concept of While some recent work in multilingual ASR focuses on end-to-end underlying phones. models to directly predict graphemes [8, 9], most systems still de- In contrast our proposed approach, Allosaurus, (Figure 2 right), pend on phonetically inspired acoustic models. Multilingual acous- comprises a language independent encoder and phone predictor, and tic models fall into two groups. The first group, shared phoneme a language dependent allophone layer and a loss function associ- models, creates a shared phoneme inventory of all phonemes from ated with each language. The encoder first produces the distribution jPunij all training languages [3, 4, 5, 6, 7, 10]. The second group, pri- h 2 R over the universal phone inventory Puni, then the allo- i jQij vate phoneme models, treats phonemes from each language as com- phone layer transforms h into phoneme distribution g 2 R of each language. The allophone layer uses a trainable allophone ma- pletely different classes performs phoneme classification separately i jQ |×|P j trix W 2 R i uni to describe allophones in the similar way as for each language [11, 2, 12]. However, these two groups have their i i i own respective drawbacks: the first group fails to consider the dis- S . The allophone matrix W is first initialized with S , and is al- connect between the phonemes across languages while the second lowed to be optimized during the training process, but we add an L2 penalty to penalize divergence from the original signature matrix Si. group completely