Towards Zero-Shot Learning for Automatic Phonemic Transcription
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Towards Zero-Shot Learning for Automatic Phonemic Transcription Xinjian Li, Siddharth Dalmia, David R. Mortensen, Juncheng Li, Alan W Black, Florian Metze Language Technologies Institute, School of Computer Science Carnegie Mellon University {xinjianl, sdalmia, dmortens, junchenl, awb, fmetze}@cs.cmu.edu Abstract Hermann and Goldwater 2018), which typically uses an un- supervised technique to learn representations which can be Automatic phonemic transcription tools are useful for low- resource language documentation. However, due to the lack used towards speech processing tasks. of training sets, only a tiny fraction of languages have phone- However, those unsupervised approaches could not gener- mic transcription tools. Fortunately, multilingual acoustic ate phonemes directly and there has been few works study- modeling provides a solution given limited audio training ing zero-shot learning for unseen phonemes transcription, data. A more challenging problem is to build phonemic tran- which consist of learning an acoustic model without any au- scribers for languages with zero training data. The difficulty dio data or text data for a given target language and unseen of this task is that phoneme inventories often differ between phonemes. In this work, we aim to solve this problem to the training languages and the target language, making it in- transcribe unseen phonemes for unseen languages without feasible to recognize unseen phonemes. In this work, we ad- dress this problem by adopting the idea of zero-shot learning. considering any target data, audio or text. Our model is able to recognize unseen phonemes in the tar- The prediction of unseen objects has been studied for a get language without any training data. In our model, we de- long time in the computer vision field. For specific object compose phonemes into corresponding articulatory attributes classes such as faces, vehicles and cats, a significant num- such as vowel and consonant. Instead of predicting phonemes ber manually labeled data is usually available, but collecting directly, we first predict distributions over articulatory at- sufficient data for every object human could recognize is im- tributes, and then compute phoneme distributions with a cus- possible. Zero-shot learning attempts to solve this problem tomized acoustic model. We evaluate our model by training it to classify unseen objects using mid-level side information. using 13 languages and testing it using 7 unseen languages. We find that it achieves 7.7% better phoneme error rate on For example, zebra can be recognized by detecting attributes average over a standard multilingual model. such as stripped, black and white. Inspired by approaches in computer vision research, we propose the Universal Phone- mic Model (UPM) to apply zero-shot learning to acoustic Introduction modeling. In this model, we decompose the phoneme into Over the last decade, automatic speech recognition (ASR) its attributes and learn to predict a distribution over various has achieved great successes in many rich-resourced lan- articulatory attributes. For example, the phoneme /a/ can be guages such as English, French and Mandarin. On the other decomposed into its attributes: vowel, open, front and un- hand, speech resources are still sparse for the majority of rounded. This can then be used to infer the unseen phonemes other languages. They cannot thus benefit directly from re- for the test language as the unseen phonemes can be de- cent technologies. As a result, there is an increasing interest composed into common attributes covered in the training in building speech processing systems for low-resource lan- phonemes. guages. In particular, phoneme transcription tools are use- Our approach is summarized in Figure 1. First, frames are ful for low-resource language documentation by improving extracted and a standard acoustic model is applied to map workflow for linguists to analyze those languages (Adams et each frame into the acoustic space (or hidden space) H.Next al. 2018; Michaud et al. 2018). we transform it into the attribute space P which reflects the A more challenging task is to transcribe phonemes in the articulatory distribution of each frame (such as whether it language with zero training data. This task has significant indicates a vowel or a consonant). Then, we compute the implications in documenting endangered languages and pre- distribution of phonemes for that frame using a predefined serving the associated cultures (Gippert et al. 2006). This signature matrix S which describes relationships between data setup has mainly been studied in the unsupervised articulatory attributes and phonemes in each language. speech processing field (Glass 2012; Versteegh et al. 2015; To evaluate our UPM approach, we trained the model on Copyright c 2020, Association for the Advancement of Artificial 13 languages and tested it on another 7 languages. We also Intelligence (www.aaai.org). All rights reserved. trained a multilingual acoustic model as a baseline for com- 8261 Figure 1: Illustration of the proposed zero-shot learning framework. Each utterance is first mapped into acoustic space (or hidden space) H. Then we transform each point in the acoustic space into attribute space P with a linear transformation V . Finally phoneme distributions can be obtained by applying a signature matrix S parison. The result indicates that we consistently outperform lips) and fricative. These articulatory features have been the baseline multilingual model, and we achieve 7.7% im- shown to be useful in speech recognition (Kirchhoff 1998; provements in phoneme error rate on average. Stuker¨ et al. 2003b; Muller¨ et al. 2017), and are a good The main contributions of this paper are as the followings: choice for attributes for our purpose. We provide some cate- 1. We propose the Universal Phonemic Model (UPM) that gories of articulatory attributes below. can recognize unseen phonemes during training by incor- Consonants. Consonants are formed by obstructing the porating knowledge from the phonetics/phonology do- airstream through the vocal tract. They can be categorized in main. terms of the placement and the manner of this obstruction. The placements can be largely divided into three classes: 2. We introduce a sequence prediction model to integrate labial, coronal, dorsal. Each of the class have more fine- a zero-shot learning framework for sequence prediction grained classes. The manners of articulation can be grouped problem. into: stop, fricative, approximant etc. 3. We show that our model is effective for 7 different lan- Vowel. In the production of vowels, the airstream is rela- guages, and our model gets 7.7% better phoneme error tively unobstructed. Each vowel sound can be specified by rate over the baseline on average. the positions of lips and tongue (Ladefoged and Johnson 2014). For instance, the tongue is at its highest point in the Approach front of the mouth for front vowels. Additionally, vowels can be characterized by properties such as whether the lips are This section explains the details of our Universal Phonemic rounding or not (rounded, unrounded). Model (UPM). In the first section, we describe how we con- Diacritics. Diacritics are small marks to modify vowels and structed a proper set of articulatory attributes for acoustic consonants by attaching to them. For instance, nasalization modeling. Next, we demonstrate how to assign attributes to marks a sound for which the velopharyngeal port is open each phoneme by giving an algorithm to parse X-SAMPA and air can pass through the nose. To make the articulatory format. Finally we show how we integrate the phonetic in- attribute set manageable, we assign attributes of diacritics to formation into the sequence model with a CTC loss (Graves some existing consonants attributes if they share similar ar- et al. 2006). ticulatory property. For example, nasalization is treated as Articulatory Attributes the nasal attribute in consonants. Unlike attributes in the computer vision field, attributes of phonemes are independent of the corpus and dataset, they are well investigated and defined in the domain of artic- ulatory phonetics (Ladefoged and Johnson 2014). Articu- In addition to articulatory attributes mentioned above, we latory phonetics describes the mechanism of speech pro- note that we also need to allocate an special attribute for duction such as the manner of articulation and placement blank in order to predict blank labels in CTC model, and of articulation, and it tends to describe phones using dis- backpropagate their gradients into the acoustic model. Thus, crete features such as voiced, bilabial (made with the two our articulatory attribute set Aphone is defined as the union 8262 Figure 2: Illustration of the sequence model for zero-shot learning. The input layer is first processed with a Bidirectional LSTM acoustic model, and produces a distribution over articulatory attributes. Then it is transformed into a phoneme distribution by a language dependent signature matrix S of these three domain attributes as well as the blank label, where the domain Pxsampa is the set of all valid X-SAMPA 2Aphone A = A ∪ A phonemes, and the range is a subset of articula- phone consonants vowels tory attributes for each phoneme . The assignment func- ∪ Adiacritics ∪ blank tion should map each phoneme into its corresponding subset of Aphone. To construct the function in the entire domain Attribute Assignment Next, we need to assign each Pxsampa, we first manually map a small subset Pbase ⊂ phoneme with appropriate attributes. There are multiple Pxsampa and construct a restricted assignment function approaches to retrieve articulatory attributes. The simplest Aphone f : Pbase → 2 . The mapping is customizable one is to use tools to collect articulatory features for each Pbase and has been verified with the IPA handbook (Decker and phoneme (Mortensen et al. 2016). However, those tools only others 1999). Then for every phoneme p ∈ Pxsampa,we provide coarse-grained phonological features but we expect continue to remove diacritics suffix from it until it could be more fine-grained and customized articulatory features.