Building a Universal Phonetic Model for Zero-Resource Languages

Building a Universal Phonetic Model for Zero-Resource Languages Paul Moore MInf Project (Part 2) Interim Report Master of Informatics School of Informatics University of Edinburgh 2020 3 Abstract Being able to predict phones from speech is a challenge in and of itself, but what about unseen phones from different languages? In this project, work was done towards building precisely this kind of universal phonetic model. Using the GlobalPhone language corpus, phones’ articulatory features, a recurrent neural network, open-source libraries, and an innovative prediction system, a model was created to predict phones based on their features alone. The results show promise, especially for using these models on languages within the same family. 4 Acknowledgements Once again, a huge thank you to Steve Renals, my supervisor, for all his assistance. I greatly appreciated his practical advice and reasoning when I got stuck, or things seemed overwhelming, and I’m very thankful that he endorsed this project. I’m immensely grateful for the support my family and friends have provided in the good times and bad throughout my studies at university. A big shout-out to my flatmates Hamish, Mark, Stephen and Iain for the fun and laugh- ter they contributed this year. I’m especially grateful to Hamish for being around dur- ing the isolation from Coronavirus and for helping me out in so many practical ways when I needed time to work on this project. Lastly, I wish to thank Jesus Christ, my Saviour and my Lord, who keeps all these things in their proper perspective, and gives me strength each day. Table of Contents 1 Introduction 9 1.1 Motivation . 9 1.2 Project outline . 10 1.3 Previous project work . 10 2 Modelling phones 11 2.1 Phones vs. phonemes . 11 2.2 Standard phone modelling . 12 2.2.1 Feature extraction . 12 2.2.2 Monophone models . 12 2.2.3 Basic triphone models . 13 2.2.4 Advanced triphone models . 13 2.2.5 Limitations of standard models . 14 2.3 Deep learning . 14 2.3.1 Recurrent Neural Networks (RNNs) . 14 2.3.2 Long Short-Term Memory (LSTM) . 15 2.3.3 RMSProp optimisation . 16 2.3.4 Connectionist Temporal Classification (CTC) loss . 16 2.3.5 Miscellaneous techniques . 17 2.4 Universal phone models . 17 2.4.1 General concepts . 17 2.4.2 Modelling unseen phones . 18 2.4.3 Universal phone modelling with attributes . 19 3 General setup 21 3.1 The GlobalPhone Dataset . 21 3.1.1 Suitability analysis . 21 3.2 File preparation . 23 3.2.1 Kaldi . 23 3.2.2 Conversion and preliminary cleaning . 23 3.2.3 Splitting the data . 23 3.2.4 Standardising phones . 24 3.2.5 Generating transcriptions . 25 3.2.6 Generating input features . 26 3.3 Organising experiments . 26 3.3.1 Additional filtering . 26 5 6 TABLE OF CONTENTS 3.3.2 Converting phones to attributes . 26 3.3.3 Dealing with diphthongs . 27 3.4 Using PyTorch-Kaldi . 29 3.4.1 Adapting input alignments . 29 3.4.2 Cost function . 29 3.4.3 Model saving . 31 3.4.4 Chunk sizes . 31 3.4.5 Network structure . 31 3.4.6 Gradient issues . 33 3.5 Predicting phones from attributes . 33 3.5.1 Distance metrics . 33 3.5.2 Initial split . 33 3.5.3 Decision trees . 34 3.5.4 Universal scoring . 35 3.6 Evaluation . 37 3.6.1 Issues with decoding . 37 3.6.2 Alternative evaluation metrics . 38 4 Experiments 41 4.1 Experiment 1: Shallow models . 41 4.1.1 Research questions . 41 4.1.2 Setup . 41 4.1.3 Results . 42 4.2 Experiment 2: Baseline network . 43 4.2.1 Research questions . 43 4.2.2 Setup . 44 4.2.3 Results . 44 4.3 Experiment 3: Attribute network . 47 4.3.1 Research questions . 47 4.3.2 Setup . 47 4.3.3 Results . 49 4.4 Experiment 4: Cross-lingual investigations . 52 4.4.1 Research questions . 52 4.4.2 Setup . 53 4.4.3 Results . 54 5 Conclusions 59 5.1 Future work . 59 5.1.1 Fixing GlobalPhone . 59 5.1.2 Phonetic attribute improvements . 59 5.1.3 Replacing PyTorch-Kaldi . 60 5.1.4 Network structure improvements . 61 5.2 Results summary . 61 Bibliography 63 A Universal Phone Set 69 TABLE OF CONTENTS 7 A.1 Base Phones . 69 A.2 Extensions . 72 A.3 Phone maps . 72 B Dataset splits 77 B.1 Speaker lists . 77 B.2 Dataset statistics . 79 C Phone errors in baseline network 81 D Confusion matrices for attribute networks 87 E Phone distributions for attribute network 97 Chapter 1 Introduction ““Come, let us go down and confuse their language so they will not un- derstand each other”. That is why [the city] was called Babel—because there the Lord confused the language of the whole world.” ∗ 1.1 Motivation The above-quoted tale of the Tower of Babel, where humanity’s single language was split into different ones, has had a profound cultural impact that continues to this day. In Douglas Adam’s The Hitchhiker’s Guide to the Galaxy the so-called Babel fish is capable of translating any spoken language. While any organism or computer system with the ability to instantly reverse the “Babel effect” remains firmly in the area of science fiction for the present, there are related problems which may be more solvable. Worldwide there are nearly 3,000 unwritten languages [Eberhard et al., 2020]. Most of these are likely to have little to no audio data available either. According to Austin and Sallabank [2011], linguists believe that around 50-90% of the 7,000 languages worldwide will go extinct within this century, which doubtless includes the vast majority of unwritten ones. Some linguists have argued that this is a natural process, and we should do little to interfere with it ([Mufwene, 2004], [Ladefoged, 1992]). However, numerous other linguists believe that it is important to preserve them if possible, since these languages are an integral part of the society and culture they are in, and are a key component of human identity ([Austin and Sallabank, 2011], [Romaine, 2007]). When trying to save any endangered language, a key factor is to have a writing system for it. This empowers members of these people groups to read and write their own language, not just speak/hear it. Consequently, cultural stories or traditions can be written down in their original languages, and people will be able to communicate in written fashion in their native tongue, along with a whole host of other benefits. ∗Genesis 11:7,9 (NIV) 9 10 Chapter 1. Introduction In fact, such communication may be an important motivator for speakers of these languages to preserve their language. Otherwise, a more common written language may be very attractive, particularly to younger members as they interact with the modern world. After all, they may reason, why continue using a language which is less con- venient for common activities such as text or email? Books or other reading materials are also a powerful impetus for perpetuating the use of such a language. However, in order to develop a writing system, an alphabet is required. Linguists need to work out the phonetic structure of a language and use this to decide on how to represent the sounds in writing. The task of discovering these phones is challenging, and often requires a great deal of time and effort. The International Phonetic Alphabet (IPA) [Smith, 1999] is frequently used to standardise the transcription of phones. Building a universal phonetic model, thus providing a way to model all the phones in the IPA, would make this undertaking considerably easier, with a phonetic transcription based on nothing other than the audio. Even if an accurate transcription proved difficult, recurring phonetic features could be highlighted, which would be beneficial. 1.2 Project outline The existing methods for modelling phones, particularly in a universal model will be discussed first. Then, the general experimental setup used across most of the experiments will be given. The experiments themselves aim to answer the following questions: • What is a reasonable baseline, using non-universal phones? • Which feature types are better for training? • Does training on languages within the same family improve performance for unseen languages within the same family, or is it better to have as many different languages as possible? Finally, directions for potential future work will be outlined, and overall findings sum- marised. 1.3 Previous project work Certain aspects of work from the project from last year [Moore, 2019] were reused. While previously the focus was on language identification, the goal for this year, as stated in this introduction, was quite different. Some of the scripts for working with Kaldi [Povey et al., 2011] and the GlobalPhone dataset [Schultz, 2002] were reused and/or improved. Furthermore, a common focus in both projects has been working on models which could be applicable in areas of the world with little to no transcribed language resources. Chapter 2 Modelling phones In this chapter, the basic principles of building models for representing phones will be covered. Based on these simpler models, ways to apply these principles in a multilin- gual or universal sense will be explored. There will also be a brief section on relevant deep learning techniques which were used in the course of this project. 2.1 Phones vs. phonemes To begin, one important distinction to make is the difference between phones and phonemes, as these will be referred to throughout the rest of this report.

Building a Universal Phonetic Model for Zero-Resource Languages

LT3212 Phonetics Assignment 4 Mavis, Wong Chak Yin

How to Edit IPA 1 How to Use SAMPA for Editing IPA 2 How to Use X

A Phonetic, Phonological, and Morphosyntactic Analysis of the Mara Language

The South African Directory Enquiries (SADE) Name Corpus

Coproduction and Coarticulation in Isizulu Clicks

Glossopoeia a Contrastive Phonological Study Of

Prestopped Bilabial Trills in Sangtam*

Improving Machine Translation of Null Subjects in Italian and Spanish

Saudi Speakers' Perception of the English Bilabial

Phonetics and Phonology Seminar Introduction to Linguistics, Andrew

Acoustic Characteristics of Aymara Ejectives: a Pilot Study

Consonant Co-Occurrence Classes and the Feature-Economy Principle* Dmitry Nikolaev Stockholm University Eitan Grossman Hebrew University of Jerusalem