<<

Building a Universal Phonetic Model for Zero-Resource

Paul Moore

MInf Project (Part 2) Interim Report Master of Informatics School of Informatics University of Edinburgh 2020

3

Abstract Being able to predict phones from is a challenge in and of itself, but what about unseen phones from different languages? In this project, work was done towards building precisely this kind of universal phonetic model. Using the GlobalPhone corpus, phones’ articulatory features, a recurrent neu- ral network, open-source libraries, and an innovative prediction system, a model was created to predict phones based on their features alone. The results show promise, especially for using these models on languages within the same family. 4

Acknowledgements

Once again, a huge thank you to Steve Renals, my supervisor, for all his assistance. I greatly appreciated his practical advice and reasoning when I got stuck, or things seemed overwhelming, and I’m very thankful that he endorsed this project. I’m immensely grateful for the support my family and friends have provided in the good times and bad throughout my studies at university. A big shout-out to my flatmates Hamish, Mark, Stephen and Iain for the fun and laugh- ter they contributed this year. I’m especially grateful to Hamish for being around dur- ing the isolation from Coronavirus and for helping me out in so many practical ways when I needed time to work on this project. Lastly, I wish to thank Jesus Christ, my Saviour and my Lord, keeps all these things in their proper perspective, and gives me strength each day. Table of Contents

1 Introduction 9 1.1 Motivation ...... 9 1.2 Project outline ...... 10 1.3 Previous project work ...... 10

2 Modelling phones 11 2.1 Phones vs. ...... 11 2.2 Standard phone modelling ...... 12 2.2.1 Feature extraction ...... 12 2.2.2 Monophone models ...... 12 2.2.3 Basic triphone models ...... 13 2.2.4 Advanced triphone models ...... 13 2.2.5 Limitations of standard models ...... 14 2.3 Deep learning ...... 14 2.3.1 Recurrent Neural Networks (RNNs) ...... 14 2.3.2 Long Short-Term Memory (LSTM) ...... 15 2.3.3 RMSProp optimisation ...... 16 2.3.4 Connectionist Temporal Classification (CTC) loss ...... 16 2.3.5 Miscellaneous techniques ...... 17 2.4 Universal phone models ...... 17 2.4.1 General concepts ...... 17 2.4.2 Modelling unseen phones ...... 18 2.4.3 Universal phone modelling with attributes ...... 19

3 General setup 21 3.1 The GlobalPhone Dataset ...... 21 3.1.1 Suitability analysis ...... 21 3.2 File preparation ...... 23 3.2.1 Kaldi ...... 23 3.2.2 Conversion and preliminary cleaning ...... 23 3.2.3 Splitting the data ...... 23 3.2.4 Standardising phones ...... 24 3.2.5 Generating transcriptions ...... 25 3.2.6 Generating input features ...... 26 3.3 Organising experiments ...... 26 3.3.1 Additional filtering ...... 26

5 6 TABLE OF CONTENTS

3.3.2 Converting phones to attributes ...... 26 3.3.3 Dealing with diphthongs ...... 27 3.4 Using PyTorch-Kaldi ...... 29 3.4.1 Adapting input alignments ...... 29 3.4.2 Cost function ...... 29 3.4.3 Model saving ...... 31 3.4.4 Chunk sizes ...... 31 3.4.5 Network structure ...... 31 3.4.6 Gradient issues ...... 33 3.5 Predicting phones from attributes ...... 33 3.5.1 Distance metrics ...... 33 3.5.2 Initial split ...... 33 3.5.3 Decision trees ...... 34 3.5.4 Universal scoring ...... 35 3.6 Evaluation ...... 37 3.6.1 Issues with decoding ...... 37 3.6.2 Alternative evaluation metrics ...... 38

4 Experiments 41 4.1 Experiment 1: Shallow models ...... 41 4.1.1 Research questions ...... 41 4.1.2 Setup ...... 41 4.1.3 Results ...... 42 4.2 Experiment 2: Baseline network ...... 43 4.2.1 Research questions ...... 43 4.2.2 Setup ...... 44 4.2.3 Results ...... 44 4.3 Experiment 3: Attribute network ...... 47 4.3.1 Research questions ...... 47 4.3.2 Setup ...... 47 4.3.3 Results ...... 49 4.4 Experiment 4: Cross-lingual investigations ...... 52 4.4.1 Research questions ...... 52 4.4.2 Setup ...... 53 4.4.3 Results ...... 54

5 Conclusions 59 5.1 Future work ...... 59 5.1.1 Fixing GlobalPhone ...... 59 5.1.2 Phonetic attribute improvements ...... 59 5.1.3 Replacing PyTorch-Kaldi ...... 60 5.1.4 Network structure improvements ...... 61 5.2 Results summary ...... 61

Bibliography 63

A Universal Phone Set 69 TABLE OF CONTENTS 7

A.1 Base Phones ...... 69 A.2 Extensions ...... 72 A.3 Phone maps ...... 72

B Dataset splits 77 B.1 Speaker lists ...... 77 B.2 Dataset statistics ...... 79

C Phone errors in baseline network 81

D Confusion matrices for attribute networks 87

E Phone distributions for attribute network 97

Chapter 1

Introduction

““Come, let us go down and confuse their language so they will not un- derstand each other”. . . That is why [the city] was called Babel—because there the Lord confused the language of the whole world.” ∗

1.1 Motivation

The above-quoted tale of the Tower of Babel, where humanity’s single language was split into different ones, has had a profound cultural impact that continues to this day. In Douglas Adam’s The Hitchhiker’s Guide to the Galaxy the so-called Babel fish is capable of translating any spoken language. While any organism or computer system with the ability to instantly reverse the “Babel effect” remains firmly in the area of science fiction for the present, there are related problems which may be more solvable.

Worldwide there are nearly 3,000 unwritten languages [Eberhard et al., 2020]. Most of these are likely to have little to no audio data available either. According to Austin and Sallabank [2011], linguists believe that around 50-90% of the 7,000 languages world- wide will go extinct within this century, which doubtless includes the vast majority of unwritten ones.

Some linguists have argued that this is a natural process, and we should do little to interfere with it ([Mufwene, 2004], [Ladefoged, 1992]). However, numerous other linguists believe that it is important to preserve them if possible, since these languages are an integral part of the society and culture they are in, and are a key component of human identity ([Austin and Sallabank, 2011], [Romaine, 2007]).

When trying to save any endangered language, a key factor is to have a writing system for it. This empowers members of these people groups to read and write their own language, not just speak/hear it. Consequently, cultural stories or traditions can be written down in their original languages, and people will be able to communicate in written fashion in their native , along with a whole host of other benefits.

∗Genesis 11:7,9 (NIV)

9 10 Chapter 1. Introduction

In fact, such communication may be an important motivator for speakers of these lan- guages to preserve their language. Otherwise, a more common written language may be very attractive, particularly to younger members as they interact with the modern world. After all, they may reason, why continue using a language which is less con- venient for common activities such as text or email? Books or other reading materials are also a powerful impetus for perpetuating the use of such a language. However, in order to develop a writing system, an alphabet is required. Linguists need to work out the phonetic structure of a language and use this to decide on how to represent the sounds in writing. The task of discovering these phones is challenging, and often requires a great deal of time and effort. The International Phonetic Alphabet (IPA) [Smith, 1999] is frequently used to standardise the transcription of phones. Building a universal phonetic model, thus providing a way to model all the phones in the IPA, would make this undertaking considerably easier, with a phonetic transcrip- tion based on nothing other than the audio. Even if an accurate transcription proved difficult, recurring phonetic features could be highlighted, which would be beneficial.

1.2 Project outline

The existing methods for modelling phones, particularly in a universal model will be discussed first. Then, the general experimental setup used across most of the experi- ments will be given. The experiments themselves aim to answer the following questions: • What is a reasonable baseline, using non-universal phones? • Which feature types are better for training? • Does training on languages within the same family improve performance for unseen languages within the same family, or is it better to have as many different languages as possible? Finally, directions for potential future work will be outlined, and overall findings sum- marised.

1.3 Previous project work

Certain aspects of work from the project from last year [Moore, 2019] were reused. While previously the focus was on language identification, the goal for this year, as stated in this introduction, was quite different. Some of the scripts for working with Kaldi [Povey et al., 2011] and the GlobalPhone dataset [Schultz, 2002] were reused and/or improved. Furthermore, a common focus in both projects has been working on models which could be applicable in areas of the world with little to no transcribed language resources. Chapter 2

Modelling phones

In this chapter, the basic principles of building models for representing phones will be covered. Based on these simpler models, ways to apply these principles in a multilin- gual or universal sense will be explored. There will also be a brief section on relevant deep learning techniques which were used in the course of this project.

2.1 Phones vs. phonemes

To begin, one important distinction to make is the difference between phones and phonemes, as these will be referred to throughout the rest of this report. A is the smallest structural unit distinguishing the meaning between sounds in a language. If one phoneme is swapped with another, the meaning of the word changes. Phonemes are consequently language-specific.

A phone, on the other hand, is the acoustic realisation of a phoneme (how it actually sounds). Phones are not language-specific. Allophones are phones that correspond to the same underlying phoneme.

This can be illustrated by the following examples (taken from [Coxhead, 2008]): the word cat is made of three distinct unit sounds, replacing one of which will change the meaning of the word. There are three phones corresponding to three phonemes.

In English, the /p/ and /ph/ phones are not interpreted as distinct: for instance, the “p” sound in pin [pIn] and spin [phIn] would be taken by native speakers to refer to the same phoneme. However, in Hindi, /p/ and /ph/ do have a distinction between them, so they are separate phonemes.

It is useful to keep this distinction in mind so that it is clear what is meant by a universal (or monolingual) phone model. The goal with such models is to represent the phone based on input sounds, rather than discovering the phoneme set of a language (which is considerably more challenging and requires much more human linguistic effort).

11 12 Chapter 2. Modelling phones

2.2 Standard phone modelling

This section outlines typical techniques for modelling phones which were used to gen- erate the early models in this project (section 4.1). Since they are well-documented, relatively common, and not the primary focus of the project, their descriptions will be kept to a fairly high-level.

2.2.1 Feature extraction The input feature vectors used in most phonological systems are often Mel-Frequency Cepstral Coefficients (MFCCs). The full details of how exactly MFCCs are calculated are outlined in Zheng et al. [2001]. MFCCs provide a vector representation of a speech signal at multiple “frames”, where a single frame is across a window of the signal (usually 25ms long), with a step be- tween frames (usually 10ms). An MFCC vector is typically 13- to 23-D. In some cases, the dimensionality is increased by adding the changes between each feature in consecutive vectors (deltas) and the change between those changes (delta- deltas). When adding delta and delta-delta features (as done for the triphone models in this project, section 4.1), the resulting feature vectors are then typically 39- to 69-D. The acoustic features in MFCCs are decorrelated, which makes them useful for in- clusion in machine-learning techniques such as training Gaussian Mixture Models (GMMs). For GMMs, high correlation would require the Gaussians to have full co- variance, or there would need to be very large number of Gaussians with diagonal covariance. In both cases, training would become computationally more expensive and would require a greater amount of data. MFCCs do not cause all these detrimental effects and so are widely used even to this day. Filterbank (FBANK) features are calculated in the same way as MFCCs, except that the decorrelation steps are not applied. They were used in the experiment in section 4.3. While not very useful for the standard models, other models such as neural networks can benefit from them, as observed for instance by Mohamed et al. [2012]. Picone [1993] provides a good overview of all of these methods for processing signals, and some additional methods too.

2.2.2 Monophone models Many monophone models are based around a combination of Hidden Markov Models (HMMs) and GMMs. An example of this is shown in Figure 2.1. Here, a single phone is viewed as being made up of three hidden states at the beginning, middle and end, each of which produces a different GMM as output. In even simpler cases a phone may have only one hidden state, or single Gaussians instead of GMMs as outputs. Multiple phones can be chained together to form models of words or sentences. The likelihood of different hidden state sequences can be computed using the Forward al- gorithm, the most likely state sequence can be calculated through the Viterbi algorithm, 2.2. Standard phone modelling 13

I beg mid end E

Figure 2.1: Example triple-state monophone HMM-GMM for a single phone. I = start state; E = end state. Adapted from Renals and Shimodaira [2019]. and the parameters for the model can be learned through the Baum-Welch algorithm [Rabiner, 1989].

2.2.3 Basic triphone models However, monophone models are not usually sufficient, because phonetic context is important. The neighbouring phones affect how exactly a particular phone sounds as they affect the (Figure 2.4) and the exact acoustic context in which the phone is spoken. What this means in practice is that the acoustic signal for individual phones is often highly variable. A common solution is to use triphones instead, where each phone is given left and right context phones. Each part (left, middle and right) can be modelled by a single Gaussian or GMM. Since there are a high number of theoretical triphones (# phones3), acousti- cally similar ones are often clustered together using decision trees (the parameters for which are also learned during training) [Renals and Shimodaira, 2019]. One common application of triphones is as a component in automatic speech recog- nition (ASR). Sequences of triphones are identified, converted from these to phones, and then combined to find words using a lexicon of pronunciations. A language model restricts the search space from spurious or improbable phone or word combinations. However, some modern end-to-end (E2E) models go “directly” from speech to words; for instance the work by Hadian et al. [2018]. Some relevant E2E components are discussed in section 2.3.

2.2.4 Advanced triphone models Some more advanced techniques for triphone models were also applied in the shallow models for this project. The first technique used included Linear Discriminant Analysis (LDA) [Haeb-Umbach and Ney, 1992] and Maximum Likelihood Linear Transforms (MLLT) [Gopinath, 1998], [Gales, 1999] during training. LDA aims to reduce the feature dimensionality, while reducing the variability within a class and maximising the variability between classes. What exactly a class is can be varied. They may be phones, or sub-phone units; in the Kaldi toolkit [Povey et al., 2011] classes are acoustic states. 14 Chapter 2. Modelling phones

During training, feature vectors from the frames before and after are spliced together, and LDA reduces the dimensionality. MLLT then finds transformations from this re- duced feature space for each speaker. This makes the resulting triphone model more speaker-independent A second technique used in this project was Speaker Adaptive Training (SAT) [Anas- tasakos et al., 1996], [Anastasakos et al., 1997], which was done on top of the triphone + LDA + MLLT model. The inner workings of SAT are fairly complex, but the key idea is that it involves learning speaker transformations while also learning the acous- tic model’s parameters during training. By doing so, it is able to improve the accuracy considerably, at the cost of taking longer to train and requiring more space for storage [Matsoukas et al., 1997].

2.2.5 Limitations of standard models One downside of a triphone-based approach for multilingual phones is that the number of possible triphones is already very large for a single language, which is why clus- tering is necessary. If this is expanded to multiple languages from different families, then the number of possible triphones required increases, which in in turn requires more data and good clustering. For instance, in the GlobalPhone dataset, Croatian has 30 phones (27,000 triphones) and Swahili has 37 phones (50,653 triphones). When combined they have 50 phones and 125,000 possible triphones. Furthermore, there are even more serious problems with unseen phones. These models can only predict phones that they have seen already in the training data. This may not always have a terrible effect: for instance, a model that can predict the phone /m/ would probably predict /mb/ as /m/, which would not necessarily be a bad thing (if the two are not separate phonemes in the target language). Thus, such multilingual models can only work well if the testing and training languages use the same (or very similar) sets of phones.

2.3 Deep learning

Modern ASR and phone recognition systems often use some form of deep learning in, for instance, E2E systems. In this section, the models, principles and techniques relevant to this project will be briefly described as some of the more recent work in universal phone prediction involves these. A basic knowledge of deep learning and its concepts (such as backpropagation) is assumed, so that will not be covered here (though Nielsen [2015] provides an excellent overview even for those who are already familiar with the key ideas).

2.3.1 Recurrent Neural Networks (RNNs)

In an RNN, each of the input feature vectors x1...xT are fed into a hidden state, and the state’s output obtained. The exact internal structure of this state can vary, but the crucial aspect is that it uses the previous state as an input as well. Thus, information from previous time steps can be retained when making predictions. 2.3. Deep learning 15

During backpropagation, gradients can be propagated along the hidden states using the unrolled RNN (shown in Figure 2.2). However, this is one of the flaws with vanilla RNNs: gradients tend to vanish to small values the further back in time they are prop- agated [Pascanu et al., 2013].

Figure 2.2: An example of an unfolded RNN. h = hidden state, = input, o = output. W , & U are weight matrices. From Wikipedia, by fdeloche, licensed under CC BY-SA 4.0. Link to URL

2.3.2 Long Short-Term Memory (LSTM) An LSTM attempts to address the shortcomings of the vanilla RNNs described previ- ously. They maintain the recurrent aspect, but use different hidden states (Figure 2.3). Each of the hidden states has a “cell” which is effectively used to store and monitor any dependencies in the sequential input features.

Figure 2.3: LSTM cell example. Ft = forget gate, It = input gate, Ot = output gate. x = input, o = output, h = hidden state, c = cell state. From Wikipedia, by fdeloche, licensed under CC BY-SA 4.0. Link to URL

There are three gates in an LSTM: 1. input gate: how much of the input at the current time should be used in the internal cell state 16 Chapter 2. Modelling phones

2. forget gate: how much of the previous cell’s internal state should be maintained in the current cell’s internal state 3. output gate: how much of the cell’s internal state should be used when calculat- ing the output for the hidden state Together, these three gates regulate information flow in a layer, and enable information to be preserved across time steps for a long time, if necessary. Thus the vanishing gradient problem can be resolved [Bayer, 2015].

2.3.3 RMSProp optimisation There are numerous different weight optimisation methods to choose from in deep learning. In this project, Root Mean Square Propagation (RMSProp) was used; it was proposed by Hinton [2012], and has seen a fairly wide range of applications since. In RMSProp, the gradients are divided by a moving average of the squared gradients (Equation 2.1 and Equation 2.2).

2 MS(wt) = γ MS(wt−1) + (1 − γ) · (δwt) (2.1)

Equation 2.1: Calculation of the moving average squared gradient for a single weight. wt = weight at time t; γ = moving average parameter; δwt = weight gradient

δwt wt+1 = wt − ηp (2.2) MS(wt) + ε

Equation 2.2: Updating a single weight parameter using RMSProp. wt = weight at time t; η = learning rate; δwt = weight gradient; MS = squared gradient (Equation 2.1); ε = stabilising parameter

RMSProp was based on Robust Propagation (Rprop) [Riedmiller and Braun, 1993], except it allows for learning with mini-batches [Hinton, 2012], rather than requiring full batches. It is more robust to steep changes in gradient than stochastic gradient descent (SGD) as well, since any large gradient would be approximately divided by itself. A small stabilising parameter, ε, (typically 1 × 10−8) is added in Equation 2.2 to ensure that if the value in the denominator is zero (or close to zero), this does not cause numerical problems.

2.3.4 Connectionist Temporal Classification (CTC) loss CTC loss works by receiving a distribution across all possible outputs at each indi- vidual time step. From this distribution, the probability of different output sequences can be calculated. Each output sequence is then transformed into an alignment by combining characters which are repeated and then deleting any blank symbols. Any alignments which fail to map to the target output are excluded. Then the loss can be computed based on how likely all the valid alignments were. 2.4. Universal phone models 17

The main benefit of using CTC is that it does not require frame-level alignments to phones or labels when training, but can work end-to-end instead. One disadvantage though, is that it treats all the outputs as occurring independently of one another, which is not usually the case in practice.

2.3.5 Miscellaneous techniques Finally, three other deep learning techniques used in this project will be outlined. The first is batch normalisation [Ioffe and Szegedy, 2015]. It is designed to stop ac- tivations of layers from becoming too large. Each of the mini-batch activations are normalised to zero mean and unit variance, and the output scaled and shifted accord- ingly. Furthermore, this effectively adds some noise to the activations which helps somewhat to regularise the training. The second is dropout [Srivastava et al., 2014]. During training, if dropout is applied to a layer, then a random subset of its outputs are effectively ignored. This has a regularisation effect since it prevents networks from relying too much on the outputs of any individual nodes. Instead, they must learn more robust representations of the input features. The third is the rectified linear unit (ReLU) activation function (Equation 2.3). This function has been observed to improve neural network training [Glorot et al., 2011]. It is faster to compute than sigmoid or tanh functions as it does not require relatively expensive exponential calculations. Backpropagation is quite straightforward as well, since the gradients are only 0 or 1.

ReLU(x) = max(0,x) (2.3)

Equation 2.3: Rectified Linear Unit activation function

2.4 Universal phone models

2.4.1 General concepts The appeal of having a universal phonetic model is quite intuitive. If there were some robust, non-language-specific phone recogniser, it would be very beneficial for un- written languages. Not only that, but it would be considerably easier to incorporate a new language into an existing model built using this universal set since there would be no need to work out a new phone set or other models beforehand. These ideas were expressed by Gokcen and Gokcen [1997]. However, there are challenges with building such a model. Firstly, the number of phonemes for which phones must be distinguishable is greatly increased. For an indi- vidual language, this could be from 141 phonemes (in !Xu),˜ down to 11 (in Rotokas and Piraha)˜ [Crystal, 2010], though 20-50 is a more typical number in most European languages. 18 Chapter 2. Modelling phones

To combine these is challenging, since distinct phonemes in one language may not be distinct in another. For instance, as mentioned previously in section 2.1, most En- glish phonetic transcriptions would not distinguish between the phones /p/ and /ph/ (aspirated), but would instead transcribe both as “p”. However, a Hindi transcription would need to show the two separately, since both sounds are important to determine the meaning of different words. One solution is to map all of the phonemes from different languages to their closest corresponding phone in the IPA to standardise them, and then merge them together. Such combination and standardisation can be useful. Schultz and Waibel [1998] inves- tigated combining phonemes into one multilingual phoneme set, finding that the word error rate (WER) slightly increased, and that it was important to add a language ques- tion when clustering triphones. In regard to unseen languages (trying a crosslingual approach), they achieved a WER of 41.5% for German, which was slightly more than double the 20% WER on a monolingual German setup. While such results may be promising, it did require that they go through the German phoneme set by hand and map its phonemes to their “global” phoneme set. The IPA uses 107 symbols for sounds, 52 diacritics and 4 prosodic marks. Putting these all together results in a huge number of possible phones, so trying to represent every single possible phone adequately in a model is simply infeasible. It would require a huge amount of meticulously transcribed training data, while some phones would still seldom or never occur. Instead, for the purposes of this project, the phoneme set and corresponding phone inventory used in training will be restricted to the ones from the phonetic transcriptions themselves, ignoring any unseen diacritics or prosodic marks. How to go from this smaller phone set to a universal one which can predict unseen phones will be discussed in the following section.

2.4.2 Modelling unseen phones There is currently no definitive answer on how to model unseen phones, but one promising approach is to use articulatory features. This takes advantage of the fact that each phoneme can be broken down into the presence or absence of distinct at- tributes. These relate to aspects such as the place and , as shown in Figure 2.4. For instance, the phone /b/ has the attributes of “, voiced, bilabial, stop & labial”. The attributes themselves are what distinguish between different phones. Thus, in theory, if such attributes could be accurately predicted, one would be able to derive the corresponding phone. NB: throughout the rest of this report, the term “attributes” refers to these articulatory features to distinguish them from input feature vectors such as MFCCs. Some progress was made using this technique by Siniscalchi et al. [2008] and Lyu et al. [2008]. Their approach was to decompose phonemes into the presence/absence of these attributes, train “simple” neural networks to recognise the likelihood of each at- 2.4. Universal phone models 19

Figure 2.4: Illustration of different places of articulation and components of human vocal anatomy. From Encyclopaedia Britannica [2020.] tribute separately, and then combine these probabilities to predict the phoneme. There did appear to be reasonable performance when comparing models trained with Man- darin to those without (which were meant to be universal). In addition, King et al. [2004] have done work with articulatory feature recognition using dynamic Bayesian networks. They included dependencies between different sets of attributes (e.. manner and place), rather than viewing them all as independent from one another. As a results, they found that this improved the accuracy of recognising these attributes, since unlikely combinations were removed. In the section on future work, they also believed it would be possible to include such attributes in phone recog- nisers. All of this confirms that breaking down phonemes into articulatory features is a rea- sonable approach to take when trying to identify them, as demonstrated in some more recent studies as well ([Muller¨ et al., 2017] and [Baljekar et al., 2015]). Furthermore, this would mean that slight inaccuracies in the alignment labels could be handled during training. For example, if a phone was labelled as /a/ instead of /A/, the attributes are identical except one is “front” and the other is “back”. Thus, during training, the model could still learn the correct attributes of “”, “unrounded” and “open”, even if one attribute was wrong.

2.4.3 Universal phone modelling with attributes An unpublished work by Li et al. [2018] provided a great deal of inspiration for this project. As above, the key idea was to model phonemes in terms of their attributes, and 20 Chapter 2. Modelling phones used these to predict phonemes. A bidirectional LSTM modelled the acoustic signals in their temporal context to obtain a distribution over attributes, after which a language- dependent signature matrix transformed that distribution into a phoneme distribution. CTC loss was used to optimise the model (Figure 2.5).

Figure 2.5: Network structure used by Li et al. [2018] (figure taken from the same paper).

However, there was one significant issue with this network structure: it required a sig- nature matrix for the language. The problems with such a matrix will be discussed more fully in section 3.5. The work here also only used English for training, so one advantage of this project is the ability to use a multilingual dataset instead (see sec- tion 3.1), rather than English alone. The exact dataset and general model setup will be detailed in the following chapter. Chapter 3

General setup

In this chapter, the GlobalPhone corpus and its use is discussed first of all, since it af- fected the way everything else was done. Next, techniques which were required across all of the experiments, such as file conversion and data organisation are given. Finally, the modifications made to PyTorch-Kaldi, and the methods for predicting phones from attributes are covered.

3.1 The GlobalPhone Dataset

The GlobalPhone (GP) corpus is a multilingual dataset containing speech from people reading news articles, along with transcriptions. Work was begun in 1995 [Schultz, 2002], with more languages being added over time. Currently, there are over 1500 speakers in total, and 22 languages in the corpus, with about 20 minutes of speech per speaker. For additional information about GP, see section 3.1 of the previous year’s project [Moore, 2019]. Summary statistics for the languages used in this project are given in Table 3.1. Note that these statistics are after various filtering operations had been applied to the data files (detailed in the rest of this section). Hence, they may not reflect the statistics in the raw GP corpus.

3.1.1 Suitability analysis One major benefit of the GP corpus is that it provides phone-level dictionaries for each word in the transcriptions. However, generating phonetic transcriptions based on the words is likely to be relatively inaccurate. There are two main reasons for this: firstly, the phonetic transcriptions for GP were made by joining the phonetic words in the GP dictionaries together, as described later in section 3.2.5. If a word had multiple pronunciations, the first in the dictionary was selected, since it is impossible to tell from the transcript files alone which pronuncia- tion occurs most frequently.

21 22 Chapter 3. General setup

Language Length (hrs) No. speakers Age (mean) Gender ratio (m:f) Bulgarian 20.6 77 32.6±13.9 42:58 Croatian 10.1 93 32.1±14.2** 40:60 Hausa 8.7 103 28.2±10.5 31:69 Polish 24.6 99 * * Swahili 11.2 70 24.8±7.5 46:54 Swedish 21.7 98 32.3±13.7 51:49 Turkish 14.8 101 27.5±10.8 28:72 Ukrainian 14.1 119 34.0±13.5 39:61 Total 125.8 760 - - Mean (language) 15.7±5.5 95±14.4 - -

Table 3.1: Summary of GlobalPhone data as used in experiments. For mean data, the ± value is the population standard deviation. Gender ratios are rounded to whole numbers. * indicates that were many speakers missing this information, so no reliable statistics could be taken. ** indicates that information was missing for just one speaker.

Secondly, the words in the transcriptions were joined together independently. This meant that word boundaries were not reflected well, particularly when words ran into each other. However, once again, there is little that can be done unless one were to go through and painstakingly transcribe every utterance by hand, phonetically.

For these reasons, a fairly high phone error rate (PER) was quite likely, regardless of how good any model happened to be. The best result for the TIMIT dataset of phonetic transcriptions [Garofolo et al., 1992] is around 15% [Michalek and Vanek, 2018]. Fur- thermore, TIMIT is well-transcribed at the phonetic level, and is monolingual; GP is multilingual and has the aforementioned issues. Consequently, an alternative evaluation metric was to generate a phonetic inventory/distribution for each language, and compare this to the “true” value (based on the transcriptions); more information about this is provided in section 3.6.2.

There is still great value nonetheless in working with datasets like GP, which are word- level transcribed. It is extremely time-consuming to create phone-level transcriptions, and there is often room for debate among linguists as to what exact phone should be used. Word-level transcriptions are far easier to make, and considerably more common in language corpora, so the techniques here would be more widely applicable.

In conclusion, despite its potential flaws, the GP corpus is a suitable source of train- ing/testing data, since some issues can be accounted for through its size, and using data augmentation techniques. Inaccurate phonetic transcriptions are the main down- side, but there is little that can be done except to make a note of it. 3.2. File preparation 23

3.2 File preparation

3.2.1 Kaldi Kaldi is an open source library designed for speech and phone recognition [Povey et al., 2011]. It provides scripts and functions for various tasks, from extracting MFCCs to training deep networks. Much of the work in file preparation detailed here was done using Kaldi, or adapted versions of its scripts.

3.2.2 Conversion and preliminary cleaning The first step was to convert the GP audio files to the WAV format to make them usable, with the sox and shorten applications. The vast majority of the files were converted properly, though a few failed for unknown reasons (the error messages provided were not particularly helpful, only warning of magic numbers). In some cases, either the transcription or the audio was missing for an utterance. For these ones, the utterances were removed. A wiki page [2010] by Partha Lal, who did some work with GP [Lal, 2011] lists a number of issues such as these, which were noted. Any utterances marked as problematic here were also discarded. Notably, the audio files for Tamil were very different - it seemed that the utterances for each speaker had been combined into one very long audio file (about half an hour long). Furthermore, there were no transcriptions or phone dictionaries. As a result, the Tamil language was removed.

3.2.3 Splitting the data According to the GP documentation [Schultz, 2002], any train/valid/test split should not contain the same or speaker in two different sets. That is, the theoretical best split would be to have completely unique speakers and articles in each split. The “higher” priority was to ensure that the speakers were kept separate, since that would provide more similarity between splits than, for example, two speakers with different accents reading the same sentence. First, each individual sentence spoken by the different speakers, was obtained from their transcriptions. These were then compared between all other speakers to find the number of sentences that were the same between the two. This provided a strong measure of the overlap between different pairs of speakers. Pairs of speakers with more than 10 overlapping sentences were marked as “very bad”, pairs with 2-10 as “bad”, and pairs with only one sentence as “minor”. The main goal was to ensure that speaker pairs with high levels of overlap were placed in the same split (to reduce overlap between splits). In most cases, some degree of at least “minor” overlap between the splits was unavoidable The strategy was to allocate as many “unique” speakers (ones with no overlap with other speakers) as possible to a split. This was done first for the test set, to ensure that it was as different as possible from either training or validation sets. If there were 24 Chapter 3. General setup no “unique” speakers left, then the next step was to add any pairs/groups of speakers which only overlapped with each other. For example if speakers 1, 2 and 3 have high levels of overlap between them, but not with any other speakers, then it would be fine to include them all in one split. Finally, if necessary, “minor” (and then “bad”, and lastly “very bad”) speakers were added too. The utt2len file contains the length of each utterance in seconds. From this, it was possible to calculate the length of each speaker’s utterances, and thus what percentage each individual speaker contributed. Thus, this process was repeated until the sets were split for each language into 70% training, 15% validation and 15% testing. The existing work with GP in Kaldi [Povey et al., 2011] provided some speaker splits; however, these were found to be roughly 80/10/10 splits for training/validation/test. Since there was a fairly large quantity of training data available, it seemed that it might be beneficial to use more data when validating/testing to get a better idea of how well the techniques could generalise. Thus, a 70/15/15 split was used instead.

3.2.4 Standardising phones A challenge with the GP dataset is that it has been collected in different places and built up over time. Consequently, it does not use a fully-standardised set of phones in each language’s dictionaries, but usually a customised one. For instance, in Swahili, the γ IPA character is represented by SWA_gh, but in Spanish as M_G. In German, M_ae corresponds to æ, but in Swedish, M_ae corresponds to ε. Thus, for each language, a phone map was built, mapping from language-specific phones to their IPA and X-SAMPA characters. X-SAMPA is a form of the IPA de- signed for computers; it uses only simple characters and puncutation marks [Wells, 1995]. In some cases, the mapping was relatively straightforward as documentation was easily available, for instance, for Hausa and Swahili. The GP work in Kaldi [Povey et al., 2011] also provided mappings for some languages. For Korean, there was an existing thesis [Kiecza, 1999], which included descriptions of the phones. Additionally, the Wikipedia IPA page for each language was checked to ensure that the phone lists were reasonable. Sometimes it was used in with the GP language dictionary to infer the IPA symbols where proper documentation was lacking, in particular with Turkish and Japanese. Most languages had similar patterns, too; for example, [x]_m usually corresponded to “m”, which provided another sanity check on the accuracy of the phone maps (although, as stated earlier, this was not an infallible guide). A example list of conversions is provided in the appendix. If there was no documentation available, and the language’s alphabet was unfamil- iar (e.g. Russian, Mandarin) then that language was ignored for the purposes of this project, since it was almost certainly not worth taking stabs in the dark to guess the correct conversions. From these phone maps, IPA and X-SAMPA word dictionaries could be generated using the GP dictionaries. The reason for creating both was that IPA is more human- readable, but X-SAMPA is better for computation, since it does not use any obscure 3.2. File preparation 25

Unicode characters which may otherwise cause unexpected behaviour.

3.2.5 Generating transcriptions

Having obtained a word-to-phone X-SAMPA dictionary (section 3.2.4), a transcript could be obtained by going through each utterance and getting the corresponding phones for each word. A silence phone was added to the start and end of each ut- terance.

For some languages, stutters/hesitations were included in the transcription. These cre- ated a problem since they were not present in the original dictionaries. In some cases, there were relatively few of these and it was easy enough to add additional entries in the dictionary. For example, the pronunciation for “” could quickly be found by looking at the one for “market”.

In other cases, there were far too many of these and it would have taken a very long time to go through and write manual entries for all of them. Automating the process was not very feasible either, since

1. Determining how many of a similar word’s phones to use was not always clear- cut.

2. Sometimes there were multiple possibilities and human input was best to check and compare similar words to work out which was most likely in the context

Consequently, for languages like Swedish, which had a large number of the issues mentioned above, some entries were added to the dictionary to account for the most frequent stutters, but the majority of stutters were not fixed (and any corresponding utterance ignored). Notably, this did not significantly affect the train/validation/test splits made previously, since the bad utterances tended to occur uniformly at random.

It was also during this stage that a problem was discovered with the Japanese dic- tionary. It was not sorted alphabetically (unusually for GP dictionaries), and upon sorting it was found that it contained many duplicate entries for the same word; for instance there were five entries for “age” with identical pronunciations. In addition, many words which appeared relatively frequently were missing; for example “niha” occurred 124 times in transcripts, but was missing from the dictionary. As a result, the was dropped for this task, since it was not possible to obtain a working dictionary anywhere else.

As mentioned previously, multiple pronunciations were an issue. However, trying to accommodate them can in fact worsen performance [van Bael and King, 2003]. Renals [2012] suggests a few reasons. Firstly, it adds complexity to the model by adding more clusters; secondly, many words tend to occur with one pronunciation the majority of the time. Unfortunately. it was not possible to tell which one occurred most often directly from the GP data. 26 Chapter 3. General setup

3.2.6 Generating input features The MFCC and filterbank features were generated using standard Kaldi scripts. All the configurations were left at their default settings, namely a 25ms window, 10ms shift and 23-D features, with delta and delta-deltas (giving 69-D feature vectors). Furthermore, it was possible to do this process just once, and then select utterances from the .scp files that Kaldi generates (these map each utterance to the location where the features are stored in binary form).

3.3 Organising experiments

The first step was to combine the data from multiple languages, using the speaker lists for training/validation/testing. This was relatively straightforward to accomplish. The scripts for generating language models were adapted from the ones for the TIMIT dataset. Once the data and language models were built, some initial alignments needed to be generated for use in the neural network. Based on the Kaldi scripts for TIMIT, a monophone model (section 2.2.2) was trained first, Next, a triphone model was built which included delta and delta-delta features (section 2.2.3). Thirdly, an LDA-MLLT triphone model was built on top of the previous triphone model, and then finally an SAT and LDA-MLLT triphone model on top of that one (section 2.2.4). Each succes- sive stage used the previous alignments as a starting point. How the parameters were selected, and how well the models performed is discussed in section 4.1.

3.3.1 Additional filtering Some final filtering was performed after the triphone models were trained. It was noticed that a few of the utterances were not aligned correctly to phones when it came to training the network. Upon closer inspection, these proved to be missing from the alignments directory. The solution was to check the alignments, and if any utterances were missing, these were removed from the data. This also seemed to fix some crashes which had been occurring later during training of the neural network.

3.3.2 Converting phones to attributes Li et al. [2018] have a list of phones and their attributes in the appendix; for instance, /a/ = “vowel open front unrounded”. It was decided that these would be used as the universal phone targets. A file mapping each phone in the languages used to a 51- dimensional set of attributes was made, where each attribute vector was 1 if the phone had that attribute, and 0 otherwise. Some attributes were effectively “extensions” to a base phone (e.g. “long” or “aspirated”); that is, they tended to be added on to the end of an existing phone, so all other attributes would be retained. Full tables of all the base phones and extensions are provided in the appendix: Table A.1 and Table A.2. 3.3. Organising experiments 27

It was then possible to take this and the phones.txt file generated by Kaldi scripts (which maps phone numbers to phones) to create a mapping from the phone numbers to their attribute vectors. In turn, this would enable the generation of attribute vectors from the numbers given out in the alignments.

3.3.3 Dealing with diphthongs During the previous stage, it was decided to remove languages that contained diph- thongs such as German and Korean. Diphthongs (a..a. gliding ) are made by moving the tongue during the pronunciation of a vowel, thus combining two vow- els into one sound. Modelling diphthongs in terms of attributes was a complication identified in this project. Since diphthongs are effectively two vowels, what should the attribute vector look like? As an example, consider the diphthong /ai/, where /a/’s attributes are: “vowel open front unrounded” and /i/’s attributes are “vowel close front unrounded”. One option would be to represent the unique attributes as 0.5, and the shared attributes as 1. So, for instance, /ai/ would have 1 for “vowel, front, unrounded” and 0.5 for “close, open”. A related option would be to set the union of the phone attributes as 1. Another alternative considered was splitting the diphthong into its two phones and then using two separate attribute vectors. This was the method used by Frankel et al. [2007] and Richardson et al. [2003]. A problem with implementing such a method in this particular context was that frame-level alignments from triphone models where the diphthongs were considered distinct phones, were being used to train the network. Thus, if diphthongs were present, this would require going through all the alignments and trying to split the corresponding consecutive entries in half (so that the first half was for the first phone and vice versa). One solution to that issue would be to split and remove diphthongs in the setup stages when training triphone models. However, there would be issues with this approach as well. The diphthongs were deemed distinct enough from other phones in the lan- guage that they each had their own phoneme, so ignoring this would likely have had an adverse effect. In fact, Richardson et al. [2003] did indeed find that this technique “resulted in minor improvement in one iteration, followed by degraded performance in future iterations” for them. Finally, even if it were possible to represent diphthongs using any one of these meth- ods, there would be one major problem which is not at all easy to overcome - predicting phones from attribute scores. The method for determining the most likely phone based on attributes is explained in more detail in section 3.5. Suffice to say that a score was generated based on how strongly each of a phone’s attributes were predicted. If the first method of weighting attributes or setting them all to 1 was used, then diph- thongs could be predicted if they scored highly enough on both phones. However, there would be no way to distinguish between /i@/ and /@i/, for instance, in Vietnamese. With the second method of splitting diphthongs, then for example, /ia/ could not be distinguished from /i/ and /a/ separately. A possible solution was thought of later, but there was not time to implement it, so it will be discussed in section 5.1.2. 28 Chapter 3. General setup

Thus eight languages remained after discarding the rest for one reason or another. They are shown in Table 3.2, and the amount of phonetic overlap between them in Figure 3.1.

Language Language code Bulgarian BG Croatian CR Hausa HA Polish PL Swahili SA Swedish SW Turkish TU Ukrainian UA

Table 3.2: Languages used in the project and their corresponding language codes

36 24 CR

20 21 32 HA

26 23 20 PL 28

18 17 17 21 SA 24 21 20 19 21 19 SW

21 21 19 19 18 21 20 TU

36 23 18 28 17 19 18 UA 16 BG CR HA PL SA SW TU

Figure 3.1: Common phones between languages. See Table 3.2 for the key to the language codes

French was accidentally removed, but by the time this mistake was realised, it was too late to fix. However, it is also somewhat notorious as a language for multiple pronun- ciations of the same words, so it may be for the best that those phonetic transcriptions were excluded. Regardless, the eight remaining languages were across a good range, and with four Slavic ones, an experiment within the same language family was possi- ble. In addition, Bulgarian and Hausa were mistakenly included; they do in fact have diph- thongs: /ya/, /yu/, /ai/ and /au/. This was only noticed later since an earlier version 3.4. Using PyTorch-Kaldi 29 of the code had used the method of setting all attributes in a diphthong to 1. When the decision to remove diphthongs was made afterwards, this part of the code was for- gotten. As a result, the pipeline appeared to run as normal, and the issue was found considerably later during more in-depth result analysis. Unfortunately, by this point it was too late to retrain the entire network for each ex- periment. A solution instead was to include these specific diphthongs in the list of universal phones during the stage of predicting phones from attributes (more details on the methods for doing so in section 3.5). However, it was believed that this would not necessarily have resulted in greatly im- peded performance. Since none of the diphthongs in question had their reverse present i.e. /ya/ was present but not /ay/, the issue of confusing the order did not apply. Fur- thermore, Bulgarian was among the best results in later experiments, so the effect of including it cannot have been disastrous.

3.4 Using PyTorch-Kaldi

PyTorch-Kaldi [Ravanelli et al., 2019] is a toolkit designed to enable the use of Py- Torch’s neural network functionality [Paszke et al., 2019] with Kaldi’s speech tools. Since it had some initial models built for the phonetic TIMIT dataset, it seemed highly appropriate to use. However, a considerable number of modifications were necessary, which are described in this section

3.4.1 Adapting input alignments By default, PyTorch-Kaldi reads in the alignments for input features corresponding to phone numbers or context-dependent phone numbers. The phone numbers were converted into attribute vectors using the phone-to-attribute mapping file built in sec- tion 3.3.2. Some modification of label indexing was then required since PyTorch-Kaldi concate- nates the features and their target labels together into rows when setting up chunks for training, and does not “expect” the target label for a feature to be any longer than 1 (and in this case the target lengths were 51).

3.4.2 Cost function In vanilla PyTorch-Kaldi, the output layer of the network was of the shape N × P, where N is the number of outputs and P is the number of phones or context-dependent phones. A softmax function (Equation 3.1) was applied, which converted each output to a row of probabilities across all possible phones. Finally, the negative log-likelihood (NLL(x) = −log(x)) was applied as the cost function; this penalised any incorrect, or correct but uncertain phone predictions, and rewarded confident and correct ones. However, this cost function was not appropriate for the new output vectors as in this case a sigmoid function (Equation 3.2) was applied to the network outputs, which 30 Chapter 3. General setup

exp(xi) σ(x)i = K for i = 1,...,K (3.1) ∑ j=1 exp(x j)

Equation 3.1: Softmax function turned them each into vectors containing values between 0 and 1. These values roughly corresponded to the probability of each attribute being present or absent in the outputs. 1 S(x) = (3.2) 1 + exp(−x)

Equation 3.2: Sigmoid function

To compare these output probabilities to the “ground truth” labels at each frame, binary cross entropy (BCE) was used. BCE is calculated using the formula in Equation 3.3. This way, all of the output values in a vector need to be close to the target, otherwise the BCE will increase. PyTorch provides the function BCELoss, which allows the BCE loss to be backpropagated through the network.

D BCE(x,y) = − ∑ yi log(xi) + (1 − yi)log(1 − xi) (3.3) i=1

Equation 3.3: Binary cross entropy formula. y = target labels (0 or 1); x = output probabilities; D = dimensionality of x and y

One issue that arose initially during training, was that because the majority phone class is “silence”, the BCE loss was flooded with 0-vectors, so after several iterations of backpropagation, the network predicted 0. Class imbalance is an old machine-learning problem, and undersampling is one way to tackle it [Japkowicz, 2000]. Thus, it was necessary to filter out the majority of silence phones. Ideally, this would be done when reading each data chunk in, and this was the first method attempted. If the mean frequency of nonsilence phone occurrences was, for example, 2000, then the silence phones would be reduced to this number. However, it was found that PyTorch-Kaldi applies some padding to the inputs to make batches the same size, based on some complicated calculations involving the length of the utterances. When the above method for filtering silence was applied, it changed the utterance lengths, and the training pipeline broke. How this issue could be fixed was not obvious, due to the rather opaque nature of the calculations, so it was decided that this approach should be avoided. Instead, the silence removal was done immediately before calculating the loss on each training batch. Here 95% of the silence labels and their corresponding outputs were removed. This was arguably less efficient, but in practice ran at a similar speed to the baseline model. The effects were seen immediately as a good mix of 0s and 1s were now predicted. Furthermore, keeping some of the silence meant that it would not be completely ignored (which would otherwise have caused problems when testing). 3.4. Using PyTorch-Kaldi 31

3.4.3 Model saving One interesting discovery made, was that during the testing phase, PyTorch-Kaldi only uses whatever the final model is after training is complete. This would likely result in the model intended for testing being overfit to the training data. Instead, this was changed so that the validation loss was checked at each epoch. If there was an im- provement, it was saved as the new best model. The final best model was then used during testing.

3.4.4 Chunk sizes The authors of PyTorch-Kaldi recommend making data chunks of roughly 1-2 hours in length. This ensures that chunks can fit in the GPU when working with a very large dataset. A simple function was constructed to calculate the total length (in hours) of each dataset split (training, etc.). It then rounded this figure up to the nearest 10 so that there would be roughly one hour or less per chunk. These chunk lengths were then written to the configuration file. Typically, there were 90 chunks in training, with 20 each for validation and testing.

3.4.5 Network structure The PyTorch-Kaldi framework provides some configurations designed for use with TIMIT. Since this project involves phone prediction as well, it was decided that these would provide a suitable baseline configuration. Two main options were considered for the baseline structure: an RNN or an LSTM (section 2.3). One alternative to these, a multi-layer perceptron (MLP) network, would have been comparatively faster to train. It was not deemed suitable however, because it inherently fails to capture the temporal aspects of the input features. An LSTM did seem like a viable option, and initial reports of relatively low validation losses per epoch seemed encouraging. However, there were two key flaws. Firstly, it was much slower to train than an RNN, by a factor of approximately three. Secondly, and more seriously, there appeared to be a memory problem somewhere in the way it had been setup in the PyTorch-Kaldi framework. With all configurations at their default settings, it would train for four epochs, before unexpectedly running out of memory to allocate the LSTM outputs on the fifth. This issue persisted even when saving and restarting at this point, and it was not feasible to find the source of the problem. Consequently, an RNN was chosen as a compromise. The baseline structure is shown in Figure 3.2. While not as good as the LSTM in terms of validation loss, it was con- siderably faster to train (making more experiments possible), and did not experience the same memory problems. The general principles behind the project still apply, and future work could replace the RNN with an LSTM or any other recurrent structure. A very similar structure was used for the attribute network (Figure 3.3). In this case, the final layer was 51-D (for each attribute), and a sigmoid function applied to it, thus forcing the values for the attributes to be between 0 and 1. BCE loss could then be utilised during training (section 3.4.2); and phones predicted from these attribute 32 Chapter 3. General setup

NLL Loss Predictions (training) (testing)

Softmax

Linear Context-dependent Raw layers phones phones

ReLU

Recurrent ReLU layers

Input features

Figure 3.2: Baseline RNN structure. Main layers shown on the left; activation layer functions shown on the right. values during testing (section 3.5). The reason for not using CTC loss as Li et al. [2018] did was in brief due to the limitations imposed within PyTorch-Kaldi; the problems and potential future solutions are discussed later in section 5.1.4.

BCE Loss Phone prediction (training) (testing)

Sigmoid

Linear 51 articulatory features layer

ReLU

Recurrent ReLU layers

Input features

Figure 3.3: Attribute network structure 3.5. Predicting phones from attributes 33

3.4.6 Gradient issues It was found that, on occasion, around the 7th-10th epochs, the model would suddenly output NaN (not a number). This appeared to be an issue with gradients. There was a gradient clipping function in the original code for core.py; it had been commented out by default. Enabling this function clipped gradients to 0.1 which seemed to resolve the problem for the baseline network. However, the problem persisted when using a network for attributes. One possible reason is related to the fact that the attribute network uses a sigmoid activation for the final layer, whereas the baseline network uses softmax. In addition, by using BCE loss instead of NLL loss, the actual values of the loss were considerably lower (e.g. 0.09 vs. 3.2 on the same epoch). This may result in the validation loss reaching a local minimum, where the sigmoid gradients become saturated and end up passing spuri- ous weights to the rest of the layers during backpropagation, even with the gradient clipping. If that were the case, then it would be fine to use the best current model since future training would be unlikely to significantly improve the performance beyond this point. While there may be another root cause for the problem, there was unfortunately not much else that could be done to fix it. When it did occur, the best model trained so far was the one used for evaluation.

3.5 Predicting phones from attributes

One problem with the work by Li et al. [2018] on a universal phonetic model (which much of this project was originally based on), was that it required a signature matrix for each language. This consisted of a mapping between the feature vectors and the language’s phonemes, to transform the attribute outputs to phones. The issue here is that this would not work if the language’s phones were unknown to begin with.

3.5.1 Distance metrics Different methods were considered for how to predict phones from their attributes, as this was not a trivial problem to solve. At first, simply calculating the cosine or pairwise distance between the outputs and each phone was considered. The predicted phone for each output would be whichever one was closest (see Figure 3.4). However, it was realised that the phones need to be known beforehand, as with the signature matrix.

3.5.2 Initial split Since the phonetic attributes are in general very different between vowels and conso- nants, the first thing that needed to be done was to split these apart, while also allowing for silence. If splitting was not done, then it was observed that a select few of the vow- els and would heavily dominate the predictions, such as /a/, /i/ and /d/. 34 Chapter 3. General setup

Phone vectors Compute 1 Test vector distances 0, 1, ..., 0, 0 0.934 Pick 0.231, 0.420, ..., 0.314, 0.246 closest n phone 0, 0, ..., 0, 1 0.621

Figure 3.4: Simple distance scoring between test vector and all possible target phone vectors.

The method chosen was to sum the values of the vowel and consonant attributes. If the sum of the two was greater than 0.5, then there was (approximately) a > 0.5 chance that the output was not silence. Any outputs below this threshold were marked as silence. The rest of the outputs were split based on whether or not the vowel attribute was larger than the consonant one. The end result was three sets: silence, vowels and consonants.

3.5.3 Decision trees Having split the outputs into vowels, consonants and silence, one option that was attempted to predict the specific phones was to create binary decision trees. These would split based on whichever attribute happened to divide the remaining phones most evenly. For example, whether or not vowels had the “unrounded” attribute split them almost exactly in half. The idea was that if the output value for an attribute was > 0.5 it would be counted as present (and ≤ 0.5 as absent), and the tree followed down until reaching a phone. See Figure 3.5 for an example. The issue with this method was that the binary tree could end up becoming very deep, where in some cases a phone was at the bottom of nine splits. This would mean that the phones towards the bottom were very unlikely to be picked since there were so many other phones that could be selected first along the way (for instance in Figure 3.6). The next attempt was to perform splits based on the general attribute categories, such as “place” and “manner” for consonants, and then take the maximum value each time. For an illustration of how this would work for vowels, see Figure 3.7. This somewhat mitigated the problem of unfair bias against certain phones since, for example, no one value for “place” was higher in the tree than another. The complica-

unrounded?

front? back?

near-open? back? near-close? central?

Figure 3.5: Example binary decision tree for classifying vowels from attributes. 3.5. Predicting phones from attributes 35

near-open? near-close? æ close-mid? ɪ open-mid? e close? ɛ a i

Figure 3.6: Bad binary tree where the /i/ and /a/ phones are much too far down and face a severe disadvantage in terms of likelihood of being predicted.

Height open near-open open-mid close-mid near-close close

Backness front central back

Rounding rounded unrounded

ø e

Figure 3.7: Part of a decision tree for vowels, where the highest scoring attribute path is chosen at each step tion in this case was that for consonants in particular, they often (but not always) had multiple values for some attributes e.g. for /b/, “place” is both “bilabial” and “labial”. Only taking one maximum would not be sufficient to distinguish between every phone. Furthermore, the fact that some phones did indeed only have one value for a category would make the tree uneven (thus causing the same problems as the binary tree), and very complicated to build. It would also not be possible to implement this very effi- ciently, since it would require taking the outputs out of GPU memory to put each of them through the decision tree.

3.5.4 Universal scoring The final, and what turned out to be arguably one of the simplest methods, was to somewhat merge all these ideas together. First, vectors for each vowel and consonant in the universal phone set were found. These vectors consisted of the indices in the attribute vector for each attribute. For instance, if “unrounded” occurred at index 30 in the attribute vector, then vowels with this attribute would include the number 30 in their vectors. 36 Chapter 3. General setup

Attribute extensions (such as “long” for vowels) were included by creating identical vectors with the extension index included. So for instance, each vowel vector had a version with a “long” (index 29) appended to the end. Only one extension was added to each base phone (i.e. not multiple extensions) to prevent an exponential blow-up in the number of possible vectors. All observed phones only ever had one extension, so while it is theoretically possible that a phone could have more than one, this would at least seem to be sufficiently rare that such a possibility was not worth including. Furthermore, extensions were only added if they existed in the training data (as it would be impossible to predict an unseen attribute). Then, at first, for each phone in the resulting lists of consonant and vowel phones, if it actually occurred in the phones.txt file, it was assigned that particular phone number, and 0 (epsilon) otherwise. This resulted in a list of vowel and consonant phones, which consisted of the phone number followed by a list of indices where its features occurred in the output. From this point, it was then very straightforward to obtain a score for each phone by summing the value of the output at each phone’s feature indices (Figure 3.8). For normalisation, this score was divided by the number of attributes so that if, for example, one phone had six attributes and another only four, the scores were still comparable between them. From these scores, the highest scoring one could be selected and the corresponding phone number given (or 0 if it was unseen).

Test vector Phone vector 0 1 ... 51 Attribute indices 35, 34, 21, 45, 29 0.420, 0.214, ..., 0.932 Outputs

Phone attributes Phone 34, 21, 45, 29 0.912, 0.032, 0.271, 0.299 number Sum

Stack scores 35 0.379 1.514 for all phones; Normalise by predict one #phone attributes with highest score 110 0.510 Candidate Scores phones

Figure 3.8: Example of scoring method with universal phones

The key difference between this method and the simple distance scoring in section 3.5.1 was that this way, if desired, a universal phones.txt file could be created with all possible phones, and then used when generating outputs (although there would now be over 580 possible phones). At first, this was avoided, in an attempt to allow for standard Kaldi decoding to work as it cannot handle unseen phones. However, once Kaldi decoding failed (see section 3.6), the universal prediction method was applied instead. A further advantage of this approach over the decision tree method is that the index slicing and other operations could all be done efficiently on tensors, without needing to detach them from the GPU environment. 3.6. Evaluation 37

Finally, it would be possible to extend this method to predict a probability distribution across phones, using softmax on the phone scores, but this was not done for now, for reasons which will be explained in the following section.

3.6 Evaluation

3.6.1 Issues with decoding For shallow monophone/triphone models, and the baseline RNN, universal phones were not being predicted. Thus, it was straightforward to use standard Kaldi decoding and scoring scripts to obtain the PER. In the case of articulatory features, this was not possible. Since the network predicted phones rather than context-dependent phones, the Kaldi language models could not accept these inputs. Even monophone language models did not work since they used probability density functions rather than raw phones. The effect was that full decoding and scoring was not possible for the attribute networks in this project, as the only alternative would be to create specialised decoding and scoring scripts from scratch. Such an approach was considered, but it was decided to avoid it for a couple of reasons. Firstly, there was a problem with obtaining reliable probabilities for silence phones, since subtracting the sum of the vowel and consonants (section 3.5.4) was not neces- sarily a very accurate way of doing it. In hindsight, it would likely have been beneficial to include “silence” as one of the attributes in the feature vectors, as King et al. [2004] did. This alone would not be a completely insurmountable challenge. The main issue lay with decoding the output probabilities (which could be obtained by using softmax on the scores). To use the Viterbi algorithm, for instance, to find the best path would require a transition or language model for the universal phone set. In a bigram model, this would the likelihood of seeing phone y given the preceding phone was x. More preceding phones could be used in the case of a trigram or n-gram model. An assumption that would need to be made is that if the previous phone was unknown before, then all other phones are equally likely. However, the main problem lay with the case where the previous phone was known. What should the probability of an unknown phone be in that scenario? Unlike in word- level language models, mapping it to an token, or equivalent, was not an option since then it not be possible to predict unknown phones (which was the whole point). Considering all unknown phones as equally likely would not be great either, since such a distribution does not apply to the phones in any unknown language. In addition, some of the transition probabilities from known to known phones would be available. How should these be weighted, given that there would be many more unseen phones than seen ones? Putting too little weight on known transition probabilities would result in the unknown ones “swamping” the decoding, and being predicted far too often. Any rare (but known) phones would almost never be predicted. On the other hand, putting too much weight would mean that unknown phones would become so improbable as to hardly ever be predicted, even if their score was really high. 38 Chapter 3. General setup

In summary, building a language model with unknown phones is not a straightforward task, and would require numerous design decisions, some of which would be difficult to justify. In fact, being able to use CTC loss would be of great benefit, as it does not require a language model to produce phone-level outputs. However, that was not possible within the current experimental framework, so some different ways to evaluate the network output had to be considered. These are dicussed in the following section.

3.6.2 Alternative evaluation metrics Some evaluation was still possible nonetheless. The best compromise was to evaluate two aspects of the model - one based on the best alignments and the other based on the transcriptions themselves. Firstly, the predicted phone outputs and their target labels (from the best alignments) were recorded in text files. Then it was possible to compare the two and create a large confusion matrix of true vs predicted phones for each frame. This would not be entirely accurate since there was some error already in the phone-frame alignments. Nonetheless, it provided some sense of how effectively the network had been trained at the frame level, and gave an indication of how well phones could be predicted from attributes alone. It was observed that a few phones tended to occur very frequently, which made it difficult to see the values of other phones in the confusion matrix diagram (Figure 3.9). To fix this issue, normalisation was done along both axes to produce two diagrams which were considerably easier to view.

Unnormalised Y-axis normalised X-axis normalised True phones

Predicted phones Predicted phones Predicted phones

Figure 3.9: Example confusion matrices. The first only uses raw counts; the second and third are normalised with respect to precision and recall respectively.

Normalising along the axis for “predicted phones” was roughly equivalent to showing the recall for each of the true phone classes. For instance, if the true class was /p/, then a high recall would mean that out of the labels which were /p/, most were correctly predicted as /p/. In contrast, normalising along the axis for “true phones” was roughly equivalent to the precision. That is, if the phone /p/ was predicted, how often was this prediction correct? Both metrics were useful for evaluating the model’s performance. Secondly, for a higher-level comparison, the target phone distribution was taken from the target transcription. This consisted of counting the number of occurrences of each 3.6. Evaluation 39 of the phones in the transcript, and normalising to a distribution of [0,1], or [0,100] (as a percentage). It is worth bearing in mind that the transcription itself was not entirely accurate (see section 3.2.5), but it would certainly be more accurate than the frame- level alignments. To get the predicted phone distribution, any entries in the predicted phone outputs which were not repeated consecutively were removed. For example if there was a sequence of phones: [a, a, b, c, c,] then the “b” would be removed as it was likely a mistake. This is because the frame window is only 25ms, so an actual phone would be expected to have at least multiple consecutive predictions. Duplicate entries were then collapsed, leaving a total number of predicted phones similar to the total number in the transcript. These distributions could then be compared to one another to see how similar they were, both in terms of absolute difference between the “truth” and predictions, and the relative difference (as a percentage) between the two. Thirdly and finally, it was possible to find the most common mistakes in the confusion matrix. Thus it could be calculated which phones tended to be mixed up the most often; roughly equivalent to substitution errors.

Chapter 4

Experiments

4.1 Experiment 1: Shallow models

4.1.1 Research questions There was one simple question here - what parameter settings were good for build- ing a multilingual triphone model? This was needed to act as a baseline and provide alignments for later experiments.

4.1.2 Setup Kaldi’s TIMIT scripts were adapted to work with the project since both were working with phones. The first step was to build a monophone model (section 2.2.2). Next was a simple triphone model using delta and delta-delta features (section 2.2.3), and fol- lowing this a triphone model incorporating LDA and MLLT. The final triphone model used LDA, MLLT and SAT (section 2.2.4). As the complexity increased, the previous model’s alignments were used as a starting point for the next one. For monophone modelling, the main hyperparameter was the number of Gaussians to use for the GMMs. For triphone modelling, two key hyperparameters were similarly the number of Gaussians, but also the number of leaves (when building decision trees for clustering triphones). In the triphones with LDA + MLLT, the three features be- fore and after were spliced together, and the dimensionality reduced from 69 × 7 (i.e. 483) down to 40 by LDA, as recommended in the Kaldi scripts for TIMIT. All other hyperparameters were left at their default settings. In Kaldi, the recommended settings for TIMIT were 1000 Gaussians for monophone models, and 15000 Gaussians with 2500 leaves for each of the triphone ones. To test if these were reasonable, a simple search around these parameters was performed. The number of leaves was varied in the values [2000, 2500, 3000]; the number of Gaussians in [10000, 15000, 20000]. The selected languages were German, Swahili and Ukrainian. These provided a wide range of phones (93 non-silence ones) and were sufficiently different from one another

41 42 Chapter 4. Experiments as to create a good multilingual model. Decoding and scoring was a slow process, even for three languages, which could have caused a major bottleneck that would prevent progress with the deeper, RNN-based models. Thus, it was decided to perform the parameter search on these three languages only, rather than all the languages that would be used later (German had not yet been dropped at this point). Once a good set of parameters had been found however, the final triphone model was trained on all the languages in use, and evaluated on the validation and test sets sepa- rately. Doing so allowed for comparisons with the network models.

4.1.3 Results Results for the monophone model are shown in Figure 4.1. Results for the triphone models are summarised in Figure 4.2.

Monophone model 56

54

52

50 PER (%)

48

46

500 1000 2000 Gaussians

Figure 4.1: PER for monophone models on validation set.

From these results, it was clear that the number of Gaussians was considerably more important than the number of leaves, which only made a relatively small difference. A more extensive search could have been done to find at what point the number of Gaus- sians or leaves became too many. However, this would have been time-intensive, and not hugely relevant to the project. The findings thus far indicated that these parameter settings would be sufficient for building basic shallow systems. The final pipeline for generating initial alignments is shown in Table 4.1. The num- ber of leaves was not increased for most of the triphone models, because it was not observed to have had a significant impact on the PER. A multilingual triphone model for the languages that would be used for the network was trained using these settings. In addition, monolingual triphone models were built for each language with the same settings. These were used when testing universal 4.2. Experiment 2: Baseline network 43

Basic triphone Basic + LDA + MLLT Basic + LDA + MLLT + SAT

36.0 36.5 35.5 35.0 34.4 33.3 32.7 30.4 29.6 28.9

2000 34.5

33.0 36.4 35.2 34.7 34.3 33.0 32.5 30.1 29.1 28.5 Leaves PER(%)

2500 31.5

30.0 36.3 35.2 34.5 34.2 33.0 32.3 30.0 29.0 28.4

3000 28.5 10000 15000 20000 10000 15000 20000 10000 15000 20000 Gaussians Gaussians Gaussians

Figure 4.2: PER for triphone models on validation set. “Basic” is the standard triphone model with delta and delta features included. Each successive model was built on top of the best previous one. “Gaussians” are the number of Gaussians used to make GMMs. “Leaves” are the number of leaves in the decision tree for clustering triphones

Model # Gaussians # Leaves Monophone 2000 - Triphone (basic) 20000 2500 Triphone (LDA + MLLT) 20000 2500 Triphone (LDA + MLLT + SAT) 20000 3000

Table 4.1: Pipeline of models and settings for generating initial frame-to-phone align- ments phone prediction (section 4.3), since it was necessary to have frame-to-phone align- ments that were as accurate as possible (due to the evaluation issues in section 3.6).

However, these monolingual models could not be used or tested in the baseline net- work, due to their context-dependent phone IDs being different from the multilingual ones.

The results for each of the languages in these models are summarised in Table 4.2. Note that the average PER for the multilingual model was greater than in Figure 4.2; this was likely because more languages were being used, and there were slightly more non-silence phones (117 vs. 93).

4.2 Experiment 2: Baseline network

4.2.1 Research questions

There was one main question addressed in this experiment: what kind of baseline network structure would be appropriate for predicting phones (without attempting to be a universal model)? Results from this network could then be compared to the previous triphone model (section 4.1), and the subsequent network designed to predict universal phones (section 4.3). 44 Chapter 4. Experiments

PER(%) Language Multilingual Monolingual Validation Test Validation Test Bulgarian 31.47 31.90 20.87 20.67 Croatian 41.06 41.59 24.82 26.90 Hausa 28.50 29.44 12.67 16.46 Polish 35.74 33.77 24.58 22.17 Swahili 31.29 27.81 20.35 16.88 Swedish 49.98 47.00 37.29 33.54 Turkish 41.87 41.50 24.01 25.32 Ukrainian 41.97 43.27 21.23 22.52 Average 37.74 37.04 23.23 23.06

Table 4.2: Summary of PERs for best multilingual and monolingual triphone models on validation and test sets.

4.2.2 Setup A diagram of the network structure was shown earlier in Figure 3.2, but its exact pa- rameters are detailed in Table 4.3. The RNN layers are marked as “repeated” in that table since they were the focus of the experiment here; their quantity was varied from 3 to 5. Batch normalisation and a dropout rate of 0.2 were applied to each of the RNN layers (section 2.3.5). Due to complications with setting up the network (most of which were detailed in section 3.4) there was not sufficient time to run many experiments with the baseline network. The focus of the project, after all, was on a network for articulatory features, so it was decided to prioritise experimental time for that one (section 4.3). The optimisation parameters were left at their default PyTorch-Kaldi configuration val- ues (Table 4.4) as again there was not enough time to peform an extensive search over possible settings. In addition, if they were deemed reasonable for TIMIT [Ravanelli et al., 2019], they were unlikely to be a major stumbling point.

Layer Dimensionality Activation function Input 69 - RNN (repeated) 550 ReLU Linear-1 # phones Softmax Linear-2 # pdfs Softmax

Table 4.3: Baseline network structure. NB: both linear layers use the output of the last RNN layer as input

4.2.3 Results The validation loss during training is shown in Figure 4.3. It can be seen that the losses are fairly comparable across the layers, though three layers was the worst performing. The reasons for the sudden spikes are unclear; these would happen occasionally, but 4.2. Experiment 2: Baseline network 45

Parameter Symbol Value Learning rate η 0.00032 Moving average γ 0.9 Stabilisation ε 1 × 10−8

Table 4.4: Parameter settings for optimisation (Equation 2.1 and Equation 2.2) afterwards the training would appear to go back to a normal path, as seen in the line for the 3-RNN-layer network.

4.0 No. RNN layers 3 3.8 4 5 3.6

3.4

3.2 Validation loss 3.0

2.8

2.6

0 5 10 15 20 Epoch

Figure 4.3: Validation loss during training for the baseline network with different num- bers of RNN layers. Some of the early stopping seen was due to gradient problems.

The PER for each language in the validation dataset is shown in Figure 4.4. Increasing the number of layers resulted in a slight increase in the PER for Croatian and Turkish, but a slight decrease in the PER for Polish and Swahili. However, there was a notable effect on two of the languages. Using 5 layers increased the PER for Bulgarian by about 10%, and roughly the same amount of decrease in PER for Ukrainian. Why exactly there would be such a dramatic difference is quite unclear. As seen before in Figure 3.1, these two languages were the most heavily overlapping in phones, so one would have expected both to perform reasonably well with a lot of relevant data available. Some analysis of where the errors occurred is also useful. Figure 4.5 shows the most frequent substitution mistakes made for each language. There were a couple of key aspects that were observed. Firstly, vowels were by far the most frequent mistakes made, with the top 3 mistakes 46 Chapter 4. Experiments

Best PER for languages in baseline network

3-layer 60 4-layer 5-layer Triphone 55

50

45 Best PER (%)

40

35

30 BG CR HA PL SA SW TU UA Avg Language

Figure 4.4: Best PER on the validation set for each language with different numbers of RNN layers in the baseline network. The “Triphone” bars are the “Avg” is the average PER across all languages.

Top 5 substitution errors with 4 RNN layers

Bulgarian Croatian Hausa Polish 10 6 8 8 6 6 6 4 4 4 4 2

% of all errors 2 2 2

0 0 0 0 i i i i

k v o a a o a e e u u

j

e e a a e o a o a o o k' Swahili Swedish Turkish Ukrainian 4 12.5 6 3 10.0 3 4 2 7.5 2 5.0 2 1 % of all errors 1 2.5

0 0 0 0.0 i

v a a o a a a a a a e e e u n

i r r a 0 e e a 0 0 m Phone Phone Phone Phone

Figure 4.5: Top 5 most common substitution errors in the validation set for each lan- guage on the 4-layer RNN network for each language including at least two. In some ways, this could be expected. There is a reason why most languages have a relatively small number of vowels: for humans, distinguishing between similar-sounding vowels requires more effort. It’s easier to use a relatively lower number of vowels (where they each sound very different), so some vowel phones which are technically distinct from one another may become part of the same phoneme.

The reason why all of this is relevant, is because in the transcripts only one pronunci- ation of each word is used. This does not allow for differences between speakers due 4.3. Experiment 3: Attribute network 47 to accent, country of origin, or gender. Another issue is that since each word is treated separately, this does not permit for changes in pronunciation due to how some words blend together, which often changes the vowel phone. For instance, in English, the “e” in “the” can sound like “uh” or “ee” depending on the the starting letter of the next word is a vowel or not. Both are distinct phones, but part of the same phoneme in this case. Secondly, with regard to consonants there were some expected mistakes. For instance, /k’/ () was predicted as /k/. This is not particularly surprising given that the difference between the two is not relevant in most languages (Hausa was the only one of the eight to have it). Most of the other languages do have /k/, but even if specific usages of it happened to be plosive, it would not be transcribed as such (and so would appear to be far more rare than it actually is). Another typical mistake was /m/ vs. /n/, but these are sometimes tricky even for English speakers to tell apart. Further graphs displaying the results for insertion and deletion too, for all the different numbers of layers, are attached in the appendix. It was decided that 4 RNN layers would be a reasonable number to use, since: • The validation loss for the 3-layer network was noticeably higher than both the 4- and 5-layer ones • The 5-layer network was slower to train and did not appear to offer a substantial improvement over the 4-layer one. The test set PER was then compared between the current best multilingual triphone model and the 4-layer RNN. The results are shown in Figure 4.6. While the PER was improved for Turkish and Croatian, it was not so for the other languages. However, this may have been due to the fact that RNNs are a relatively basic and limited form for temporal deep learning. As discussed earlier (section 3.4.5), using LSTM layers seemed promising but there were other issues associated with that approach.

4.3 Experiment 3: Attribute network

4.3.1 Research questions There were two main questions in this experiment. Firstly, would the proposed network structure and scoring method be able to successfully predict phones? Secondly, would FBANK features outperform MFCCs with this network structure?

4.3.2 Setup Having decided to use 4 RNN layers in section 4.2, the majority of the structure was kept the same as the baseline network with a few tweaks (Table 4.5). A diagram was shown previously in Figure 3.3. The optimisation parameters were also the same as 48 Chapter 4. Experiments

60 Model type Network 55 Triphone

50

45

40 Best PER (%)

35

30

25 BG CR HA PL SA SW TU UA Avg Language

Figure 4.6: Test PERs compared between the 4-layer RNN and the multilingual triphone model. before (Table 4.4). Frame to feature alignments were obtained from the multilingual triphone model described in section 4.1.

Layer Dimensionality Activation function Input 69 - RNN (x4) 550 ReLU Linear 51 Sigmoid

Table 4.5: Network layer dimensionalities and activations

One issue that arose when using FBANK features was a mismatch between the num- bers of feature vectors and target labels. Even though the FBANK features were ex- tracted using the same frame window and time step, there were consistently two more labels than there were feature vectors for each utterance. Why exactly this was the case was unclear. A solution which addressed the “symptoms” of the problem (rather than whatever the underlying cause happened to be) was to remove the last two labels for each utterance. These labels were consistently observed to have been silence anyway, so it did not seem at first that too much information would be lost, if any. While this fix enabled training on the articulatory network with FBANK features, it oddly did not do so for the baseline network (which was attempted for comparison). Instead in the baseline case, even with all other settings left at the same values, there was a zero division error when initialising the network model. Why this occurred could 4.3. Experiment 3: Attribute network 49 not be found out since everything else appeared to be exactly the same.

4.3.3 Results The loss during training is shown in Figure 4.7. It can be seen that the loss plateaus from the 5th-20th epochs, before suddenly spiking at the end. Similarly to the spike seen in the baseline network loss (Figure 4.3), the exact reason why this spike occurred is unclear. It could be overfitting, though this would appear to be unlikely given that the training loss also increases substantially at the same time. Perhaps instead there was an error in adjusting the weights during backpropagation, due to saturation of the sigmoid function.

Full loss Zoomed-in loss 0.0860 0.8 Training 0.0855 Validation 0.0850 0.6 0.0845

Loss 0.4 0.0840 0.0835 0.2 0.0830

0.0825 0 5 10 15 20 0 5 10 15 20 Epoch Epoch

Figure 4.7: Loss during training for articulatory feature network. The first graph shows the entire range of the loss; the second is zoomed in on the “flat” portion of the first graph.

The results in terms of the frame-level phone accuracies are shown in Table 4.6, when either MFCC or FBANK features were used. They were assessed using the best mono- lingual triphone models, since the PERs for the targets would be lower (Table 4.2), so would provide a more reliable phone assessment. Since “silence” is a fairly common “phone” (and it is not particularly useful to predict it in the context of this project), the accuracy without silence labels was also included. It was clear from Table 4.6 that the attempted solution for fixing the issues with training on FBANK features had not worked, as the accuracy was simply atrocious for most languages. In particular, for Ukrainian, the performance was little better than random chance. However, the results with MFCC features were encouraging. While the accuracies were not incredibly high, they indicated that prediction using attributes was working to an extent. Considering that there were 117 phones in the training set, the fact that the network could predict the labels correctly 57% of the time for the using only phonetic attributes is striking. An example pair of confusion matrices for Croatian are shown in Figure 4.8 and Fig- ure 4.9. NB: to make the charts here easier to read with the large number of phones, 50 Chapter 4. Experiments

Accuracy Language MFCC FBANK All Non-silence All Non-silence Bulgarian 0.571 0.537 0.081 0.081 Croatian 0.572 0.572 0.120 0.120 Hausa 0.425 0.345 0.126 0.125 Polish 0.479 0.475 0.072 0.072 Swahili 0.469 0.371 0.135 0.123 Swedish 0.359 0.359 0.040 0.040 Turkish 0.529 0.529 0.111 0.111 Ukrainian 0.434 0.304 0.028 0.028

Table 4.6: Frame-level phone prediction accuracies on test set for articulatory feature network using MFCCs or FBANK features as input. “All” refers to all phones, including silence the labels were kept in the X-SAMPA format for the most part. Accordingly, when referring to a specific phone, the IPA version will also be provided e.g. dZ (/dZ/). Con- fusion matrices for the rest of the languages are attached in the appendix; they are excluded here due to space constraints. From these figures a couple of conclusions can be drawn.

Precision confusion matrix (normalised) 1.0 J

L

S

Z

a

b 0.8 d

dZ

d_z\

e

f

g

i 0.6

j

k

l

m True phones n

o 0.4 p

r

s

sil

t

tS 0.2 t_s\

ts

u

v\

x

0.0 i j l I J f t r s z L F k v x y P a o e j\ E J\ S b d g h n p u 1 2 4 7 9 } A V B Z C N r\ G & O Q w s\ M ts v\ h\ p\ m B\ tS sil rZ dz t_j M\ yu z_j dZ S_j n_j 4_j Z_j r\_j r_0 j_~ ts_j v\_j r_~ t_s\ tS_j s_~ dz_j g_~ h_~ p_~ 4_~ d_z\ B_~ dZ_j ts_~ v\_~ h\_~ p\_~ dz_~ Predicted phones

Figure 4.8: Confusion matrix for Croatian. Normalised along y-axis to show precision

Firstly, the recall rate was generally quite good for most phones. However, there were a few such as L (/L/) which tended to be nearly missed completely in favour of other phones which shared similar attributes e.g. /l/. Secondly, the precision of predictions was quite scattered. In some ways this was to be expected, since there were so many possible phones that could be predicted. Whereas recall should fare better, since in comparison, it only requires the predictions to be correct within a smaller group of phones. 4.3. Experiment 3: Attribute network 51

Recall confusion matrix (normalised) J

L

S

Z 0.75 a

b

d

dZ

d_z\

e 0.60 f

g

i

j

k 0.45 l

m

n True phones o

p

r 0.30

s

sil

t

tS

t_s\ 0.15 ts

u

v\

x

z 0.00 i j l I J f t r s z L F k v x y P a o e j\ E J\ S b d g h n p q u 1 2 4 7 9 } A V B Z C N r\ G & O Q w s\ M ts v\ h\ p\ m B\ tS sil rZ dz t_j M\ yu z_j dZ S_j n_j 4_j Z_j r\_j r_0 j_~ ts_j v\_j r_~ t_s\ tS_j s_~ dz_j g_~ h_~ p_~ 4_~ d_z\ B_~ dZ_j ts_~ v\_~ h\_~ p\_~ dz_~ Predicted phones

Figure 4.9: Confusion matrix for Croatian. Normalised along x-axis to show recall

One interesting pattern is present in Figure 4.8. For instance, when t_j (/tj/) was predicted, the true label was often just /t/. Similarly, z (/z/) and z_j (/zj/) usually cor- responded to just /z/, and the same pattern can be seen for various phone “extensions”. Again, this was encouraging, as making that kind of mistake would not be massively detrimental in trying to get a phone set for a language. Certainly, if two phones were predicted (one with an “extension” and the other without, it would give cause to inves- tigate and see if there really was a significant distinction. Similarly to the experiment in section 4.2, it was possible to see where the errors occurred most frequently. However, the main difference is that in this case, the errors were at the frame-level, rather than word- or phone-level, so the two experiments are not strictly comparable. There are two main points to note. Firstly, a lot of the errors were with vowels being confused. Unfortunately, this kind of mistake is quite likely to happen between vowels which only differ in one attribute. Secondly, a large number of the errors where when a phone should have been silence, but was predicted to be non-silence. This suggests that the way silence was split from vowels and consonants in section 3.5.2 was possibly too harsh, and led to not enough silence being predicted. Or, it could be the case that filtering out 95% of the silence during training (section 3.4.2) was also too much, and slightly more should have been kept. It may even be a combination of the two. One final area of analysis was the distribution of phones. An example of this is shown for Turkish; full results are again available in the appendix. The phone distribution comparison is shown in Figure 4.11; the differences between the two in Figure 4.12. Note that only predicted phones which were within the Turkish true phone set were included; any ones which did not match were left out. From examining the absolute differences, the distributions are in general quite close, 52 Chapter 4. Experiments

Top 5 errors with all languages

Bulgarian Croatian Hausa Polish 80000 60000 10000 60000 60000 40000 7500 40000 40000 5000 # errors 20000 20000 2500 20000

0 0 0 0 i i i i i i t t t

o a o u d d n d t

e a e o o u sil sil sil sil sil sil sil sil sil c Swahili Swedish Turkish Ukrainian 80000 40000 30000 10000 60000 30000 7500 20000 40000 20000 5000 # errors 10000 20000 2500 10000

0 0 0 0 i i i i t t t t

k a o a a o e e d d

l i e e e 0 sil sil sil sil sil o e sil sil sil Phone Phone Phone Phone

Figure 4.10: Top 5 mistakes in frame classification for each language. x → y means that x (the “true” phone) was predicted as y save for about four phones. Notably, M (/W/) was rarely predicted despite it being relatively common in Turkish. This may be because it is unique to Turkish out of all the other training languages, so other vowels common to the languages tend to dominate in attribute prediction.

When looking at the relative differences, there is also clearly an issue with predicting phones such as J\ (/é/) which are also infrequent in Turkish. Again, one cause is that these phones occur so rarely that their attributes are not predicted frequently enough. A more specific reason may be that the “plosive” attribute (part of J\ was quite rare so the network was unable to learn it. How to go about predicting rare attributes will be discussed in the section on future work (section 5.1).

In conclusion, the prediction methods applied here showed promise, though there are numerous improvements and fixes which could be made.

4.4 Experiment 4: Cross-lingual investigations

4.4.1 Research questions

The last experiment aimed to explore two related questions. Firstly, how well could the model predict phones for a language it had not been trained on? This was, after all, the primary motivation behind this project. Secondly, how would the model perform on unseen languages within a particular language family, when it was trained exclusively on that same language family? Or, would it be better to have a wider range of languages during training? 4.4. Experiment 4: Cross-lingual investigations 53

12 True phones Predicted phones 10

8

6

4 Percentage of occurrences 2

0 i j l f t s z k v y a o e J\ S b d h n p u 4 9 Z M J\: m tS dZ Phone

Figure 4.11: Comparison of phone distributions between the Turkish transcripts and the predicted phones

4

2

0 Difference 2

100

50

0

50

100 i j l f t Relative difference (%) s z k v y a o e J\ S b d h n p u 4 9 Z M J\: m tS dZ Phone

Figure 4.12: Absolute and relative differences between true and predicted phone distri- butions. NB: the graph with relative differences is cut off at -100; this is because if the original value is very small, then the percentage difference is enormous (and makes it very difficult to see the rest).

4.4.2 Setup There were two main datasets used. The first included all eight languages. The second included only : for this one a separate multilingual triphone model was trained as a baseline. In each case, one language was left out during training and the final model tested on the held-out language. Due to time constraints, it was not possible to repeat this for each language. Table 4.7 summarises the two datasets and for which languages this process was completed.

Since it was believed at the time of this experiment that there was still a way to get decoding with Kaldi to work for universal phones, the best multilingual triphone model had its data split into sub-groups where one language was removed (and another where only one language remained). This was because the phones.txt file would need to be the same during both training and testing, to allow the language to be decoded in Kaldi. 54 Chapter 4. Experiments

Dataset Dataset languages Excluded languages Full Bulgarian, Croatian, Hausa, Pol- Bulgarian, Hausa, Swahili, Turk- ish, Swahili, Swedish, Turkish, ish, Ukrainian Ukrainian Slavic Bulgarian, Croatian, Polish, Bulgarian, Ukrainian Ukrainian

Table 4.7: Summary of the two datasets used. The excluded languages are the ones which were individually held out during training.

Once the Kaldi decoding issues were finally realised, there was not sufficient time to retrain a triphone model for each possible set of languages. Instead, the networks were trained in the same way as before (section 4.3) minus one language in the existing multilingual data. However, final testing was done using the monolingual triphone models as labels, since that was still relatively quick to accomplish. One last point worth mentioning is that the models trained when the Slavic languages were held out from the Slavic dataset had a lower learning rate of 0.0016 (vs 0.0032). This was due to an earlier attempt to briefly explore changing the learning rate, which was unfortunately left in. However, the validation loss decreased to about 0.072 as it had for other models in the past, so it seems relatively unlikely that this would have had a great impact on the results.

4.4.3 Results

Accuracy Language Included Excluded Decrease All Non-silence All Non-silence All Non-silence Bulgarian 0.571 0.537 0.562 0.478 0.009 0.059 Hausa 0.425 0.345 0.377 0.319 0.048 0.026 Swahili 0.469 0.371 0.316 0.274 0.153 0.097 Turkish 0.529 0.529 0.413 0.412 0.116 0.117 Ukrainian 0.434 0.304 0.332 0.272 0.102 0.032

Table 4.8: Frame-level phone prediction accuracies on the test set, for languages which which were either included during training, or excluded. “Decrease” is the decrease in accuracy when a language was excluded. “All” refers to all phones, including silence

As could be expected, the accuracy for a language when it was removed from the train- ing process Table 4.8 was not as good as before. However, this decrease was noticeably lower for some languages. In particular, Bulgarian seemed to fare remarkably well. This is quite possibly because it shares so many phones with Ukrainian (Figure 3.1), so provided Ukrainian is in the training mix it can manage adequately. For other languages like Turkish, there was a substantial decrease, but again the overall accuracy was not terrible. Achieving 41% accuracy, at the frame level, for an unfamil- iar language and phone set is not that bad given the relatively simple network structure. 4.4. Experiment 4: Cross-lingual investigations 55

The results for Bulgarian and Ukrainian can be compared between the two datasets as well: the one with all eight languages and the one with only Slavic languages. The phone prediction accuracy on test frames is shown in Table 4.9.

Accuracy Language All (inc) All (exc) Slavic (inc) Slavic (exc) A NS A NS A NS A NS Bulgarian 0.571 0.537 0.562 0.478 0.593 0.527 0.481 0.461 Ukrainian 0.434 0.304 0.332 0.272 0.472 0.323 0.307 0.268

Table 4.9: Frame-level phone prediction accuracies on the test set. “All” refers to the dataset with all eight languages, “Slavic” is the dataset with just the four Slavic lan- guages. “(inc)” = language was included; “(exc)” = language was excluded; “A” = all phones; “NS” = non-silence phones

It appears that the accuracy was mostly improved for the two languages when they were trained within the same family. Again, this could be expected since there are fewer phones to confuse, and most will be fairly similar. An interesting point though, is that the accuracy was always worse in the Slavic dataset than the full one, when the languages are treated as unseen. It would seem that indeed either more training data, or just a wider variety of languages and phonetic attributes within the data helps to make the model more robust and adaptable to different languages. Another area to investigate was the most frequently mistaken phones. Table 4.10 shows the top three most common incorrect predictions. Notably, “silence” tends to feature quite prominently, reinforcing the view from Figure 4.10 in section 4.3 that “silence” was being removed too harshly during predictions. The non-silence phone mistakes were also quite similar across the datasets. This sug- gests that the reasons for predicting these phones incorrectly was perhaps more due to the similarity between the phones. For example, for the tests, /A/ and /a/ are identical in attributes except one is “front” and the other is “back”.

Accuracy Language All (inc) All (exc) Slavic (inc) Slavic (exc) sil → t o → u sil → t sil → t Bulgarian sil → d sil → d o → u o → u E → i E → i sil → d a → E A → a A → a A → a sil → p Ukrainian 0 → o sil → d sil → t A → a sil → d sil → n I → i sil → t

Table 4.10: Top 3 most frequently phones predicted incorrectly. x → y means that x is the “true” phone and y is the predicted phone. “All” refers to the dataset with all eight languages, “Slavic” is the dataset with just the four languages. “(inc)” = language was included; “(exc)” = language was excluded.

A final way to assess the performance on unseen languages is to compare phone dis- tributions as before. Figure 4.13 and Figure 4.14 show the results for Bulgarian on 56 Chapter 4. Experiments test data, when it was excluded from training, for both datasets. The results when it was included in training are not shown, to prevent the graph from becoming overly crowded.

True phones Full dataset 12 Slavic dataset

10

8

6

4 Percentage of occurrences

2

0 i j l f t r s z k v x a o E S b d g n p u 7 Z ts m tS l_j f_j dz t_j r_j ya yu s_j z_j dZ k_j v_j b_j d_j g_j n_j p_j m_j Phone

Figure 4.13: Phone distribution comparison for Bulgarian test data. Both datasets were trained without Bulgarian.

From the Figure 4.14, it is clear that both distributions are quite similar. It is interesting to see which particular phones the datasets are better, or worse at predicting. The Slavic one struggles with /s/ and /z/ considerably more than the full dataset. Whereas the full one is comparatively worse at predicting E (/E/) and /i/. This was somewhat surprising: one would intuitively think that a Slavic model would be better at predicting given their widespread use in Slavic languages. And on the other hand, one would also believe that a model trained on more languages would be better at predicting basic vowels. Why exactly this occurred is unclear, but then again it is worth taking these results with a pinch of salt, given the assumptions made in, for instance, choosing to generate distributions from transcripts or frame-labels in the first place. 4.4. Experiment 4: Cross-lingual investigations 57

8 Full dataset Slavic dataset 6

4

2

0

2

Difference from true values 4

6 i j l f t r s z k v x a o E S b d g n p u 7 Z ts m tS l_j f_j dz t_j r_j ya yu s_j z_j dZ k_j v_j b_j d_j g_j n_j p_j m_j Phone

Figure 4.14: Difference between the “true” phone distribution in test data and predicted distributions. Both datasets were trained without Bulgarian.

Chapter 5

Conclusions

5.1 Future work

In this section, various ways to continue what was started here are discussed, from data processing to the network structure itself.

5.1.1 Fixing GlobalPhone The GP corpus was indeed very useful, but there are certain aspects which could be improved if work was to go further. Firstly, the dictionaries for some languages like Swedish need to be updated to account for the transcribed stutters. This would have to be done tediously by hand, unless it were possible to do it automatically. However, it is suspected that a manual method is preferable. It was observed when adding entries for stutters, that there were in some cases two plausible possibilities e.g. two slightly different vowels. In these cases, careful comparison and some human intuition was used to make the decision. Secondly, the problem with the Japanese dictionary needs to be resolved. It is unclear if this is only an issue with the version of GP available at university, or if there was a mistake made when compiling the corpus. Nonetheless, this particular issue was beyond the scope of this project. Thirdly, while also a tedious task, it would undoubtedly be beneficial to go through the transcripts that contain words with multiple pronunciations, and listen to the audio. The transcript could then be labelled appropriately with the correct pronunciation. Thus, the training data would be of even better quality, and it would make training with languages that have many different pronunciations of the same words (like French) considerably easier.

5.1.2 Phonetic attribute improvements In hindsight, it would probably have been beneficial to include a silence attribute as Frankel et al. [2007] did in their work with articulatory features. Hopefully this would

59 60 Chapter 5. Conclusions assist in making better predictions for whether or not a phone was silence than the considerably more limited method of using the value of the vowel/consonant attributes. Regarding diphthongs, these remain problematic for predicting accurately. Splitting them into two phones may be the only plausible option. If this is done, one potential solution would be to include an extra “diphthong” attribute for an additional version of any vowel which is part of a diphthong, similarly to other “extension” attributes. Then it would be possible to split the diphthongs, and during training the model would hopefully learn to discriminate between cases where the vowel occurs alone, and where it is part of a diphthong. A potential problem that could then arise is that the pronun- ciation of a diphthong vowel is not exactly the same depending on the order of vowels e.g. /ai/ (“aye”) vs. /ia/ (“ee-ya”). The acoustic context will of course affect how a diphthong sounds as well. Nonetheless, these issues may not actually be so serious in reality. Lastly, there was also the question of how to deal with rare attributes. One option could be to try a version of “hard negative” mining. That is, each time a particularly rare attribute is predicted incorrectly, the corresponding feature vector (or its utterance) is set aside, and once a suitable collection of these has been built up, they could be used for a special round (or rounds) of training. This would artificially increase the frequency with which rare attributes were seen in the network, and would hopefully help them to be predicted more reliably as a result.

5.1.3 Replacing PyTorch-Kaldi One practical improvement would be to move away from PyTorch-Kaldi and look to create a network structure more or less from scratch using libraries like PyKaldi [Can et al., 2018] (Python wrappers for Kaldi), and PyKaldi2 [Lu et al., 2019] (more support for PyTorch). PyTorch-Kaldi is undeniably a powerful tool for word-level recognition systems. On the other hand, it required a great deal of tweaking and debugging to do things such as change the labels from phone numbers to phone attributes. This itself was often very difficult to do as there is lot of code to wade through, and small changes in one part sometimes had unexpected effects elsewhere. Furthermore, the way that the data-flow is set up made it very difficult to get the cor- responding utterance for a list of labels. Once a chunk had been loaded, only the input features and the target labels were visible to the network during training. This was fine in normal circumstances; however it meant that when one received a list of labels during the running of the network, it was not possible to tell which parts correspond to which utterance (without a substantial overhaul of the codebase). This was one reason why the analysis for the attribute networks took place at a lan- guage level: it was simply not possible to reliably split them into individual utterances. Splitting based on the frame labels would not work either, as these were not completely accurate (being based on the triphone models). If starting from scratch, it would be preferable to use, for example, the PyTorch DataLoader 5.2. Results summary 61 class. With some adjustments, it would allow for data to be read in chunks, while as- sociating the data with, for instance, transcripts and utterance lists. This would also make tasks like converting phone numbers to features considerably easier.

5.1.4 Network structure improvements While the issues and potential solutions discussed here are somewhat related to sec- tion 5.1.3, they are more focused on the network structure itself. In retrospect, it would probably have been better to implement CTC loss (seen in sec- tion 2.4.3). The benefits are quite clear: there would be no need to train shallow triphone models first (to obtain frame alignments), but instead the transcripts could be used directly. While the transcripts may not have been entirely accurate, this accuracy could have been improved using the methods in section 5.1.1. Transcription accuracy would at any rate almost certainly be better than the typical PERs of about # TODO 32% for each frame. PyTorch-Kaldi does not provide for the use of CTC Loss. Trying to add this feature would have required refactoring a large amount of the existing code. In particular, the entire way that data is loaded would need to be changed, since it is done assuming frame-level alignments (which CTC does not use). From personal experience, it is safe to say that this would be extremely difficult, and future work would almost definitely be better off instead creating such a network more or less from the ground up. An additional improvement to the structure would be to use different layer structures e.g. LSTM layers. Again, these were not included due to issues with the PyTorch- Kaldi implementation (discussed in section 3.4.5), but a new network could probably resolve these. Finally, it could be worth exploring the inclusion of an additional layer that maps from attributes to phones directly. This would remove the need to split or score attributes, and more importantly, could allow for the more useful ones to be weighted heavily (thus enabling better predictions). The main issue that could arise with such a layer though, is that even if it were given the full number of universal phones to predict from, it would probably only learn to predict the phones in the training set. Consequently, the attribute weights would be biased towards the languages it had learned from.

5.2 Results summary

In conclusion, this project has explored the possibility of building a universal pho- netic model, with mixed success. Despite various complications and setbacks, it was possible to build a system which attempted truly universal prediction. The shallow triphone models performed surprisingly well, and in fact were better for most languages than the baseline network. The baseline results themselves suggested that 4 recurrent layers was a reasonable number to work with, though this could change if, for instance, an LSTM layer were used instead. 62 Chapter 5. Conclusions

The results for the attribute network, while not incredible in accuracy, do indicate that the methods explored here are a viable direction to travel in. In particular, some of the results for completely unseen languages were definitely promising. Furthermore, many of the improvements outlined in section 5.1 simply require additional time to implement, and are not incredibly complicated. Once improved, this could be a powerful assistive tool in preserving the zero-resource languages of the world, and all the cultural significance that goes along with them. Perhaps one day in the not-too-distant future, it will be possible to build a real-life, fully-working universal phonetic model, and we can all be brought a little closer to- gether as a result. Bibliography

T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul. A compact model for speaker-adaptive training. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96, volume 2, pages 1137–1140 vol.2, 1996.

T. Anastasakos, J. McDonough, and J. Makhoul. Speaker adaptive training: a maxi- mum likelihood approach to speaker normalization. In 1997 IEEE ICASSP, pages 1043–1046, 1997.

Peter K. Austin and Julia Sallabank. The Cambridge Handbook of Endangered Lan- guages. Cambridge University Press, 2011.

Pallavi Baljekar, Sunayana Sitaram, Prasanna Kumar Muthukumar, and Alan W. Black. Using articulatory features and inferred phonological segments in zero re- source speech processing. In INTERSPEECH, 2015.

Justin Simon Bayer. Learning Sequence Representations. Dissertation, Technische Universitat¨ Munchen,¨ 2015.

Encyclopaedia Britannica. Human vocal organs and points of articulation, 2020. URL https://www.britannica.com/science/phonetics#/media/1/457255/ 3597.

Dogan Can, Victor R. Martinez, Pavlos Papadopoulos, and Shrikanth S. Narayanan. PyKaldi: A Python Wrapper for Kaldi. In 2018 IEEE ICASSP. IEEE, 2018.

Dr. Peter Coxhead. Natural Language Processing & Applications: Phones and Phonemes, 2008. URL https://www.cs.bham.ac.uk/˜pxc/nlp/NLPA-Phon1. pdf.

D. Crystal. The Cambridge Encyclopedia of Language. Cambridge University Press, 2010.

David M. Eberhard, Gary F. Simons, and Charles D. Fennig. Ethnologue: Languages of the world. twenty-second edition., 2020. URL https://www.ethnologue.com.

Joe Frankel, Mirjam Wester, and Simon King. Articulatory feature recognition using dynamic Bayesian networks. Computer Speech & Language, 21(4):620 – 640, 2007. ISSN 0885-2308. doi: https://doi.org/10.1016/j.csl.2007.03.002. URL http:// www.sciencedirect.com/science/article/pii/S0885230807000204.

63 64 Bibliography

M. J. F. Gales. Semi-tied covariance matrices for hidden Markov models. IEEE Trans- actions on Speech and Audio Processing, 7(3):272–281, 1999. J. Garofolo, Lori Lamel, W. Fisher, Jonathan Fiscus, D. Pallett, N. Dahlgren, and V. Zue. TIMIT acoustic-phonetic continuous speech corpus. Linguistic Data Con- sortium, 11 1992. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dud´ık, editors, Pro- ceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 315– 323. PMLR, 11–13 Apr 2011. S. Gokcen and J. M. Gokcen. A multilingual phoneme and model set: toward a uni- versal base for automatic speech recognition. In 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pages 599–605, 12 1997. R. A. Gopinath. Maximum likelihood modeling with Gaussian distributions for clas- sification. In Proceedings of ICASSP 1998, volume 2, pages 661–664 vol.2, 1998. Hossein Hadian, Hossein Sameti, Daniel Povey, and Sanjeev Khudanpur. End-to-end speech recognition using lattice-free MMI. pages 12–16, 09 2018. doi: 10.21437/ Interspeech.2018-1423. R. Haeb-Umbach and H. Ney. Linear discriminant analysis for improved large vocab- ulary continuous speech recognition. In [Proceedings] ICASSP-92, volume 1, pages 13–16 vol.1, 1992. Geoff Hinton. Neural networks for machine learning - lecture 6a - overview of mini- batch gradient descent, 2012. URL https://www.cs.toronto.edu/˜tijmen/ csc321/slides/lecture_slides_lec6.pdf. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http: //arxiv.org/abs/1502.03167. Nathalie Japkowicz. The Class Imbalance Problem: Significance and Strategies. In In Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI, pages 111–117, 2000. Daniel Kiecza. Dantengetriebene Bestimmung von Vokabulareinheiten fur¨ koreanische Spracherkennung auf großen Wortschatzen.¨ Master’s thesis, Universitat¨ Karlsruhe, 10 1999. NB: Could only be found within the GP corpus files. S. King, J. Frankel, and M. Wester. Articulatory feature recognition using dynamic Bayesian networks. In Proc. ICSLP, September 2004. . Another view of endangered languages. Language, 68(4):809–811, 1992. Partha Lal. Globalphone issues wiki page, 2010. URL https://wiki.inf.ed.ac. uk/CSTR/GlobalPhone. Bibliography 65

Partha Lal. Cross-lingual automatic speech recognition using tandem features. PhD thesis, University of Edinburgh, 2011. Xinjian Li, Siddharth Dalmia, David R. Mortensen, Florian Metze, and Alan W. Black. Zero-shot learning for speech recognition with universal phonetic model, 2018. Liang Lu, Xiong Xiao, Zhuo Chen, and Yifan Gong. PyKaldi2: Yet another speech toolkit based on Kaldi and PyTorch. CoRR, abs/1907.05955, 2019. URL http: //arxiv.org/abs/1907.05955. Dau-Cheng Lyu, Marco Siniscalchi, Tae-Yoon Kim, and Chin-Hui Lee. Continuous phone recognition without target language training data. pages 2687–2690, 01 2008. Spyros Matsoukas, Rich Schwartz, Hubert Jin, and Long Nguyen. Practical imple- mentations of speaker-adaptive training. In DARPA Speech Recognition Workshop, 1997. Josef Michalek and Jan Vanek. A survey of recent DNN architectures on the TIMIT phone recognition task, 2018. A. Mohamed, G. Hinton, and G. Penn. Understanding how deep belief networks per- form acoustic modelling. In 2012 IEEE ICASSP, pages 4273–4276, 2012. Paul Moore. Low Resource Language Identification from Speech using X-vectors, 2019. Salikoko S. Mufwene. Language birth and death. Annual Review of Anthropology, 33 (1):201–222, 2004. doi: 10.1146/annurev.anthro.33.070203.143852. Markus Muller,¨ Jorg¨ Franke, Sebastian Stuker,¨ and Alex Waibe. Improving phoneme set discovery for documenting unwritten languages. In Jurgen¨ Trouvain, Ing- mar Steiner, and Bernd Mobius,¨ editors, Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2017, pages 202–209. TUDpress, Dresden, 2017. Michael A. Nielsen. Neural Networks and Deep Learning. Determination Press, 2015. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, page III–1310–III–1318. JMLR.org, 2013. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des- maison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alche´ Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. J. W. Picone. Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9):1215–1247, 1993. 66 Bibliography

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Na- gendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Dec. 2011. L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. M. Ravanelli, T. Parcollet, and Y. Bengio. The PyTorch-Kaldi Speech Recognition Toolkit. In In Proc. of ICASSP, 2019. Steve Renals. ASR lecture: Words and pronunciation models, slide 9, 2012. URL https://www.inf.ed.ac.uk/teaching/courses/asr/2011-12/ asr-lexlm-nup4.pdf. Steve Renals and Hiroshi Shimodaira. ASR lecture: Context dependent phone mod- els, 2019. URL http://www.inf.ed.ac.uk/teaching/courses/asr/2018-19/ asr04-cdhmm-handout.pdf. Matthew Richardson, Jeff Bilmes, and Chris Diorio. Hidden-articulator Markov mod- els for speech recognition. Speech Communication, 41(2):511 – 529, 2003. ISSN 0167-6393. doi: https://doi.org/10.1016/S0167-6393(03)00031-1. URL http: //www.sciencedirect.com/science/article/pii/S0167639303000311. Martin Riedmiller and Heinrich Braun. A Direct Adaptive Method for Faster Back- propagation Learning: The RPROP Algorithm. In IEEE INTERNATIONAL CON- FERENCE ON NEURAL NETWORKS, pages 586–591, 1993. Suzanne Romaine. Preserving endangered languages. Language and Linguistics Com- pass, 1(1-2):115–132, 2007. doi: 10.1111/j.1749-818X.2007.00004.x. Tanja Schultz. GlobalPhone: A Multilingual Speech and Text Database Developed at Karlsruhe University. Technical report, Interactive Systems Laboratories, Karlsruhe University, Carnegie Mellon University, 2002. Tanja Schultz and Alex Waibel. Multilingual and crosslingual speech recognition. In Proc. DARPA Workshop on Broadcast News Transcription and Understanding, pages 259–262, 1998. S. M. Siniscalchi, T. Svendsen, and Chin-Hui Lee. Toward a detector-based universal phone recognizer. In IEEE ICASSP, pages 4261–4264, 03 2008. doi: 10.1109/ ICASSP.2008.4518596. Caroline Smith. Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet. Phonology, 17:291–295, 08 1999. doi: 10.1017/S0952675700003894. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from over- fitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html. Bibliography 67

C.P.J. van Bael and Simon J. King. The keyword lexicon - an accent-independent lexicon for automatic speech recognition. 2003. J. Kendrick Wells. Computer-coding the IPA: a proposed extension of SAMPA. 1995. Fang Zheng, Guoliang Zhang, and Zhanjiang Song. Comparison of different imple- mentations of MFCC. Journal of Computer Science and Technology, 16(6):582– 589, Nov 2001. ISSN 1860-4749. doi: 10.1007/BF02943243.

Appendix A

Universal Phone Set

A.1 Base Phones

Table A.1: Universal phones (base).

Attributes IPA X-SAMPA v/c other a a vowel open front unrounded b b consonant voiced labial c c consonant voiceless dorsal d d consonant voiced alveolar plosive coronal ã d‘ consonant voiced retroflex plosive coronal e e vowel close-mid front unrounded f f consonant voiceless labiodental labial p”f pf consonant voiceless labiodental fricative bilabial plosive labial g g consonant dorsal h h consonant voiceless glottal fricative H h\ consonant i i vowel close front unrounded j j consonant voiced palatal dorsal ê J\_< consonant voiced palatal approximant dorsal implosive J j\ consonant voiced dorsal k k consonant dorsal l l consonant voiced alveolar lateral approximant coronal í l‘ consonant voiced retroflex lateral approximant coronal Õ l\ consonant alveolar lateral flap coronal m m consonant bilabial nasal labial n n consonant voiced alveolar nasal coronal ï n‘ consonant retroflex nasal coronal o o vowel close-mid back rounded p p consonant voiceless bilabial plosive labial F p\ consonant voiceless bilabial fricative labial q q consonant voiceless uvular plosive

69 70 Appendix A. Universal Phone Set

Table A.1: Universal phones (base).

Attributes IPA X-SAMPA v/c other r r consonant alveolar trill coronal ó r‘ consonant retroflex flap coronal ô r\ consonant alveolar approximant coronal õ r\‘ consonant retroflex approximant coronal r r_0 consonant r˚Z rZ consonant voiced alveolar fricative trill s s consonant voiceless alveolar fricative coronal ù s‘ consonant voiceless retroflex fricative coronal C s\ consonant voiceless alveolo-palatal fricative coronal t t consonant voiceless alveolar plosive coronal ú t‘ consonant voiceless retroflex plosive coronal úù t‘s‘ consonant voiceless retroflex plosive fricative coronal u u vowel close back rounded v v consonant voiced labiodental fricative labial V v\ consonant voiced labiodental approximant w w consonant labial-velar approximant labial x x consonant voiceless velar fricative dorsal Ê x\ consonant voiceless palatal-velar fricative dorsal y y vowel close front rounded z z consonant voiced alveolar fricative coronal ü z‘ consonant voiced retroflex fricative coronal ý z\ consonant voiced alveolo-palatal fricative A A vowel open back unrounded B B consonant voiced bilabial fricative labial à B\ consonant bilabial trill labial ç C consonant voiceless palatal fricative D D consonant voiced coronal E E vowel open-mid front unrounded M F consonant labiodental nasal labial G G consonant dorsal å G\ consonant voiced uvular plosive dorsal 4 H consonant labial-palatal approximant labial Ë H\ consonant voiceless epiglottal fricative I I vowel near-close front unrounded fl1 I\ vowel near-close central unrounded ñ J consonant palatal nasal é J\ consonant voiced palatal plosive ì K consonant voiceless alveolar lateral fricative coronal Ð K\ consonant voiced alveolar lateral fricative coronal L L consonant palatal lateral approximant Ï L\ consonant velar lateral approximant dorsal W M vowel close back unrounded î M\ consonant velar approximant dorsal A.1. Base Phones 71

Table A.1: Universal phones (base).

Attributes IPA X-SAMPA v/c other N N consonant velar nasal dorsal ð N\ consonant uvular nasal dorsal O O vowel open-mid back rounded ò O\ consonant labial V P consonant labiodental approximant labial 6 Q vowel open back rounded K R consonant ö R\ consonant uvular trill S S consonant voiceless postalveolar fricative coronal „S S_a consonant voiceless postalveolar fricative coronal apical T T consonant voiceless dental fricative coronal U U vowel near-close back rounded 0fi U\ vowel near-close central rounded 2 V vowel open-mid back unrounded û W consonant voiceless labial-velar fricative labial X X consonant voiceless uvular fricative è X\ consonant voiceless pharyngeal fricative Y Y vowel near-close front rounded Z Z consonant voiced postalveolar fricative coronal @ @ vowel close-mid open-mid central rounded unrounded æ { vowel near-open front unrounded 0 } vowel close central rounded 1 1 vowel close central unrounded ø 2 vowel close-mid front rounded 3 3 vowel open-mid central unrounded Æ 3\ vowel open-mid central rounded R 4 consonant alveolar flap coronal ë 5 consonant velar alveolar lateral approximant coronal dorsal 5 6 vowel near-open central 7 7 vowel close-mid back unrounded 8 8 vowel close-mid central rounded œ 9 vowel open-mid front rounded Œ & vowel open front rounded P ? consonant Q ?\ consonant voiced pharyngeal fricative | |\ consonant dental click coronal { |\|\ consonant alveolar coronal } =\ consonant ts ts consonant voiceless alveolar coronal dz dz consonant voiced alveolar affricate coronal dý dz\ consonant voiced alveolo-palatal affricate coronal dZ dZ consonant voiced postalveolar affricate coronal 72 Appendix A. Universal Phone Set

Table A.1: Universal phones (base).

Attributes IPA X-SAMPA v/c other tS tS consonant voiceless postalveolar affricate coronal tC t_s\ consonant voiceless alveolo-palatal affricate coronal C s\ consonant voiceless palatal fricative dorsal tì tK consonant voiceless alveolar lateral affricate coronal dý“ d_z\ consonant voiced alveolo-palatal affricate

A.2 Extensions

IPA X-SAMPA Attribute x: : long x’ _> ejective ˜x _˜ nasal xh _h aspirated xw _w labial xj _j palatal x _t breathy- x¨ _ˆ non-syllabic x”“ _d dental coronal

Table A.2: Universal phone extensions

A.3 Phone maps

Table A.3: Bulgarian phone map

GP phone IPA X-SAMPA GP phone IPA X-SAMPA SIL sil sil a a a b b b bj bj b_j d d d dj dj d_j dZ dZ dZ dz dz dz e E E f f f fj fj f_j g g g gj gj g_j i i i j j j ja ya ya ju yu yu k k k kj kj k_j l l l lj lj l_j m m m mj mj m_j n n n nj nj n_j o o o p p p pj pj p_j A.3. Phone maps 73

Table A.3: Bulgarian phone map

GP phone IPA X-SAMPA GP phone IPA X-SAMPA r r r rj rj r_j s s s S S S sj sj s_j t t t tj tj t_j tS tS tS ts ts ts u u u v v v vj vj v_j x x x Y 7 7 z z z Z Z Z zj zj z_j - - -

Table A.4: Croatian phone map

GP phone IPA X-SAMPA GP phone IPA X-SAMPA SIL sil sil M_+hGH unk unk M_+QK unk unk M_a a a M_b b b M_cp cC t_s\ M_d d d M_dp dý d_z\ M_dZ dZ dZ M_e e e M_f f f M_g g g M_i i i M_j j j M_k k k M_l l l M_L L L M_m m m M_n n n M_nj ñ J M_o o o M_p p p M_r r r M_s s s M_sj S S M_t t t M_ts ts ts M_tS tS tS M_u u u M_v V v\ M_x x x M_z z z M_zj Z Z - - -

Table A.5: Hausa phone map

GP phone IPA X-SAMPA GP phone IPA X-SAMPA SIL sil sil H_a a a H_aI ai ai H_aU au au H_b b b H_B á b_< H_c c c H_d d d H_D â d_< H_DZ dZ dZ H_e e e H_F F p\ H_g g g H_h h h H_i i i H_j j j H_k k k H_K k’ k_> 74 Appendix A. Universal Phone Set

Table A.5: Hausa phone map

GP phone IPA X-SAMPA GP phone IPA X-SAMPA H_KR kh k_h H_l l l H_m m m H_n n n H_o o o H_p p p H_Q Pj ?_j H_r r r H_R ó r‘ H_s s s H_S S S H_t t t H_TS ts ts H_u u u H_w w w H_z z z

Table A.6: Polish phone map

GP phone IPA X-SAMPA GP phone IPA X-SAMPA SIL sil sil M_a a a M_b b b M_c ts ts M_d d d M_dZ dZ dZ M_dz dz dz M_dzj dzj dz_j M_e E E M_eo5 ˜E E_˜ M_f f f M_g g g M_h x x M_i i i M_i2 1 1 M_j j j M_k k k M_l l l M_m m m M_n n n M_n˜ ñ J M_o O O M_oc5 ˜O O_˜ M_p p p M_p p p M_r r r M_s s s M_S S S M_sj Sj S_j M_t t t M_tS tS tS M_tsj tsj ts_j M_u u u M_v v v M_w w w M_z z z M_Z rZ rZ M_zj zj z_j

Table A.7: Swahili phone map

GP phone IPA X-SAMPA GP phone IPA X-SAMPA SIL sil sil SWA_a a a SWA_b á b_< SWA_ch tS tS SWA_d â d_< SWA_dh tD D SWA_e E E SWA_f f f SWA_g ä g_< SWA_gh G G SWA_h h h SWA_i i i SWA_j ê J\_< SWA_k k k SWA_kh x x SWA_l l l A.3. Phone maps 75

Table A.7: Swahili phone map

GP phone IPA X-SAMPA GP phone IPA X-SAMPA SWA_m m m SWA_mb mb b_˜ SWA_mv Mv v_˜ SWA_n n n SWA_nd nd d_˜ SWA_ng Ng g_˜ SWA_ng˜ N N SWA_nj ñ J SWA_ny nê J\_<_˜ SWA_nz nz z_˜ SWA_o O O SWA_p p p SWA_r R 4 SWA_s s s SWA_sh S S SWA_t t t SWA_th θ T SWA_u u u SWA_v v v SWA_w w w SWA_y j j SWA_z z z

NB: other language phone maps not included here, but they are available in the source code for this project, under conf > phone_maps

Appendix B

Dataset splits

B.1 Speaker lists

Each table below contains the speaker IDs used in the split for each language.

Table B.1: Training speaker lists.

Language Speakers Bulgarian 018 020 021 023 025 026 027 032 035 039 041 045 048 049 050 056 060 064 065 066 067 069 070 071 072 073 075 077 078 079 080 082 083 085 087 088 089 091 092 093 094 096 097 098 099 101 102 103 104 105 107 111 112 113 114 Croatian 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 022 023 024 025 026 027 028 029 030 031 032 034 055 056 060 061 062 063 064 065 066 067 068 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 089 090 092 093 094 Hausa 001 002 003 005 020 021 022 028 029 031 032 038 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 Polish 002 003 005 006 007 008 010 011 013 014 015 016 017 018 019 020 021 022 024 026 028 031 032 034 035 042 051 052 054 055 056 057 058 059 060 061 062 064 065 066 067 068 069 070 071 073 074 075 076 078 079 080 081 082 083 085 086 087 088 089 091 092 093 094 095 096 098 099 100 Swahili 001 002 006 007 008 010 012 013 014 016 025 026 027 034 035 039 040 042 043 044 045 047 050 051 052 053 054 056 057 062 063 066 067 068 071 072 078 079 080 082 087 088 092 093 094 095 096 097 099 100 101 102

77 78 Appendix B. Dataset splits

Table B.1: Training speaker lists.

Language Speakers Swedish 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 033 036 037 038 039 051 052 053 054 055 056 057 058 059 063 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 Turkish 001 002 004 005 007 009 012 013 014 016 020 021 023 025 026 027 030 031 032 034 035 038 039 040 041 043 044 045 046 047 048 049 052 057 058 062 064 065 066 067 068 069 070 072 075 076 077 078 080 081 082 083 084 085 086 087 088 089 091 092 093 094 095 096 097 098 099 100 Ukrainian 011 012 019 021 022 023 026 030 043 046 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 084 085 086 087 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119

Table B.2: Validation speaker lists.

Language Speakers Bulgarian 051 052 053 054 055 058 084 090 100 106 Croatian 033 034 035 036 046 048 051 053 054 057 058 059 069 Hausa 004 006 007 008 009 010 011 012 013 014 015 016 017 018 019 023 024 Polish 012 030 036 037 038 039 040 041 046 063 072 077 084 090 097 Swahili 004 005 015 017 020 021 023 081 084 Swedish 030 031 032 034 035 045 046 047 048 049 050 066 067 068 069 Turkish 008 024 029 033 053 054 055 056 059 060 061 063 063 071 073 074 079 090 Ukrainian 027 028 029 031 032 033 034 035 036 037 038 039 040 041 042 044 045 047 083 088

Table B.3: Test speaker lists.

Language Speakers Bulgarian 040 042 043 046 047 059 062 063 068 095 109 110 Croatian 037 038 039 040 041 042 043 044 045 047 049 050 052 088 Hausa 025 026 027 030 033 034 035 036 037 039 040 041 042 043 044 083 Polish 001 004 009 023 025 029 033 043 044 045 047 048 049 050 053 Swahili 029 032 033 036 038 048 055 061 073 Swedish 040 041 042 043 044 060 061 062 064 070 071 072 074 075 076 Turkish 003 006 010 011 015 017 018 019 022 028 036 037 042 050 051 Ukrainian 001 002 003 004 005 006 007 008 009 010 013 014 015 016 017 018 020 024 025 B.2. Dataset statistics 79

B.2 Dataset statistics

Language Length (hrs) No. speakers Age (mean) Gender ratio (m:f) Bulgarian 14.8 55 32.3±13.8 45:55 Croatian 7.0 66 34.6±15.0** 38:62 Hausa 6.1 70 30.1±11.5 22:78 Polish 17.3 69 * * Swahili 7.8 52 24.8±7.6 49:51 Swedish 15.1 68 32.4±14.1 44:56 Turkish 10.5 68 27.7±10.7 21:79 Ukrainian 9.8 80 35.4±13.4 35:65 Total 88.4 528 - - Mean (language) 11.1±3.9 66±8.3 - -

Table B.4: Summary of training GlobalPhone data as used in experiments. For mean data, the ± value is the population standard deviation. Gender ratios are rounded to whole numbers. * indicates that were many speakers missing this information, so no reliable statistics could be taken. ** indicates that the information was missing for one speaker.

Language Length (hrs) No. speakers Age (mean) Gender ratio (m:f) Bulgarian 2.9 10 39.7±15.0 30:70 Croatian 1.6 13 25.1±8.3 46:54 Hausa 1.3 17 25.2±7.4 25:75 Polish 3.7 15 * * Swahili 1.7 9 25.3±9.6 29:71 Swedish 3.3 15 31.7±11.6 60:40 Turkish 2.2 18 27.9±10.9 44:56 Ukrainian 2.1 20 27.2±8.7 50:50 Total 18.8 117 - - Mean (language) 2.35±0.65 14.6±3.6 - -

Table B.5: Summary of validation GlobalPhone data as used in experiments. For mean data, the ± value is the population standard deviation. Gender ratios are rounded to whole numbers. * indicates that were many speakers missing this information, so no reliable statistics could be taken. 80 Appendix B. Dataset splits

Language Length (hrs) No. speakers Age (mean) Gender ratio (m:f) Bulgarian 2.9 12 28.2±12.4 33:67 Croatian 1.5 14 29.2±10.9 50:50 Hausa 1.3 16 22.9±5.7 81:19 Polish 3.6 15 * * Swahili 1.7 9 24.1±5.2 50:50 Swedish 3.3 15 32.3±13.6 73:27 Turkish 2.3 15 26.7±10.9 40:60 Ukrainian 2.2 19 35.1±16.1 47:53 Total 18.8 117 - - Mean (language) 2.35±0.79 14.4±2.7 - -

Table B.6: Summary of testing GlobalPhone data as used in experiments. For mean data, the ± value is the population standard deviation. Gender ratios are rounded to whole numbers. * indicates that were many speakers missing this information, so no reliable statistics could be taken. Appendix C

Phone errors in baseline network

Top 5 deletion errors with 3 RNN layers

Bulgarian Croatian Hausa Polish 15.0 12 20 12.5 10 15

10.0 8 15 10 7.5 6 10 5.0 4 % of all errors 5 5 2.5 2

0.0 0 0 0 a t o i i j e o n a i j r j i w Swahili Swedish Turkish Ukrainian 12 12 15.0 12 10 10 12.5 10 8 8 10.0 8 6 7.5 6 6

4 4 4 5.0 % of all errors

2 2 2 2.5

0 0 0 0.0 a i u w r e d n i l n 0 Phone Phone Phone Phone

Figure C.1: Top 5 most common deletion errors in the validation set for each language when using the 3-layer RNN network

81 82 Appendix C. Phone errors in baseline network

Top 5 insertion errors with 3 RNN layers

Bulgarian Croatian Hausa Polish

12.5 10 10 10 8 10.0 8 8 6 7.5 6 6

5.0 4 4 4 % of all errors 2.5 2 2 2

0.0 0 0 0 v sil i j a n a i e t i u n a v o v a n Swahili Swedish Turkish Ukrainian 10 12 10 40 8 10 8 30 8 6 6 6 20 4 4 4 % of all errors 10 2 2 2

0 0 0 0 n k u i j a i e s n e n a i p a sil o t Phone Phone Phone Phone

Figure C.2: Top 5 most common insertion errors in the validation set for each language when using the 3-layer RNN network

Top 5 substitution errors with 3 RNN layers

Bulgarian Croatian Hausa Polish

6 8 8 6 6 6 4 4 4 4 2

% of all errors 2 2 2

0 0 0 0 i i i i l

k v o a o o a a e e u

j

e e n a a e a o a o u k' Swahili Swedish Turkish Ukrainian

3 6 4 12.5 10.0 3 2 4 7.5 2 5.0 1 2 % of all errors 1 2.5

0 0 0 0.0 i r

v a a o a a a a a a a a e e e e u

i i r e e a 0 0 Phone Phone Phone Phone

Figure C.3: Top 5 most common substitution errors in the validation set for each lan- guage when using the 3-layer RNN network 83

Top 5 deletion errors with 4 RNN layers

Bulgarian Croatian Hausa Polish 15.0 12 20 12.5 10 15 15 10.0 8 10 7.5 6 10

5.0 4

% of all errors 5 5 2.5 2

0.0 0 0 0 a o i t i e j o a n i j u j i Swahili Swedish Turkish Ukrainian 12 15.0 12.5 10 10 12.5 10.0 8 8 10.0 7.5 6 6 7.5

5.0 4 4 5.0 % of all errors 2.5 2 2 2.5

0.0 0 0 0.0 a u i w r e d t i l j 0 j Phone Phone Phone Phone

Figure C.4: Top 5 most common deletion errors in the validation set for each language when using the 4-layer RNN network

Top 5 insertion errors with 4 RNN layers

Bulgarian Croatian Hausa Polish 12 15.0 12 10 8 12.5 10 8 10.0 6 8 6 7.5 6 4 5.0 4 4 % of all errors 2 2.5 2 2

0.0 0 0 0 v i sil n j n a i t i u v sil l a v i n Swahili Swedish Turkish Ukrainian

15.0 15.0 10 10 12.5 12.5 8 8 10.0 10.0 6 6 7.5 7.5 4 4 5.0

% of all errors 5.0

2.5 2 2 2.5

0.0 0 0 0.0 n k i u a a i e n s e n a i v p a e k Phone Phone Phone Phone

Figure C.5: Top 5 most common insertion errors in the validation set for each language when using the 4-layer RNN network 84 Appendix C. Phone errors in baseline network

Top 5 substitution errors with 4 RNN layers

Bulgarian Croatian Hausa Polish 10 6 8 8 6 6 6 4 4 4 4 2

% of all errors 2 2 2

0 0 0 0 i i i i

k v o a a o a e e u u

j

e e a a e o a o a o o k' Swahili Swedish Turkish Ukrainian 4 12.5 6 3 10.0 3 4 2 7.5 2 5.0 2 1 % of all errors 1 2.5

0 0 0 0.0 i r

v a a o a a a a a a e e e u n

i r r a 0 e e a 0 0 m Phone Phone Phone Phone

Figure C.6: Top 5 most common substitution errors in the validation set for each lan- guage when using the 4-layer RNN network

Top 5 deletion errors with 5 RNN layers

Bulgarian Croatian Hausa Polish 15.0 20 12 15 12.5 10 15 10.0 8 10 7.5 10 6 5.0

% of all errors 4 5 5 2.5 2

0.0 0 0 0 a o i v i o e j n a i j r i j v Swahili Swedish Turkish Ukrainian 12 12 12.5 10 10 10 10.0 8 8 8 7.5 6 6 6

5.0 4 4 4 % of all errors

2.5 2 2 2

0.0 0 0 0 a i u m r e d n i l e 0 j Phone Phone Phone Phone

Figure C.7: Top 5 most common deletion errors in the validation set for each language when using the 5-layer RNN network 85

Top 5 insertion errors with 5 RNN layers

Bulgarian Croatian Hausa Polish 15.0 12 10 10 12.5 10 8 8 10.0 8 6 6 7.5 6 4 4 5.0 4 % of all errors 2.5 2 2 2

0.0 0 0 0 v j n u e n a t i e i u j a a v i n Swahili Swedish Turkish Ukrainian 12 12 10 12.5 10 10 8 10.0 8 8 6 7.5 6 6 4 4 4 5.0 % of all errors

2 2 2 2.5

0 0 0 0.0 k h n j w a e i s n e n a t i a i v u Phone Phone Phone Phone

Figure C.8: Top 5 most common insertion errors in the validation set for each language when using the 5-layer RNN network

Top 5 substitution errors with 5 RNN layers

Bulgarian Croatian Hausa Polish

10 4 6 6 8 3 6 4 4 2 4

% of all errors 2 2 1 2

0 0 0 0 i i i i

k v a a a e e e u u

j

e e a a e o o o a o o k' Swahili Swedish Turkish Ukrainian 15 4 6 3 3 10 4 2 2 2 5 % of all errors 1 1

0 0 0 0 i i r

a a a a a a e e e e u u n

y i r a e 0 e e a 0 n Phone Phone Phone Phone

Figure C.9: Top 5 most common substitution errors in the validation set for each lan- guage when using the 5-layer RNN network

Appendix D

Confusion matrices for attribute networks

87 88 Appendix D. Confusion matrices for attribute networks

Precision confusion matrix (normalised) 1.0 7 E S Z a b b_j d dZ 0.8 d_j dz f f_j g g_j i j k 0.6 k_j l l_j m m_j n n_j True phones o p 0.4 p_j r r_j s s_j sil t tS t_j 0.2 ts u v v_j x ya yu z z_j 0.0 i j l I J f t r s z L F k v x y P a o e j\ E J\ S b d g h n p q u 1 2 4 7 9 } A V B Z C U N r\ G & O Q w s\ M L\ ts v\ h\ p\ m W B\ tS l_j sil rZ f_j dz t_j r_j M\ ya yu s_j z_j dZ k_j v_j S_j b_j d_j g_j n_j p_j Z_j r\_j r_0 j_~ ts_j t_~ r_~ m_j tS_j s_~ k_~ dz_j g_~ p_~ B_~ r\_~ dZ_j w_~ ts_~ v\_~ h\_~ tS_~ dz_~ Predicted phones

Figure D.1: Confusion matrix for test Bulgarian data between frame labels and predicted network outputs. Normalised along y-axis to show precision.

Recall confusion matrix (normalised) 7 E S Z a b b_j 0.75 d dZ d_j dz f f_j g g_j 0.60 i j k k_j l l_j m 0.45 m_j n n_j o True phones p p_j r r_j 0.30 s s_j sil t tS t_j ts u 0.15 v v_j x ya yu z z_j 0.00 i j l I J f t r s z L F k v x y P a o e j\ E J\ S b d g h n p q u 1 2 4 7 9 } A V B Z C U N r\ G & O Q w s\ M L\ ts v\ h\ p\ m W B\ tS l_j sil rZ f_j dz t_j r_j M\ ya yu s_j z_j dZ k_j v_j S_j b_j d_j g_j n_j p_j Z_j r\_j r_0 j_~ ts_j t_~ r_~ m_j tS_j s_~ k_~ dz_j g_~ p_~ B_~ r\_~ dZ_j w_~ ts_~ v\_~ h\_~ tS_~ dz_~ Predicted phones

Figure D.2: Confusion matrix for test Bulgarian data between frame labels and predicted network outputs. Normalised along x-axis to show recall. 89

Precision confusion matrix (normalised) 1.0 J

L

S

Z

a

b 0.8 d

dZ

d_z\

e

f

g

i 0.6

j

k

l

m

n True phones o 0.4 p

r

s

sil

t

tS 0.2 t_s\

ts

u

v\

x

z 0.0 i j l I J f t r s z L F k v x y P a o e j\ E J\ S b d g h n p q u 1 2 4 7 9 } A V B Z C N r\ G & O Q w s\ M ts v\ h\ p\ m B\ tS sil rZ dz t_j M\ yu z_j dZ S_j n_j 4_j Z_j r\_j r_0 j_~ ts_j v\_j r_~ t_s\ tS_j s_~ dz_j g_~ h_~ p_~ 4_~ d_z\ B_~ dZ_j ts_~ v\_~ h\_~ p\_~ dz_~ Predicted phones

Figure D.3: Confusion matrix for test Croatian data between frame labels and predicted network outputs. Normalised along y-axis to show precision.

Recall confusion matrix (normalised) J

L

S

Z 0.75 a

b

d

dZ

d_z\

e 0.60 f

g

i

j

k 0.45 l

m

n True phones o

p

r 0.30

s

sil

t

tS

t_s\ 0.15 ts

u

v\

x

z 0.00 i j l I J f t r s z L F k v x y P a o e j\ E J\ S b d g h n p q u 1 2 4 7 9 } A V B Z C N r\ G & O Q w s\ M ts v\ h\ p\ m B\ tS sil rZ dz t_j M\ yu z_j dZ S_j n_j 4_j Z_j r\_j r_0 j_~ ts_j v\_j r_~ t_s\ tS_j s_~ dz_j g_~ h_~ p_~ 4_~ d_z\ B_~ dZ_j ts_~ v\_~ h\_~ p\_~ dz_~ Predicted phones

Figure D.4: Confusion matrix for test Croatian data between frame labels and predicted network outputs. Normalised along x-axis to show recall. 90 Appendix D. Confusion matrices for attribute networks

Precision confusion matrix (normalised) 1.0 ?_j S a ai au b b_< 0.8 c d dZ d_< e g

h 0.6 i j k k_> k_h

True phones l m 0.4 n o p p\ r r` 0.2 s sil t ts u w z 0.0 i j l I J f t r s z c L F k v x y P a o e i: j\ E J\ S b d g h n p q u 1 2 4 7 9 } A V B Z C N r\ G & O Q w s\ M ai r` ts v\ y: h\ p\ m W B\ tS sil rZ dz M\ yu au s_j z_j ?_j dZ S_j d_j n_j Z_j r\_j r_0 j_~ ts_j k_h t_~ r_~ tS_j s_~ z_~ k_> k_~ dz_j b_< b_~ d_< d_~ g_~ 4_~ B_~ r\_~ dZ_j ts_~ h\_~ tS_~ dz_~ dZ_~ Predicted phones

Figure D.5: Confusion matrix for test Hausa data between frame labels and predicted network outputs. Normalised along y-axis to show precision.

Recall confusion matrix (normalised) ?_j S

a 0.75 ai au b b_< c d 0.60 dZ d_< e g h i 0.45 j k k_> k_h

True phones l m n 0.30 o p p\ r r`

s 0.15 sil t ts u w z 0.00 i j l I J f t r s z c L F k v x y P a o e i: j\ E J\ S b d g h n p q u 1 2 4 7 9 } A V B Z C N r\ G & O Q w s\ M ai r` ts v\ y: h\ p\ m W B\ tS sil rZ dz M\ yu au s_j z_j ?_j dZ S_j d_j n_j Z_j r\_j r_0 j_~ ts_j k_h t_~ r_~ tS_j s_~ z_~ k_> k_~ dz_j b_< b_~ d_< d_~ g_~ 4_~ B_~ r\_~ dZ_j ts_~ h\_~ tS_~ dz_~ dZ_~ Predicted phones

Figure D.6: Confusion matrix for test Hausa data between frame labels and predicted network outputs. Normalised along x-axis to show recall. 91

Precision confusion matrix (normalised) 1.0 1 E E_~ J O O_~ S S_j 0.8 a b d dZ dz dz_j f 0.6 g i j k l m

True phones n 0.4 p r rZ s sil t tS ts 0.2 ts_j u v w x z z_j 0.0 i j l I J f t r s z L F k v x y P a o e i: j\ E J\ S b d g h n p q u 1 2 4 7 9 } A V B Z C N r\ G & O Q w s\ M ts v\ h\ p\ m W B\ tS sil rZ dz t_j M\ yu s_j z_j dZ S_j d_j n_j Z_j r\_j r_0 j_~ ts_j v\_j t_~ r_~ tS_j s_~ z_~ k_~ v_~ dz_j P_~ E_~ S_~ d_~ g_~ h_~ p_~ 4_~ B_~ r\_~ dZ_j O_~ w_~ ts_~ v\_~ h\_~ p\_~ dz_~ Predicted phones

Figure D.7: Confusion matrix for test Polish data between frame labels and predicted network outputs. Normalised along y-axis to show precision.

Recall confusion matrix (normalised) 1 E E_~ J O O_~ 0.75 S S_j a b d dZ 0.60 dz dz_j f g i j 0.45 k l m

True phones n p r rZ 0.30 s sil t tS ts ts_j 0.15 u v w x z z_j 0.00 i j l I J f t r s z L F k v x y P a o e i: j\ E J\ S b d g h n p q u 1 2 4 7 9 } A V B Z C N r\ G & O Q w s\ M ts v\ h\ p\ m W B\ tS sil rZ dz t_j M\ yu s_j z_j dZ S_j d_j n_j Z_j r\_j r_0 j_~ ts_j v\_j t_~ r_~ tS_j s_~ z_~ k_~ v_~ dz_j P_~ E_~ S_~ d_~ g_~ h_~ p_~ 4_~ B_~ r\_~ dZ_j O_~ w_~ ts_~ v\_~ h\_~ p\_~ dz_~ Predicted phones

Figure D.8: Confusion matrix for test Polish data between frame labels and predicted network outputs. Normalised along x-axis to show recall. 92 Appendix D. Confusion matrices for attribute networks

Precision confusion matrix (normalised) 1.0 4 D E G J N O S 0.8 T a b_< b~ d_< d~ f 0.6 g_< g~ h i j j_<

True phones j_<~ 0.4 k l m n p s sil t 0.2 tS u v v~ w z z~ 0.0 i j l I J f t r s z L F k v x y P T a o e i: j\ E J\ S b d g h n p q u 1 2 4 7 9 } A V B Z C N r\ D G & O Q w s\ M z\ ts v\ y: h\ p\ m W B\ tS sil dz t_j M\ yu s_j z_j dZ k_j z~ S_j d_j g_j n_j v~ Z_j b~ d~ g~ r\_j j_< j_~ J_< ts_j t_~ z_< k_~ v_~ x_~ dz_j J\_< J\_~ b_< b_~ d_< d_~ g_< g_~ h_~ n_< p_~ B_< B_~ r\_< r\_~ dZ_j w_~ ts_~ v\_< v\_~ h\_~ p\_~ j_<~ dZ_~ Predicted phones

Figure D.9: Confusion matrix for test Swahili data between frame labels and predicted network outputs. Normalised along y-axis to show precision.

Recall confusion matrix (normalised) 4 D E 0.75 G J N O S T a 0.60 b_< b~ d_< d~ f g_< g~ 0.45 h i j j_<

True phones j_<~ k l 0.30 m n p s sil t tS 0.15 u v v~ w z z~ 0.00 i j l I J f t r s z L F k v x y P T a o e i: j\ E J\ S b d g h n p q u 1 2 4 7 9 } A V B Z C N r\ D G & O Q w s\ M z\ ts v\ y: h\ p\ m W B\ tS sil dz t_j M\ yu s_j z_j dZ k_j z~ S_j d_j g_j n_j v~ Z_j b~ d~ g~ r\_j j_< j_~ J_< ts_j t_~ z_< k_~ v_~ x_~ dz_j J\_< J\_~ b_< b_~ d_< d_~ g_< g_~ h_~ n_< p_~ B_< B_~ r\_< r\_~ dZ_j w_~ ts_~ v\_< v\_~ h\_~ p\_~ j_<~ dZ_~ Predicted phones

Figure D.10: Confusion matrix for test Swahili data between frame labels and predicted network outputs. Normalised along x-axis to show recall. 93

Precision confusion matrix (normalised) 1.0 2 9 9: @ A: C E E: N O 0.8 S U a a: b d d` e e: f 0.6 g h i i: j k l l` True phones m n 0.4 n` o o: p r s s` sil t t` 0.2 u u: v x y y: { {: }: 0.0 i j l I J f t r s z L F k v x y P a o e i: j\ E J\ S b d g h n p q u 1 2 4 7 9 { } A V B Z C U N r\ G l` & O Q w s\ M t` ts v\ y: a: o: e: E: h\ p\ u: 9: {: }: m @ A: s` B\ tS sil rZ Q: d` n` dz M\ yu s_j z_j dZ d_j Z_j r\_j r_0 j_~ f_~ ts_j t_~ r_~ s_~ k_~ dz_j J\_~ b_~ d_~ g_~ h_~ p_~ 4_~ B_~ r\_~ dZ_j G_~ v\_~ h\_~ Predicted phones

Figure D.11: Confusion matrix for test Swedish data between frame labels and pre- dicted network outputs. Normalised along y-axis to show precision.

Recall confusion matrix (normalised) 2 9 9: 0.75 @ A: C E E: N O S U 0.60 a a: b d d` e e: f g h 0.45 i i: j k l l` True phones m n n` 0.30 o o: p r s s` sil t t` 0.15 u u: v x y y: { {: }: 0.00 i j l I J f t r s z L F k v x y P a o e i: j\ E J\ S b d g h n p q u 1 2 4 7 9 { } A V B Z C U N r\ G l` & O Q w s\ M t` ts v\ y: a: o: e: E: h\ p\ u: 9: {: }: m @ A: s` B\ tS sil rZ Q: d` n` dz M\ yu s_j z_j dZ d_j Z_j r\_j r_0 j_~ f_~ ts_j t_~ r_~ s_~ k_~ dz_j J\_~ b_~ d_~ g_~ h_~ p_~ 4_~ B_~ r\_~ dZ_j G_~ v\_~ h\_~ Predicted phones

Figure D.12: Confusion matrix for test Swedish data between frame labels and pre- dicted network outputs. Normalised along x-axis to show recall. 94 Appendix D. Confusion matrices for attribute networks

Precision confusion matrix (normalised) 1.0 4 9 J\ J\: M S 0.8 Z a b d dZ e 0.6 f h i j k True phones l 0.4 m n o p s

sil 0.2 t tS u v y z 0.0 i j l I J f t r s z L F k v x y P a o e i: j\ E J\ S b d g h n p q u 1 2 4 7 9 } A V B Z C N r\ G & O Q w s\ M ts v\ J\: h\ p\ m B\ tS sil rZ dz r_j M\ yu s_j z_j dZ S_j d_j n_j Z_j r\_j j_~ ts_j t_~ tS_j s_~ k_~ dz_j J\_~ d_~ g_~ 4_~ B_~ r\_~ dZ_j ts_~ v\_~ h\_~ p\_~ dz_~ Predicted phones

Figure D.13: Confusion matrix for test Turkish data between frame labels and predicted network outputs. Normalised along y-axis to show precision.

Recall confusion matrix (normalised) 4 9 0.75 J\ J\: M S Z

a 0.60 b d dZ e f 0.45 h i j k True phones l

m 0.30 n o p s sil

t 0.15 tS u v y z 0.00 i j l I J f t r s z L F k v x y P a o e i: j\ E J\ S b d g h n p q u 1 2 4 7 9 } A V B Z C N r\ G & O Q w s\ M ts v\ J\: h\ p\ m B\ tS sil rZ dz r_j M\ yu s_j z_j dZ S_j d_j n_j Z_j r\_j j_~ ts_j t_~ tS_j s_~ k_~ dz_j J\_~ d_~ g_~ 4_~ B_~ r\_~ dZ_j ts_~ v\_~ h\_~ p\_~ dz_~ Predicted phones

Figure D.14: Confusion matrix for test Turkish data between frame labels and predicted network outputs. Normalised along x-axis to show recall. 95

Precision confusion matrix (normalised) 1.0 0 A E I S S_j Z Z_j b b_j 0.8 d dZ d_j dz dz_j f f_j g h\ h\_j 0.6 i j k k_j l l_j m m_j

True phones n n_j 0.4 p p_j r r_j s s_j sil t tS tS_j 0.2 t_j ts ts_j u v\ v\_j x x_j z z_j 0.0 i j l I J f t r s z L F k v x y P a o e j\ E J\ S b d g h n p q u 0 1 2 4 7 9 } A V B Z C U N r\ G & O Q w s\ M ts v\ h\ p\ m W B\ tS l_j sil rZ f_j dz t_j r_j M\ yu s_j z_j dZ k_j x_j S_j b_j d_j g_j n_j p_j Z_j r\_j r_0 j_~ ts_j v\_j t_~ r_~ h\_j m_j tS_j v_~ dz_j b_< b_~ d_~ g_~ h_~ B_~ r\_~ dZ_j w_~ ts_~ v\_~ h\_~ tS_~ dz_~ Predicted phones

Figure D.15: Confusion matrix for test Ukrainian data between frame labels and pre- dicted network outputs. Normalised along y-axis to show precision.

Recall confusion matrix (normalised) 0 A E I S S_j Z Z_j b 0.75 b_j d dZ d_j dz dz_j f f_j 0.60 g h\ h\_j i j k k_j l 0.45 l_j m m_j

True phones n n_j p p_j r r_j 0.30 s s_j sil t tS tS_j t_j ts 0.15 ts_j u v\ v\_j x x_j z z_j 0.00 i j l I J f t r s z L F k v x y P a o e j\ E J\ S b d g h n p q u 0 1 2 4 7 9 } A V B Z C U N r\ G & O Q w s\ M ts v\ h\ p\ m W B\ tS l_j sil rZ f_j dz t_j r_j M\ yu s_j z_j dZ k_j x_j S_j b_j d_j g_j n_j p_j Z_j r\_j r_0 j_~ ts_j v\_j t_~ r_~ h\_j m_j tS_j v_~ dz_j b_< b_~ d_~ g_~ h_~ B_~ r\_~ dZ_j w_~ ts_~ v\_~ h\_~ tS_~ dz_~ Predicted phones

Figure D.16: Confusion matrix for test Ukrainian data between frame labels and pre- dicted network outputs. Normalised along x-axis to show recall.

Appendix E

Phone distributions for attribute network

14 True phones Predicted phones 12

10

8

6

4 Percentage of occurrences 2

0 i j l f t r s z k v x a o E S b d g n p u 7 Z ts m tS l_j f_j dz t_j r_j ya yu s_j z_j dZ k_j v_j b_j d_j g_j n_j p_j m_j Phone

Figure E.1: Comparison of phone distributions between the Bulgarian transcripts and the predicted phones from test data

97 98 Appendix E. Phone distributions for attribute network

12 True phones Predicted phones 10

8

6

4

Percentage of occurrences 2

0 i j l J f t r s z L k x a o e S b d g n p u Z ts v\ m tS dZ t_s\ d_z\ Phone

Figure E.2: Comparison of phone distributions between the Croatian transcripts and the predicted phones from test data

True phones 25 Predicted phones

20

15

10

Percentage of occurrences 5

0 i j l t r s z c k a o e S b d g h n p u w ai r` ts p\ m au ?_j dZ k_h k_> b_< d_< Phone

Figure E.3: Comparison of phone distributions between the Hausa transcripts and the predicted phones from test data

True phones 10 Predicted phones

8

6

4

Percentage of occurrences 2

0 i j l J f t r s z k v x a E S b d g n p u 1 O w ts m tS rZ dz z_j dZ S_j ts_j dz_j E_~ O_~ Phone

Figure E.4: Comparison of phone distributions between the Polish transcripts and the predicted phones from test data 99

True phones Predicted phones 20

15

10

5 Percentage of occurrences

0 i j l J f t s z k v T a E S h n p u 4 N D G O w m tS z~ v~ b~ d~ g~ j_< b_< d_< g_< j_<~ Phone

Figure E.5: Comparison of phone distributions between the Swahili transcripts and the predicted phones from test data

True phones 10 Predicted phones

8

6

4

Percentage of occurrences 2

0 i j l f t r s k v x y a o e i: E S b d g h n p u 2 9 { C U N l` O t` y: a: o: e: E: u: 9: {: }: m @ A: s` d` n` Phone

Figure E.6: Comparison of phone distributions between the Swedish transcripts and the predicted phones from test data

12 True phones Predicted phones 10

8

6

4 Percentage of occurrences 2

0 i j l f t s z k v y a o e J\ S b d h n p u 4 9 Z M J\: m tS dZ Phone

Figure E.7: Comparison of phone distributions between the Turkish transcripts and the predicted phones from test data 100 Appendix E. Phone distributions for attribute network

True phones 12 Predicted phones

10

8

6

4 Percentage of occurrences 2

0 i j l I f t r s z k x E S b d g n p u 0 A Z ts v\ h\ m tS l_j f_j dz t_j r_j s_j z_j dZ k_j x_j S_j b_j d_j n_j p_j Z_j ts_j v\_j h\_j m_j tS_j dz_j Phone

Figure E.8: Comparison of phone distributions between the Ukrainian transcripts and the predicted phones from test data