<<

A Deep Generative Model of Formant Typology

Ryan Cotterell and Jason Eisner

Department of Computer Science Johns Hopkins University, Baltimore MD, 21218 ryan.cotterell,eisner @jhu.edu { }

Abstract and its goal should be the construction of a univer- sal prior over potential . A probabilistic What makes some types of languages more approach does not rule out linguistic systems com- probable than others? For instance, we know pletely (as long as one’s theoretical formalism can that almost all spoken languages contain the vowel /i/; why should that be? The describe them at all), but it can position phenomena field of linguistic typology seeks to answer on a scale from very common to very improbable. these questions and, thereby, divine the mech- Probabilistic modeling also provides a discipline anisms that underlie human . In our for drawing conclusions from sparse data. While work, we tackle the problem of vowel system we know of over 7000 human languages, we have typology, i.e., we propose a generative proba- some sort of linguistic analysis for only 2300 of bility model of which a language con- them (Comrie et al., 2013), and the dataset used in tains. In contrast to previous work, we work di- rectly with the acoustic information—the first this paper (Becker-Kristal, 2010) provides simple two formant values—rather than modeling dis- vowel data for fewer than 250 languages. crete sets of phonemic symbols (IPA). We de- Formants are the resonant frequencies of the hu- velop a novel generative probability model and man vocal tract during the production of speech report results based on a corpus of 233 lan- sounds. We propose a Bayesian generative model guages. of vowel inventories, where each language’s inven- tory is a finite subset of acoustic vowels represented 1 Introduction 2 as points (F1,F2) R . We deploy tools from the ∈ Human languages are far from arbitrary; cross- neural-network and point-process literatures and linguistically, they exhibit surprising similarity in experiment on a dataset with 233 distinct languages. many respects and many properties appear to be We show that our most complicated model outper- universally true. The field of linguistic typology forms simpler models. seeks to investigate, describe and quantify the axes along which languages vary. One facet of language 2 Acoustic and Formants that has been the subject of heavy investigation is Much of human communication takes place the nature of vowel inventories, i.e., which vowels through speech: one conversant emits a sound wave a language contains. It is a cross-linguistic univer- to be comprehended by a second. In this work, we sal that all spoken languages have vowels (Gordon, consider the nature of the portions of such sound 2016), and the underlying principles guiding vowel waves that correspond to vowels. We briefly review selection are understood: vowels must be both the relevant bits of acoustic phonetics so as to give easily recognizable and well-dispersed (Schwartz an overview of the data we are actually modeling et al., 2005). In this work, we offer a more formal and develop our notation. treatment of the subject, deriving a generative prob- ability model of vowel inventory typology. Our The anatomy of a sound wave. The sound wave work builds on (Cotterell and Eisner, 2017) by in- that carries spoken language is a function from vestigating not just discrete IPA inventories but the time to amplitude, describing sound pressure vari- cross-linguistic variation in acoustic formants. ation in the air. To distinguish vowels, it is help- The philosophy behind our approach is that lin- ful to transform this function into a spectrogram guistic typology should be treated probabilistically (Fig.1) by using a short-time Fourier transform

37 Proceedings of NAACL-HLT 2018, pages 37–46 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics

5000 Hz principles are and . /i/ /u/ /ɑ/ focalization dispersion 4000 Hz Focalization. The notion of focalization grew 3000 Hz out of quantal vowel theory (Stevens, 1989). Quan-

2000 Hz tal vowels are those that are phonetically “better” than others. They tend to display certain proper- 1000 Hz ties, e.g., the formants tend to be closer together 0 Hz (Stevens, 1987). Cross-linguistically, quantal vow- els are the most frequently attested vowels, e.g., the Figure 1: Example spectrogram of the three English vowels: /i/, /u/ and /A/. The x-axis is time and y-axis is frequency. The cross-linguistically common vowel /i/ is considered first two formants F1 and F2 are marked in with arrows for quantal, but less common /y/ is not. each vowel. The figure was made with Praat (Boersma et al., 2002). Dispersion. The second core principle of vowel system organization is known as dispersion. As the name would imply, the principle states that (Deng and O’Shaughnessy, 2003, Chapter 1) to de- the vowels in “good” vowel systems tend to be compose each short interval of the wave function spread out. The motivation for such a principle into a weighted sum of sinusoidal waves of differ- is clear—a well-dispersed set of vowels reduces a ent frequencies (measured in Hz). At each interval, listener’s potential confusion over which vowel is the variable darkness of the spectrogram indicates being pronounced. See Schwartz et al.(1997) for a the weights of the different frequencies. In pho- review of dispersion in vowel system typology and netic analysis, a common quantity to consider is its interaction with focalization, which has led to a formant—a local maximum of the (smoothed) the joint dispersion-focalization theory. frequency spectrum. The fundamental frequency F0 determines the pitch of the sound. The formants Notation. We will denote the universal set of F1 and F2 determine the quality of the vowel. international phonetic alphabet (IPA) symbols as . The observed vowel inventory for lan- V Two is all you need (and what we left out). In guage ` has size n` and is denoted V ` = terms of vowel recognition, it is widely speculated ` ` ` ` d (v1, v1),..., (v ` , v ` ) R , where for that humans rely almost exclusively on the first { n n } ⊆ V × each k [1, n`], v` is an IPA symbol assigned two formants of the sound wave (Ladefoged, 2001, ∈ k ∈ V by a linguist and v` Rd is a vector of d measur- Chapter 5). The two-formant assumption breaks k ∈ able phonetic quantities. In short, the IPA symbol down in edge cases: e.g., the third formant F 3 v` was assigned as a label for a phoneme with pro- helps to distinguish the roundness of the vowel k nunciation v` . The ordering of the elements within (Ladefoged, 2001, Chapter 5). Other non-formant k V ` is arbitrary. features may also play a role. For example, in tonal languages, the same vowel may be realized Goals. This framework recognizes that the same with different tones (which are signaled using F0): IPA symbol v (such as /u/) may represent a slightly Mandarin Chinese makes a distinction between mˇa different sound v in one language than in another, (horse) and ma´ (hemp) without modifying the qual- although they are transcribed identically. We are ity of the vowel /a/. Other features, such as creaky specifically interested in how the vowels in a lan- , can play a role in distinguishing . guage influence one another’s fine-grained pro- We do not explicitly model any of these aspects of nunciation in Rd. In general, there is no reason vowel space, limiting ourselves to (F1,F2) as in to suspect that speakers of two languages, whose previous work (Liljencrants and Lindblom, 1972). phonological systems contain the same IPA symbol, However, it would be easy to extend all the models should produce that vowel with identical formants. we will propose here to incorporate such informa- Data. For the remainder of the paper, we will tion, given appropriate datasets. 2 take d = 2 so that each v = (F1,F2) R , the ∈ 3 The Phonology of Vowel Systems vector consisting of the first two formant values, as compiled from the field literature by Becker- The vowel inventories of the world’s languages Kristal(2006). This dataset provides inventories display clear structure and appear to obey several V ` in the form above. Thus, we do not consider underlying principles. The most prevalent of these further variation of the vowel pronunciation that

38 may occur within the language (between speakers, features may also be informed by phonological and between tokens of the vowel, or between earlier phonetic processes in the language. We leave mod- and later intervals within a token). eling of this step to future work; so our current likelihood term ignores the evidence contributed 4 Phonemes versus Phones by the IPA labels in the dataset, considering only the pronunciations in Rd. Previous work (Cotterell and Eisner, 2017) has The overall idea is that human languages ` draw placed a distribution over discrete phonemes, ignor- their inventories from some universal prior, which ing the variation across languages in the pronuncia- we are attempting to reconstruct. A caveat is that tion of each phoneme. In this paper, we crack open we will train our method by maximum-likelihood, the phoneme abstraction, moving to a learned set which does not quantify our uncertainty about the of finer-grained phones. reconstructed parameters. An additional caveat is Cotterell and Eisner(2017) proposed (among that some languages in our dataset are related to other options) using a determinantal point process one another, which belies the idea that they were (DPP) over a universal inventory of 53 sym- V drawn independently. Ideally, one ought to capture bolic (IPA) vowels. A draw from such a DPP is these relationships using hierarchical or evolution- a language-specific inventory of vowel phonemes, ary modeling techniques. V . In this paper, we say that a language in- ⊆ V stead draws its inventory from a larger set ¯, again 5 Determinantal Point Processes V using a DPP. In both cases, the reason to use a Before delving into our generative model, we DPP is that it prefers relatively diverse inventories briefly review technical background used by Cot- whose individual elements are relatively quantal. terell and Eisner(2017). A DPP is a probability While we could in principle identify ¯ with Rd, V distribution over the subsets of a fixed ground set of for convenience we still take it to be a (large) dis- size N—in our case, the set of phones ¯. The DPP crete finite set ¯ = v¯ ,..., v¯ , whose elements V V { 1 N } is usually given as an L-ensemble (Borodin and we call phones. ¯ is a learned cross-linguistic pa- V Rains, 2005), meaning that it is parameterized by a rameter of our model; thus, its elements—the “uni- N N positive semi-definite matrix L R × . Given a versal phones”—may or may not correspond to ∈ discrete base set ¯ of phones, the probability of a phonetic categories traditionally used by linguists. V subset V¯ ¯ is given by We presume that language ` draws from the DPP ⊆ V ¯ ` ¯ ` ¯ a subset V , whose size we call n . For each p(V ) det (LV¯ ) , (1) ⊆ V ` ∝ universal phone v¯i that appears in this inventory V¯ , the language then draws an observable language- where LV¯ is the submatrix of L corresponding to ` 2 the rows and columns associated with the subset specific pronunciation vi µi, σ I from a ∼ N V¯ ¯. The entry L , where i = j, has the effect distribution associated cross-linguistically with the ⊆ V ij 6  of describing the similarity between the elements universal phone v¯i. We now have an inventory of v¯ and v¯ (both in ¯)—an ingredient needed to pronunciations. i j V As a final step in generating the vowel inventory, model dispersion. And, the entry Lii describes the ` quality—focalization—of the vowel v¯i, i.e., how we could model IPA labels. For each v¯i V¯ , a ∈ ` much the model wants to have v¯i in a sampled set field linguist presumably draws the IPA label vi ` d independent of the other members. conditioned on all the pronunciations vi R : ` { ∈ v¯i V¯ in the inventory (and perhaps also on ∈ } ` 5.1 Probability Kernel their underlying phones v¯i V¯ ). This labeling ∈ In this work, each phone v¯ ¯ is associated with process may be complex. While each pronuncia- i ∈ V a probability density over the space of possible pro- tion in Rd (or each underlying phone in ¯) may V nunciations 2. Our measure of phone similarity have a preference for certain IPA labels in , the R V will consider the “overlap” between the densities n` labels must be drawn jointly because the lin- associated with two phones. This works as follows: guist will take care not to use the same label for Given two densities f(x, y) and f (x, y) over 2, two phones, and also because the linguist may like 0 R we define the kernel (Jebara et al., 2004) as to describe the inventory using a small number of distinct IPA features, which will tend to favor fac- ρ ρ (f, f 0; ρ) = f(x, y) f 0(x, y) dx dy, (3) torial grids of symbols. The linguist’s use of IPA K Zx Zy 39 M ` p(v`,1,..., v`,n µ ,..., µ ,N) p(µ ,... µ N) p(N) (2) | 1 N 1 N | `Y=1 h i M n` `,k ¯ ` = p(v µa` ) p(V (a ) µ1,..., µN ,N) p(µ1,... µN N) p(N) | k | | `=1 " a` A(n`,N) k=1 ! # Y ∈X Y 4 3 2 1 | {z } | {z } | {z } | {z } Figure 2: Joint likelihood of M vowel systems under our deep generative probability model for continuous-space vowel inventories. Here language ` has an observed inventory of pronunciations v`,k : 1 k n` , and a` [1,N] denotes a { ≤ ≤ } k ∈ phone that might be responsible for the pronunciation v`,k. Thus, a` denotes some way to jointly label all n` pronunciations with distinct phones. We must sum over all N such labelings a` A(n`,N) since the true labeling is not observed. In other n` ∈ words, we sum over all ways a` of completing the data for language `. Within each summand, the product of factors 3 and 4 is  the probability of the completed data, i.e., the joint probability of generating the inventory V¯ (a`) of phones used in the labeling and their associated pronunciations. Factor 3 considers the prior probability of V¯ (a`) under the DPP, and factor 4 is a likelihood term that considers the probability of the associated pronunciations. with inverse temperature parameter ρ. Algorithm 1 Generative Process In our setting, f, f 0 will both be Gaussian dis- 1: N Poisson (λ)( N) 1 ∼ ∈ tributions with means µ and µ0 that share a fixed 2: for i = 1 to N : 2 2 spherical covariance matrix σ I. Then eq. (3) and 3: µi (0,I)( R ) 2 d ∼ N N N∈ indeed its generalization to any R has a closed- 4: define L R × via (6) ∈ form solution (Jebara et al., 2004, §3.1): 5: for ` = 1 to M : ` ` ` 6: V¯ DPP (L)( [1,N]); let n = V¯ 3 ∼ ⊆ | | (f,f 0; ρ) = (4) 7: for i V¯ ` : K ∈ d (1 2ρ)d 2 ` 2 4 2 −2 ρ µ µ0 8: v˜i µi, σ I (2ρ) 2 2πσ exp || − || . ` ∼ N ` − 4σ2 9: v = νθ v˜ 4   i i   Notice that making ρ small (i.e., high temperature)  has an effect on (4) similar to scaling the variance matrix L where 2 σ by the temperature, but it also results in chang- (fi, fj; ρ) if i = j ing the scale of , which affects the balance be- Lij = K 6 (6) K ( (fi, fj; ρ) + F (µi) if i = j tween dispersion and focalization in (6) below. K Since L is the sum of two positive definite ma- 5.2 Focalization Score trices (the first specializes a known kernel and the The probability kernel given in eq. (3) naturally second is diagonal and positive), it is also positive handles the linguistic notion of dispersion. What definite. As a result, it can be used to parameterize a DPP over ¯. Indeed, since L is positive definite about focalization? We say that a phone is focal to V the extent that it has a high score and not merely positive semidefinite, it will assign positive probability to any subset of ¯. V F (µ) = exp (U2 tanh(U1µ + b1) + b2) > 0 As previously noted, this DPP does not define (5) a distribution over an infinite set, e.g., the pow- erset of R2, as does recent work on continuous where µ is the mean of its density. To learn the DPPs (Affandi et al., 2013). Rather, it defines a parameters of this neural network from data is to distribution over the powerset of a set of densities learn which phones are focal. We use a neural net- with finite cardinality. Once we have sampled a work since the focal regions of R2 are distributed subset of densities, a real-valued quantity may be in a complex way. additionally sampled from each sampled density.

5.3 The L Matrix 6 A Deep Generative Model If f = (µ , σ2I) is the density associated with We are now in a position to expound our generative i N i the phone v¯ , we may populate an N N real model of continuous-space vowel typology. We i × 40 generate a set of formant pairs for M languages this vector through a feed-forward neural network in a four step process. Note that throughout this νθ with parameters θ. In short: exposition, language-specific quantities with be ` 2 superscripted with an integral language marker v˜i (µi, σ I) (9) ∼ N `, whereas universal quantities are left unsuper- ` ` vi = νθ(v˜i ), (10) scripted. The generative process is written in al- gorithmic form in Alg.1. Note that each step is where the second step is deterministic. We can numbered and color-coded for ease of comparison fuse these two steps into a single step p(v µ ), i | i with the full joint likelihood in Fig.2. whose closed-form density is given in eq. (12) be- low. In effect, step 4 takes a Gaussian phone as Step 1 : p(N). We sample the size N of the uni- input and produces the observed formant vector versal phone inventory ¯ from a Poisson distribu- V with an underlying formant vector in the middle. λ tion with a rate parameter , i.e., This completes our generative process. We do not observe all the steps, but only the final col- N Poisson (λ) . (7) ` ∼ lection of pronunciations vi for each language, where the subscripts i that indicate phone identity That is, we do not presuppose a certain number of have been lost. The probability of this incomplete phones in the model. dataset involves summing over possible phones for each pronunciation, and is presented in Fig.2. Step 2 : p(µ1,..., µN ). Next, we sample the means µ of the Gaussian phones. In the model i 6.1 A Neural Transformation of a Gaussian presented here, we assume that each phone is A crucial bit of our model is running a sample generated independently, so p(µ1,..., µN ) = N from a Gaussian through a neural network. Under i=1 p(µi). Also, we assume a standard Gaussian prior over the means, µ (0,I). certain restrictions, we can find a closed form for Q i ∼ N The sampled means define our N Gaussian the resulting density; we discuss these below. Let phones µ , σ2I : we are assuming for simplic- νθ be a depth-2 multi-layer perceptron N i ity that all phones share a single spherical covari-  ν (˜v ) = W tanh (W ˜v + b ) + b . ance matrix, defined by the hyperparameter σ2. θ i 2 1 i 1 2 (11) The dispersion and focalization of these phones In order to find a closed-form solution, we require define the matrix L according to equations (4)–(6), that (5) be a diffeomorphism, i.e., an invertible where ρ in (4) and the weights of the focalization 2 2 mapping from R R where both νθ and its neural net (5) are also hyperparameters. 1 → inverse νθ− are differentiable. This will be true as ` 2 2 Step 3 : p(V¯ µ ,..., µ ). Next, for each lan- long as W1,W2 R × are square matrices of full- | 1 N ∈ guage ` [1,...,M], we sample a diverse subset rank and we choose a smooth, invertible activation ∈ of the N phones, via a single draw from a DPP function, such as tanh. Under those conditions, we parameterized by matrix L: may apply the standard theorem for transforming a random variable (see Stark and Woods, 2011): ¯ ` V DPP(L), (8) 1 ∼ 1 p(vi µi) = p(νθ− (vi) µi) det Jν (v ) | | θ− i where V¯ ` [1,N]. Thus, i V¯ ` means that = p(v˜i µi) det Jν 1(v ) (12) ⊆ ∈ | θ− i language ` contains phone v¯i. Note that even the size of the inventory, n` = V¯ ` , was chosen by the where Jν 1(x) is the Jacobian of the inverse of the |` | θ− DPP. In general, we have n N. neural network at the point x. Recall that p(v˜ µ )  i | i ` is Gaussian-distributed. Step 4 : i V¯ ` p(vi µi) The final step in our ∈ | generative process is that the phones v¯i in language Q 7 Modeling Assumptions ` must generate the pronunciations v` R2 (for- i ∈ mant vectors) that are actually observed in lan- Imbued in our generative story are a number of guage `. Each vector takes two steps. For each assumptions about the linguistic processes behind ¯ ` 2 i V , we generate an underlying v˜i R from vowel inventories. We briefly draw connections ∈ ∈ the corresponding Gaussian phone. Then, we run between our theory and the linguistics literature.

41 Why underlying phones? A technical assump- tic changes in some direction might not prevent tion of our model is the existence of a universal two phones from acting as similar or two pronunci- set of underlying phones. Each phone is equipped ations from being attributed to the same phone. In with a probability distribution over reported acous- general, a unit circle of radius σ in latent space may tic measurements (pronunciations), to allow for a be mapped by νθ to an oddly shaped connected re- single phone to account for multiple slightly differ- gion in acoustic space, and a Gaussian in latent ent pronunciations in different languages (though space may be mapped to a multimodal distribution. never in the same language). This distribution can capture both actual interlingual variation and also 8 Inference and Learning random noise in the measurement process. While our universal phones may seem to re- We fit our model via MAP-EM (Dempster et al., semble the universal IPA symbols used in phono- 1977). The E-step involves deciding which phones logical transcription, they lack the rich featural each language has. To achieve this, we fashion a specifications of such phonemes. A phone in our Gibbs sampler (Geman and Geman, 1984), yielding model has no features other than its mean position, a Markov-Chain Monte Carlo E-step (Levine and which wholly determines its behavior. Our univer- Casella, 2001). sal phones are not a substantive linguistic hypothe- 8.1 Inference: MCMC E-Step sis, but are essentially just a way of partitioning R2 into finitely many small regions whose similarity Inference in our model is intractable even when the and focalization can be precomputed. This techni- phones µ1,..., µN are fixed. Given a language cal trick allows us to use a discrete rather than a with n vowels, we have to determine which subset continuous DPP over the R2 space.1 of the N phones best explains those vowels. As discussed above, the alignment a between the n Our phones are Gaus- Why a neural network? vowels and n of the N phones represents a latent sians of spherical variance σ2, presumed to be scat- variable. Marginalizing it out is #P-hard, as we tered with variance 1 about a two-dimensional la- can see that it is equivalent to summing over all tent vowel space. Distances in this latent space bipartite matchings in a weighted graph, which, in are used to compute the dissimilarity of phones turn, is as costly as computing the permanent of a for modeling dispersion, and also to describe the matrix (Valiant, 1979). Our sampler2 is an approxi- phone’s ability to vary across languages. That is, mation algorithm for the task. We are interested in two phones that are distant in the latent space can sampling a, the labeling of observed vowels with appear in the same inventory—presumably they universal phones. Note that this implicitly sam- are easy to discriminate in both perception and ples the language’s phone inventory V¯ (a), which articulation—and it is easy to choose which one is fully determined by a. better explains an acoustic measurement, thereby Specifically, we employ an MCMC method affecting the other measurements that may appear closely related to Gibbs sampling. At each step in the inventory. of the sampler, we update our vowel-phone align- We relate this latent space to measurable acous- ment a` as follows. Choose a language ` and a tic space by a learned diffeomorphism νθ (Cotterell ` ` 1 vowel index k [1, n ], and let i = ak (that is, and Eisner, 2017). ν− can be regarded as warping ∈ θ pronunciation v`,k is currently labeled with univer- the acoustic distances into perceptual/articulatory sal phone v¯ ). We will consider changing a` to j, distances. In some “high-resolution” regions of i k where j is drawn from the (N n`) phones that acoustic space, phones with fairly similar (F ,F ) − 1 2 do not appear in V¯ (a`), heuristically choosing j in values might yet be far apart in the latent space. `,k proportion to the likelihood p(v µj). We then Conversely, in other regions, relatively large acous- | ` stochastically decide whether to keep ak = i or set 1Indeed, we could have simply taken our universal phone ` ak = j in proportion to the resulting values of the set to be a huge set of tiny, regularly spaced overlapping Gaus- product 4 3 in eq. (2). sians that “covered” (say) the unit circle. As a computational · matter, we instead opted to use a smaller set of Gaussians, For a single E-step, the Gibbs sampler “warm- giving the learner the freedom to infer their positions and tune starts” with the labeling from the end of the pre- their variance σ2. Because of this freedom, this set should not be too large, or a MAP learner may overfit the training data vious iteration’s E-step. It sweeps S = 5 times with zero-variance Gaussians and be unable to explain the test languages—similar to overfitting a Gaussian mixture model. 2Taken from Volkovs and Zemel(2012, 3.1).

42 through all vowels for all languages, and returns S perceptual system is known to perform a non-linear sampled labelings, one from the end of each sweep. transformation on acoustic signals, starting with We are also interested in automatically choosing the non-linear cochlear transform that is physically 1 the number of phones N, for which we take the performed in the ear. While νθ− is intended as Poisson’s rate parameter λ = 100. To this end, loosely analogous, we determine its benefit by re- we employ reversible-jump MCMC (Green, 1995), moving eq. (10) from our generative story, i.e., we resampling N at the start of every E-step. take the observed formants vk to arise directly from the Gaussian phones. 8.2 Learning: M-Step Given the set of sampled alignments provided by Baseline #3: Supervised phones and alignments. A final baseline we consider is supervised phones. the E-step, our M-step consists of optimizing the Linguists standardly employ a finite set of phones— log-likelihood of the now-complete training data symbols from the international phonetic alphabet using the inferred latent variables. We achieved (IPA). In phonetic annotation, it is common to map this through SGD training of the diffeomorphism each sound in a language back to this universal dis- parameters θ, the means µ of the Gaussian phones, i crete alphabet. Under such an annotation scheme, it and the parameters of the focalization kernel . F is easy to discern, cross-linguistically, which vow- 9 Experiments els originate from the same phoneme: an /I/ in German may be roughly equated with an /I/ in En- 9.1 Data glish. However, it is not clear how consistent this Our data is taken from the Becker-Kristal corpus annotation truly is. There are several reasons to (Becker-Kristal, 2006), which is a compilation of expect high-variance in the cross-linguistic acous- various phonetic studies and forms the largest multi- tic signal. First, IPA symbols are primarily useful lingual phonetic database. Each entry in the corpus for interlinked phonological distinctions, i.e., one corresponds to a linguist’s phonetic description of applies the symbol /I/ to distinguish it from /i/ in a language’s vowel system: an inventory consist- the given language, rather than to associate it with ing of IPA symbols where each symbol is associ- the sound bearing the same symbol in a second ated with two or more formant values. The corpus language. Second, field linguists often resort to the contains data from 233 distinct languages. When closest common IPA symbol, rather than an exact multiple inventories were available for the same match: if a language makes no distinction between language (due to various studies in the literature), /i/ and /I/, it is more common to denote the sound we selected one at random and discarded the others. with a /i/. Thus, IPA may not be as universal as hoped. Our dataset contains 50 IPA symbols so this 9.2 Baselines baseline is only reported for N = 50. Baseline #1: Removing dispersion. The key technical innovation in our work lies in the incor- 9.3 Evaluation poration of a DPP into a generative model of vowel Evaluation in our setting is tricky. The scientific formants—a continuous-valued quantity. The role goal of our work is to place a bit of linguistic the- of the DPP was to model the linguistic principle ory on a firm probabilistic footing, rather than a of dispersion—we may cripple this portion of our downstream engineering-task, whose performance model, e.g., by forcing to be a diagonal kernel, we could measure. We consider three metrics. K i.e., Kij = 0 for i = j. In this case the DPP 6 Cross-Entropy. Our first evaluation metric is becomes a Bernoulli Point Process (BPP)—a spe- cross-entropy: the average negative log-probability cial case of the DPP. Since dispersion is widely of the vowel systems in held-out test data, given accepted to be an important principle governing the universal inventory of N phones that we trained naturally occurring vowel systems, we expect a through EM. We find this to be the cleanest method system trained without such knowledge to perform for scientific evaluation—it is the metric of opti- worse. mization and has a clear interpretation: how sur- Baseline #2: Removing the neural network νθ. prised was the model to see the vowel systems of Another question we may ask of our formulation is held-out, but attested, languages? whether we actually need a fancy neural mapping The cross-entropy is the negative log of the ν to model our typological data well. The human expression in eq. (2), with ` now rang- θ ··· 43 Q   N metric DPP+νθ BPP+νθ DPP νθ Sup. − 3000 x-ent 540.02 540.05 600.34   15 cloze1 5.76 5.76 6.53 2500 cloze12 4.89 4.89 5.24  x-ent 280.47 275.36 335.36  2000 25 cloze1 5.04 5.25 6.23  cloze12 4.76 4.97 5.43  1500 x-ent 222.85 231.70 320.05 1610.37 50 cloze1 3.38 3.16 4.02 4.96 1000 cloze12 2.73 2.93 3.04 6.95 500 x-ent 212.14 220.42 380.31  57 cloze1 2.21 3.08 3.25  0 cloze12 2.01 3.05 3.41  0 200 400 600 800 1000 1200 x-ent 271.95 301.45 380.02  100 cloze1 2.26 2.42 3.03  Figure 3: A graph of v = (F1,F2) in the union of all the cloze12 1.96 2.01 2.51  training languages’ inventories, color-coded by inferred phone (N = 50). Table 1: Cross-entropy in nats per language (lower is better) and expected Euclidean-distance error of the cloze prediction (lower is better). The overall best value for each task is bold- vowel, given (861, 1420), (571, 1079) , and the { } faced. The case N = 50 is compared against our supervised fact that n` = 3. A “cloze12” evaluation would baseline. The N = 57 row is the case where we allowed N to fluctuate during inference using reversible-jump MCMC; aim to predict two missing vowels. this was the N value selected at the final EM iteration. 9.4 Experimental Details ing over held-out languages.3 Wallach et al.(2009) Here, we report experimental details and the hy- give several methods for estimating the intractable perparameters that we use to achieve the results sum in language `. We use the simple harmonic reported. We consider a neural network νθ with mean estimator, based on 50 samples of a` drawn k [1, 4] layers and find k = 1 the best per- ∈ with our Gibbs sampler (warm-started from the former on development data. Recall that our dif- final E-step of training). feomorphism constraint requires that each layer have exactly two hidden units, the same as the Cloze Evaluation. In addition, following Cot- number of observed formants. We consider N ∈ terell and Eisner(2017), we evaluate our trained 15, 25, 50, 100 phones as well as letting N fluc- { } model’s ability to perform a cloze task (Taylor, tuate with reversible-jump MCMC (see footnote1). ` ` 1953). Given n 1 or n 2 of the vowels in held- S = 5 − − We train for 100 iterations of EM, taking out language `, can we predict the pronunciations samples at each E-step. At each M-step, we run vk of the remaining 1 or 2? We predict vk to be 50 iterations of SGD for the focalization NN and ` νθ(µi) where i = ak is the phone inferred by the also for the diffeomorphism NN. For each N, sampler. Note that the sampler’s inference here is we selected (σ2, ρ) by minimizing cross-entropy based only on the observed vowels (the likelihood) on a held-out development set. We considered and the focalization-dispersion preferences of the 2 k 5 k 5 (σ , ρ) 10 k=1 ρ k=1. DPP (the prior). We report the expected error of ∈ { } × { } such a prediction—where error is quantified by Eu- 9.5 Results and Error Analysis clidean distance in (F1,F2) formant space—over the same 50 samples of a`. We report results in Tab.1. We find that our DPP For instance, consider a previously unseen model improves over the baselines. The results vowel system with formant values (499, 2199), support two claims: (i) dispersion plays an impor- { (861, 1420), (571, 1079) . A “cloze1” evaluation tant role in the structure of vowel systems and (ii) } would aim to predict (499, 2199) as the missing learning a non-linear transformation of a Gaussian { } improves our ability to model sets of formant-pairs. 3Since that expression is the product of both probability Also, we observe that as we increase the number of distributions and probability densities, our “cross-entropy” phones, the role of the DPP becomes more impor- metric is actually the sum of both entropy terms and (poten- tially negative) differential entropy terms. Thus, a value of 0 tant. We visualize a sample of the trained alignment has no special significance. in Fig.3.

44 Frequency Encodes Dispersion. Why does dis- 11 Conclusion persion not always help? The models with fewer Our model combines existing techniques of proba- phones do not reap the benefits that the models bilistic modeling and inference to attempt to fit the with more phones do. The reason lies in the fact actual distribution of the world’s vowel systems. that the most common vowel formants are already We presented a generative probability model of dispersed. This indicates that we still have not sets of measured (F ,F ) pairs. We view this as quite modeled the mechanisms that select for good 1 2 a necessary step in the development of generative vowel formants, despite our work at the phonetic probability models that can explain the distribu- level; further research is needed. We would prefer tion of the world’s languages. Previous work on a model that explains the evolutionary motivation generating vowel inventories has focused on how of sound systems as communication systems. those inventories were transcribed into IPA by field linguists, whereas we focus on the field linguists’ Number of Induced Phones. What is most acoustic measurements of how the vowels are actu- salient in the number of induced phones is that ally pronounced. it is close to the number of IPA phonemes in the data. However, the performance of the phoneme- Acknowledgments supervised system is much worse, indicating that, We would like to acknowledge Tim Vieira, Katha- perhaps, while the linguists have the right idea rina Kann, Sebastian Mielke and Chu-Cheng Lin about the number of universal symbols, they did for reading many early drafts. The first author not specify the correct IPA symbol in all cases. would like to acknowledge an NDSEG grant and Our data analysis indicates that this is often due to a Facebook PhD fellowship. This material is also pragmatic concerns in linguistic field analysis. For based upon work supported by the National Sci- example, even if /I/ is the proper IPA symbol for ence Foundation under Grant No. 1718846 to the the sound, if there is no other sound in the vicinity last author. the annotator may prefer to use more common /i/.

References 10 Related Work Raja Hafiz Affandi, Emily Fox, and Ben Taskar. 2013. Approximate inference in continuous determinantal Most closely related to our work is the classic study processes. In Advances in Neural Information Pro- of Liljencrants and Lindblom(1972), who provide cessing Systems, pages 1430–1438. a simulation-based account of vowel systems. They Roy Becker-Kristal. 2006. Predicting vowel inven- argued that minima of a certain objective that en- tories: The dispersion-focalization theory revisited. codes dispersion should correspond to canonical The Journal of the Acoustical Society of America, vowel systems of a given size n. Our tack is dif- 120(5):3248–3248. ferent in that we construct a generative probability Roy Becker-Kristal. 2010. Acoustic Typology of Vowel model, whose parameters we learn from data. How- Inventories and Dispersion Theory: Insights from ever, the essence of modeling is the same in that a Large Cross-Linguistic Corpus. Ph.D. thesis, we explain formant values, rather than discrete IPA UCLA. symbols. By extension, our work is also closely Paulus Petrus Gerardus Boersma et al. 2002. Praat, a related to extensions of this theory (Schwartz et al., system for doing phonetics by computer. Glot Inter- 1997; Roark, 2001) that focused on incorporating national, 5. the notion of focalization into the experiments. Alexei Borodin and Eric M. Rains. 2005. Eynard- Our present paper can also be regarded as a con- Mehta theorem, Schur process, and their Pfaffian tinuation of Cotterell and Eisner(2017), in which analogs. Journal of Statistical Physics, 121(3- 4):291–317. we used DPPs to model vowel inventories as sets of discrete IPA symbols. That paper pretended Bernard Comrie, Matthew S. Dryer, David Gil, and that each IPA symbol had a single cross-linguistic Martin Haspelmath. 2013. Introduction. In (F ,F ) Matthew S. Dryer and Martin Haspelmath, editors, 1 2 pair, an idealization that we remove in this The World Atlas of Language Structures Online. paper by discarding the IPA symbols and modeling Max Planck Institute for Evolutionary Anthropol- formant values directly. ogy, Leipzig.

45 Ryan Cotterell and Jason Eisner. 2017. Probabilistic Kenneth N. Stevens. 1989. On the quantal nature of typology: Deep generative models of vowel inven- speech. Journal of Phonetics, 17:3–45. tories. In Proceedings of the 55th Annual Meet- ing of the Association for Computational Linguistics Wilson L. Taylor. 1953. Cloze procedure: a new tool (ACL), Vancouver, Canada. for measuring readability. Journalism and Mass Communication Quarterly, 30(4):415. Arthur P. Dempster, Nan M. Laird, and Donald B. Ru- bin. 1977. Maximum likelihood from incomplete Leslie G. Valiant. 1979. The complexity of comput- data via the EM algorithm. Journal of the Royal Rta- ing the permanent. Theoretical Computer Science, tistical Society, Series B (Statistical Methodology), 8(2):189–201. pages 1–38. Maksims Volkovs and Richard S. Zemel. 2012. Effi- Li Deng and Douglas O’Shaughnessy. 2003. Speech cient sampling for bipartite matching problems. In Processing: A Dynamic and Optimization-Oriented Advances in Neural Information Processing Systems, Approach. CRC Press. pages 1313–1321.

Stuart Geman and Donald Geman. 1984. Stochas- Hanna Wallach, Ian Murray, Ruslan Salakhutdinov, tic relaxation, Gibbs distributions, and the Bayesian and David Mimno. 2009. Evaluation methods for restoration of images. IEEE Transactions on Pattern topic models. In International Conference on Ma- Analysis and Machine Intelligence, (6):721–741. chine Learning (ICML), pages 1105–1112. Matthew K. Gordon. 2016. Phonological Typology. Oxford. Peter J. Green. 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model de- termination. Biometrika, 82(4):711–732. Tony Jebara, Risi Kondor, and Andrew Howard. 2004. Probability product kernels. Journal of Machine Learning Research, 5:819–844. Peter Ladefoged. 2001. Vowels and : An Introduction to the Sounds of Languages. Wiley- Blackwell. Richard A. Levine and George Casella. 2001. Im- plementations of the Monte Carlo EM algorithm. Journal of Computational and Graphical Statistics, 10(3):422–439. Johan Liljencrants and Bjorn¨ Lindblom. 1972. Numer- ical simulation of vowel quality systems: The role of perceptual contrast. Language, pages 839–862. Brian Roark. 2001. Explaining vowel inventory ten- dencies via simulation: Finding a role for quantal locations and formant normalization. In North East Linguistic Society, volume 31, pages 419–434. Jean-Luc Schwartz, Christian Abry, Louis-Jean Boe,¨ Nathalie Vallee,´ and Lucie Menard.´ 2005. The dispersion-focalization theory of sound systems. The Journal of the Acoustical Society of America, 117(4):2422–2422. Jean-Luc Schwartz, Louis-Jean Boe,¨ Nathalie Vallee,´ and Christian Abry. 1997. The dispersion- focalization theory of vowel systems. Journal of Phonetics, 25(3):255–286. Henry Stark and John Woods. 2011. Probability, Statis- tics, and Random Processes for Engineers. Pearson. Kenneth N. Stevens. 1987. Relational properties as per- ceptual correlates of phonetic features. In Interna- tional Conference of Phonetic Sciences, pages 352– 355.

46