A Deep Generative Model of Vowel Formant Typology
Total Page:16
File Type:pdf, Size:1020Kb
A Deep Generative Model of Vowel Formant Typology Ryan Cotterell and Jason Eisner Department of Computer Science Johns Hopkins University, Baltimore MD, 21218 ryan.cotterell,eisner @jhu.edu { } Abstract and its goal should be the construction of a univer- sal prior over potential languages. A probabilistic What makes some types of languages more approach does not rule out linguistic systems com- probable than others? For instance, we know pletely (as long as one’s theoretical formalism can that almost all spoken languages contain the vowel phoneme /i/; why should that be? The describe them at all), but it can position phenomena field of linguistic typology seeks to answer on a scale from very common to very improbable. these questions and, thereby, divine the mech- Probabilistic modeling also provides a discipline anisms that underlie human language. In our for drawing conclusions from sparse data. While work, we tackle the problem of vowel system we know of over 7000 human languages, we have typology, i.e., we propose a generative proba- some sort of linguistic analysis for only 2300 of bility model of which vowels a language con- them (Comrie et al., 2013), and the dataset used in tains. In contrast to previous work, we work di- rectly with the acoustic information—the first this paper (Becker-Kristal, 2010) provides simple two formant values—rather than modeling dis- vowel data for fewer than 250 languages. crete sets of phonemic symbols (IPA). We de- Formants are the resonant frequencies of the hu- velop a novel generative probability model and man vocal tract during the production of speech report results based on a corpus of 233 lan- sounds. We propose a Bayesian generative model guages. of vowel inventories, where each language’s inven- tory is a finite subset of acoustic vowels represented 1 Introduction 2 as points (F1,F2) R . We deploy tools from the ∈ Human languages are far from arbitrary; cross- neural-network and point-process literatures and linguistically, they exhibit surprising similarity in experiment on a dataset with 233 distinct languages. many respects and many properties appear to be We show that our most complicated model outper- universally true. The field of linguistic typology forms simpler models. seeks to investigate, describe and quantify the axes along which languages vary. One facet of language 2 Acoustic Phonetics and Formants that has been the subject of heavy investigation is Much of human communication takes place the nature of vowel inventories, i.e., which vowels through speech: one conversant emits a sound wave a language contains. It is a cross-linguistic univer- to be comprehended by a second. In this work, we sal that all spoken languages have vowels (Gordon, consider the nature of the portions of such sound 2016), and the underlying principles guiding vowel waves that correspond to vowels. We briefly review selection are understood: vowels must be both the relevant bits of acoustic phonetics so as to give easily recognizable and well-dispersed (Schwartz an overview of the data we are actually modeling et al., 2005). In this work, we offer a more formal and develop our notation. treatment of the subject, deriving a generative prob- ability model of vowel inventory typology. Our The anatomy of a sound wave. The sound wave work builds on (Cotterell and Eisner, 2017) by in- that carries spoken language is a function from vestigating not just discrete IPA inventories but the time to amplitude, describing sound pressure vari- cross-linguistic variation in acoustic formants. ation in the air. To distinguish vowels, it is help- The philosophy behind our approach is that lin- ful to transform this function into a spectrogram guistic typology should be treated probabilistically (Fig.1) by using a short-time Fourier transform 37 Proceedings of NAACL-HLT 2018, pages 37–46 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics 5000 Hz principles are and . /i/ /u/ /ɑ/ focalization dispersion 4000 Hz Focalization. The notion of focalization grew 3000 Hz out of quantal vowel theory (Stevens, 1989). Quan- 2000 Hz tal vowels are those that are phonetically “better” than others. They tend to display certain proper- 1000 Hz ties, e.g., the formants tend to be closer together 0 Hz (Stevens, 1987). Cross-linguistically, quantal vow- els are the most frequently attested vowels, e.g., the Figure 1: Example spectrogram of the three English vowels: /i/, /u/ and /A/. The x-axis is time and y-axis is frequency. The cross-linguistically common vowel /i/ is considered first two formants F1 and F2 are marked in with arrows for quantal, but less common /y/ is not. each vowel. The figure was made with Praat (Boersma et al., 2002). Dispersion. The second core principle of vowel system organization is known as dispersion. As the name would imply, the principle states that (Deng and O’Shaughnessy, 2003, Chapter 1) to de- the vowels in “good” vowel systems tend to be compose each short interval of the wave function spread out. The motivation for such a principle into a weighted sum of sinusoidal waves of differ- is clear—a well-dispersed set of vowels reduces a ent frequencies (measured in Hz). At each interval, listener’s potential confusion over which vowel is the variable darkness of the spectrogram indicates being pronounced. See Schwartz et al.(1997) for a the weights of the different frequencies. In pho- review of dispersion in vowel system typology and netic analysis, a common quantity to consider is its interaction with focalization, which has led to a formant—a local maximum of the (smoothed) the joint dispersion-focalization theory. frequency spectrum. The fundamental frequency F0 determines the pitch of the sound. The formants Notation. We will denote the universal set of F1 and F2 determine the quality of the vowel. international phonetic alphabet (IPA) symbols as . The observed vowel inventory for lan- V Two is all you need (and what we left out). In guage ` has size n` and is denoted V ` = terms of vowel recognition, it is widely speculated ` ` ` ` d (v1, v1),..., (v ` , v ` ) R , where for that humans rely almost exclusively on the first { n n } ⊆ V × each k [1, n`], v` is an IPA symbol assigned two formants of the sound wave (Ladefoged, 2001, ∈ k ∈ V by a linguist and v` Rd is a vector of d measur- Chapter 5). The two-formant assumption breaks k ∈ able phonetic quantities. In short, the IPA symbol down in edge cases: e.g., the third formant F 3 v` was assigned as a label for a phoneme with pro- helps to distinguish the roundness of the vowel k nunciation v` . The ordering of the elements within (Ladefoged, 2001, Chapter 5). Other non-formant k V ` is arbitrary. features may also play a role. For example, in tonal languages, the same vowel may be realized Goals. This framework recognizes that the same with different tones (which are signaled using F0): IPA symbol v (such as /u/) may represent a slightly Mandarin Chinese makes a distinction between mˇa different sound v in one language than in another, (horse) and ma´ (hemp) without modifying the qual- although they are transcribed identically. We are ity of the vowel /a/. Other features, such as creaky specifically interested in how the vowels in a lan- voice, can play a role in distinguishing phonemes. guage influence one another’s fine-grained pro- We do not explicitly model any of these aspects of nunciation in Rd. In general, there is no reason vowel space, limiting ourselves to (F1,F2) as in to suspect that speakers of two languages, whose previous work (Liljencrants and Lindblom, 1972). phonological systems contain the same IPA symbol, However, it would be easy to extend all the models should produce that vowel with identical formants. we will propose here to incorporate such informa- Data. For the remainder of the paper, we will tion, given appropriate datasets. 2 take d = 2 so that each v = (F1,F2) R , the ∈ 3 The Phonology of Vowel Systems vector consisting of the first two formant values, as compiled from the field literature by Becker- The vowel inventories of the world’s languages Kristal(2006). This dataset provides inventories display clear structure and appear to obey several V ` in the form above. Thus, we do not consider underlying principles. The most prevalent of these further variation of the vowel pronunciation that 38 may occur within the language (between speakers, features may also be informed by phonological and between tokens of the vowel, or between earlier phonetic processes in the language. We leave mod- and later intervals within a token). eling of this step to future work; so our current likelihood term ignores the evidence contributed 4 Phonemes versus Phones by the IPA labels in the dataset, considering only the pronunciations in Rd. Previous work (Cotterell and Eisner, 2017) has The overall idea is that human languages ` draw placed a distribution over discrete phonemes, ignor- their inventories from some universal prior, which ing the variation across languages in the pronuncia- we are attempting to reconstruct. A caveat is that tion of each phoneme. In this paper, we crack open we will train our method by maximum-likelihood, the phoneme abstraction, moving to a learned set which does not quantify our uncertainty about the of finer-grained phones. reconstructed parameters. An additional caveat is Cotterell and Eisner(2017) proposed (among that some languages in our dataset are related to other options) using a determinantal point process one another, which belies the idea that they were (DPP) over a universal inventory of 53 sym- V drawn independently. Ideally, one ought to capture bolic (IPA) vowels.