Phonotactic Complexity and its Trade-offs Tiago Pimentel Brian Roark Ryan Cotterell University of Cambridge Google University of Cambridge
[email protected] [email protected] [email protected] Abstract 3.50 We present methods for calculating a mea- 3.25 sure of phonotactic complexity—bits per phoneme—that permits a straightforward 3.00 cross-linguistic comparison. When given a 2.75 word, represented as a sequence of phone- mic segments such as symbols in the inter- 2.50 national phonetic alphabet, and a statistical 2.25 model trained on a sample of word types 5 6 7 8 from the language, we can approximately Cross Entropy (bits per phoneme) Average Length (# IPA tokens) measure bits per phoneme using the negative log-probability of that word under the model. Figure 1: Bits-per-phoneme vs average word length This simple measure allows us to compare using an LSTM language model. the entropy across languages, giving insight into how complex a language’s phonotactics are. Using a collection of 1016 basic concept communicative capacity (Pellegrino et al., 2011; words across 106 languages, we demonstrate Coupé et al., 2019). a very strong negative correlation of −0:74 Methods for making hypotheses about linguis- between bits per phoneme and the average tic complexity objectively measurable and testable length of words. have long been of interest, though existing mea- sures are typically relatively coarse—see, e.g., 1 Introduction Moran and Blasi(2014) and §2 below for reviews. Briefly, counting-based measures such as inventory One prevailing view on system-wide phonological sizes (e.g., numbers of vowels, consonants, sylla- complexity is that as one aspect increases in com- bles) typically play a key role in assessing phono- plexity (e.g., size of phonemic inventory), another logical complexity.