Phonotactic Complexity and its Trade-offs

Tiago Pimentel Brian Roark Ryan Cotterell University of Cambridge Google University of Cambridge [email protected] [email protected] [email protected]

Abstract 3.50

We present methods for calculating a mea- 3.25 sure of phonotactic complexity—bits per —that permits a straightforward 3.00 cross-linguistic comparison. When given a 2.75 word, represented as a sequence of phone- mic segments such as symbols in the inter- 2.50 national phonetic alphabet, and a statistical 2.25 model trained on a sample of word types 5 6 7 8 from the , we can approximately Cross Entropy (bits per phoneme) Average Length (# IPA tokens) measure bits per phoneme using the negative log-probability of that word under the model. Figure 1: Bits-per-phoneme vs average word length This simple measure allows us to compare using an LSTM language model. the entropy across , giving insight into how complex a language’s phonotactics are. Using a collection of 1016 basic concept communicative capacity (Pellegrino et al., 2011; words across 106 languages, we demonstrate Coupé et al., 2019). a very strong negative correlation of −0.74 Methods for making hypotheses about linguis- between bits per phoneme and the average tic complexity objectively measurable and testable length of words. have long been of interest, though existing mea- sures are typically relatively coarse—see, e.g., 1 Introduction Moran and Blasi(2014) and §2 below for reviews. Briefly, counting-based measures such as inventory One prevailing view on system-wide phonological sizes (e.g., numbers of , consonants, sylla- complexity is that as one aspect increases in com- bles) typically play a key role in assessing phono- plexity (e.g., size of phonemic inventory), another logical complexity. Yet, in addition to their categor- reduces in complexity (e.g., degree of phonotactic ical nature, such measures generally do not capture interactions). Underlying this claim—the so-called longer-distance (e.g., cross-syllabic) phonological compensation hypothesis (Martinet, 1955; Moran dependencies such as harmony. In this paper, and Blasi, 2014)—is the conjecture that languages we take an information-theoretic view of phonotac- arXiv:2005.03774v1 [cs.CL] 7 May 2020 are, generally speaking, of roughly equivalent com- tic complexity, and advocate for a measure that per- plexity, i.e., no language is overall inherently more mits straightforward cross-linguistic comparison: complex than another. This conjecture is widely bits per phoneme. For each language, a statistical accepted in the literature and dates back, at least, language model over words (represented as phone- to the work of Hockett(1958). Since along any mic sequences) is trained on a sample of types from one axis, a language may be more complex than the language, and then used to calculate the bits per another, this conjecture has a corollary that com- phoneme for new samples, thus providing an upper pensatory relationships between different types of bound of the actual entropy (Brown et al., 1992). complexity must exist. Such compensation has Characterizing via information been hypothesized to be the result of natural pro- theoretic measures goes back to Cherry et al. cesses of historical change, and is sometimes at- (1953), who discussed the information content of tributed to a potential linguistic universal of equal phonemes in isolation, based on the presence or absence of distinctive features, as well as in groups, ing our complexity measure not only corre- e.g., trigrams or possibly . Here we lever- lates with word length in a diverse set of lan- age modern recurrent neural language modeling guages, but also intra language families. Stan- methods to build models over full word forms dard measures of phonotactic complexity do represented as phoneme strings, thus capturing any not show such correlations. dependencies over longer distances (e.g., harmony) (iv) Explicit feature representations: We also in assigning probabilities to phonemes in sequence. find (in §5.5) that methods for including By training and evaluating on comparable corpora features explicitly in the representation, in each language, consisting of concept-aligned using methods described in §4.1, yield little words, we can characterize and compare their benefit except in an extremely low-resource phonotactics. While probabilistic characterizations condition. of phonotactics have been employed extensively in psycholinguistics (see §2.4), such methods Our methods2 permit a straight-forward cross- have generally been used to assess single words linguistic comparison of phonotactic complexity, within a lexicon (e.g., classifying high versus low which we use to demonstrate an intriguing trade-off probability words during stimulus construction), with word length. Before motivating and present- rather than information-theoretic properties of the ing our methods, we next review related work on lexicon as a whole, which our work explores. measuring complexity and phonotactic modeling. The empirical portion of our paper exploits this information-theoretic take on complexity to exam- 2 Background: Phonological Complexity ine multiple aspects of phonotactic complexity: 2.1 Linguistic complexity (i) Bits-per-phoneme and Word Length: In Linguistic complexity is a nuanced topic. For §5.1, we show a very high negative correla- example, one can judge a particular sentence to be tion of −0.74 between bits-per-phoneme and syntactically complex relative to other sentences average word length for the same 1016 basic in the language. However, one can also describe concepts across 106 languages. This correla- a language as a whole as being complex in one tion is plotted in Fig.1. In contrast, conven- aspect or another, e.g., polysynthetic languages tional phonotactic complexity measures, e.g., are often deemed morphologically complex. In number of consonants in an inventory, demon- this paper, we look to characterize phonotactics strate poor correlation with word length. Our at the language level. However, we use methods results are consistent with Pellegrino et al. more typically applied to specific sentences (2011), who show a similar correlation in in a language, for example in the service of speech.1 We additionally establish, in §5.2, psycholinguistic experiments. that the correlation persists when controlling In cross-linguistic studies, the term complexity for characteristics of long words, e.g., early is generally used chiefly in two manners, which versus late positions in the word. Moran and Blasi(2014) follow Miestamo(2006) in calling relative and absolute. Relative complex- (ii) Constraining Language: Despite often be- ity metrics are those that capture the difficulty of ing thought of as adding complexity, pro- learning or processing language, which Miestamo cesses like vowel harmony and final- (2006) points out may vary depending on the in- devoicing improve the predictability of subse- dividual (hence, is relative to the individual being quent segments by constraining the number considered). For example, vowel harmony, which of well formed forms. Thus, they reduce com- we will touch upon later in the paper, may make plexity measured in bits per phoneme. We vowels more predictable for a native speaker, hence validate our models by systematically remov- less difficult to process; for a second language ing certain constraints in our corpora in §5.3. learner, however, vowel harmony may increase dif- (iii) Intra- versus Inter-Family Correlation: ficulty of learning and speaking. Absolute com- Additionally, we present results in §5.4 show- plexity measures, in contrast, assess the number of

1See also Coupé et al.(2019), where -based bigram 2Code to train these models and reproduce results models are used to establish a comparable information rate in is available at https://github.com/tpimentelms/ speech across 17 typologically diverse languages. phonotactic-complexity. parts of a linguistic (sub-)system, e.g., number of complexity. Phonological complexity can be for- phonemes or licit syllables. mulated in terms of language production (e.g., com- In the sentence processing literature, surprisal plexity of planning or articulation) or in terms of (Hale, 2001; Levy, 2008) is a widely used measure language processing (e.g., acoustic confusability or of processing difficulty, defined as the negative log predictability), a distinction often framed around probability of a word given the preceding words. the ideas of articulatory complexity and perceptual Words that are highly predictable from the preced- salience—see, e.g., Maddieson(2009). One recent ing context have low surprisal, and those that are instantiation of this was the inclusion of both fo- not predictable have high surprisal. The phonotac- calization and dispersion to model vowel system tic measure we advocate for in §3 is related to sur- typology (Cotterell and Eisner, 2017). prisal, though at the phoneme level rather than the It is also natural to ask questions about the word level, and over words rather than sentences. phonological complexity of an entire language in Measures related to phonotactic probability have addition to that of individual phonemes—whether been used in a range of psycholinguistic studies— articulatory or perceptual, phonemic or phonotactic. see §2.4—though generally to characterize single Measures of such complexity that allow for cross- words within a language (e.g., high versus low prob- linguistic comparison are non-trivial to define. We ability words) rather than for cross-linguistic com- review several previously proposed metrics here. parison as we are here. Returning to the distinction Size of phoneme inventory. The most basic met- made by Miestamo(2006), we will remain agnostic ric proposed for measuring phonological complex- in this paper as to which class (relative or absolute) ity is the number of distinct phonemes in the lan- such probabilistic complexity measures fall within, guage’s phonemic inventory (Nettle, 1995). There as well as whether the trade-offs that we document has been considerable historical interest in count- are bonafide instances of complexity compensation ing both the number of vowels and the number or are due to something else, e.g., related to the of consonants—see, e.g., Hockett(1955); Green- communicative capacity as hypothesized by Pelle- berg et al.(1978); Maddieson and Disner(1984). grino et al.(2011). We bring up this terminological Phoneme inventory size has its limitations—it ig- distinction primarily to situate our use of complex- nores the phonotactics of the language. It does, ity within the diverse usage in the literature. however, have the advantage that it is relatively Additionally, however, we will point out that an easy to compute without further linguistic analy- important motivation for those advocating for the sis. Correlations between the size of vowel and use of absolute over relative measures to character- consonant inventories (measured in number of ize linguistic complexity in cross-linguistic studies phonemes) have been extensively studied, with con- is a practical one. Miestamo (2006; 2008) claims tradictory results presented in the literature—see, that relative complexity measures are infeasible for e.g., Moran and Blasi(2014) for a review. Increases broadly cross-linguistic studies because they rely in phonemic inventory size are also hypothesized on psycholinguistic data, which is neither common to negatively correlate with word length measured enough nor sufficiently easily comparable across in phonemes (Moran and Blasi, 2014). In Nettle languages to support reliable comparison. In this (1995), an inverse relationship was demonstrated study, we demonstrate that surprisal and related between the size of the segmental inventory and measures are not subject to the practical obstacles the mean word length for 10 languages, and similar raised by Miestamo, independently of whichever results (with some qualifications) were found for a class of complexity they fall into. much larger collection of languages in Moran and Blasi(2014). 3 We will explore phoneme inventory 2.2 Measures of Phonological Complexity size as a baseline in our studies in §5. The complexity of phonemes has long been studied Markedness in Phoneme Inventory. A refine- in linguistics, including early work on the topic ment of phoneme inventory size takes into ac- by Zipf(1935), who argued that a phoneme’s ar- count markedness of the individual phonemes. ticulatory effort was related to its frequency. Tru- 3 betzkoy(1938) introduced the notion of marked- Note that by examining negative correlations between word length and inventory size within the context of complex- ness of phonological features, which bears some ity compensation, word length is also being taken implicitly indirect relation to both frequency and articulatory as a complexity measure, as we shortly make explicit. McWhorter(2001) argues that one should judge the While syllable inventories and syllable-based complexity of an inventory by counting the cross- measures of phonotactic complexity—e.g., high- linguistic frequency of the phonemes in the inven- est complexity syllable type in Maddieson(2006)— tory, channeling the spirit of Greenberg(1966). do incorporate more of the constraints at play in a Thus, a language that has fewer phonemes, but language versus segment-based measures, (a) they contains cross-linguistically marked ones such as remain relatively simple counting measures; and clicks, could be more complex.4 McWhorter jus- (b) phonological constraints do not end at the sylla- tifies this definition with the observation that no ble boundary. Phenomena such as vowel harmony attested language has a phonemic inventory that operate at the word level. Further, the combinato- consists only of marked segments. Beyond fre- rial possibilities captured by a syllabic inventory, quency, Lindblom and Maddieson(1988) propose as discussed by Maddieson(2009), can be seen as a tripartite markedness rating scheme for various a sort of categorical version of a distribution over consonants. In this paper, we are principally look- forms. Stochastic models of word-level phonotac- ing at phonotactic complexity, though we did exam- tics permit us to go beyond simple enumeration ine the joint training of models across languages, of a set, and characterize the distribution in more which can be seen as modeling some degree of robust information-theoretic terms. typicality and markedness. 2.3 Phonotactics Word length. As stated earlier, word length, Beyond characterizing the complexity of phonemes measured in the number of phonemes in a word, in isolation or the number of syllables, one can has been shown to negatively correlate with other also look at the system determining how phonemes complexity measures, such as phoneme inventory combine to form longer sequences in order to size (Nettle, 1995; Moran and Blasi, 2014). To the create words. The study of which sequences of extent that this is interpreted as being a compen- phonemes constitute natural-sounding words is satory relation, this would indicate that word length called phonotactics. For example, as Chomsky and is being taken as an implicit measure of complexity. Halle(1965) point out in their oft-cited example, Alternatively, word length has a natural interpre- brick is an actual word in English;5 blick is not tation in terms of information rate, so trade-offs an actual word in English, but is judged to be a could be attributed to communicative capacity (Pel- possible word by English speakers; and bnick is legrino et al., 2011; Coupé et al., 2019). neither an actual nor a possible word in English, Number of Licit Syllables. Phonological con- due to constraints on its phonotactics. straints extend beyond individual units to the struc- Psycholinguistic studies often use phonotactic ture of entire words themselves, as we discussed probability to characterize stimuli within a above; so why stop at counting phonemes? One language—see §2.4 for details. For example, step in that direction is to investigate the syllabic Goldrick and Larson(2008) demonstrate that structure of language, and count the number of pos- both articulatory complexity and phonotactic sible licit syllables in the language (Maddieson and probability influence the speed and accuracy Disner, 1984; Shosted, 2006). Syllabic complex- of speech production. Measures of the overall ity brings us closer to a more holistic measure of complexity of a phonological system must thus phonological complexity. Take, for instance, the also account for phonotactics. case of Mandarin Chinese. At first blush, one may Cherry et al.(1953) took an explicitly assume that Mandarin has a complex information-theoretic view of phonemic struc- due to an above-average-sized phonemic inventory ture, including discussions of both encoding (including tones); closer inspection, however, re- phonemes as feature bundles and the redundancy veals a more constrained system. Mandarin only within groups of phonemes in sequence. This admits two codas: /n/ and /N/. perspective of phonemic coding has led to work on characterizing the explicit rules or constraints 4McWhorter(2001) was one of the first to offer a quan- that lead to redundancy in phoneme sequences, titative treatment of linguistic complexity at all levels. Note, including morpheme structure rules (Halle, 1959) however, he rejects the equal complexity hypothesis, arguing creoles are simpler than other languages. As our data contains 5For convenience, we just use standard orthography to no creole, we cannot address this hypothesis; rather we only represent actual and possible words, rather than phoneme compare non-creole languages. strings. or conditions (Stanley, 1967). Recently, Futrell Lee, 2011, inter alia). et al.(2017) take such approaches as inspiration for Within the psycholinguistics literature refer- a generative model over feature dependency graphs. enced above, phonotactic probability was typically We, too, examine decomposition of phonemes into operationalized by summing or averaging the fre- features for representation in our model (see §4.1), quency with which single phonemes and phoneme though in general this only provided modeling bigrams occur, either overall or in certain word improvements over atomic phoneme symbols in positions (initial, medial, final); and neighborhood a low-resource scenario. density of a word is typically the number of words Much of the work in phonotactic modeling is in a lexicon that have Levenshtein distance 1 from intended to explain the sorts of grammaticality the word (see, e.g., Storkel and Hoover, 2010). judgments exemplified by the examples of Chom- Note that these measures are used to characterize sky and Halle(1965) discussed earlier. Recent specific words, i.e., given a lexicon, these measures work is typically founded on the commonly held allow for the designation of high versus low phono- 6 perspective that such judgements are gradient tactic probability words and high versus low neigh- hence amenable to stochastic modeling, e.g., borhood density words, which is useful for design- Hayes and Wilson(2008) and Futrell et al. ing experimental stimuli. Our bits-per-phoneme (2017)—though cf. Gorman(2013). In this paper, measure, in contrast, is used to characterize the however, we are looking at phonotactic modeling distribution over a sample of a language rather as the means for assessing phonotactic complexity than specific individual words in that language. and discovering potential evidence of trade-offs Other work has made use of phonotactic proba- cross-linguistically, and are not strictly speaking bility to examine how such processing and learning evaluating the model on its ability to capture such considerations may impact the lexicon. Dautriche judgments, gradient or otherwise. et al.(2017) take phonotactic probability as one 2.4 Phonotactic probability and surprisal component of ease of processing and learning—the other being perceptual confusability—that might A word’s phonotactic probability has been shown influence how lexicons become organized over to influence both processing and learning of lan- time. They operationalize phonotactic probabil- guage. Words with high phonotactic probabilities ity via generative phonotactic models (phoneme (see brief notes on the operationalization of this be- n-gram models and probabilistic context-free gram- low) have been shown to speed speech processing, mars with syllable structure), hence closer to the both recognition (e.g., Vitevitch and Luce, 1999) approaches described in this paper than the work and production (e.g., Goldrick and Larson, 2008). cited earlier in this section. Generating artificial lex- Phonotactically probable words in a language have icons from such models, they find that real lexicons also been shown to be easier to learn (Storkel, 2001, demonstrate higher network density (as indicated 2003; Coady and Aslin, 2004, inter alia), although by Levenshtein distances, frequency of minimal such an effect is also influenced by neighborhood pairs, and other measures) than the randomly gener- density (Coady and Aslin, 2003), as are the speech ated lexicons, suggesting that the pressure towards processing effects (Vitevitch and Luce, 1999). In- highly clustered lexicons is driven by more than formally, phonological neighborhood density is the just phonotactic probability. number of similar sounding words in the lexicon, which, to the extent that high phonotactic proba- Evidence of pressure towards communication ef- bility implies phonotactic patterns frequent in the ficiency in the lexicon has focused on both phono- lexicon, typically correlates to some degree with tactic probability and word length. The information phonotactic probability—i.e., dense neighborhoods content, as measured by the probability of a word will typically consist of phonotactically probable in context, is shown to correlate with orthographic words. Some effort has been made to disentangle length (taken as a proxy for phonological word the effect of these two characteristics (Vitevitch length) (Piantadosi et al., 2009, 2011). Piantadosi and Luce, 1999; Storkel et al., 2006; Storkel and et al.(2012) show that words with lower bits per

6 phoneme have higher rates of homophony and pol- Gradient judgments would account for the fact that bwick ysemy, in support of their hypothesis that words is typically judged to be a possible English word like blick but not as good. In other words, bwick is better than bnick but not that are easier to process will have higher levels as good as blick. of ambiguity. Relatedly, Mahowald et al.(2018) demonstrate, in nearly all of the 96 languages in- base 2. Specifically, we will be interested in bits vestigated, a high correlation between orthographic per phoneme, that is, how much information each probability (as proxy for phonotactic probability) phoneme in a word conveys, on average. and frequency, i.e., frequently used forms tend to be phonotactically highly probable, at least within the 3.1 Linguistic Rationale word lengths examined (3-7 symbols). A similar Here we seek to make a linguistic argument for perspective on the role of predictability in phonol- the adoption of bits per phoneme as a metric for ogy holds that words that are high probability in complexity in the phonological literature. Bits are context (i.e., low surprisal) tend to be reduced, and fundamentally units of predictability: If the entropy those that are low probabilty in context are prone to of your distribution is higher, that is more bits, then change (Hume and Mailhot, 2013) or to some kind it is less predictable, and if the entropy is lower, of enhancement (Hall et al., 2018). As Priva and that is, fewer bits, then it is more predictable with Jaeger(2018) point out, frequency, predictabilty an entropy of 0 indicating determinism. and information content (what they call inform- tivity and operationalize as expected predictabil- Holistic Treatment. When we just count the ity) are related and easily confounded, hence the number of distinctions in individual parts of the perspectives presented by these papers are closely phonology, e.g., number of vowels or number of related. Again, for these studies and those cited consonants, we do not get a holistic picture of earlier, such measures are used to characterize in- how these pieces interact. A simple probabilistic dividual words within a language rather than the treatment will inherently capture nuanced inter- lexicon as a whole. actions. Indeed, it is not clear how to balance the number of consonants, the number of vowels 3 The Probabilistic Lexicon and the number of tones to get a single number of phonological complexity. Probabilistically In this work, we are interested in a hypothetical modeling phonological strings, however, does p :Σ∗ → phonotactic distribution lex R+ over the capture this. We judge the complexity of a lexicon. In the context of phonology, we interpret phonological system as its entropy. Σ∗ as all “universally possible phonological surface forms,” following Hayes and Wilson Longer-Distance Dependencies. To the best of 7 (2008). The distribution plex, then, assigns a the authors’ knowledge, the largest phonological probability to every possible surface form x ∈ Σ∗. unit that has been considered in the context In the special case that plex is a log-linear model, of cross-linguistic phonological complexity is then we arrive at what is known as a maximum the syllable, as discussed in §2.2. However, entropy grammar (Goldwater and Johnson, 2003; the syllable clearly has limitations. It cannot Jäger, 2007). A good distribution plex should capture, tautologically, cross-syllabic phonological assign high probability to phonotactically valid processes, which abound in the languages of the words, including non-existent ones, but little world. For instance, vowel and consonant harmony probability to phonotactic impossibilities. For are quite common cross-linguistically. Naturally, instance, the possible English word blick should a desideratum for any measure of phonological receive much higher probability than ∗bnick, which complexity is to consider all levels of phonological is not a possible English word. The lexicon of a processes. Examples of vowel harmony in Turkish language, then, is considered to be generated as are presented in Tab.1. samples without replacement from plex. Frequency Information. None of the previously If we accept the existence of the distribution plex, then a natural manner by which we should measure proposed phonological complexity measures deals the phonological complexity of language is through with the fact that certain patterns are more frequent Shannon’s entropy (Cover and Thomas, 2012): than others; probability models inherently handle this as well. Indeed, consider the role of the un- X H(plex) = − plex(x) log plex(x) (1) voiced velar fricative /x/ in English; while not part x∈Σ∗ of the canonical consonant inventory, /x/ neverthe- less appears in a variety of loanwords. For instance, H(p ) log The units of lex are bits as we take to be many native English speakers do pronounce the last 7Hayes and Wilson(2008) label Σ∗ as Ω. name of composer Johann Sebastian Bach as /bax/. English Turkish English Turkish Using our definition of phonological complexity as entropy, we can prove a general result that any ear kulak throat bogaz˘ rain yagmur˘ foam köpük hard-constraining process will reduce entropy, thus, summit zirve claw pençe making the phonology less complex. The fact that nail tırnak herd sürü this holds for any hard contraint, be it vowel har- horse beygir dog köpek mony or final-obstruent devoicing, is a fact that conditioning reduces entropy. Table 1: Turkish evinces two types of vowel harmony, front-back and round-unround. Here we focus on just 3.3 A Variational Upper Bound front-back harmony. The examples in the table above are such that all vowels in a word are either back (ı, u, If we want to compute eq. (1), we are immediately a, o) or front (i, ü, e, ö), which is generally the case. faced with two problems. First, we do not know plex: we simply assume the existence of such a Moreover, acts upon /x/ as one distribution from which the words of the lexicon would expect: consider Morris Halle’s (1978) ex- were drawn. Second, even if we did know plex, ample Sandra out-Bached Bach, where the second computation of the H(plex) would be woefully in- word is pronounced /out-baxt/ with a final /t/ rather tractable, as it involves an infinite sum. Following than a /d/. We conclude that /x/ is in the consonant Brown et al.(1992), we tackle both of these issues inventory of at least some native English speakers. together. Note that this line of reasoning follows However, counting it on equal status with the far Cotterell et al.(2018) and Mielke et al.(2019) who more common /k/ when determining complexity use a similar technique for measuring language seems incorrect. Our probabilistic metric covers complexity at the sentence level. this corner case elegantly. We start with a basic inequality from informa- tion theory. For any distribution qlex with the same Relatively Modest Annotation Requirements. support as plex, the cross-entropy provides an upper Many of these metrics require a linguist’s analysis bound on the entropy, i.e., of the language. This is a tall order for many lan- guages. Our probabilistic approach only requires H(p ) ≤ H(p , q ) (2) relatively simple annotations, namely, a Swadesh lex lex lex (1955)-style list in the international phonetic al- where cross-entropy is defined as phabet (IPA) to estimate a distribution. When dis- cussing why he limits himself to counting complex- X ities, Maddieson(2009) writes: H(plex, qlex) = − plex(x) log qlex(x) (3) x∈Σ∗ “[t]he factors considered in these studies only involved the inventories of conso- Note that eq. (2) is tight if and only if plex = qlex. nant and vowel contrasts, the tonal sys- We still are not done, as eq. (3) still requires knowl- tem, if any, and the elaboration of the edge of plex and involves an infinite sum. However, syllable canon. It is relatively easy to we are now in a position to exploit samples from (i) find answers for a good many languages plex. Specifically, given x˜ ∼ plex, we approxi- to such questions as ‘how many conso- mate nants does this language distinguish?’ or N ‘how many types of syllable structures 1 X H(p , q ) ≈ − log q (x˜(i)) (4) does this language allow?’ ” lex lex N lex i=1 The moment one searches for data on more elab- orate notions of complexity, e.g., the existence of with equality if we let N → ∞. In information vowel harmony, one is faced with the paucity of theory, this equality in the limit is called the asymp- data—a linguist must have analyzed the data. totic equipartition property and follows easily from the weak law of large numbers. Now, we have an 3.2 Constraints Reduce Entropy empirical procedure for estimating an upper bound Many in the world employ hard con- on H(plex). For the rest of the paper, we will use straints, e.g., a syllable final obstruent must be de- the right-hand side of eq. (4) as a surrogate for the voiced or the vowels in a word must be harmonic. phonotactic complexity of a language. ∗ How to choose qlex? Choosing a good qlex is a (LM) models a probability distribution over Σ : two-step process. First, we choose a variational |x| family Q. Then, we choose a specific qlex ∈ Q by Y p(x) = p (xi | x

N Trigram LM. n-grams assume the sequence fol- 1 X q = argsup log q(x˜(i)) (5) lows a (n−1)-order Markov model, conditioning lex q∈Q N i=1 the probability of a phoneme on the (n−1) previ- ous ones This procedure corresponds to maximum likeli- count(xi, xi−1, . . . , xi+1−n) hood estimation. In this work, we consider two fn (xi | x

4.1 Phoneme-Level Language Models p (xi | x bedding representations. A phoneme i will have a black northern sami cáhppes˘ /Ùaahppes/ (k) immediately hill mari töpök /tørøk/ set of binary attributes ai (e.g. stress, , nasal), each with its corresponding embedding rep- Table 2: Sample of the lexicon in NorthEuraLex corpus. resentation z(k). A phoneme embedding will, then, be composed by the element-wise average of each of its features lookup embedding To test this latter assertion, we made use of the UniMorph9 morphological database (Kirov et al., P (k) (k) k ai z 2018) to look up words and assess the percentage zi = (11) P (k) that correspond to lemmas or base forms. Of the k ai 106 languages in our collection, 48 are also in the (j) where ai is 1 if phoneme i presents attribute j and UniMorph database, and 46 annotate their lemmas z(j) is the lookup embeddings of attribute j. This in a way that allowed for simple string matching architecture forces similar phonemes, measured in with our word forms. For these 46 languages, on av- terms of overlap in distinctive features, to have erage we found 313 words in UniMorph of the 1016 similar representations. concepts (median 328). A mean of 87.2% (median 93.3%; minimum 58.6%) of these matched lemmas 4.2 NorthEuraLex Data for that language in the UniMorph database. This We make use of data from the NorthEuraLex corpus rough string matching approach provides some in- (Dellert and Jäger, 2017). The corpus is a concept- dication that the items in the corpus are largely aligned multi-lingual lexicon with data from 107 composed of such base forms. languages. The lexicons contains 1016 “basic” con- cepts. Importantly, NorthEuraLex is appealing for Dataset Limitations. Unfortunately, there is less our study as all the words are written in a unified typological diversity in our dataset than we would IPA scheme. A sample of the lexicon is provided ordinarily desire. NorthEuraLex draws its lan- in Tab.2. For the results reported in this paper, we guages from 21 distinct language families that are omitted Mandarin, since no tone information was spoken in Europe and Asia. This excludes lan- 10 included in its annotations, causing its phonotactics guages indigenous to the Americas, Australia, to be greatly underspecified. No other tonal lan- Africa and South-East Asia. While lamentable, guages were included in the corpus, so all reported we know of no other concept-aligned lexicon with results are over 106 languages. broader typological diversity that is written in a unified phonetic alphabet, so we must save studies Why is Base-Concept Aligned Important? of more typologically diverse set of languages for Making use of data that is concept aligned across future work. the languages provides a certain amount of control In addition, we note that the process of base (to the extent possible) of the influence of linguistic concept selection and identification of correspond- content on the forms that we are modeling. In other ing forms from each language (detailed in Dellert, words, these forms should be largely comparable 2015, 2017) was non-trivial, and some of the corpus across the languages in terms of how common they design decisions may have resulted in somewhat are in the active vocabulary of adult speakers. Fur- biased samples in some languages. For example, ther, base concepts as defined for the collection are there was an attempt to minimize the frequency of more likely to be lemmas without inflection, thus loanwords in the dataset, which may make the lexi- reducing the influence of morphological processes cons in loanword heavy languages, such as English 8 on the results. with its extensive Latinate vocabulary, somewhat 8Most of the concepts in the dataset do not contain function less representative of everyday use than in other words and verbs are in the bare infinitive form – e.g., have, instead of to have) – although there are a few exceptions. 9https://unimorph.github.io For example, the German word hundert is represented as a 10Inuit languages, which are genetically related to the lan- hundred in English. guages of Siberia, are included in the lexicon. languages. Similarly, the creation of a common Correlation IPA representation over this number of languages Measure Pearson r Spearman ρ required choices that could potentially result in corpus artifacts. As with the issue of linguistic di- Number of: phonemes -0.047 -0.054 versity, we acknowledge that the resource has some vowels -0.164 -0.162 limitations but claim that it is the best currently consonants 0.030 0.045 available dataset for this work. Bits/phoneme: Splitting the Data. We split the data at the con- unigram -0.217 -0.222 trigram -0.682 -0.672 cept level into 10 folds, used for cross validation. LSTM -0.762 -0.744 We create train-dev-test splits where the training portion has 8 folds (≈ 812 concepts) and the dev Table 3: Pearson and Spearman rank correlation coeffi- and test portions have 1 fold each (≈ 102 con- cients between complexity measures and average word cepts). We then create language-specific sets with length in phoneme segments. the language-specific words for the concept to be rendered. Cross-validation allows us to have all 1016 concepts in our test sets (although evaluated process, using the orthography as a guide. For ex- using different model instances), and we do our ample, the German /ţu:k/ is converted to /ţu:g/ Zug following studies using all of them. based on the orthography . 4.3 Artificial Languages The Role of Vowel Harmony. Like final obstru- ent devoicing, vowel harmony plays a roll in re- In addition to naturally occurring languages, we ducing the number of licit syllables. In contrast to are also interested in artificial ones. Why? We final obstruent devoicing, however, vowel harmony wish to validate our models in a controlled setting, acts cross-syllabically. Consider the Turkish lex- quantifying the contribution of specific linguistic icon, where most, but not all basic lexical items phenomena to our complexity measure. Thus, de- obey vowel harmony. Processes like this reduce veloping artificial languages, which only differ with the entropy of p and, thus, can be considered as respect to one phonological property is useful. lex creating a less complex phonotactics. The Role of Final-Obstruent Devoicing. Final- For vowel harmony, we create 10 artificial obstruent devoicing reduces phonological complex- datasets by randomly replacing each vowel in a ity under our information-theoretic metric. The word with a new sampled (with replacement) vowel reason is simple: there are fewer valid syllables from that language’s vowel inventory. This breaks as all those with voiced final are ruled all vowel harmony, but keeps the syllabic structure. out. Indeed, this point is also true of the syllable counting metric discussed in §2.2. One computa- 5 Results tional notion of complexity might say that the com- plexity of the phonology is equal to the number 5.1 Study 1: Bits Per Phoneme Negatively of states required to encode the transduction from Correlates with Word Length an underlying form to a surface form in a minimal As stated earlier, Pellegrino et al.(2011) investi- finite-state transduction. Note that all SPE-style gated a complexity trade-off with the information rules may be so encoded (Kaplan and Kay, 1994). density of speech. From a 7-language study they Thus, the complexity of the phonotactics could be found a strong correlation (R = −0.94) between said to be related to the number of SPE-style rules the information density and the syllabic complexity that operate. In contrast, under our metric, any of a language. One hypothesis adduced to explain process that constrains the number of possibilities these findings is that, for functional reasons, the will, inherently, reduce complexity. The studies in rate of linguistic information is very similar cross- §5.3 allow us to examine the magnitude of such a linguistically. Inspired by their study, we conduct reduction, and validate our models with respect to a similar study with our phonotactic setup. We this expected behavior. hypothesize that the bits per phoneme for a given We create two artificial datasets without final- concept correlates with the number of phonemes obstruent devoicing based on the German and in the word. Moreover, the bits per word should be Dutch portions of NorthEurLex. We reverse the similar across languages. 3.8 itl Model fra rus eng LSTM 3.6 niv dan por ket sqi lezmrjbreronpolhye pbu ben trigram niv bsknor che hun xal liv oss hin bsktat catkpvglecymkoivepddobul 3.4 heb baknld sweazj ceskoroloukr kat ket deuisl slkbel abkitldar fra sqichv lezsahkazruskmruznarb hrvsmn eng checymkhkudmronyuxpolhye fin danmhrkoiddo porekkadyturpbueuskrloss slvbua sms tel 3.2 hunbul xal spakor benkan gldykglitlbe ckt hebtat kpvenfvep mrjbre slk dar lavkat lat evn chv catgle yuxain mdfbelellces selsjd kaz kcaarbmyv ukr nio pessma nor bakudmsahnldainlivkmrazjuzn hrv yrkita sme avasmj mns 3.0 enfmhr isl eusspaolo ess khk deuekk ellfin hin lbe tam swejpnturmyv smnkan ykg mal mdfkrl abk mncniobuagld tel kca sel lav slv ale ckt 2.8 jpn sjd yrkita lit evn kal latsmsava pes mns tam sme ess 2.6 smj mal Cross Entropy (bits per phoneme) ale mnc sma kal 2.4

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 Average Length (# IPA tokens)

Figure 2: Per-phoneme complexity vs average word length under both a trigram and an LSTM language model.

1.2 # Unique Kernel Density Estimate consonant Histogram density (10 bins) 1.0 60 vowel Histogram density (100 bins) inventory 0.8

40 0.6 Density 0.4 20 0.2 Inventory Size (# Phonemes)

0 0.0 5 6 7 8 12 14 16 18 20 22 24 26 Average Length (# IPA tokens) Complexity (Bits per word)

Figure 3: Conventional measures of phonological com- Figure 4: Kernel density estimate (KDE) of the average plexity vs average word length. These complexity mea- phonotactic complexity per word across 106 different sures are based in inventory size. languages. Different languages tend to present similar complexities (bits per word).

5.2 Study 2: Possible Confounds for Negative Correlations We consider the relation between the average One possible confound for these results is that bits per phoneme of a held-out portion of a lan- phonemes later in a word may in general have guage’s lexicon, as measured by our best language higher probability given the previous phonemes model, and the average length of the words in that than those earlier in the string. This sort of posi- language. We present the results in Figures2 and tional effect was demonstrated in Dutch (van Son 3 and in Tab.3. We find a strong correlation under and Pols, 2003), where position in the word ac- the LSTM LM (Spearman’s ρ = −0.744 with p < −19 counted for much of the variance in segmental 10 ). At the same time, we only see a weak corre- information.11 To ensure that we are not simply lation under conventional measures of phonotactic complexity, such as vowel inventory size (Spear- 11We briefly note that the van Son and Pols(2003) study man’s ρ = −0.162 with p = 0.098). In Fig.4, did not make use of a train/dev/test split of their data, but we plot the kernel density estimate and histogram rather simply analyzed raw relative frequency over their Dutch corpus. As a result, all positions beyond any word onset that is densities (both 10 and 100 bins) of word-level com- unique in their corpus would have probability one, leading to plexity (bits per word). a more extreme position effect than we would observe using replicating such a positional effect across many lan- Correlation guages, we performed several additional analyses. Measure Pearson r Spearman ρ Truncated words. First, we calculated the bits- Per Word: per-phoneme for just the first three positions in all languages -0.269 -0.312 the word, and then looked at the correlation be- each language (avg) -0.220 -0.257 each language (min) -0.561 -0.607 tween this word-onset bits-per-phoneme and the Per Language: average (full) word length in phoneme segments. In Fake (avg) -0.270 -0.254 other words, for the purpose of calculating bits-per- Fake (min) -0.586 -0.568 phoneme, we truncated all words to a maximum Real -0.762 -0.744 of three phonemes, and in such a way explicitly eliminated the contribution of positions later in any Table 4: Pearson and Spearman rank correlation coeffi- word. Using the LSTM model, this yielded a Spear- cients between complexity measures and word length man correlation of ρ = −0.469 (p < 10−7) , in in phoneme segments. All correlations are statistically significant with p < 10−8. contrast to ρ = −0.744 without such truncation (reported in Tab.3). This suggests that there is a contribution of later positions to the effect pre- Complexity sented in Tab.3 that we lose by eliding them, but Model Orig Art Diff that even in the earlier positions of the word we are trigram: seeing a trade-off with full average word length. German 3.703 3.708 0.005 (0.13%) Dutch 3.607 3.629 0.022 (0.58%)† Correlation with phoneme position. We next LSTM: looked to measure a position effect directly, by German 3.230 3.268 0.038 (1.18%)† calculating the correlation between word position Dutch 3.161 3.191 0.030 (0.95%)† and bits for that position across all languages. Here ρ = −0.429 Table 5: Complexities for original and artificial lan- we find a Spearman correlation of † (p < 10−200), which again supports the contention guages when removing final-obstruent devoicing. rep- resents an statistically significant difference with p < that later positions in general require fewer bits 0.05 to encode. Nonetheless, this correlation is still weaker than the per-language word length one (of ρ = −0.744). sets with the same size as the original languages— thus creating fake ‘languages’. We take the aver- Per-word correlations. We also calculated the age word length and bits per phoneme in each of correlation between word length and bits per these fake languages and compare the correlation— phoneme across all languages (without averaging returning to the ‘language’ level this time—with per language here). The Spearman correlation be- the original correlation. After running this test for tween these factors—at the word level using all 104 permutations, we found no shuffled set with an −19 languages—is ρ = −0.312 (p < 10 ). Analyz- equal or higher Spearman (or Pearson) correlation ing each language individually, there is an average than the real set. Thus, with a strong confidence −19 Spearman’s ρ = −0.257 (p < 10 ) between bits (p < 10−4) we can state there is a language level per phoneme and word length. The minimum neg- effect. Average and minimum negative correlations ative (i.e., highest magnitude) correlation of any for these ‘fake’ languages (as well as the real set language in the set is ρ = −0.607. These per word for ease of comparison) are presented in the lower correlations are reported in the upper half of Tab.4. half of Tab.4.

Permuted ‘language’ correlations. Finally, to 5.3 Study 3: Constraining Languages determine if our language effects perhaps arise Reduces Phonotactic Complexity due to the averaging of word lengths and bits per phoneme for each language, we ran a permutation Final-obstruent devoicing and vowel harmony re- test on languages. We shuffle words (with their duce the number of licit syllables in a language, pre-calculated bits-per-phoneme values) into 106 hence reducing the entropy. To determine the mag- nitude that such effects can have on the measure regularization and validating on unseen forms. for our different model types, we conduct two stud- 4.75 Correlation Measure Pearson r Spearman ρ

4.25 Number of: phonemes -0.214 -0.095 vowels -0.383 -0.367 consonants -0.147 -0.092 3.75 Bits/phoneme: unigram -0.267 -0.232

Artificial Language trigram -0.621 -0.520 3.25 LSTM -0.778 -0.526 Model LSTM Table 6: Pearson and Spearman correlation between trigram 2.75 complexity measures and word length in phoneme seg- 2.75 3.25 3.75 4.25 4.75 ments averaged across language families. Original Language

Figure 5: Complexities for natural and artifical lan- guages when removing vowel harmony. A paired per- tion between the number of phonological units in a mutation test showed all differences present statistical language and its average word length across a large difference with p < 0.01. and varied set of languages. They found that, while these measures of phonotactic complexity (number ies. In the first, we remove final-obstruent de- of vowels, consonants or phonemes in a language) voicing from the German and Dutch languages in are correlated with word length when measured NorthEuraLex, as discussed in §4.3. In the sec- across a varied set of languages, such a correlation ond study, we remove vowel harmony from 10 usually does not hold within language families. We languages that have it,12 as also explained in §4.3. hypothesize that this is due to their measures being After deriving two artificial languages without rather coarse approximations to phonotactic com- obstruent devoicing from both German and Dutch, plexity, so that only large changes in the language we used 10 fold cross validation to train models for would show significant correlation given the noise. each language. The statistical relevance of differ- We also hypothesize that our complexity measure is ences between normal and artificial languages was less noisy, hence should be able to yield significant analyzed using paired permutation tests between correlations both within and across families. the pairs. Results are presented in Tab.5. We see Results in Tab.3 show a strong correlation for that the n-gram can capture this change in com- the LSTM measure, while they show a weak one plexity for Dutch, but not for German. At the same for conventional measures of complexity. As stated time, the LSTM shows a statistically significant before, Moran and Blasi(2014) found that vowel increase of ≈ 0.034 bits per phoneme when we inventory size shows a strong correlation to word remove obstruent devoicing from both languages. length on a diverse set of languages, but, as men- Fig.5 presents a similar impact on complexity from tioned in §4.2, our dataset is more limited than vowel harmony removal, as evidenced by the fact desired. To test if we can mitigate this effect we that all points fall above the equality line. Average average the complexity measures and word length complexity increased by ≈ 0.62 bits per phoneme per family (instead of per language) and calculate (an approximate 16% entropy increase), as mea- the same correlations again. These results are pre- sured by our LSTM models. sented in Tab.6 and show that when we average In both of these artificial language scenarios, these complexity measures per family we indeed the LSTM models appeared more sensitive to the find a stronger correlation between vowel inven- constraint removal, as expected. tory size and average word length, although with a higher null hypothesis probability (Spearman’s 5.4 Study 4: Negative Trade-off Persists ρ = −0.367 with p = 0.111). We also see our Within and Across Families LSTM based measure still shows a strong correla- Moran and Blasi(2014) investigated the correla- tion (Spearman’s ρ = −0.526 with p = 0.017).

12The languages with vowel harmony are: bua, ckt, evn, fin, We now analyze these correlations intra families, hun, khk, mhr, mnc, myv, tel, and tur. for all family languages in our dataset with at least Spearman ρ Model Complexity Spearman ρ Family LSTM Vowels # Langs n-Grams: unigram 4.477 -0.222 Dravidian -1.0∗ -0.894 4 trigram 3.270 -0.672 Indo-European -0.662∗ -0.218 37 Nakh-Daghestanian -0.771† -0.530 6 Independent Embeddings: Lookup 2.976 -0.744 Turkic -0.690† -0.773† 8 Phoneme 2.992 -0.741 Uralic -0.874∗ 0.363† 26 Lookup + Phoneme 2.975 -0.752 ∗ Statistically significant with p < 0.01 † Shared Embeddings: Statistically significant with p < 0.1 Lookup 2.988 -0.743 Phoneme 2.977 -0.744 Table 7: Spearman correlation between complexity mea- Lookup + Phoneme 2.982 -0.740 sures and average word length per language family. Phonotactic complexity in bits per phoneme presents very strong intra-family correlation with word length Table 8: Average cross-entropy across all languages and in three of the five families. Size of vowel inventory the correlation between complexity and average word presents intra-family correlation in Turkic and Uralic. length for different models.

(e.g., using just 10% of the training used in our 4 languages. These results are presented in Tab.7. standard trials, or 81 example words) where we Our LSTM based phonotactic complexity measure observed even a small benefit to explicit feature shows strong intra family correlation with average representations and shared embeddings. word length for all five analyzed language families (−0.662 ≥ ρ ≥ −1.0 with p < 0.1). At the same 6 Conclusion time, vowel inventory size only shows a negative statistically significant correlation within Turkic. We have presented methods for calculating a well- motivated measure of phonotactic complexity: bits 5.5 Study 5: Explicit feature representations per phoneme. This measure is derived from in- do not generally improve models formation theory and its value is calculated using Tab.3 presents strong correlations when using an the probability distribution of a language model. LSTM with standard one-hot lookup embedding. We demonstrate that cross-linguistic comparison Here we train LSTMs with three different phoneme is straightforward using such a measure, and find embedding models: (1) a typical Lookup embed- a strong negative correlation with average word ding, in which each Phoneme has an associated length. This trade-off with word length can be embedding; (2) a phoneme features based embed- seen as an example of complexity compensation or ding, as explained in §4.1; (3) the concatenation of perhaps related to communicative capacity. the Lookup and the Phoneme embedding. We also Acknowledgments train these models both using independent models for each language, and with independent models, We thank Damián E. Blasi for his feedback on pre- but sharing embedding weights across languages. vious versions of this paper and the anonymous re- We first analyze these model variants under the viewers, as well as action editor Eric Fosler-Lussier, same lens as used in Study 1. Tab.8 shows the cor- for their constructive and detailed comments—the relations between the complexity measure resulting paper is much improved as a result. from each of this models and the average number of phonemes in a word. We find strong correlations for all of them (−0.740 ≥ ρ ≥ −0.752 with p < References −18 10 ). We also present in Tab.8 these models’ Peter F. Brown, Vincent J. Della Pietra, Robert L. cross entropy, averaged across all languages. At Mercer, Stephen A. Della Pietra, and Jennifer C. least for the methods that we are using here, we de- Lai. 1992. An estimate of an upper bound for the rived no benefit from either more explicit featural entropy of English. Computational Linguistics, representations of the phonemes or by sharing the 18(1):31–40. embeddings across languages. We also investigated scenarios using less train- E. Colin Cherry, Morris Halle, and Roman Jakob- ing data, and it was only in very sparse scenarios son. 1953. Toward the logical description of languages in their phonemic aspect. Language, Johannes Dellert. 2017. Information-Theoretic pages 34–46. Causal Inference of Lexical Flow. Ph.D. the- sis, University of Tübingen. Noam Chomsky and Morris Halle. 1965. Some controversial questions in phonological theory. Johannes Dellert and Gerhard Jäger. 2017. Journal of Linguistics, 1(2):97–138. Northeuralex (version 0.9).

Jeffry A. Coady and Richard N. Aslin. 2003. Richard Futrell, Adam Albright, Peter Graff, and Phonological neighbourhoods in the developing Timothy J. O’Donnell. 2017. A generative lexicon. Journal of Child Language, 30(2):441– model of phonotactics. Transactions of the Asso- 469. ciation for Computational Linguistics, 5:73–86.

Jeffry A. Coady and Richard N. Aslin. 2004. Matthew Goldrick and Meredith Larson. 2008. Young children’s sensitivity to probabilistic Phonotactic probability influences speech pro- phonotactics in the developing lexicon. Journal duction. Cognition, 107(3):1155–1164. of Experimental Child Psychology, 89(3):183– 213. Sharon Goldwater and Mark Johnson. 2003. Learn- ing OT constraint rankings using a maximum Ryan Cotterell and Jason Eisner. 2017. Proba- entropy model. In Proceedings of the Stockholm bilistic typology: Deep generative models of Workshop on Variation within Optimality The- vowel inventories. In Proceedings of the 55th ory. Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), Sharon Goldwater, Mark Johnson, and Thomas L. pages 1182–1192, Vancouver, Canada. Associa- Griffiths. 2006. Interpolating between types and tion for Computational Linguistics. tokens by estimating power-law generators. In Advances in Neural Information Processing Sys- Ryan Cotterell, Sebastian J. Mielke, Jason Eis- tems, pages 459–466. ner, and Brian Roark. 2018. Are all languages equally hard to language-model? In Proceed- Kyle Gorman. 2013. Generative phonotactics. ings of the 2018 Conference of the North Amer- Ph.D. thesis, University of Pennsylvania. ican Chapter of the Association for Computa- tional Linguistics: Human Language Technolo- Joseph Greenberg. 1966. Language universals, gies, Volume 2 (Short Papers), pages 536–541, with special reference to feature hierarchies. The New Orleans, Louisiana. Association for Com- Hague: Mouton. putational Linguistics. Joseph H. Greenberg, Charles A. Ferguson, and Christophe Coupé, Yoon Oh, Dan Dediu, and Edith Moravcsik, editors. 1978. Universals of François Pellegrino. 2019. Different languages, human language. Vol. 2: Phonology. Stanford similar encoding efficiency: Comparable infor- University Press. mation rates across the human communicative niche. Science Advances, 5(9):eaaw2594. John Hale. 2001. A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the Thomas M. Cover and Joy A. Thomas. 2012. El- 2nd meeting of the North American Chapter of ements of Information Theory. John Wiley & the Association for Computational Linguistics, Sons. pages 1–8.

Isabelle Dautriche, Kyle Mahowald, Edward Gib- Kathleen Currie Hall, Elizabeth Hume, T. Florian son, Anne Christophe, and Steven T. Piantadosi. Jaeger, and Andrew Wedel. 2018. The role of 2017. Words cluster phonetically beyond phono- predictability in shaping phonological patterns. tactic regularities. Cognition, 163:128–145. Linguistics Vanguard, 4(s2).

Johannes Dellert. 2015. Compiling the Uralic Morris Halle. 1959. The sound pattern of Russian. dataset for NorthEuraLex, a lexicostatistical Mouton, The Hague. database of northern Eurasia. In First Interna- tional Workshop on Computational Linguistics Morris Halle. 1978. Knowledge unlearned and un- for Uralic Languages. taught: What speakers know about the sounds of their language. In M. Halle, J. Bresnan, and Christo Kirov, Ryan Cotterell, John Sylak- G. Miller, editors, Linguistic Theory and Psycho- Glassman, Géraldine Walther, Ekaterina Vylo- logical Reality, pages 294–303. The MIT Press, mova, Patrick Xia, Manaal Faruqui, Sebastian J. Cambridge, MA. Mielke, Arya McCarthy, Sandra Kübler, David Yarowsky, Jason Eisner, and Mans Hulden. 2018. Bruce Hayes and Colin Wilson. 2008. A maximum UniMorph 2.0: Universal morphology. In Pro- entropy model of phonotactics and phonotactic ceedings of the 11th Language Resources and learning. Linguistic Inquiry, 39(3):379–440. Evaluation Conference.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Roger Levy. 2008. Expectation-based syntactic Long short-term memory. Neural computation, comprehension. Cognition, 106(3):1126–1177. 9(8):1735–1780. Björn Lindblom and Ian Maddieson. 1988. Pho- Charles Francis Hockett. 1955. A manual of netic universals in consonant systems. Lan- phonology. Waverly Press, Baltimore, MD. guage, Speech, and Mind, pages 62–78.

Charles Francis Hockett. 1958. A course in modern Ian Maddieson. 2006. Correlating phonological linguistics. Macmillan, New York. complexity: Data and validation. Linguistic Ty- pology, 10(1):106–123. Elizabeth Hume and Frédéric Mailhot. 2013. The role of entropy and surprisal in phonologization Ian Maddieson. 2009. Calculating phonological and language change. In Alan C.L. Yu, editor, complexity. In François Pellegrino, Ioana Chi- Origins of sound change: Approaches to phonol- toran, Egidio Marsico, and Christophe Coupe, ogization, pages 29–47. Oxford University Press, editors, Approaches to phonological complex- Oxford. ity, pages 85–110. Mouton de Gruyter Berlin, Germany. Hemant Ishwaran and Lancelot F. James. 2003. Generalized weighted Chinese restaurant pro- Ian Maddieson and Sandra Ferrari Disner. 1984. cesses for species sampling mixture models. Sta- Patterns of sounds. Cambridge University Press. tistica Sinica, pages 1211–1235. Kyle Mahowald, Isabelle Dautriche, Edward Gib- Gerhard Jäger. 2007. Maximum entropy models son, and Steven T. Piantadosi. 2018. Word forms and stochastic optimality theory. Architectures, are structured for efficient use. Cognitive sci- Rules, and Preferences: Variations on Themes ence, 42(8):3116–3134. by Joan W. Bresnan. Stanford: CSLI, pages 467– 479. André Martinet. 1955. Économie des changements phonétiques. Éditions A. Francke S. A. Frederick Jelinek. 1980. Interpolated estimation of Markov source parameters from sparse data. John McWhorter. 2001. The world’s simplest In Proc. Workshop on Pattern Recognition in grammars are creole grammars. Linguistic Ty- Practice, 1980. pology, 5(2):125–66.

Ronald M. Kaplan and Martin Kay. 1994. Regular Stephen Merity, Nitish Shirish Keskar, and Richard models of phonological rule systems. Computa- Socher. 2018. An analysis of neural language tional Linguistics, 20(3):331–378. modeling at multiple scales. arXiv preprint arXiv:1803.08240. Urvashi Khandelwal, He He, Peng Qi, and Dan Ju- rafsky. 2018. Sharp nearby, fuzzy far away: How Sebastian J. Mielke, Ryan Cotterell, Kyle Gorman, neural language models use context. In Proceed- Brian Roark, and Jason Eisner. 2019. What ings of the 56th Annual Meeting of the Asso- kind of language is hard to language-model? ciation for Computational Linguistics (Volume In Proceedings of the 57th Annual Meeting of 1: Long Papers), pages 284–294, Melbourne, the Association for Computational Linguistics, Australia. Association for Computational Lin- pages 4975–4989, Florence, Italy. Association guistics. for Computational Linguistics. Matti Miestamo. 2006. On the feasibility of com- Uriel Cohen Priva and T. Florian Jaeger. 2018. The plexity metrics. In FinEst Linguistics, Proceed- interdependence of frequency, predictability, and ings of the Annual Finnish and Estonian Confer- informativity in the segmental domain. Linguis- ence of Linguistics, pages 11–26. tics Vanguard, 4(s2).

Matti Miestamo. 2008. Grammatical complexity Ryan K. Shosted. 2006. Correlating complexity: in a cross-linguistic perspective. In Matti Mies- A typological approach. Linguistic Typology, tamo, Kaius Sinnemaki, and Fred Karlsson, ed- 10(1):1–40. itors, Language complexity: Typology, contact, Jasper Snoek, Hugo Larochelle, and Ryan P. change, pages 23–41. John Benjamins Amster- Adams. 2012. Practical Bayesian optimization dam, The Netherlands. of machine learning algorithms. In Advances in Neural Information Processing Systems, pages Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan 2951–2959. Cernockˇ y,` and Sanjeev Khudanpur. 2010. Recur- rent neural network based language model. In R.J.J.H. van Son and Louis C.W. Pols. 2003. Infor- Eleventh Annual Conference of the International mation structure and efficiency in speech produc- Speech Communication Association. tion. In Eighth European Conference on Speech Communication and Technology (Eurospeech). Steven Moran and Damián Blasi. 2014. Cross- linguistic comparison of complexity measures Richard Stanley. 1967. Redundancy rules in in phonological systems. In Frederick J. phonology. Language, pages 393–436. Newmeyer and Laurel B. Preston, editors, Mea- Holly L. Storkel. 2001. Learning new words: suring grammatical complexity, pages 217–240. Phonotactic probability in language develop- Oxford University Press Oxford, UK. ment. Journal of Speech, Language, and Hear- ing Research, 44(6):1321–1337. Steven Moran, Daniel McCloy, and Richard Wright, editors. 2014. PHOIBLE Online. Max Holly L. Storkel. 2003. Learning new words II: Planck Institute for Evolutionary Anthropology, Phonotactic probability in verb learning. Jour- Leipzig. nal of Speech Language and Hearing Research, 46(6):1312–1323. Daniel Nettle. 1995. Segmental inventory size, word length, and communicative efficiency. Lin- Holly L. Storkel, Jonna Armbrüster, and Tiffany P. guistics, 33:359–367. Hogan. 2006. Differentiating phonotactic prob- ability and neighborhood density in adult word François Pellegrino, Ioana Chitoran, Egidio Mar- learning. Journal of Speech, Language, and sico, and Christophe Coupé. 2011. A cross- Hearing Research, 49(6):1175–1192. language perspective on speech information rate. Holly L. Storkel and Jill R. Hoover. 2010. An Language, 87(3):539–558. online calculator to compute phonotactic prob- ability and neighborhood density on the basis Steven T. Piantadosi, Harry Tily, and Edward Gib- of child corpora of spoken . son. 2011. Word lengths are optimized for ef- Behavior Research Methods, 42(2):497–506. ficient communication. Proceedings of the Na- tional Academy of Sciences, 108(9):3526–3529. Holly L. Storkel and Su-Yeon Lee. 2011. The in- dependent effects of phonotactic probability and Steven T. Piantadosi, Harry Tily, and Edward Gib- neighbourhood density on lexical acquisition by son. 2012. The communicative function of ambi- preschool children. Language and Cognitive guity in language. Cognition, 122(3):280–291. Processes, 26(2):191–211. Steven T. Piantadosi, Harry J. Tily, and Edward Martin Sundermeyer, Ralf Schlüter, and Hermann Gibson. 2009. The communicative lexicon hy- Ney. 2012. LSTM neural networks for language pothesis. In The 31st annual meeting of the Cog- modeling. In Thirteenth Annual Conference of nitive Science Society (CogSci09), pages 2582– the International Speech Communication Asso- 2587. ciation. Morris Swadesh. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics, 21(2):121–137.

Nikolaï Sergeyevich Trubetzkoy. 1938. Grundzüge der phonologie. Van den Hoeck & Ruprecht, Gottigen, Germany.

Michael S. Vitevitch and Paul A. Luce. 1999. Prob- abilistic phonotactics and neighborhood activa- tion in spoken word recognition. Journal of Memory and Language, 40(3):374–408.

George Kingsley Zipf. 1935. The psycho-biology of language: An introduction to dynamic philol- ogy. MIT Press, Cambridge.