<<

A Generative Model of Phonotactics

Richard Futrell Adam Albright Brain and Cognitive Sciences Department of Massachusetts Institute of Technology Massachusetts Institute of Technology [email protected] [email protected]

Peter Graff Timothy J. O’Donnell Intel Corporation Department of Linguistics [email protected] McGill University [email protected]

Abstract tuitions reveal that speakers are aware of the restric- tions on sound sequences which can make up possi- We present a probabilistic model of phono- ble morphemes in their language—the phonotactics tactics, the set of well-formed phoneme se- of the language. Phonotactic restrictions mean that quences in a language. Unlike most compu- each language uses only a subset of the logically, tational models of phonotactics (Hayes and or even articulatorily, possible strings of phonemes. Wilson, 2008; Goldsmith and Riggle, 2012), we take a fully generative approach, model- Admissible phoneme combinations, on the other ing a process where forms are built up out hand, typically recur in multiple morphemes, lead- of subparts by phonologically-informed struc- ing to redundancy. ture building operations. We learn an inven- It is widely accepted that phonotactic judgments tory of subparts by applying stochastic memo- may be gradient: the nonsense word blick is better ization (Johnson et al., 2007; Goodman et al., as a hypothetical English word than bwick, which 2008) to a generative process for phonemes structured as an and-or graph, based on con- is better than bnick (Hayes and Wilson, 2008; Al- cepts of feature hierarchy from generative bright, 2009; Daland et al., 2011). To account for (Clements, 1985; Dresher, 2009). such graded judgements, there have been a vari- Subparts are combined in a way that allows ety of probabilistic (or, more generally, weighted) tier-based feature interactions. We evaluate models proposed to handle phonotactic learning and our models’ ability to capture phonotactic dis- generalization over the last two decades (see Da- tributions in the lexicons of 14 languages land et al. (2011) and below for review). How- drawn from the WOLEX corpus (Graff, 2012). Our full model robustly assigns higher proba- ever, inspired by optimality-theoretic approaches to bilities to held-out forms than a sophisticated phonology, the most linguistically informed and suc- N-gram model for all languages. We also cessful such models have been constraint-based— present novel analyses that probe model be- formulating the problem of phonotactic generaliza- havior in more detail. tion in terms of restrictions that penalize illicit com- binations of sounds (e.g., ruling out ∗bn-). generative 1 Introduction In this paper, by contrast, we adopt a approach to modeling phonotactic structure. Our People have systematic intuitions about which se- approach harkens back to early work on the sound quences of sounds would constitute likely or un- structure of lexical items which made use of mor- likely words in their language: Although blick is not pheme structure rules or conditions (Halle, 1959; an English word, it sounds like it could be, while Stanley, 1967; Booij, 2011; Rasin and Katzir, 2014). bnick does not (Chomsky and Halle, 1965). Such in- Such approaches explicitly attempted to model the

73

Transactions of the Association for Computational Linguistics, vol. 5, pp. 73–86, 2017. Action Editor: Eric Fosler-Lussier. Submission batch: 8/2016; Revision batch: 11/2016; Published 2/2017. c 2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021 redudancy within the set of allowable lexical forms cally universal inventory. In our case, this amounts in a language. We adopt a probabilistic version of to the idea that an inventory of segments and sub- this idea, conceiving of the phonotactic system as segments can be acquired by a learner that stores the component of the linguistic system which gen- and reuses commonly occuring segments in partic- erates the phonological form of lexical items such ular, phonologically relevant contexts. In short, we as words and morphemes.1 Our system learns in- view the problem of learning the phoneme inven- ventories of reusable phonotactically licit structures tory as one of concentrating probability mass on the from existing lexical items, and assembles new lex- segments which have been observed before, and the ical items by combining these learned phonotac- problem of phonotactic generalization as learning tic patterns using phonologically plausible structure- which (sub-)segments are likely in particular tier- building operations. Thus, instead of modeling based phonological contexts. phonotactic generalizations in terms of constraints, we treat the problem as a problem of learning lan- 2 Model Motivations guage specific inventories of phonological units and language specific biases on how these phones are In this section, we give an overview of how our likely to be combined. model works and discuss the phenomena and the- Although there have been a number of earlier gen- oretical ideas that motivate it. erative models of phonotactic structure (see Sec- tion 4) these models have mostly used relatively 2.1 Feature Dependency Graphs simplistic or phonologically implausible representa- Most formal models of phonology posit that seg- tions of phones and phonological structure-building. ments are grouped into sets, known as natural By contrast, our model is built around three repre- classes, that are characterized by shared articulatory sentational assumptions inspired by the generative and acoustic properties, or phonological features phonology literature. First, we capture sparsity in (Trubetzkoy, 1939; Jakobson et al., 1952; Chomsky the space of feature-specifications of phonemes by and Halle, 1968). For example, the segments /n/ and using feature dependency graphs—an idea inspired /m/ are classified with a positive value of a nasal- by work on feature geometries and the contrastive ity feature (i.e., NASALITY:+). Similarly, /m/ and hierarchy (Clements, 1985; Dresher, 2009). Sec- /p/ can be classified using the labial value of a ond, our system can represent phonotactic general- PLACE feature, PLACE:labial. These features al- izations not only at the level of fully specified seg- low compact description of many phonotactic gen- ments, but also allows the storage and reuse of sub- eralizations.2 segments, inspired by the autosegments and class From a probabilistic structure-building perspec- nodes of autosegmental phonology. Finally, also in- tive, we need to specify a generative procedure spired by autosegmental phonology, we make use of which assembles segments out of parts defined in a structure-building operation which is senstitive to terms of these features. In this section, we will build tier-based contextual structure. up such a procedure starting from the simplest possi- To model phonotactic learning, we make use of ble procedure and progressing towards one which is tools from Bayesian nonparametric statistics. In par- more phonologically informed. We will clarify the ticular, we make use of the notion of lexical mem- 2 oization (?; Goodman et al., 2008; Wood et al., For compatibility with the data sources used in evaluation 2009; O’Donnell, 2015)—the idea that language- (Section 5.2), the feature system we use here departs in several ways from standard feature sets: (1) We use multivalent rather specific generalizations can be captured by the stor- than binary-valued features. (2) We represent manner with a age and reuse of frequent patterns from a linguisti- single feature, which has values such as vocalic, stop, and fricative. This approach allows us to refer to manners more 1Ultimately, we conceive of phonotactics as the module compactly than in systems that employ combinations of features of phonology which generates the underlying forms of lexical such as sonorant, continuant, and consonantal. For items, which are then subject to phonological transformations example, rather than referring to vowels as ‘non-syllabic’, we (i.e., transductions). In this work, however, we do not attempt refer to them using feature value vocalic for the feature to model transformations from underlying to surface forms. MANNER.

74

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021 generative process here using an analogy to PCFGs, zero or nominal mass on any sequence containing but this analogy will break down in later sections. the segment /x/, although this is a logically possi- The simplest procedure for generating a seg- ble phoneme. So our generative procedure for a ment from features is to specify each feature phoneme must be able to learn to generate only the independently. For example, consider the set licit segments of a language, given some probabil- of feature-value pairs for /t/: {NASALITY:-, ity distributions at the and- and or-nodes. For this PLACE:alveolar, ...}. In a naive generative pro- task, independently sampling values at and-nodes cedure, one could generate an instance of /t/ by inde- does not give us a way to rule out particular com- pendently choosing values for each feature in the set binations of features such as those forming /x/. {NASALITY, PLACE, ...}. We express this process Our approach to this problem uses the idea of using the and-or graph notation below. Box-shaped stochastic memoization (or adaptation), in which the nodes—called or-nodes—represent features such as results of certain computations are stored and may NASALITY, while circular nodes represent groups of be probabilistically reused “as wholes,” rather than features whose values are chosen independently and recomputed from scratch (Michie, 1968; Goodman are called and-nodes. et al., 2008). This technique has been applied to the problem of learning lexical items at various levels of linguistic structure (de Marcken, 1996; Johnson NASALITY ... PLACE et al., 2007; Goldwater, 2006; O’Donnell, 2015). Given our model so far, applying stochastic memo- This generative procedure is equivalent (ignoring or- ization is equivalent to specifying an adaptor gram- der) to a PCFG with rules: mar over the PCFGs described so far. SEGMENT NASALITY ... PLACE → Let f be a stochastic function which samples NASALITY + → feature values using the and-or graph representa- NASALITY - → tion described above.We apply stochastic memo- PLACE bilabial → PLACE alveolar ization to each node. Following Johnson et al. → ... (2007) and Goodman et al. (2008), we use a distri- Not all combinations of feature-value pairs cor- bution for probabilistic memoization known as the respond to possible phonemes. For example, while Dirichlet Process (DP) (Ferguson, 1973; Sethura- man, 1994). Let mem f be a DP-memoized ver- /l/ is distinguished from other consonants by the { } sion of f. The behavior of a DP-memoized function feature LATERAL, it is incoherent to specify vow- can be described as follows. The first time we invoke els as LATERAL. In order to concentrate probabil- mem f , the feature specification of a new segment ity mass on real segments, our process should opti- { } mally assign zero probability mass to these incoher- will be sampled using f. On subsequent invocations, we either choose a value from among the set of pre- ent phonemes. We can avoid specifying a LATERAL feature for vowels by structuring the generative pro- vious sampled values (a memo draw), or we draw a new value from f (a ). The probability of cess as below, so that the LATERAL or-node is only base draw i ni reached for consonants: sampling the th old value in a memo draw is N+θ , where N is the number of tokens sampled so far, ni VOCALIC consonant vowel is the number of times that value i has been used in the past, and θ > 0 is a parameter of the model. A A B θ base draw happens with probability N+θ . This pro- cess induces a bias to reuse items from f which have LATERAL ... HEIGHT ... been frequently generated in the past. Beyond generating well-formed phonemes, a ba- We apply mem recursively to the sampling proce- sic requirement of a model of phonotactics is that dure for each node in the feature dependency graph. it concentrates mass only on the segments in a par- The more times that we use some particular set of ticular language’s segment inventory. For exam- features under a node to generate words in a lan- ple, the model of English phonotactics should put guage, the more likely we are to reuse that set of

75

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021 features in the future in a memo draw. This dynamic els. Because all and-nodes are recursively memo- leads our model to rapidly concentrate probability ized, our model is able to bind together particular mass on the subset of segments which occur in the non-oral choices (node B), learning for instance that inventory of a language. the combination {NASALITY:+, VOICED:+} com- monly recurs for both vowels and consonants in a 2.2 Class Node Structure language. That is, {NASALITY:+, VOICED:+} be- Our use of and-or graphs and lexical memoiza- comes a high-probability memo draw. tion to model inter-feature dependencies is in- Since the model learns an inventory of fully spec- spired by work in phonology on distinctiveness ified segments at node A, the model could learn one- and markedness hierarchies (Kean, 1975; Berwick, off exceptions to this generalization as well. For 1985; Dresher, 2009). In addition to using feature example, it could store at a high level a segment hierarchies to delineate possible segments, the liter- with {NASALITY:+, VOICED:-} along with some ature has used these structures to designate bundles other features, while maintaining the generalization of features that have privileged status in phonolog- that {NASALITY:+, VOICED:+} is highly frequent in ical description, i.e. feature geometries (Clements, base draws. Language-specific phoneme invento- 1985; Halle, 1995; McCarthy, 1988). For example, ries abound with such combinations of class-node- many analyses group features concerning laryngeal based generalizations and idiosyncrasies. By using states (e.g., VOICE, ASPIRATION, etc.) under a la- lexical memoization at multiple different levels, our ryngeal node, which is distinct from the node con- model can capture both the broader generalizations taining oral place-of-articulation features (Clements described in class node terminology and the excep- and Hume, 1995). These nodes are known as class tions to those generalizations. nodes. In these analyses, features grouped together under the laryngeal class node may covary while be- 2.3 Sequential Structure as Memoization in ing independent of features grouped under the oral Context class node. In Section 2.2, we focused on the role that features The lexical memoization technique discussed play in defining a language’s segment inventory. We above captures this notion of class node directly, be- gave a phonologically-motivated generative process, cause the model learns an inventory of subsegments equivalent to an adaptor grammar, for phonemes under each node. in isolation. However, features also play an im- Consider the feature dependency graph below. portant role in characterizing licit sequences. We A model sequential restrictions as context-dependent segment inventories. Our model learns a distribution B VOCALIC over segments and subsegments conditional on each vowel consonant preceding sequence of (sub)segments, using lexi- ...... NASALITY VOICE C cal memoization. Introducing context-dependence means that the model can no longer be formulated ... BACKNESS HEIGHT as an adaptor grammar. In this graph, the and-node A generates fully spec- ified segments. And-node B can be thought of as 2.4 Tier-based Interaction generating the non-oral properties of a segment, in- One salient property of sequential restrictions in cluding voicing and nasality. And-node C is a class phonotactics is that segments are often required to node bundling together the oral features of vowel bear the same feature values as nearby segments. segments. For example, a sequence of a nasal and a follow- The features under B are outside of the VO- ing stop must agree in place features at the end of a CALIC node, so these features are specified for both morpheme in English. Such restrictions may even consonant and vowel segments. This allows be non-local. For example, many languages pre- combinations such as voiced nasal consonants, and fer combinations of vowels that agree in features also rarer combinations such as unvoiced nasal vow- such as HEIGHT, BACKNESS, or ROUNDING, even

76

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021 So in order to generate the next segment follow- ing /ak/ in the example, we start at node A in the next draw from the feature geometry, with some probabil- ity we do a memo draw conditioned on /ak/, defined by the red tier. If we decide to do a base draw in- stead, we then repeat the procedure conditional on the previous n 1 segments, recursively until we − are conditioning on the empty context. That is, we Figure 1: Tiers defined by class nodes A and B for context do a memo draw conditional on /k/, or conditional sequence /ak/. See text. on the empty context. This process of conditioning on successively smaller contexts is a standard tech- nique in Bayesian nonparametric language modeling across arbitrary numbers of intervening consonants (Teh, 2006; Goldwater et al., 2006). (i.e., vowel harmony). At the empty context, if we decide to do a base One way to describe these sequential feature in- draw, then we generate a novel segment by repeat- teractions is to assume that feature values of one ing the whole process at each child node, to gen- segment in a word depend on values for the same erate several subsegments. In the example, we or closely related features in other segments. This would assemble a phoneme by independently sam- is accomplished by dividing segments into subsets pling subsegments at the nasal/laryngeal node B and (such as consonants and vowels), called tiers, and the MANNER node, and then combining them. Cru- then making a segment’s feature values preferen- cially, the conditioning context consists only of the tially dependent on the values of other segments on values at the current node in the previous phonemes. the same tier. So when we sample a subsegment from node B, it is Such phonological tiers are often identified with conditional on the previous two values at node B, class nodes in a feature dependency graph. For { VOICE:+, NASAL:-} and { VOICE:-, NASAL:-}, example, a requirement that one vowel identically defined by the blue tier in the figure. The process match the vowel in the preceding syllable would be continues down the feature dependency graph recur- stated as a requirement that the vowel’s HEIGHT, sively. At the point where the model decides on BACKNESS, and ROUNDING features match the val- vowel place features such as height and backness, ues of the preceding vowel’s features. In this case, these will be conditioned only on the vowel places the vowels themselves need not be adjacent—by as- features of the preceding /a/, with /k/ skipped en- suming that vowel quality features are not present tirely as it does not have values at vowel place nodes. in consonants, it is possible to say that two vowels This section has provided motivations and a walk- are adjacent on a tier defined by the nodes HEIGHT, through of our proposed generative procedure for se- BACKNESS, and ROUNDING. quences of segments. In the next section, we give the Our full generative process for a segment follow- formalization of the model. ing other segments is the following. We follow the example of the generation of a phoneme conditional 3 Formalization of the Models on a preceding context of /ak/, shown with simpli- fied featural specifications and tiers in Figure 1. Here we give a full formal description of our pro- At each node in the feature dependency graph, we posed model in three steps. First, in Section 3.1, can either generate a fully-specified subsegment for we formalize the generative process for a segment that node (memo draw), or assemble a novel subseg- in isolation. Second, in Section 3.2, we give for- ment for that node out of parts defined by the fea- mulation of Bayesian nonparametric N-gram mod- ture dependency graph (base draw). Starting at the els with backoff. Third, in Section 3.3, we show root node of the feature dependency graph, we de- how to drop the generative process for a phoneme cide whether to do a memo draw or base draw con- into the N-gram model such that tier-based interac- ditional on the previous n subsegments at that node. tions emerge naturally.

77

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021 3.1 Generative Process for a Segment Note that in both cases the distribution over child A feature dependency graph G is a fully connected, subgraphs is drawn from a Dirichlet process, as be- singly rooted, directed, acyclic graph given by the low, capturing the notion of subsegmental storage triple V, A, t, r where V is a set of vertices or discussed above. h i A directed arcs t nodes, is a set of , is a total function 3.2 N-Gram Models with DP-Backoff t(n): V and, or , and r is a distinguished 7→ { } root node in V . A directed arc is a pair p, c where Let T be a set of discrete objects (e.g., atomic sym- h i the parent p and child c are both elements in V . The bols or structured segments as defined in the preced- function t(n) identifies whether n is an and- or or- ing sections). Let T ∗ be the set of all finite-length node. Define ch(n) to be the function that returns strings which can be generated by combining ele- all children of node n, that is, all n N such that ments of T , under concatenation, , including the 0 ∈ · n, n A. empty string .A context, u is any finite string be- h 0i ∈ A subgraph Gs of feature dependency graph G ginning with a special distinguished start sym- is the graph obtained by starting from node s by re- bol and ending with some sequence in T ∗, that is, u start T . taining only nodes and arcs reachable by traversing ∈ { · ∗} arcs starting from s.A subsegment ps is a subgraph For any string α, define hd(α) to be the function rooted in node s for which each or-node contains ex- that returns the first symbol in the string, tl(α) to actly one outgoing arc. Subsegments represent sam- be the function that returns suffix of α minus the first symbol, and α to be the length of α, with hd() = pled phone constituents. A segment is a subsegment | | tl() =  and  = 0. Write the concatenation of rooted in r—that is, a fully specified phoneme. | | s two strings α and α as α α . The distribution associated with a subgraph G is 0 · 0 given by Gs below. Gs is a distribution over sub- Let Hu be a distribution on next symbols—that r is, objects in T stop —conditioned on a given segments; the distribution for the full graph G is a ∪ { } context u. For an N-gram model of order N, the distribution over fully specified segments. We oc- N Gs(ps) probability of a string β in T ∗ is given by Kstart(β casionally overload the notation such that N · will refer to the probability mass function associated stop), where Ku (α) is defined as: s s with distribution G evaluated at the subsegment p . N 1 α =  K (α) = N , u Hf (u)(hd(α)) Ku hd(α)(tl(α)) otherwise s s s N × · H DP(θ ,G ) (1) ∼ n (2) s s where fn( ) is a context-management function H 0 (p 0 ) t(s) = AND · s ch(s) s s 0∈Y which determines which parts of the left-context G (p ) = s s s ψ H 0 (p 0 ) t(s) = OR  s0 should be used to determine the probability of the s ch(s)  0∈X current symbol. In the case of the N-gram models The first case of the definition covers and-nodes.  used in this paper, fn( ) takes a sequence u and re- We assume that the leaves of our feature dependency · turns only the rightmost n 1 elements from the graph—which represent atomic feature values such − sequence, or the entire sequence if it has length less as the laryngeal value of a PLACE feature—are than n. childless and-nodes. Note two aspects of this formulation of N-gram The second case of the definition covers or-nodes models. First, Hu is a family of distributions over in the graph, where ψs is the probability associated s0 next symbols or more general objects. Later, we will with choosing outgoing arc s, s from parent or- h 0i drop in phonological-feature-based generative pro- node s to child node s . Thus, or-nodes define mix- 0 cesses for these distributions. Second, the function ture distributions over outgoing arcs. The mixture fn is a parameter of the above definitions. In what weights are drawn from a Dirichlet process. In par- follows, we will use a variant of this function which ticular, for or-node n in the underlying graph G, the is sensitive to tier-based structure, returning the pre- vector of probabilities over outgoing edges is dis- vious n 1 only on the appropriate tier. tributed as follows. − MacKay and Peto (1994) introduced a hierarchi- s ψ~s DP(θ , UNIFORM( ch(s) )) cal Dirichlet process-based backoff scheme for N- ∼ | |

78

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021 s gram models, with generalizations in Teh (2006) and Note that the function H s recursively backs fn(u) Goldwater et al. (2006). In this setup, the distribu- off to the empty context, but its ultimate base distri- s tion over next symbols given a context u is drawn bution is indexed by fN (u), using the global maxi- hierarchically from a Dirichlet process whose base mum N-gram order N. So when samples are drawn measure is another Dirichlet process associated with from the feature dependency graph, they are con- context tl(u), and so on, with all draws ultimately ditioned on non-empty tier-based contexts. In this backing off into some unconditioned distribution way, subsegments are generated based on tier-based over all possible next symbols. That is, in a hier- context and based on featural backoff in an inter-

archical Dirichlet process N-gram model, Hfn(u) is leaved fashion. given as follows. 3.4 Inference DP(θ ,H ) n 1 fn(u) fn 1(u) − ≥ Hfn(u) DP(θ , UNIFORM(T stop )) n = 0 We use the Chinese Restaurant Process represen- ∼ fn(u) ∪ { } n tation for sampling. Inference in the model is 3.3 Tier-Based Interactions over seating arrangements for observations of sub- segments and over the hyperparameters θ for each To make the N-gram model defined in the last sec- restaurant. We perform Gibbs sampling on seating tion capture tier-based interactions, we make two arrangements in the Dirichlet N-gram models by re- changes. First, we generalize the generative pro- moving and re-adding observations in each restau- cess Hs from Equation 1 to Hs , which generates u rant. These Gibbs sweeps had negligible impact subsegments conditional on a sequence u. Sec- on model behavior. For the concentration parame- ond, we define a context-truncating function f s(u) n ter θ, we set a prior Gamma(10,.1). We draw pos- which takes a context of segments u and returns the terior samples using the slice sampler described in rightmost n 1 non-empty subsegments whose root − Johnson and Goldwater (2009). We draw one pos- node is s. Then we substitute the generative pro- s terior sample of the hyperparameters for each Gibbs cess H s (which applies the context-management fn(u) sweep. In contrast to the Gibbs sweeps, we found re- function f s( ) to the context u) for H in Equa- n · fn(u) sampling hyperparameters to be crucial for achiev- tion 2. The resulting probability distribution is: ing the performance described below (Section 5.3).

1 α =  N r N Ku (α) = Hfr (u)(hd(α)) Ku hd(α)(tl(α)) otherwise . N × · 4 Related Work n Phonotactics has proven a fruitful problem domain KN (α) is the distribution over continuations u for computational models. Most such work has given a context of segments. Its definition depends s adopted a constraint-based approach, attempting to on H s , which is the generalization of the gener- fn(u) design a scoring function based on phonological fea- ative process for segments Hs to be conditional on s s tures to separate acceptable forms from unaccept- some tier-based N-gram context fn(u). H s is: fn(u) able ones, typically by formulating restrictions or

s s constraints to rule out less-good structures. DP(θfs (u),Hfs (u)) n 1 s n n 1 ≥ H s s s − fn(u) DP(θfs (u),Gfs (u)) n = 0 This concept has led naturally to the use of undi- ∼ n N n rected (maximum-entropy, log-linear) models. In

s0 s0 this class of models, a form is scored by evaluation t AND s ch(s) H s (p ) (s) = s s 0 fn0 (u) 3 s ∈ Gf (u)(p ) = s s0 s0 against a number of predicates, called factors —for n ψ H (p ) t(s) = OR. s0 ch(s) s0 f s0 (u) ( Q ∈ n example, whether two adjacent segments have the s s P phonological features VOICE:+ VOICE:-. Each fac- Hf s(u) and Gf s(u) above are mutually recursive n s n tor is associated with a weight, and the score for a functions. Hf s(u) implements backoff in the tier- n s form is the sum of the weights of the factors which based context of previous subsegments; G s im- fn(u) plements backoff by going down into the probabil- are true for the form. The well-known model of ity distributions defined by the feature dependency 3Factors are also commonly called “features”—a term we graph. avoid to prevent confusion with phonological features.

79

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021 Hayes and Wilson (2008) adopts this framework, ing structure-building operations, and does assign pairing it with a heuristic procedure for finding ex- higher probabilities to held-out forms than an N- planatory factors while preventing overfitting. Simi- gram model (Section 5.3). From this perspective, larly, Albright (2009) assigns a score to forms based our model can be seen as a proof of concept that it on factors defined over natural classes of adjacent is possible to have rich feature-based conditioning segments. Constraint-based models have the advan- without adopting a constraint-based approach. tage of flexibility: it is possible to score forms using While our model can capture featural interactions, arbitrarily complex and overlapping sets of factors. it is less flexible than a constraint-based model in For example, one can state a constraint against ad- that the allowable interactions are specified by the jacent phonemes having features VOICE:+ and LAT- feature dependency graph. For example, there is ERAL:+, or any combination of feature values. no way to encode a direct constraint against adja- In contrast, we have presented a model where cent phonemes having features VOICE:+ and LAT- forms are built out of parts by structure-building op- ERAL:+. We consider this a strength of the ap- erations. From this perspective, the goal of a model proach: A particular feature dependency graph is is not to rule out bad forms, but rather to discover a parameter of our model, and a specific scientific repeating structures in good forms, such that new hypothesis about the space of likely featural interac- forms with those structures can be generated. tions between phonemes, similar to feature geome- In this setting there is less flexibility in how tries from classical generative phonology (Clements, phonological features can affect well-formedness. 1985; McCarthy, 1988; Halle, 1995).4 For a structure-building model to assign “scores” to While probabilistic approaches have mostly taken arbitrary pairs of co-occurring features, there must a constraint-based approach, recent formal language be a point in the generative process where those fea- theoretic approaches to phonology have investigated tures are considered in isolation. Coming up with what basic parts and structure building operations such a process has been challenging. As a result of are needed to capture realistic feature-based interac- this limitation, structure-building models of phono- tions (Heinz et al., 2011; Jardine and Heinz, 2015). tactics have not generally included rich featural in- We see probabilistic structure-building approaches teractions. For example, Coleman and Pierrehum- such as this work as a way to unify the recent for- bert (1997) give a probabilistic model for phonotac- mal language theoretic advances in computational tics where words are generated using grammar over phonology with computational phonotactic model- units such as syllables, onsets, and rhymes. This ing. model does not incorporate fine-grained phonolog- Our model joins other NLP work attempting to ical features such as voicing and place. do sequence generation where each symbol is gen- In fact, it has been argued that a constraint- erated based on a rich featural representation of based approach is required in order to capture rich previous symbols (Bilmes and Kirchhoff, 2003; feature-based interactions. For example, Goldsmith Duh and Kirchhoff, 2004), though we focus more and Riggle (2012) develop a tier-based structure- on phonology-specific representations. Our and-or building model of Finnish phonotactics which cap- graphs are similar to those used in computer vision tures nonlocal vowel harmony interactions, but ar- to represent possible objects (Jin and Geman, 2006). gue that this model is inadequate because it does not assign higher probabilities to forms than an N- 5 Model Evaluation and Experiments gram model, a common baseline model for phono- tactics (Daland et al., 2011). They argue that this Here we evaluate some of the design decisions of deficiency is because the model cannot simulta- our model and compare it to a baseline N-gram neously model nonlocal vowel-vowel interactions model and to a widely-used constraint-based model, and local consonant-vowel interactions. Because of BLICK. In order to probe model behavior, we also

our tier-based conditioning mechanism (Sections 2.4 4 We do however note that it may be possible to learn feature and 3.3), our model can simultaneously produce lo- hierarchies on a language-by-language basis from universal ar- cal and nonlocal interactions between features us- ticulatory and acoustic biases, as suggested by Dresher (2009).

80

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021 present evaluations on artificial data, and a sampling of “representative forms” preferred by one model as

compared to another. duration manner Our model consists of structure-building opera- vowel otherwise tions over a learned inventory of subsegments. If our laryngeal nasal model can exploit more repeated structure in phono- suprasegmental C place 2nd art. lateral logical forms than the N-gram model or constraint-

based models, then it should assign higher probabil- backness height rounding ities to forms. The log probability of a form under a model corresponds to the description length of that Figure 2: Feature dependency graph with class node form under the model; if a model assigns a higher structure used in our experiments. Plain text nodes are log probability to a form, that means the model is ca- OR-nodes with no child distributions. The arc marked pable of compressing the form more than other mod- otherwise represents several arcs, each labelled with a stop fricative els. Therefore, we compare models on their ability consonant manner such as , , etc. to assign high probabilities to phonological forms, as in Goldsmith and Riggle (2012).

duration laryngeal nasal manner

5.1 Evaluation of Model Components vowel otherwise We are interested in discovering the extent to which

each model component described above— feature suprasegmental backness height rounding C place 2nd art. lateral dependency graphs (Section 2.1), class node struc- ture (Section 2.2), and tier-based conditioning (Sec- Figure 3: “Flat” feature dependency graph. tion 2.4)— contributes to the ability of the model to explain wordforms. 5.2 Lexicon Data To evaluate the contribution of feature depen- dency graphs, we compare our models with a base- The WOLEX corpus provides transcriptions for line N-gram model, which represents phonemes as words in dictionaries of 60 diverse languages, rep- atomic units. For this N-gram model, we use a Hier- resented in terms of phonological features (Graff, archical Dirichlet Process with n = 3. 2012). In addition to words, the dictionaries in- To evaluate feature dependency graphs with and clude some short set phrases, such as of course. We without articulated class node structure, we com- use the featural representation of WOLEX, and de- pare models using the graph shown in Figure 3 sign our feature dependency graphs to generate only (the minimal structure required to produce well- well-formed phonemes according to this feature sys- formed phonemes) to models with the graph shown tem. For space reasons, we present the evaluation of in Figure 2, which includes phonologically moti- our model on 14 of these languages, chosen based vated “class nodes”.5 on the quality of their transcribed lexicons, and the authors’ knowledge of their phonological systems. To evaluate tier-based conditioning, we compare models with the conditioning described in Sec- 5.3 Held-Out Evaluation tions 2.4 and 3.3 to models where all decisions are conditioned on the full featural specification of the Here we test whether the different model configu- previous n 1 phonemes. This allows us to isolate rations described above assign high probability to − improvements due to tier-based conditioning beyond held-out forms. This tests the models’ ability to improvements from the feature hierarchy. generalize beyond their training data. We train each model on 2500 randomly selected wordforms from

5 a WOLEX dictionary, and compute posterior predic- These feature dependency graphs differ from those in the exposition in Section 2 in that they do not include a MANNER tive probabilities for the remaining wordforms from feature; but rather treat vowel as a possible value of MANNER. the final state of the model.

81

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021 Language ngram flat cl. node flat/no tiers cl.node/no tiers English -22.20 -22.15 -21.73∗∗ -22.15 -22.14 linguistic structure we are interested in. French -18.30 -18.28 -17.93∗∗ -18.29 -18.28 Georgian -20.21 -20.17 -19.64∗ -20.18 -20.18 If the autosegmental model outperforms the N- German -24.77 -24.72 -24.07∗∗ -24.73 -24.74 Greek -22.48 -22.45 -21.65∗∗ -22.45 -22.45 gram model even on artificial data with no phono- Haitian Creole -16.09 -16.04 -15.82∗∗ -16.05 -16.04 Lithuanian -19.03 -18.99 -18.58∗ -18.99 -18.99 logical structure, then its performance on the real Mandarin -13.95 -13.83∗ -13.78∗∗ -13.82∗ -13.82∗ Mor. Arabic -16.15 -16.10 -16.00∗ -16.13 -16.12 linguistic data in Section 5.3 might be overfitting. Polish -20.12 -20.08 -19.76∗∗ -20.08 -20.07 On the other hand, if the autosegmental model does Quechua -14.35 -14.30 -13.87∗ -14.30 -14.31 Romanian -18.71 -18.68 -18.32∗∗ -18.69 -18.68 better on real data but not artificial data, then we can Tatar -16.21 -16.18 -15.65∗∗ -16.19 -16.19 Turkish -18.88 -18.85 -18.55∗∗ -18.85 -18.84 conclude that it is picking up on some real distinc- tive structure of that data. Table 1: Average log posterior predictive probability of a For each real lexicon Lr, we generate an artificial held-out form. “ngram” is the DP Backoff 3-gram model. lexicon L by training a DP 3-gram model on L “flat” models use the feature dependency graph in Fig- a r and forward-sampling L forms. Additionally, the ure 3. “cl. node” models use the graph in Figure 2. See | r| text for motivations of these graphs. “no tiers” models forms in La are constrained to have the same distri- condition each decision on the previous phoneme, rather bution over lengths as the forms in Lr. The resulting than on tiers of previous features. Asterisks indicate sta- lexicons have no tier-based or featural interactions tistical significance according to a t-test comparing with except as they appear by chance from the N-gram the scores under the N-gram model. * = p < .05; ** model trained on these lexica. For each La we then = p < .001. train our models on the first 2500 forms and score the probabilities of the held-out forms, the same pro- Table 1 shows the average probability of a held- cedure as in Section 5.3. out word under our models and under the N-gram We ran this procedure for all the lexicons shown model for one model run.6 For all languages, we in Table 1. For all but one lexicon, we find that the get a statistically significant increase in probabili- autosegmental models do not significantly outper- ties by adopting the autosegmental model with class form the N-gram models on artificial data. The ex- nodes and tier-based conditioning. Model variants ception is Mandarin Chinese, where the average log without either component do not significantly out- probability of an artificial form is 13.81 under the − perform the N-gram model except in Chinese. The N-gram model and 13.71 under the full autoseg- − combination of class nodes and tier-based condition- mental model. The result suggests that the anoma- ing results in model improvements beyond the con- lous behavior of Mandarin Chinese in Section 5.3 tributions of the individual features. may be due to overfitting. When exposed to data that explicitly does not 5.4 Evaluation on Artificial Data have autosegmental structure, the model is not more Our model outperforms the N-gram model in pre- accurate than a plain sequence model for almost all dicting held-out forms, but it remains to be shown languages. But when exposed to real linguistic data, that this performance is due to capturing the kinds the model is more accurate. This result provides ev- of linguistic intuitions discussed in Section 2. An idence that the generative model developed in Sec- alternative possibility is that the Autosegmental N- tion 2 captures true distributional properties of lexi- gram model, which has many more parameters than cons that are absent in N-gram distributions, such as a plain N-gram model, can simply learn a more ac- featural and tier-based interactions. curate model of any sequence, even if that sequence has none of the structure discussed above. To evalu- 5.5 Comparison with a Constraint-Based ate this possibility, we compare the performance of Model our model in predicting held-out linguistic forms to Here we provide a comparison with Hayes and its performance in predicting held-out forms from Wilson (2008)’s Phonotactic Learner, which out- artificial lexicons which expressly do not have the puts a phonotactic grammar in the form of a set

6 of weighted constraints on feature co-occurrences. The mean standard deviation per form of log probabilities over 50 runs of the full model ranged from .09 for Amharic to This grammar is optimized to match the constraint .23 for Dutch. violation profile in a training lexicon, and so can be

82

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021 seen as a probabilistic model of that lexicon. The Length BLICK ngram cl. node authors have distributed one such grammar, BLICK, 2 -6.50 -6.81 -5.18 as a “reference point for phonotactic probability in 3 -9.38 -8.76 -7.95 4 -14.1 -11.7 -11.4 experimentation” (Hayes, 2012). Here we compare 5 -18.1 -14.2 -13.9 our model against BLICK on its ability to assign probabilities to forms, as in Section 5.3. Table 2: Average log posterior predictive probability of Ideally, we would simply compute the probabil- an English form of fixed length under BLICK and our ity of forms like we did in our earlier model com- models. parisons. BLICK returns scores for each form. English N-gram English Full Model However, since the probabilistic model underlying collaborationist mistrustful BLICK is undirected, these scores are in fact unnor- a posteriori inharmoniousness malized log probabilities, so they cannot be com- sacristy absentmindedness pared directly to the normalized probabilities as- matter of course blamelessness signed by the other models. Furthermore, because earnest money phlegmatically the probabilistic model underlying BLICK does not Table 3: Most representative forms for the N-gram model penalize forms for length, the normalizing constant and for our full model (“cl. node” in Table 1) in En- over all forms is in fact infinite, making straightfor- glish. Forms are presented in native , but ward comparison of predictive probabilities impos- were scored based on their phonetic form. sible. Nevertheless, we can turn BLICK scores into probabilities by conditioning on further constraints, ways. First, BLICK and our models were trained on such as the length k of the form. We enumerate all different data; it is possible that our training data are possible forms of length k to compute the normal- more representative of our test data than BLICK’s izing constant for the distribution over forms of that training data were. Second, BLICK uses a different length. The same procedure can also be used to com- underlying featural decomposition than our models; pute the probabilities of each form, conditioned on it is possible that our feature system is more ac- the length of the form k, under the N-gram and Au- curate. Nevertheless, these results show that our tosegmental models. model concentrates more probability mass on (short) To compare our models against BLICK, we cal- forms attested in a language, whereas BLICK likely culate conditional probabilities for forms of length spreads its probability mass more evenly over the 2 through 5 from the English lexicon.7 The forms space of all possible (short) strings. are those in the WOLEX corpus; we include them for this evaluation if they are k symbols long in the 5.6 Representative Forms WOLEX representation. For our N-gram and Au- tosegmental models, we use the same models as in In order to get a sense of the differences between Section 5.3. The average probabilities of forms un- models, we investigate what phonological forms are der the three models are shown in Table 2. For preferred by different kinds of models. These forms length 3-5, the autosegmental model assigns the might be informative about the phonotactic patterns highest probabilities, followed by the N-gram model that our model is capturing which are not well- and BLICK. For length 2, BLICK outperforms the represented in simpler models. We calculate the rep- DP N-gram model but not the autosegmental model. resentativeness of a form f with respect to model m as opposed to m as p(f m )/p(f m ) (Good, Our model assigns higher probabilities to short 1 2 | 1 | 2 forms than BLICK. That is, our models have iden- 1965; Tenenbaum and Griffiths, 2001). The forms m tified more redundant structure in the forms than that are most “representative” of model 1 are not m BLICK, allowing them to compress the data more. the forms that 1 assigns the highest probability, but m m However, the comparison is imperfect in several rather the forms that 1 ranks highest relative to 2. Tables 3 and 4 show forms from the lexicon that 7Enumerating and scoring the 22,164,361,129 possible are most representative of our full model and of the forms of length 6 was computationally impractical. N-gram model for English and Turkish. The most

83

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021 Turkish N-gram Turkish Full Model While the tier-based conditioning in our model üstfamilya büyükkarapınar would seem to be capable of modeling nonlocal dekstrin kızılcapınar interactions such as vowel harmony, we have not mnemotekni altınpınar found that the models do well at reproducing these ekskavatör sarımehmetler foksterye karaelliler nonlocal interactions. We believe this is because the model’s behavior is dominated by nodes high in the Table 4: Most representative forms for N-gram and Au- feature dependency graph. In any case, a simple tosegmental models in Turkish. Markov model defined over tiers, as we have pre- sented here, might not be enough to fully model uniquely representative forms for our full model are vowel harmony. Rather, a model of phonological morphologically complex forms consisting of many processes, transducing underlying forms to surface productive, frequently reused morphemes such as forms, seems like a more natural way to capture ness. On the other hand, the representative forms these phenomena. for the N-gram model include foreign forms such as We stress that this model is not tied to a particular a posteriori (for English) and ekskavatör (for Turk- feature dependency graph. In fact, we believe our ish), which are not built out of parts that frequently model provides a novel way of testing different hy- repeat in those languages. The representative forms potheses about feature structures, and could form the suggest that the full model places more probability basis for learning the optimal feature hierarchy for a mass on words which are built out of highly produc- given data set. The choice of feature dependency tive, phonotactically well-formed parts. graph has a large effect on what featural interactions the model can represent directly. For example, nei- 6 Discussion ther feature dependency graph has shared place fea- tures for consonants and vowels, so the model has We find that our models succeed in assigning high limited ability to represent place-based restrictions probabilities to unseen forms, that they do so specifi- on consonant-vowel sequences such as requirements cally for linguistic forms and not random sequences, for labialized or palatalized consonants in the con- that they tend to favor forms with many productive text of /u/ or /i/. These interactions can be treated in parts, and that they perform comparably to a state- our framework if vowels and consonants share place of-the-art constraint-based model in assigning prob- features, as in Padgett (2011). abilities to short forms. The improvement for our models over the N-gram 7 Conclusion baseline is consistent but not large. We attribute this to the way in which phonological generaliza- We have presented a probabilistic generative model tions are used in the present model: in particular, for sequences of phonemes defined in terms of phonological generalizations function primarily as a phonological features, based on representational form of backoff for a sequence model. Our mod- ideas from generative phonology and tools from els have lexical memoization at each node in a fea- Bayesian nonparametric modeling. We consider ture dependency graph; as such, the top node in our model as a proof of concept that probabilistic the graph ends up representing transition probabil- structure-building models can include rich featural ities for whole phonemes conditioned on previous interactions. Our model robustly outperforms an N- phonemes, and the rest of the feature dependency gram model on simple metrics, and learns to gener- graph functions as a backoff distribution. When a ate forms consisting of highly productive parts. We model has been exposed to many training forms, its also view this work as a test of the scientific hy- behavior will be largely dominated by the N-gram- potheses that phonological features can be organized like behavior of the top node. In future work it might in a hierarchy and that they interact along tiers: in be effective to learn an optimal backoff procedure our model evaluation, we found that both concepts which gives more influence to the base distribution were necessary to get an improvement over a base- (Duh and Kirchhoff, 2004; Wood and Teh, 2009). line N-gram model.

84

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021 Acknowledgments B. Elan Dresher. 2009. The Contrastive Hierarchy in Phonology. Cambridge University Press. We would like to thank Tal Linzen, Leon Bergen, Kevin Duh and Katrin Kirchhoff. 2004. Automatic Edward Flemming, Edward Gibson, Bob Berwick, learning of language model structure. In Proceedings Jim Glass, and the audiences at MIT’s Phonology of COLING 2004, Geneva, Switzerland. Circle, SIGMORPHON, and the LSA 2016 Annual Thomas S. Ferguson. 1973. A bayesian analysis of Meeting for helpful comments. This work was sup- some nonparametric problems. Annals of Statistics, ported in part by NSF DDRIG Grant #1551543 to 1(2):209–230. R.F. John Goldsmith and Jason Riggle. 2012. Information theoretic approaches to phonology: the case of Finnish vowel harmony. Natural langauge and linguistic the- References ory, 30(3):859–96. Sharon Goldwater, Thomas L. Griffiths, and Mark John- Adam Albright. 2009. Feature-based generalization as a son. 2006. Interpolating between types and tokens by source of gradient acceptability. Phonology, 26:9–41. estimating power-law generators. In Advances in Neu- Robert C. Berwick. 1985. The acquisition of syntactic ral Information Processing Systems, volume 18, pages knowledge. MIT Press, Cambridge, MA. 459–466, Cambridge, MA. MIT Press. Jeff A. Bilmes and Katrin Kirchhoff. 2003. Fac- Sharon Goldwater. 2006. Nonparametric Bayesian Mod- tored language models and generalized parallel back- els of Lexical Acquisition. Ph.D. thesis, Brown Uni- off. In Proceedings of the 2003 Conference of the versity. North American Chapter of the Association for Com- Irving John Good. 1965. The Estimation of Probabili- putational Linguistics on Human Language Technol- ties. MIT Press, Cambridge, MA. ogy: Companion Volume of the Proceedings of HLT- Noah D. Goodman, Vikash K. Mansinghka, Daniel Roy, NAACL 2003–Short Papers–Volume 2, pages 4–6. As- Keith Bonawitz, and Joshua B. Tenenbaum. 2008. sociation for Computational Linguistics. Church: A language for generative models. In Un- Geert Booij. 2011. Morpheme structure constraints. In certainty in Artificial Intelligence, Helsinki, Finland. The Blackwell Companion to Phonology. Blackwell. AUAI Press. Noam Chomsky and Morris Halle. 1965. Some contro- Peter Graff. 2012. Communicative Efficiency in the Lexi- versial questions in phonological theory. Journal of con. Ph.D. thesis, Massachusetts Institute of Technol- Linguistics, 1:97–138. ogy. Noam Chomsky and Morris Halle. 1968. The Sound Morris Halle. 1959. The Sound Pattern of Russian: A Pattern of English. Harper & Row, New York, NY. linguistic and acoustical investigation. Mouton, The George N. Clements and Elizabeth V. Hume. 1995. The Hague, The Netherlands. internal organization of speech sounds. In The Hand- Morris Halle. 1995. Feature geometry and feature book of Phonological Theory, pages 24–306. Black- spreading. Linguistic Inquiry, 26(1):1–46. well, Oxford. Bruce Hayes and Colin Wilson. 2008. A maximum en- George N. Clements. 1985. The geometry of phonologi- tropy model of phonotactics and phonotactic learning. cal features. Phonology Yearbook, 2:225–252. Linguistic Inquiry, 39(3):379–440. John Coleman and Janet B. Pierrehumbert. 1997. Bruce Hayes. 2012. Blick - a phonotactic probability Stochastic phonological grammars and acceptability. calculator. In John Coleman, editor, Proceedings of the 3rd Meet- Jeffrey Heinz, Chetan Rawal, and Herbert G. Tanner. ing of the ACL Special Interest Group in Computa- 2011. Tier-based strictly local constraints for phonol- tional Phonology, pages 49–56, Somserset, NJ. Asso- ogy. In The 49th Annual Meeting of the Association ciation for Computational Linguistics. for Computational Linguistics. Robert Daland, Bruce Hayes, James White, Marc Roman Jakobson, C. Gunnar M. Fant, and Morris Halle. Garellek, Andreas Davis, and Ingrid Normann. 2011. 1952. Preliminaries to Speech Analysis: The Distinc- Explaining sonority projection effects. Phonology, tive Features and their Correlates. The MIT Press, 28:197–234. Cambridge, Massachusetts and London, England. Carl de Marcken. 1996. Linguistic structure as com- Adam Jardine and Jeffrey Heinz. 2015. A concatenation position and perturbation. In Proceedings of the 34th operation to derive autosegmental graphs. In Proceed- annual meeting on Association for Computational Lin- ings of the 14th Meeting on the Mathematics of Lan- guistics, pages 335–341. Association for Computa- guage (MoL 2015), pages 139–151, Chicago, USA, tional Linguistics. July.

85

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021 Ya Jin and Stuart Geman. 2006. Context and hierar- Frank Wood and Yee Whye Teh. 2009. A hierarchi- chy in a probabilistic image model. In Proceedings cal nonparametric Bayesian approach to statistical lan- of the 2006 IEEE Computer Society Conference on guage model domain adaptation. In Artificial Intelli- Computer Vision and Pattern Recognition (CVRP’06), gence and Statistics, pages 607–614. pages 2145–2152. Frank Wood, Cédric Archambeau, Jan Gasthaus, James Mark Johnson and Sharon Goldwater. 2009. Improving Lancelot, and Yee Whye Teh. 2009. A stochastic nonparameteric bayesian inference: experiments on memoizer for sequence data. In Proceedings of the unsupervised word segmentation with adaptor gram- 26 th International Conference on Machine Learning, mars. In NAACL-HLT 2009, pages 317–325. Associa- Montreal, Canada. tion for Computational Linguistics. Mark Johnson, Thomas L Griffiths, and Sharon Goldwa- ter. 2007. Adaptor grammars: A framework for speci- fying compositional nonparametric Bayesian models. In Advances in Neural Information Processing Sys- tems, pages 641–648. Mary-Louise Kean. 1975. The Theory of Markedness in . Ph.D. thesis, Massachusetts Institute of Technology. David J.C. MacKay and Linda C. Bauman Peto. 1994. A hierarchical Dirichlet language model. Natural Lan- guage Engineering, 1:1–19. John McCarthy. 1988. Feature geometry and depen- dency: A review. Phonetica, 43:84–108. Donald Michie. 1968. “memo” functions and machine learning. Nature, 218:19–22. Timothy J. O’Donnell. 2015. Productivity and Reuse in Language: A Theory of Linguistic Computation and Storage. The MIT Press, Cambridge, Massachusetts and London, England. Jaye Padgett. 2011. Consonant-vowel place feature in- teractions. In The Blackwell Companion to Phonol- ogy, pages 1761–1786. Blackwell Publishing, Malden, MA. Ezer Rasin and Roni Katzir. 2014. A learnability argu- ment for constraints on underlying representations. In Proceedings of the 45th Annual Meeting of the North East Linguistic Society (NELS 45), Cambridge, Mas- sachusetts. Jayaram Sethuraman. 1994. A constructive definition of Dirichlet priors. Statistica Sinica, pages 639–650. Richard Stanley. 1967. Redundancy rules in phonology. Language, 43(2):393–436. Yee Whye Teh. 2006. A Bayesian interpretation of in- terpolated Kneser-Ney. Technical Report TRA2/06, School of Computing, National University of Singa- pore. Joshua B Tenenbaum and Thomas L Griffiths. 2001. The rational basis of representativeness. In Proceedings of the 23rd Annual Conference of the Cognitive Science Society, pages 1036–1041. Nikolai S. Trubetzkoy. 1939. Grundzüge der Phonolo- gie. Number 7 in Travaux du Cercle Linguistique de Prague. Vandenhoeck & Ruprecht, Göttingen.

86

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00047 by guest on 26 September 2021