On the Complexity and Typology of Inflectional Morphological Systems

Ryan Cotterell and Christo Kirov and Mans Hulden and Jason Eisner Department of Computer Science, Johns Hopkins University Department of Linguistics, University of Colorado {ryan.cotterell,ckirov1,eisner}@jhu.edu, [email protected]

Abstract In this work, we develop information-theoretic We quantify the linguistic complexity of dif- tools to operationalize this hypothesis about the ferent languages’ morphological systems. We complexity of inflectional systems. We model verify that there is a statistically significant each inflectional system using a tree-structured empirical trade-off between paradigm size and directed graphical model whose factors are neural irregularity: A language’s inflectional paradigms networks and whose structure (topology) must may be either large in size or highly irregular, be learned along with the factors. We explain our but never both. We define a new measure of approach to quantifying two aspects of inflectional paradigm irregularity based on the conditional complexity and, in one case, approximate our entropy of the surface realization of a paradigm— metric using a simple variational bound. This al- how hard it is to jointly predict all the lows a data-driven approach by which we can mea- forms in a paradigm from the lemma. We sure the morphological complexity of a given estimate irregularity by training a predictive language in a clean manner that is more theory- model. Our measurements are taken on large agnostic than previous approaches. morphological paradigms from 36 typologi- Our study evaluates 36 diverse languages, cally diverse languages. using collections of paradigms represented ortho- graphically. Thus, we are measuring the complex- ity of each written language. The corresponding 1 Introduction spoken language would have different complexity, based on the corresponding phonological forms. What makes an inflectional system ‘‘complex’’? Importantly, our method does not depend upon Linguists have sometimes considered measur- a linguistic analysis of into constituent ing this by the size of the inflectional paradigms (e.g., hoping 7→ hope+ing). We find (McWhorter, 2001). The number of distinct in- support for the complexity trade-off hypothesis. flected forms of each word indicates the number Concretely, we show that the more unique forms of morphosyntactic distinctions that the language an inflectional paradigm has, the more predictable makes on the surface. However, this gives only the forms must be from one another—for example, a partial picture of complexity (Sagot, 2013). forms in a predictable paradigm might all be Some inflectional systems are more irregular: related by a simple change of . This intuition It is harder to guess how the inflected forms has a long history in the linguistics community, as of a word will be spelled or pronounced, given field linguists have often noted that languages with the base form. Ackerman and Malouf (2013) extreme morphological richness, for example, hypothesize that there is a limit to the irregu- agglutinative and polysynthetic languages, have larity of an inflectional system. We refine this virtually no exceptions or irregular forms. Our hypothesis to propose that systems with many contribution lies in mathematically formulating forms per paradigm have an even stricter limit this notion of regularity and providing a means to on irregularity per distinct form. That is, the two estimate it by fitting a probability model. Using dimensions interact: A system cannot be complex these tools, we provide a quantitative verification along both axes at once. In short, if a language of this conjecture on a large set of typologically demands that its speakers use a lot of distinct diverse languages, which is significant with forms, those forms must be relatively predictable. p < 0.037. 327

Transactions of the Association for Computational Linguistics, vol. 7, pp. 327–342, 2019. Action Editor: Chris Dyer. Submission batch: 12/2017; Published 6/2019. © 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. 2 Morphological Complexity 2.2 Defining Complexity

2.1 Word-Based Ackerman and Malouf (2013) distinguish two types of morphological complexity, which we We adopt the framework of word-based morphol- elaborate on below. For a more general overview ogy (Aronoff, 1976; Spencer, 1991). An inflected of morphological complexity, see Baerman et al. in this framework is represented as a set (2015). of word types. Each word type is a triple of 2.2.1 Enumerative Complexity • a lexeme ` (an arbitrary integer or string that The first type, enumerative complexity (e-complexity), indexes the word’s core meaning and part of measures the number of surface morphosyntactic speech) distinctions that a language makes within a part of speech. • a slot σ (an arbitrary integer or object that Given a lexicon, our present paper will measure indicates how the word is inflected) the e-complexity of the verb system as the average • a surface form w (a string over a fixed of the verb paradigm size |M(`)|, where ` ranges phonological or orthographic alphabet Σ) over all verb lexemes in domain(M). Importantly, we define the size |m| of a paradigm m to be the number of distinct surface forms in the A paradigm m is a map from slots to surface 1 paradigm, rather than the number of slots. That is, forms. We use dot notation to access elements def |m| = |range(m)| rather than |domain(m)|. of this map. For example, m.past denotes the past-tense surface form in paradigm m. Under our definition, nearly all English verb paradigms have size 4 or 5, giving the English verb An inflected lexicon for a language can be system an e-complexity between 4 and 5. If m = regarded as defining a map M from lexemes to M(walk ), then |m| = 4, since range(m) = their paradigms. Specifically, M(`).σ = w iff the Verb {walk, walks, walked, walking}. The manually lexicon contains the triple (`, σ, w).2 For example, constructed lexicon may define separate slots σ = in the case of the English lexicon, if ` is the English 1 [ TENSE=PRESENT, PERSON=1, NUMBER=SG ] and lexeme walkVerb, then M(`).past = walked. In σ = [ TENSE=PRESENT, PERSON=2, NUMBER=SG ], linguistic terms, we say that in `’s paradigm M(`), 2 but in this paradigm, those slots are not distin- the past-tense slot is filled (or realized) by walked. guished by any morphological marking: m.σ1 = Nothing in our method requires a Bloomfieldian m.σ2 = walk. Nor is the past tense walked dis- structuralist analysis that decomposes each word tinguished from the past participle. This phenom- into underlying morphs; rather, this paper is enon is known as syncretism. a-morphous in the sense of Anderson (1992). Why might the creator of a lexicon choose to More specifically, we will work within the define two slots for a syncretic form, rather than UniMorph annotation scheme (Sylak-Glassman, a single merged slot? Perhaps because the slots 2016). In the simplest case, each slot σ specifies are not always syncretic: in the example above, a morphosyntactic bundle of inflectional features 3 one English verb, be, does distinguish σ1 and σ2. such as tense, mood, person, number, and gender. But an English lexicon that did choose to merge For example, the Spanish surface form pongas σ1 and σ2 could handle be by adding extra slots (from the lexeme poner ‘to put’) fills a slot that that are used only with be. A second reason is that indicates that this word has the features [ TENSE= the merged slot might be inelegant to describe PRESENT, MOOD=SUBJUNCTIVE, PERSON=2, NUMBER=SG]. using the feature bundle notation: English verbs We postpone a discussion of the details of (other than be) have a single form shared by UniMorph until §7.1, but it is mostly compatible the bare infinitive and all present tense forms with other, similar schemes. except 3rd-person singular, but a single slot for this form could not be easily characterized by a 1See Baerman (2015, Part II) for a tour of alternative views of inflectional paradigms. single feature bundle, and so the lexicon creator 2We assume that the lexicon never contains distinct triples of the form (`, σ, w) and (`, σ, w0), so that M(`).σ has a 3This verb has a paradigm of size 8: {be, am, are, is, was, unique value if it is defined at all. were, been, being}.

328 might reasonably split it for convenience. A third language if many verbs express it with suffix -o reason might be an attempt at consistency across while many others use -∅. In §5, we will propose languages: In principle, an English lexicon is free an improvement to their entropy-based measure. to use the same slots as Sanskrit and thus list dual and plural forms for every English , which 2.3 The Low-Entropy Conjecture just happen to be identical in every case (complete The low-entropy conjecture, as formulated by syncretism). Ackerman and Malouf (2013, p. 436), ‘‘is the The point is that our e-complexity metric is hypothesis that enumerative morphological com- insensitive to these annotation choices. It focuses plexity is effectively unrestricted, as long as the on observable surface distinctions, and so does not average conditional entropy, a measure of inte- care whether syncretic slots are merged or kept grative complexity, is low.’’ Indeed, Ackerman separate. Later, we will construct our i-complexity and Malouf go so far as to say that there need be no metric to have the same property. upper bound on e-complexity, but the i-complexity The notion of e-complexity has a long history must remain sufficiently low (as is the case for in linguistics. The idea was explicitly discussed as Archi, for example). Our hypothesis is subtly early as Sapir (1921). More recently, Sagot (2013) different in that we postulate that morphological has referred to this concept as counting com- systems face a trade-off between e-complexity plexity, referencing comparison of the complexity and i-complexity: a system may be complex under of creoles and non-creoles by McWhorter (2001). either metric, but not under both. The amount of For a given part of speech, e-complexity appears e-complexity permitted is higher when i-complexity to vary dramatically over the languages of the is low. world. Whereas the regular English verb paradigm This line of thinking harks back to the equal has 4–5 slots in our annotation, the Archi verb will complexity conjecture of Hockett, who stated: ‘‘ob- have thousands (Kibrik, 1998). However, does jective measurement is difficult, but impression- this make the Archi system more complex, in istically it would seem that the total grammatical the sense of being more difficult to describe or complexity of any language, counting both the learn? Despite the plethora of forms, it is often the morphology and syntax, is about the same as case that one can regularly predict one form from any other’’ (Hockett, 1958, pp. 180–181). Similar another, indicating that few forms actually have trade-offs have been found in other branches of to be memorized for each lexeme. linguistics (see Oh [2015] for a review). For exam- ple, there is a trade-off between rate of speech and 2.2.2 Integrative Complexity syllable complexity (Pellegrino et al., 2011): This The second notion of complexity is integrative means that even though Spanish speakers utter complexity (i-complexity), which measures how many more syllables per second than Chinese, regular an inflectional system is on the surface. the overall information rate is quite similar as Students of a foreign language will most certainly Chinese syllables carry more information (they have encountered the concept of an irregular contain information). verb. Pinning down a formal and workable cross- Hockett’s equal complexity conjecture is contro- linguistic definition is non-trivial, but the intuition versial: some languages (such as Riau Indonesian) that some inflected forms are regular and others do seem low in complexity across morphology irregular dates back at least to Bloomfield (1933, and syntax (Gil, 1994). This is why Ackerman pp. 273–274), [who famously argued that what and Malouf instead posit that a linguistic system makes a surface form regular is that it is the has bounded integrative complexity—it must not output of a deterministic function. For an in-depth be too high, though it can be low, as indeed it is dissection of the subject, see Stolz et al. (2012). in isolating languages like Chinese and Thai. Ackerman and Malouf (2013) build their defini- tion of i-complexity on the information-theoretic 3 Paradigm Entropy notion of entropy (Shannon, 1948). Their intuition is that a morphological system should be con- 3.1 Morphology as a Distribution sidered complex to the extent that its forms are Following Dreyer and Eisner (2009) and Cotterell unpredictable. They say, for example, that the et al. (2015), we identify a language’s inflectional nominative singular form is unpredictable in a system with a probability distribution p(M = m)

329 over possible paradigms.4 Our measure of model is thus capable of evaluating arbitrary i-complexity will be related to the entropy of wug-formations (Berko, 1958), including irregular this distribution. ones. For instance, knowing the behavior of the English verb system essentially means knowing a 3.2 Paradigm Entropy joint distribution over 5-tuples of surface forms The distribution p gives rise to the paradigm such as (run, runs, ran, run, running). More pre- entropy H(M), also written as H(p). This is cisely, one knows probabilities such as the expected number of bits needed to represent p(M.pres = run, M.3s = runs, M.past = a paradigm drawn from p, under a code that is ran, M.pastp = run, M.presp = running). optimized for this purpose. Thus, it may be related We do not observe p directly, but each observed to the cost of learning paradigms or the cost of paradigm (5-tuple) can help us estimate it. We storing them in memory, and thus relevant to assume that the paradigms m in the inflected functional pressures that prevent languages from lexicon were drawn independently and identically growing too complex. (There is no guarantee, distributed (IID) from p. Any novel verb paradigm of course, that human learners actually estimate in the future would be drawn from p as well. The the distribution p, or that its entropy actually distribution p represents the inflectional system represents the cognitive cost of learning or storing because it describes what regular paradigms and paradigms.) plausible irregular paradigms tend to look like. Our definition of i-complexity in §5 will (roughly The fact that some paradigms are used more speaking) divide H(M) by the e-complexity, so frequently than others (more tokens in a corpus) that the i-complexity is measured in bits per does not mean that they have higher probability distinct surface form. This approach is inspired under the morphological system p(m). Rather, by Ackerman and Malouf (2013); we discuss the their higher usage reflects the higher probability of differences in §6. their lexemes. That is due to unrelated factors—the probability of a lexeme may be modeled separately 3.3 A Variational Upper Bound on Entropy by a stick-breaking process (Dreyer and Eisner, We now review how to estimate H(M) by esti- 2011), or may reflect the semantic meaning asso- mating p by a model q. We do not actually ciated to that lexeme. The role of p(m) in the know the true distribution p. Furthermore, even model is only to serve as the base distribution if we knew p, the definition of H(M) involves from which a lexeme type ` selects the tuple of a sum over the infinite set of n-tuples (Σ∗)n, strings m = M(`) that will be used thereafter to which is intractable for most distributions p. Thus, express `. following Brown et al. (1992), we will use a We expect the system to place low probability probability model to define a good upper bound on implausible paradigms: For example, p(run, for H(M) and held-out data to estimate that , , run, running) is close to zero. More- bound. over, we expect it to assign high conditional For any distribution p, the entropy H(p) is probability to the result of applying highly regular upper-bounded by the cross-entropy H(p, q), processes: For example, for p(M.presp | M.3s) where q is any other distribution over the same in English, we have p(wugging | wugs) ≈ space:5 p(running | runs) ≈ 1, where wug is a novel p(M. = X X verb. Nonetheless, our estimate of presp p(m)[− log p(m)] ≤ p(m)[− log q(m)] w | M. = ) 3s wugs will have support over m m w ∈ Σ∗ × · · · × Σ∗, due to smoothing. The (1)

4 Formally speaking, we assume a discrete sample space (Throughout this paper, log denotes log2.) The in which each outcome is a possible lexeme ` equipped gap between the two sides is the Kullback-Leibler with a paradigm M(`). Recall that a random variable is divergence D(p || q), which is 0 iff p = q. technically defined as a function of the outcome. Thus, M is a paradigm-valued random variable that returns the whole Maximum-likelihood training of a probability paradigm. M.past is a string-valued random expression that model q ∈ Q is an attempt to minimize this returns the past slot, so π(M.past = ran) is a marginal probability that marginalizes over the rest of the paradigm. 5The same applies for conditional entropies as used in §5.

330 Figure 1: A specific Spanish verb paradigm as it would be generated by two different tree-structured Bayesian networks. The nodes in each network represent the slots dictated by the paradigm’s shape (not labeled). The topology in (a) predicts all forms from the lemma. The topology in (b), on the other hand, makes it easier to predict some of the forms given the others: pongas is predicted from pongo, with which it shares a stem. Qualitatively, the structure selection algorithm in §4.4 finds trees like (b). gap by minimizing the right-hand side. More m.σ = m.σ0 iff m0.σ = m0.σ0). Notice that precisely, it minimizes the sampling-based esti- paradigms of the same shape must have the same P mate m pˆtrain(m)[− log q(m)], where pˆtrain is size (but not conversely). Most English verbs use the empirical distribution of a set of training one of 2 shapes: In 4-form verbs such as regular examples that are assumed to be drawn IID sprint and irregular stand, the past participle is from p. syncretic with the past tense, whereas in irregular Because the trained q may be overfit to the 5-form verbs such as eat, that is not so. There are training examples, we must make our final esti- also a few other English verb paradigm shapes: mate of H(p, q) using a separate set of held-out For example, run has only 4 distinct forms, but in P test examples, as m pˆtest(m)[− log q(m)]. We its paradigm, the past participle is syncretic with then use this as our (upwardly biased) estimate of the present tense. The verb be has a shape of its the paradigm entropy H(p). In our setting, both own, with 8 distinct forms. The extra slots needed the training and the test examples are paradigms for be might be either missing in other shapes, or from a given inflected lexicon. present but syncretic. Our model qθ says that the first step in gen- 4 A Generative Model of the Paradigm erating a paradigm is to pick its shape s. This uses a distribution qθ(S = s), which we estimate by To fit q given the training set, we need a tractable maximum likelihood from the training set. Thus, family Q of joint distributions over paradigms, s ranges over the set S of shapes that appear in with parameters θ. The structure of the model and the training set. the number of parameters θ will be determined automatically from the training set: A language 4.2 A Tree-Structured Distribution with more slots overall or more paradigm shapes s will require more parameters. This means that Q Next, conditioned on the shape , we follow is technically a semi-parametric family. Cotterell et al. (2017b) and generate all the forms of the paradigm using a tree-structured Bayesian network—a directed graphical model in which the 4.1 Paradigm Shapes form at each slot is generated conditionally on the We say that two paradigms m, m0 have the same form at a single parent slot. Figure 1 illustrates shape if they define the same slots (that is, two possible tree structures for Spanish verbs. domain(m) = domain(m0)) and the same pairs Each paradigm shape s has its own tree struc- of slots are syncretic in both paradigms (that is, ture. If slot σ exists in shape s, we denote its

331 parent in our shape s model by pas(σ). Then our otherwise. The seq2seq model is skipped in such model is6 cases: It is only used on non-syncretic parent-child pairs. As a result, if shape s has 5 slots that are Y qθ(m | s) = qθ(m.σ | m.pas(σ),S =s) all syncretic with one another, 4 of these slots can σ∈s be derived by deterministic copying. As they are (2) completely predictable, they contribute log 1 = 0 For the slot σ at of the tree, pas(σ) is bits to the paradigm entropy. The method in the defined to be a special slot empty with an empty next section will always favor a tree structure that feature bundle, whose form is fixed to be the exploits copying. As a result, the extra 4 slots will empty string. In the product above, σ does not not increase the i-complexity, just as they do not range over empty. increase the e-complexity (recall §2.2.1). We train the parameters θ on all non-syncretic 4.3 Neural Sequence-to-Sequence Model slot pairs in the training set. Thus, a paradigm with We model all of the conditional probability fac- n distinct forms contributes n2 training examples: tors in equation (2) using a neural sequence-to- Each form in the paradigm is predicted from each sequence model with parameters θ. Specifically, of the n − 1 other forms, and from the empty we follow Kann and Schutze¨ (2016) and use a long form. We use maximum-likelihood training (see short-term memory–based sequence-to-sequence §7.2). (seq2seq) model (Sutskever et al., 2014) with attention (Bahdanau et al., 2015). This is the 4.4 Structure Selection state of the art in morphological reinflection (i.e., Given a model qθ, we can decompose its the conversion of one inflected form to another entropy H(qθ) into a weighted sum of conditional [Cotterell et al., 2016]). entropies For example, in German, qθ(M.nompl = X Hande¨ | M.nomsg = Hand,S = 3) is given H(M) = H(S) + p(S =s)H(M | S =s) by the probability that the seq2seq model assigns s∈S (3) H ¨a n d e to the output sequence when given where the input sequence X H(M | S =s) = H(M.σ | M.pas(σ),S =s) H a n d S=3 IN=NOM IN=SG OUT=NOM OUT=PL σ∈s (4) The input sequence indicates the parent slot The cross-entropy H(p, q ) has a similar de- (nominative singular) and the child slot (nomi- θ composition. The only difference is that all of native plural), by using special characters to spec- the (conditional) entropies are replaced by (con- ify their feature bundles. This tells the seq2seq ditional) cross-entropies, meaning that they are model what kind of to do. The input estimated using a held-out sample from p rather sequence also indicates the paradigm shape s. than q . The log-probabilities are still taken from Thus, we are able to use only a single seq2seq θ q . model, with parameters θ, to handle all of the θ θ conditional distributions in the entire model. Shar- It follows that given a fixed (as trained in the H(p, q ) ing parameters across conditional distributions is previous section), we can minimize θ by s a form of multi-task learning and may improve choosing the tree for each shape that minimizes generalization to held-out data. the cross-entropy version of equation (4). How? For each shape s, we select the minimum- As a special case, if σ and σ0 are syncretic weight directed spanning tree over the n slots within shape s, then we define q (M.σ = w | s θ used by that shape, as computed by the Chu- M.σ0 = w0,S = s) to be 1 if w = w0 and 0 Liu-Edmonds algorithm (Edmonds, 1967).7 The 6Below, we will define the factors so that the generated m does—usually—have shape s. We will ensure that if two slots 7Similarly, Chow and Liu (1968) find the best tree- are syncretic in shape s, then their forms are in fact equal in structured undirected graphical model by computing the m. But non-syncretic slots will also have a (tiny) probability max-weighted undirected spanning tree. We need a of equal forms, so the model qθ(m | s) is deficient—it sums directed model instead because §4.3 provides conditional to slightly < 1 over the paradigms m that have shape s. distributions.

332 weight of each potential directed edge σ0 → ‘‘lexical complexity.’’ Some of the bits needed σ is the conditional cross-entropy H(M.σ | to specify a lexeme’s paradigm m are necessary M.σ0,S = s) under the seq2seq model trained in merely to specify the stem. A language whose the previous section, so equation (4) implies that stems are numerous or highly varied will tend to the weight of a tree is the cross-entropy we would have higher H(M), but we do not wish to regard it get by selecting that tree.8 In practice, we estimate as morphologically complex simply on that basis. the conditional cross-entropy for the non-syncretic We can decompose H(M) into slot pairs using a held-out development set (not the test set). For syncretic slot pairs, which are H(M) = H(M.σˇ) + H(M | M.σˇ) (5) | {z } | {z } handled by copying, the conditional cross-entropy lexical entropy morphological entropy is always 0, so edges between syncretic slots can be selected free of cost. where σˇ denotes the single lowest-entropy slot, After selecting the tree, we could retrain the σˇ =def argminH(M.σ) (6) seq2seq parameters θ to focus on the conditional σ distributions we actually use, training on only the slot pairs in each training paradigm that and we estimate H(M.σ) for any σ using the correspond to an tree edge in the model of that seq2seq distribution qθ(M.σ = w | M.empty = paradigm’s shape. Our experiments in §7 omitted ), which can be regarded as a model for generating this step. But in fact, training on all n2 pairs may forms of slot σ from scratch. even have found a better θ: It can be seen as a We will refer to σˇ as the lemma because it form of multi-task regularization (available also gives in some sense the simplest form of the to human learners). lexeme, although it is not necessarily the slot that lexicographers use as the citation form for the 5 From Paradigm Entropy lexeme. to i-Complexity We now define i-complexity as the entropy per form when predicting the remaining forms of M Having defined a way to approximate paradigm from the lemma: entropy, H(M), we finally operationalize our measure of i-complexity for a language. H(M | M.σˇ) (7) |s| − 1 One Paradigm Shape. We start with the simple case where the language has a single paradigm where the numerator can be obtained by sub- shape: S = {s}. Our initial idea was to define traction via equation (5). This is a fairer repre- i-complexity as bits per form, H(M) / |s|, where sentation of the morphological irregularity (e.g., |s| is the enumerative complexity—the number of the average difficulty in predicting the inflectional distinct forms in the paradigm. ending that is added to a given stem). Notice that if However, H(M) reflects not only the lan- |s| = 1 (an ), the morphological guage’s morphological complexity, but also its complexity is appropriately undefined, since no inflectional endings are ever added to the stem. 8 Where the weight of the tree is taken to include the If we had allowed the lexical entropy H(M.σˇ) weight of the special edge empty → σ to the root node σ. Thus, for each slot σ, the weight of empty → σ is the cost of to remain in the numerator, then a language with selecting σ as the root. It is an estimate of H(M.σ | S =s), larger e-complexity |s| would have amortized the difficulty of predicting the σ form without any parent. that term over more forms—meaning that larger In the implementation, we actually decrement the weight e-complexity would have tended to lead to lower of every edge σ0 → σ (including when σ0 = empty) by the weight of empty → σ. This does not change the optimal i-complexity, other things equal. By removing tree, because it does not change the relative weights of the that term from the numerator, our definition (7) possible parents of σ. However, it ensures that every σ eliminates this as a possible reason for the now has root cost 0, as required by the Chu-Liu-Edmonds observed tradeoff between e-complexity and algorithm (which does not consider root costs). Notice that because H(X) − H(X | Y ) = I(X; Y ), the decremented i-complexity. 0 weight is actually an estimate of −I(M.σ; M.σ ). Thus, Multiple Paradigm Shapes. Now, we consider finding the min-weight tree is equivalent to finding the tree that maximizes the total mutual information on the edges, the more general case where multiple paradigm just like the Chow-Liu algorithm (Chow and Liu, 1968). shapes are allowed: |S| ≥ 1. Again we are

333 interested in the entropy per non-lemma form. heuristically chosen candidate set of potential The i-complexity is . Then, they consider a distribution r(m.σ | m.σ0) that selects among those forms. P H(S) + s p(S =s)H(M | M.σˇ(s),S =s) In contrast to our neural sequence-to-sequence P s p(S =s)(|s| − 1) approach, this distribution unfortunately does not (8) have support over Σ∗ and, thus, cannot consider where changes other than substitution of morphological

def exponents. σˇ(s) = argminH(M.σ | S =s) (9) σ As a concrete example of r, consider Table 1’s (simplified) Modern Greek example from In the case where |s| and σˇ(s) are constant Ackerman and Malouf (2013). The conditional over all S, this reduces to equation (7). This is distribution r(m.gen;sg | m.acc;pl = . . . -i) because the numerator is essentially an expanded over genitive singular forms is peaked because formula for the conditional entropy in (7)—the there is exactly one possible transformation: only wrinkle is that different parts of it condition Substituting -us for -i. Other conditional on different slots. distributions for Modern Greek are less peaked: To estimate equation (8) using a trained model Ackerman and Malouf (2013) estimated that q and a held-out test set, we follow §3.3 by r(m.nom;sg | m.acc;pl = . . . -a) swaps - estimating all − log p(··· ) terms in the entropies a for ∅ with probability 2/3 and for -o with with our model surprisals − log q(··· ), but using probability 1/3. We reiterate that no other output the empirical probabilities on the test set for all has positive probability under their model, for other p(··· ) terms including p(S =s). Suppose the example, swapping -a for -es or ablaut of a stem test set paradigms are m1,..., mN with shapes vowel. In contrast, our p allows arbitrary irregulars s1, . . . , sN respectively. Then taking q = qθ, our (§6.1). final estimate of the i-complexity (8) works out to Average Conditional Entropy. The second log q(S =s )  i difference is their use of the pairwise conditional PN − + log q(M = m | S =s ) i=1  i i  entropies between cells. They measure the com- − log q(M.σˇ(s ) = m .σˇ(s ) | S =s ) i i i i plexity of the entire paradigm by the average con- PN i=1 |si| − 1 ditional entropy: (10) where we have multiplied both the numerator 1 X X H(M.σ | M.σ0). (11) and denominator by N. In short, the denominator n2 − n is the total number of non-lemma forms in the σ σ06=σ test set, and the numerator is the total number of bits that our model needs to predict these This differs from our tree-based measure, in forms (including the paradigm shapes si) given which an irregular form only needs to be derived the lemmas. The numerator of equation (10) is from its parent—possibly a similar or even syn- an upper bound on the numerator of equation (8) cretic irregular form—rather than from all other since it uses (conditional) cross-entropies rather forms in the paradigm. So it ‘‘only needs to pay than (conditional) entropies. once’’ and it even ‘‘shops around for the cheapest deal.’’ Also, in our measure, the lemma does not 6 A Methodological Comparison to ‘‘pay’’ at all. Ackerman and Malouf (2013) Ackerman and Malouf measure conditional Our formulation of the low-entropy principle entropies, which are simple to compute because q differs somewhat from Ackerman and Malouf their model is simple. (Again, it only permits a (2013); the differences are highlighted below. small number of possible outputs for each input, based on the finite set of allowed sub- Heuristic Approximation to p. Ackerman and stitutions that they annotated by hand.) In contrast, Malouf (2013) first construct what we regard as a our estimate uses conditional cross-entropies, heuristic approximation to the joint distribution asking whether our q can predict real held-out p over forms in a paradigm. They provide a forms distributed according to p.

334 SINGULAR PLURAL

CLASS NOM GEN ACC VOC NOM GEN ACC VOC 1 -os -u -on -e -i -on -us -i 2 -s -∅ -∅ -∅ -es -on -es -es 3 -∅ -s -∅ -∅ -es -on -es -es 4 -∅ -s -∅ -∅ -is -on -is -is 5 -o -u -o -o -a -on -a -a 6 -∅ -u -∅ -∅ -a -on -a -a 7 -os -us -os -os -i -on -i -i 8 -∅ -os -∅ -∅ -a -on -a -a

Table 1: Structuralist analysis of Modern Greek nominal inflection classes (Ralli, 1994, 2002).

6.1 Critique of Ackerman and Malouf (2013) approach is theory-agnostic in that we jointly learn Now, we offer a critique of Ackerman and Malouf surface-to-surface transformations, reminiscent of (2013) on three points: (i) different linguistic a-morphorous morphology (Anderson, 1992), and theories dictating how words are subdivided into thus our estimate of paradigm entropy does not morphemes may offer different results, (ii) certain suffer this drawback. Indeed, our assumptions are types of morphological irregularity, particularly limited—recurrent neural networks are universal , aren’t handled, and (iii) average con- approximators. It has been shown that any com- ditional entropy overestimates the i-complexity in putable function can be computed by some finite comparison to joint entropy. recurrent neural network (Siegelmann and Sontag, 1991, 1995). Thus, the only true assumption we Theory-Dependent Complexity. We consider make of morphology is mild: We assume it a classic example from English is Turing-computable. That behavior is Turing- that demonstrates the effect of the specific anal- computable is a rather fundamental tenet of cog- ysis chosen. In regular English plural forma- nitive science (McCulloch and Pitts, 1943; Sobel tion, the speaker has three choices: [z], [s], and and Li, 2013). [ z]. Here are two potential analyses. One could In our approach, theory dependence is primarily treat this as a case of pure allomorphy with introduced through the selection of slots in our three potential, unrelated . Under such paradigms, which is a form of bias that would an analysis, the entropy will reflect the empirical be present in any human-derived set of morphol- frequency of the three possibilities found in ogical annotations. A key example of this is the some data set: roughly, 1/4 log 1/4 + 3/8 log 3/8 + way in which different annotators or annotation 3/8 log 3/8 ≈ 1.56127. On the other hand, if we standards may choosetolimitorexpand syncretism— assume a different model with a unique underlying situations where the same string-identical form affix /z/, which is attached and then converted may fill multiple different paradigm slots. For ex- to either [z], [s], or [ z] by an application of ample, Finnish has two accusative inflections for perfectly regular phonology, this part of the and adjectives, one always coinciding in morphological system of English has entropy of form with the nominative and the other coinciding 0—one choice. See Kenstowicz (1994, p. 72) with the genitive. Many grammars therefore omit for a discussion of these alternatives from a the- these two slots in the paradigm entirely, although oretical standpoint. Note that our goal is not to some include them. Depending on which linguistic advocate for one of these analyses, but merely choice annotators make, the language could appear to suggest that Ackerman and Malouf (2013)’s to have more or fewer paradigm slots. We have quantity is analysis-dependent.9 In contrast, our carefully defined our e-complexity and i-complexity metrics so that they are not sensitive to these 9 Bonami and Beniamine (2016) have made a similar point choices. (Rob Malouf, personal communication). Other proposed morphological complexity metrics have relied on a similar As a second example of annotation depen- assumption (e.g., Bane, 2008). dence, different linguistic theories might disagree

335 about which distinctions constitute productive (Robins, 2013) in its analysis in that any form can inflectional morphology, and which are deriva- be generated from any other, which, in practice, tional or even fixed lexical properties. For ex- will cause it to overestimate the i-complexity of ample, our dataset for Turkish treats causative a morphological system. Consider the German verb forms as derivationally related lexical items. dative plural Handen¨ (from the German Hand The number of apparent slots in the Turkish ‘‘hand’’). Predicting this form from the nomina- inflectional paradigms is reduced because these tive singular Hand is difficult, but predicting forms were excluded. it from the nominative plural Hande¨ is trivial: just add the suffix -n. In Ackerman and Malouf Morphological Irregularity. A second prob- (2013)’s formulation, r(Handen¨ | Hand) and lem with the model in Ackerman and Malouf r(Handen¨ | Hande¨ ) both contribute to the para- (2013) is its inability to treat certain kinds of digm’s entropy with the former substantially irregularity, particularly cases of suppletion. As raising the quantity. Our method in §4.4 is able far as we can tell, the model is incapable of eval- to select the second term and regard Handen¨ as uating cases of morphological suppletion unless predictable once Hande¨ is in hand. they are explicitly encoded in the model. Consider, again, the case of the English suppletive past tense 7 Experiments form went— if one’s analysis of the English base is effectively a distribution of the choices add [d], Our experimental design is now fairly straight- add [t], and [ d], one will assign probability 0 to forward: plot e-complexity versus i-complexity went as the past tense of go. We highlight the over as many languages as possible, We then importance of this point because suppletive forms devise a numerical test of whether the complexity are certainly very common in academic English: trade-off conjecture (§1) appears to hold. the plural of binyan is binyanim and the plural 7.1 Data and UniMorph Annotation of lemma is lemmata. It is unlikely that native English speakers possess even a partial model of At the moment, the largest source of annotated Hebrew and Greek nominal morphology—a more full paradigms is the UniMorph dataset (Sylak- plausible scenario is simply that these forms are Glassman et al., 2015; Kirov et al., 2018), which learned by rote. As speakers and hearers are capa- contains data that have been extracted from ble of producing and understanding these forms, Wiktionary, as well as other morphological lexica we should demand the same capacity of our mod- and analyzers, and then converted into a uni- els. Not doing so also ties into the point in the versal format. A partial subset of UniMorph has previous section about theory-dependence since been used in the running of the SIGMORPHON- it is ultimately the linguist—supported by some CoNLL 2017 and 2018 shared tasks on morphol- theoretical notion—who decides which forms are ogical inflection generation (Cotterell et al., deemed irregular and hence left out of the analy- 2017a, 2018b). sis. We note that these restrictive assumptions are We use verbal paradigms from 33 typologically relatively common in the literature, for example, diverse languages, and nominal paradigms from Allen and Becker (2015)’s sublexical learner is 18 typologically diverse languages. We only likewise incapable of placing probability mass on considered languages that had at least 700 fully irregulars.10 annotated verbal or nominal paradigms, as the neural methods we deploy required a large amount Average Conditional Entropy versus Joint of training example to achieve high performance.11 Entropy. Finally, we take issue with the formu- As the neural methods require a large set of anno- lation of paradigm entropy as average conditional tated training examples to achieve high performance, entropy, as exhibited in equation (11). For one, it does not correspond to the entropy of any 11Focusing on data-rich languages should also help actual joint distribution p(M), and has no obvious mitigate sample bias caused by variable-sized dictionaries mathematical interpretation. Second, it is Priscian in our database. In many languages, irregular words are also very frequent and may be more likely to be included in a 10In the computer science literature, it is far more common dictionary first. If that’s the case, smaller dictionaries might to construct distributions with support over Σ∗ (Paz, 2003; have lexical statistics skewed toward irregulars more so than Bouchard-Cotˆ e´ et al., 2007; Dreyer et al., 2008; Cotterell et larger dictionaries. In general, larger dictionaries should be al., 2014), which do not have this problem. more representative samples of a language’s broader lexicon.

336 it is difficult to use them in a lower-resource tween paradigm slots. We call this the ‘‘green scenario. scheme.’’ To estimate a language’s e-complexity (§2.2.1), Model and Training Details. We train the seq2seq- we average over all paradigms in the UniMorph with-attention model using the OpenNMT toolkit inflected lexicon. (Klein et al., 2017). We largely follow the recipe To estimate i-complexity, we first partition those given in Kann and Schutze¨ (2016), the winning paradigms into training, development and test sets. submission on the 2016 SIGMORPHON shared We identify the paradigm shapes from the training task for inflectional morphology. Accordingly, we set (§4.1). We also use the training set to train use a character embedding size of 300, and 100 the parameters θ of our conditional distribution hidden units in both the encoder and decoder. (§4.3), then estimate conditional entropies on the Our gradient-based optimization method was development set and use Edmonds’s algorithm to AdaDelta (Zeiler, 2012) with a minibatch size select a global model structure for each shape of 80. We trained for 20 epochs, which yielded (§4.4). Now we evaluate i-complexity on the test 20 models via early stopping. We selected set (equation (10)). Using held-out test data gives the model that achieved the highest average an unbiased estimate of a model’s predictive ability, log p(m.σ | m.σ0) on (σ0, σ) pairs from the which is why it is standard practice in statisti- development set. cal NLP, though less common in quantitative linguistics. 8 Results and Analysis

7.2 Experimental Details Our results are [plotted in Figure 2, where each dot represents a language. We see little difference We experiment separately on nominal and verbal between the green and the purple training sets, . For i-complexity, we hold out at random though it was not clear a priori that this would 50 full paradigms for the development set, and 50 be so. other full paradigms for the test set. The plots appear to show a clear trade-off be- For comparability across languages, we tried to tween i-complexity and the e-complexity. We now ensure a ‘‘standard size’’ for the training set Dtrain. provide quantitative support for this impression by We sampled it from the remaining data using two constructing a statistical significance test. Visu- different designs, to address the fact that different ally, our low-entropy trade-off conjecture boils languages have different-size paradigms. down to the claim that languages cannot exist in the upper right-hand corner of the graph, that is, Equal Number of Paradigms (‘‘purple scheme’’). they cannot have both high e-complexity and high In the first regime, Dtrain (for each language) is i-complexity. In other words, the upper-right hand derived from 600 randomly chosen non-held-out corner of the graph is ‘‘emptier’’ than it would be paradigms m. We trained the reinflection model by chance. in §4.4 on all non-syncretic pairs within these How can we quantify this? The Pareto curve paradigms, as described in §4.3. This disadvan- for a multi-objective optimization problem shows, tages languages with small paradigms, as they for each x, the maximum value y of the second train on fewer pairs. objective that can be achieved while keeping the first objective ≥ x (and vice-versa). This is shown Equal Number of Pairs (‘‘green scheme’’). In in Figure 2 as a step curve, showing the maximum the second regime, we trained the reinflection i-complexity y that was actually achieved for each model in §4.4 on 60,000 non-syncretic pairs level x of e-complexity. This curve is the tightest (m.σ0, m.σ) (where σ0 may be empty) sampled non-increasing function that upper-bounds all of without replacement from the non-held-out para- the observed points: We have no evidence from digms.12 This matches the amount of training our sample of languages that any language can data, but may disadvantage languages with large appear above the curve. paradigms, since the reinflection model will see We say that the upper right-hand corner is fewer examples of any individual mapping be- ‘‘empty’’ to the extent that the area under the 12For a few languages, fewer than 60,000 pairs were Pareto curve is small. To ask whether it is indeed available, in which case we used all pairs. emptier than would be expected by chance, we

337 Figure 2: The x-axis is our measure of e-complexity, the average number of distinct forms in a paradigm. The y-axis is our estimate of i-complexity, the average bits per distinct non-lemma form. We overlay purple and green graphs (§7.2): to obtain the y coordinate, all the languages are trained on the same number of paradigms (purple scheme) or on the same number of slot pairs (green scheme). The purple curve is the Pareto curve for the purple points, and the area under it is shaded in purple; similarly for green. The languages are labeled with their two-letter ISO 639-1 codes in white text inside the colored circles. perform a nonparametric permutation test that des- e-complexity is high. This suggests further inves- troys the claimed correlation between the e-complexity tigation as to where in the language these bounds and i-complexity values. From our observed apply. Such bounds are motivated by the notion points {(x1, y1),..., (xm, ym)}, we can sto- that naturally occurring languages must be learn- chastically construct a new set of points able. Presumably, languages with large paradigms {(x1, yσ(1)),..., (xm, yσ(m))} where σ is a per- need to be regular overall, because in such a mutation of 1, 2, . . . , m selected uniformly at language, the average word type is observed too random. The resulting scatterplot is what we rarely for a learner to memorize an irregular sur- would expect under the null hypothesis of no face form for it. Yet even in such a language, correlation. Our p-value is the probability that the some word types are frequent, because some lex- new scatterplot has an even emptier upper right- emes and some slots are especially useful. Thus, hand corner—that is, the probability that the area if learnability of the lexicon is indeed the driving under the null-hypothesis Pareto curve is less than force,13 then we should make the finer-grained or equal to the area under the actually observed prediction that irregularity may survive in the Pareto curve. We estimate this probability by more frequently observed word types, regardless constructing 10,000 random scatterplots. of paradigm size. Rarer forms are more likely to be In the purple training scheme, we find that the predictable—meaning that they are either regular, upper right-hand corner is significantly empty, or else irregular in a way that is predictable from a with p < 0.021 and p < 0.037 for the verbal related frequent irregular (Cotterell et al., 2018a). and nominal paradigms, respectively. In the green Dynamical models. We could even investigate training scheme, we find that the upper right-hand directly whether patterns of morphological irreg- corner is significantly empty with p < 0.032 and ularity can be explained by the evolution of lan- p < 0.024 in the verbal and nominal paradigms, guage through time. Languages may be shaped respectively. by natural selection or, more plausibly, by noisy transmission from each generation to the next 9 Future Directions (Hare and Elman, 1995; Smith et al., 2008), in a Frequency. Ackerman and Malouf hypothe- natural communication setting where each learner sized that i-complexity is bounded, and we have 13Rather than, say, description length of the lexicon demonstrated that the bounds are stronger when (Rissanen and Ristad, 1994).

338 observes some forms more frequently than others. inflectional systems, using tools from generative Are naturally occurring inflectional systems more modeling and deep learning. With an empirical learnable (at least by machine learning algorithms) study on noun and verb systems in 36 typologically than would be expected by chance? Do artificial diverse languages, we have exhibited a Pareto- languages with unusual properties (for example, style trade-off between the e-complexity and unpredictable rare forms) tend to evolve into lan- i-complexity of morphological systems. In short, guages that are more typologically natural? a morphological system can mark a large num- We might also want to study whether children’s ber of morphosyntactic distinctions, as Finnish, morphological systems increase in i-complexity Turkish, and other agglutinative and polysynthetic as they approach the adult system. Interestingly, languages do; or it may have a high-level of unpre- this definition of i-complexity could also explain dictability (irregularity); or neither.15 But it cannot certain issues in first language acquisition, where do both. children often overregularize (Pinker and Prince, The NLP community often focuses one-complexity 1988): They impose the regular pattern on irreg- and views a language as morphologically complex ular verbs, producing forms like instead if it has a profusion of unique forms, even if they of ran. Children may initially posit an inflectional are very predictable. The reason is probably our system with lower i-complexity, before con- habit of working at the word-level, so that all forms verging on the true system, which has higher not found in the training set are out-of-vocabulary i-complexity. (OOV). Indeed, NLP practitioners often use high Phonology Plus . A human learner OOV rates as a proxy for defining morphologi- of a written language also has access to phonol- cal complexity. However, as NLP moves to the ogical information that could affect predictability. character-level, we need other definitions of mor- One could, for example, jointly model all the writ- phological richness. A language like Hungarian, ten and spoken forms within each paradigm, where with almost perfectly predictable morphology, the Bayesian network may sometimes predict a may be easier to process than a language like spoken slot from a written slot or vice-versa. German, with an abundance of irregularity. Moving Beyond the Forms. The complexity Acknowledgments of morphological inflection is only a small bit of the larger question of morphological typology. This material is based upon work supported in part We have left many bits unexplored. In this paper, by the National Science Foundation under grant we have predicted orthographic forms from morpho- no. 1718846. The first author was supported by syntactic feature bundles. Ideally, we would like a Facebook Fellowship. We want to thank Rob to also predict which morphosyntactic bundles are Malouf for providing extensive and very helpful realized as words within a language, and which feedback on multiple versions of the paper. How- bundles are syncretic. Thatis, what paradigm shapes ever, the opinions in this paper are our own: are plausible or implausible? Our acknowledgment does not constitute an In addition, our current treatment depends endorsement by Malouf. We would also like to upon a paradigmatic treatment of morphology, thank the anonymous reviewers along with action which is why we have focused on inflectional editor Chris Dyer and editor-in-chief Lillian Lee. morphology. In contrast, derivational morphol- ogy is often viewed as syntagmatic.14 Can we References devise quantitative formulation of derivational complexity—for example, extending to polysyn- Farrell Ackerman and Robert Malouf. 2013. Mor- thetic languages? phological organization: The low conditional 10 Conclusions entropy conjecture. Language, 89(3):429–464. We have provided clean mathematical formula- Blake Allen and Michael Becker. 2015. Learning tions of enumerative and integrative complexity of alternations from surface forms with sublexical

14For paradigmatic treatments of derivational morphology, 15Carstairs-McCarthy (2010) has pointed out that lan- see Cotterell et al. (2017c) for a computational perspective guages need not have morphology at all, though they must and the references therein for theoretical perspectives. have phonology and syntax.

339 phonology. Unpublished manuscript, Univer- Lai. 1992. An estimate of an upper bound for the sity of British Columbia and Stony Brook Uni- entropy of English. Computational Linguistics, versity. Available as https://ling.auf.net/lingbuzz/ 18(1):31–40. 002503. Andrew Carstairs-McCarthy. 2010. The Evolution Stephen R. Anderson. 1992. A-Morphous Morphol- of Morphology, volume 14. Oxford University ogy, volume 62, Cambridge University Press. Press.

Mark Aronoff. 1976. Word Formation in Gener- C. K. Chow and Cong N. Liu. 1968. Approx- ative Grammar. Number 1 in Linguistic Inquiry imating discrete probability distributions with Monographs. MIT Press, Cambridge, MA. dependence trees. IEEE Transactions on Infor- mation Theory, 14(3):462–467. Matthew Baerman. 2015. The Oxford Handbook Ryan Cotterell, Christo Kirov, Mans Hulden, and of Inflection. Oxford Handbooks in Linguistic. Jason Eisner. 2018a. On the diachronic stability Part II: Paradigms and their Variants of irregularity in inflectional morphology. arXiv Matthew Baerman, Dunstan Brown, and Greville preprint arXiv:1804.08262v1. G. Corbett. 2015. Understanding and measuring Ryan Cotterell, Christo Kirov, John Sylak-Glassman, morphological complexity: An introduction. Geraldine˙ Walther, Ekaterina Vylomova, Arya In Matthew Baerman, Dunstan Brown, and D. McCarthy, Katharina Kann, Sebastian Greville G. Corbett, editors, Understanding and Mielke, Garrett Nicolai, Miikka Silfverberg, measuring morphological complexity. Oxford David Yarowsky, Jason Eisner, and Mans University Press. Hulden. 2018b. The CoNLL–SIGMORPHON Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua 2018 shared task: Universal morphological Bengio. 2015. Neural machine translation by reinflection. In Proceedings of the CoNLL jointly learning to align and translate. In Pro- SIGMORPHON 2018 Shared Task: Universal ceedings of ICLR. Morphological Reinflection, pages 1–27. RyanCotterell,Christo Kirov,JohnSylak-Glassman, Max Bane. 2008. Quantifying and measuring mor- Geraldine´ Walther, Ekaterina Vylomova, phological complexity. In Proceedings of the Patrick Xia, Manaal Faruqui, Sandra Kubler,¨ 26th West Coast Conference on Formal Lin- David Yarowsky, Jason Eisner, and Mans guistics, pages 69–76. Hulden. 2017a. The CoNLL-SIGMORPHON Jean Berko. 1958. The child’s learning of English 2017 shared task: Universal morphological morphology. Word, 14(2-3):150–177. reinflection in 52 languages. In Proceedings of the CoNLL-SIGMORPHON 2017 Shared Leonard Bloomfield. 1933. Language, University Task: Universal Morphological Reinflection, of Chicago Press. Reprint edition (October 15, Vancouver. 1984). Ryan Cotterell, Christo Kirov, John Sylak- Olivier Bonami and Sarah Beniamine. 2016. Joint Glassman, David Yarowsky, Jason Eisner, predictiveness in inflectional paradigms. Word and Mans Hulden. 2016. The SIGMORPHON Structure, 9(2):156–182. 2016 shared task—morphological reinflection. In Proceedings of the 14th SIGMORPHON Alexandre Bouchard-Cotˆ e,´ Percy Liang, Thomas Workshop on Computational Research in Pho- Griffiths, and Dan Klein. 2007. A probabilistic netics, Phonology, and Morphology, pages 10–22, approach to diachronic phonology. In Proceed- Berlin. ings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Ryan Cotterell, Nanyun Peng, and Jason Eisner. Computational Natural Language Learning 2014. Stochastic contextual edit distance and (EMNLP-CoNLL), pages 887–896, Prague. probabilistic FSTs. In Proceedings of the 52nd Annual Meeting of the Association for Peter F. Brown, Vincent J. Della Pietra, Robert L. Computational Linguistics (ACL), pages 625–630. Mercer, Stephen A. Della Pietra, and Jennifer C. Baltimore, MD.

340 Ryan Cotterell, Nanyun Peng, and Jason Eisner. Charles F. Hockett. 1958. A Course In Modern 2015. Modeling word forms using latent under- Linguistics. The MacMillan Company. lying morphs and phonology. Transactions of the Association for Computational Linguistics Katharina Kann and Hinrich Schutze.¨ 2016. (TACL), 3433–447. Single-model encoder-decoder with explicit morphological representation for reinflection. Ryan Cotterell, John Sylak-Glassman, and Christo In Proceedings of the 54th Annual Meeting of Kirov. 2017b. Neural graphical models over the Association for Computational Linguistics strings for principal parts morphological para- (ACL), pages 555–560, Berlin. digm completion. In Proceedings of the 15th Conference of the European Chapter of the Michael J. Kenstowicz. 1994. Phonology in Association for Computational Linguistics Generative Grammar. Blackwell Oxford. (EACL), pages 759–765, Valencia. Aleksandr E. Kibrik. 1998, Archi (Caucasian – Ryan Cotterell, Ekaterina Vylomova, Huda Daghestanian). In The Handbook of Morphol- Khayrallah, Christo Kirov, and David Yarowsky. ogy, pages 455–476, Blackwell Oxford. 2017c. Paradigm completion for derivational morphology. In Proceedings of the Confer- Christo Kirov, Ryan Cotterell, John Sylak- ence on Empirical Methods in Natural Lan- Glassman, GeraldineWalther,´ Ekaterina Vylomova, guage Processing (EMNLP), pages 725–731, Patrick Xia, Manaal Faruqui, Sebastian Mielke, Copenhagen. Arya D. McCarthy, Sandra Kubler,¨ David Yarowsky, Jason Eisner, and Mans Hulden. Markus Dreyer and Jason Eisner. 2009. Graphical 2018. UniMorph 2.0: Universal morphology. In models over multiple strings. In Proceedings Proceedings of LREC 2018, Miyazaki, Japan. of the Conference on Empirical Methods in European Language Resources Association Natural Language Processing (EMNLP), pages (ELRA). 101–110. Singapore. Guillaume Klein, Yoon Kim, Yuntian Deng, Markus Dreyer and Jason Eisner. 2011. Dis- Jean Senellart, and Alexander Rush. 2017. covering morphological paradigms from plain OpenNMT: Open-source toolkit for neural text using a Dirichlet process mixture model. machine translation. In Proceedings of ACL In Proceedings of the Conference on Empir- 2017, System Demonstrations, pages 67–72. ical Methods in Natural Language Processing (EMNLP), pages 616–627, Edinburgh. Warren S. McCulloch and Walter Pitts. 1943. A logical calculus of the ideas immanent Markus Dreyer, Jason Smith, and Jason Eis- in nervous activity. Bulletin of Mathematical ner. 2008, October. Latent-variable modeling Biophysics, 5(4):115–133. of string transductions with finite-state meth- ods. In Proceedings of the 2008 Conference on John McWhorter. 2001. The world’s simplest Empirical Methods in Natural Language Pro- grammars are creole grammars. Linguistic Typol- cessing (EMNLP), pages 1080–1089, Honolulu, ogy, 5(2):125–66. HI. Yoon Mi Oh. 2015. Linguistic Complexity and Jack Edmonds. 1967. Optimum branchings. Jour- Information: Quantitative Approaches. Ph.D. nal of Research of the National Bureau of thesis, Universite´ de Lyon, France. Standards B, 71(4):233–240. Azaria Paz. 2003. Probabilistic Automata, John David Gil. 1994. The structure of Riau Indonesian. Wiley and Sons. Nordic Journal of Linguistics, 17(2):179–200. Franc¸ois Pellegrino, Christophe Coupe,´ and Egidio Mary Hare and Jeffrey L. Elman. 1995. Learn- Marsico. 2011. A cross-language perspective ing and morphological change. Cognition, on speech information rate. Language, 56(1):61–98. 87(3):539–558.

341 Steven Pinker and Alan Prince. 1988. On language Kenny Smith, Michael L. Kalish, Thomas L. and connectionism: Analysis of a parallel Griffiths, and Stephan Lewandowsky. 2008. distributed processing model of language acqui- Cultural transmission and the evolution of sition. Cognition, 28(1):73–193. human behaviour. Philosophical Transactions B. doi.org/10.1098/rstb.2008.0147. Angela Ralli. 1994. Feature representations and feature-passing operations in Greek nominal Carolyn P. Sobel and Paul Li. 2013. The Cognitive inflection. In Proceedings of the 8th Symposium Sciences: An Interdisciplinary Approach. Sage on English and Greek Linguistics, pages 19–46. Publications.

Angela Ralli. 2002. The role of morphology in Andrew Spencer. 1991. Morphological Theory: gender determination: evidence from Modern An Introduction to Word Structure in Gener- Greek. Linguistics, 40(3; ISSU 379):519–552. ative Grammar. Wiley-Blackwell. Thomas Stolz, Hitomi Otsuka, Aina Urdze, and Jorma Rissanen and Eric S. Ristad. 1994. Lan- Johan van der Auwera. 2012. Irregularity in guage acquisition in the MDL framework. In Morphology (and beyond), volume 11. Walter Eric S. Ristad, editor, Language Computation. de Gruyter. American Mathematical Society, Philadelphia, PA. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with Robert Henry Robins. 2013. A Short History of neural networks. In Advances in Neural Infor- Linguistics. Routledge. mation Processing Systems (NIPS), pages 3104–3112. Benoˆıt Sagot. 2013. Comparing complexity mea- sures. In Computational Approaches to Mor- John Sylak-Glassman. 2016, The composition phological Complexity, Paris. and use of the universal morphological feature schema (Unimorph schema). Johns Hopkins . 1921. Language: An Introduction University. to the Study of Speech. John Sylak-Glassman, Christo Kirov, David Claude E. Shannon. 1948. A mathematical theory Yarowsky, and Roger Que. 2015, July. A of communication. Bell Systems Technical language-independent feature schema for Journal, 27. inflectional morphology. In Proceedings of the 53rd Annual Meeting of the Association Hava T. Siegelmann and Eduardo D. Sontag. for Computational Linguistics and the 7th 1991. Turing computability with neural nets. International Joint Conference on Natural Applied Mathematics Letters, 4(6):77–80. Language Processing (ACL), pages 674–680, Beijing. Hava T. Siegelmann and Eduardo D. Sontag. 1995. On the computational power of neural Matthew D. Zeiler. 2012. ADADELTA: An nets. Journal of Computer and System Sciences, adaptive learning rate method. arXiv preprint 50(1):132–150. arXiv:1212.5701v1.

342