
On the Complexity and Typology of Inflectional Morphological Systems Ryan Cotterell and Christo Kirov and Mans Hulden and Jason Eisner Department of Computer Science, Johns Hopkins University Department of Linguistics, University of Colorado fryan.cotterell,ckirov1,[email protected], [email protected] Abstract In this work, we develop information-theoretic We quantify the linguistic complexity of dif- tools to operationalize this hypothesis about the ferent languages’ morphological systems. We complexity of inflectional systems. We model verify that there is a statistically significant each inflectional system using a tree-structured empirical trade-off between paradigm size and directed graphical model whose factors are neural irregularity: A language’s inflectional paradigms networks and whose structure (topology) must may be either large in size or highly irregular, be learned along with the factors. We explain our but never both. We define a new measure of approach to quantifying two aspects of inflectional paradigm irregularity based on the conditional complexity and, in one case, approximate our entropy of the surface realization of a paradigm— metric using a simple variational bound. This al- how hard it is to jointly predict all the word lows a data-driven approach by which we can mea- forms in a paradigm from the lemma. We sure the morphological complexity of a given estimate irregularity by training a predictive language in a clean manner that is more theory- model. Our measurements are taken on large agnostic than previous approaches. morphological paradigms from 36 typologi- Our study evaluates 36 diverse languages, cally diverse languages. using collections of paradigms represented ortho- graphically. Thus, we are measuring the complex- ity of each written language. The corresponding 1 Introduction spoken language would have different complexity, based on the corresponding phonological forms. What makes an inflectional system ‘‘complex’’? Importantly, our method does not depend upon Linguists have sometimes considered measur- a linguistic analysis of words into constituent ing this by the size of the inflectional paradigms morphemes (e.g., hoping 7! hope+ing). We find (McWhorter, 2001). The number of distinct in- support for the complexity trade-off hypothesis. flected forms of each word indicates the number Concretely, we show that the more unique forms of morphosyntactic distinctions that the language an inflectional paradigm has, the more predictable makes on the surface. However, this gives only the forms must be from one another—for example, a partial picture of complexity (Sagot, 2013). forms in a predictable paradigm might all be Some inflectional systems are more irregular: related by a simple change of suffix. This intuition It is harder to guess how the inflected forms has a long history in the linguistics community, as of a word will be spelled or pronounced, given field linguists have often noted that languages with the base form. Ackerman and Malouf (2013) extreme morphological richness, for example, hypothesize that there is a limit to the irregu- agglutinative and polysynthetic languages, have larity of an inflectional system. We refine this virtually no exceptions or irregular forms. Our hypothesis to propose that systems with many contribution lies in mathematically formulating forms per paradigm have an even stricter limit this notion of regularity and providing a means to on irregularity per distinct form. That is, the two estimate it by fitting a probability model. Using dimensions interact: A system cannot be complex these tools, we provide a quantitative verification along both axes at once. In short, if a language of this conjecture on a large set of typologically demands that its speakers use a lot of distinct diverse languages, which is significant with forms, those forms must be relatively predictable. p < 0:037. 327 Transactions of the Association for Computational Linguistics, vol. 7, pp. 327–342, 2019. Action Editor: Chris Dyer. Submission batch: 12/2017; Published 6/2019. © 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. 2 Morphological Complexity 2.2 Defining Complexity 2.1 Word-Based Morphology Ackerman and Malouf (2013) distinguish two types of morphological complexity, which we We adopt the framework of word-based morphol- elaborate on below. For a more general overview ogy (Aronoff, 1976; Spencer, 1991). An inflected of morphological complexity, see Baerman et al. lexicon in this framework is represented as a set (2015). of word types. Each word type is a triple of 2.2.1 Enumerative Complexity • a lexeme ` (an arbitrary integer or string that The first type, enumerative complexity (e-complexity), indexes the word’s core meaning and part of measures the number of surface morphosyntactic speech) distinctions that a language makes within a part of speech. • a slot σ (an arbitrary integer or object that Given a lexicon, our present paper will measure indicates how the word is inflected) the e-complexity of the verb system as the average • a surface form w (a string over a fixed of the verb paradigm size jM(`)j, where ` ranges phonological or orthographic alphabet Σ) over all verb lexemes in domain(M). Importantly, we define the size jmj of a paradigm m to be the number of distinct surface forms in the A paradigm m is a map from slots to surface 1 paradigm, rather than the number of slots. That is, forms. We use dot notation to access elements def jmj = jrange(m)j rather than jdomain(m)j. of this map. For example, m:past denotes the past-tense surface form in paradigm m. Under our definition, nearly all English verb paradigms have size 4 or 5, giving the English verb An inflected lexicon for a language can be system an e-complexity between 4 and 5. If m = regarded as defining a map M from lexemes to M(walk ), then jmj = 4, since range(m) = their paradigms. Specifically, M(`).σ = w iff the Verb fwalk; walks; walked; walkingg. The manually lexicon contains the triple (`; σ; w).2 For example, constructed lexicon may define separate slots σ = in the case of the English lexicon, if ` is the English 1 [ TENSE=PRESENT, PERSON=1, NUMBER=SG ] and lexeme walkVerb, then M(`):past = walked. In σ = [ TENSE=PRESENT, PERSON=2, NUMBER=SG ], linguistic terms, we say that in `’s paradigm M(`), 2 but in this paradigm, those slots are not distin- the past-tense slot is filled (or realized) by walked. guished by any morphological marking: m.σ1 = Nothing in our method requires a Bloomfieldian m.σ2 = walk. Nor is the past tense walked dis- structuralist analysis that decomposes each word tinguished from the past participle. This phenom- into underlying morphs; rather, this paper is enon is known as syncretism. a-morphous in the sense of Anderson (1992). Why might the creator of a lexicon choose to More specifically, we will work within the define two slots for a syncretic form, rather than UniMorph annotation scheme (Sylak-Glassman, a single merged slot? Perhaps because the slots 2016). In the simplest case, each slot σ specifies are not always syncretic: in the example above, a morphosyntactic bundle of inflectional features 3 one English verb, be, does distinguish σ1 and σ2. such as tense, mood, person, number, and gender. But an English lexicon that did choose to merge For example, the Spanish surface form pongas σ1 and σ2 could handle be by adding extra slots (from the lexeme poner ‘to put’) fills a slot that that are used only with be. A second reason is that indicates that this word has the features [ TENSE= the merged slot might be inelegant to describe PRESENT, MOOD=SUBJUNCTIVE, PERSON=2, NUMBER=SG]. using the feature bundle notation: English verbs We postpone a discussion of the details of (other than be) have a single form shared by UniMorph until §7.1, but it is mostly compatible the bare infinitive and all present tense forms with other, similar schemes. except 3rd-person singular, but a single slot for this form could not be easily characterized by a 1See Baerman (2015, Part II) for a tour of alternative views of inflectional paradigms. single feature bundle, and so the lexicon creator 2We assume that the lexicon never contains distinct triples of the form (`; σ; w) and (`; σ; w0), so that M(`).σ has a 3This verb has a paradigm of size 8: fbe, am, are, is, was, unique value if it is defined at all. were, been, beingg. 328 might reasonably split it for convenience. A third language if many verbs express it with suffix -o reason might be an attempt at consistency across while many others use -;. In §5, we will propose languages: In principle, an English lexicon is free an improvement to their entropy-based measure. to use the same slots as Sanskrit and thus list dual and plural forms for every English noun, which 2.3 The Low-Entropy Conjecture just happen to be identical in every case (complete The low-entropy conjecture, as formulated by syncretism). Ackerman and Malouf (2013, p. 436), ‘‘is the The point is that our e-complexity metric is hypothesis that enumerative morphological com- insensitive to these annotation choices. It focuses plexity is effectively unrestricted, as long as the on observable surface distinctions, and so does not average conditional entropy, a measure of inte- care whether syncretic slots are merged or kept grative complexity, is low.’’ Indeed, Ackerman separate. Later, we will construct our i-complexity and Malouf go so far as to say that there need be no metric to have the same property. upper bound on e-complexity, but the i-complexity The notion of e-complexity has a long history must remain sufficiently low (as is the case for in linguistics. The idea was explicitly discussed as Archi, for example).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages16 Page
-
File Size-