Marrying Universal Dependencies and Universal Morphology
Total Page:16
File Type:pdf, Size:1020Kb
Marrying Universal Dependencies and Universal Morphology Arya D. McCarthy1, Miikka Silfverberg2, Ryan Cotterell1, Mans Hulden2, and David Yarowsky1 1Johns Hopkins University 2University of Colorado Boulder farya, rcotter2, [email protected] fmiikka.silfverberg,[email protected] Abstract The Universal Dependencies (UD) and Uni- versal Morphology (UniMorph) projects each present schemata for annotating the mor- phosyntactic details of a language. Each project also provides corpora of annotated text in many languages—UD at the token level and UniMorph at the type level. As each cor- pus is built by different annotators, language- specific decisions hinder the goal of universal schemata. With compatibility of tags, each project’s annotations could be used to validate the other’s. Additionally, the availability of both type- and token-level resources would be Figure 1: Example of annotation disagreement in UD a boon to tasks such as parsing and homograph between two languages on translations of one phrase, disambiguation. To ease this interoperability, reproduced from Malaviya et al.(2018). The final word we present a deterministic mapping from Uni- in each, “refrescante”, is not inflected for gender: It has versal Dependencies v2 features into the Uni- the same surface form whether masculine or feminine. Morph schema. We validate our approach by Only in Portuguese, it is annotated as masculine to re- lookup in the UniMorph corpora and find a flect grammatical concord with the noun it modifies. macro-average of 64:13% recall. We also note incompatibilities due to paucity of data on ei- ther side. Finally, we present a critical evalu- its schema. On a dataset-by-dataset basis, they in- ation of the foundations, strengths, and weak- corporate annotator errors, omissions, and human nesses of the two annotation projects. decisions when the schemata are underspecified; one such example is in Figure 1. 1 Introduction A dataset-by-dataset problem demands a dataset- The two largest standardized, cross-lingual datasets by-dataset solution; our task is not to translate for morphological annotation are provided by the a schema, but to translate a resource. Starting Universal Dependencies (UD; Nivre et al., 2017) from the idealized schema, we create a rule-based and Universal Morphology (UniMorph; Sylak- tool for converting UD-schema annotations to Glassman et al., 2015; Kirov et al., 2018) projects. UniMorph annotations, incorporating language- Each project’s data are annotated according to its specific post-edits that both correct infelicities and own cross-lingual schema, prescribing how fea- also increase harmony between the datasets them- tures like gender or case should be marked. The selves (rather than the schemata). We apply this schemata capture largely similar information, so conversion to the 31 languages with both UD and one may want to leverage both UD’s token-level UniMorph data, and we report our method’s recall, treebanks and UniMorph’s type-level lookup tables showing an improvement over the strategy which and unify the two resources. This would permit a just maps corresponding schematic features to each leveraging of both the token-level UD treebanks other. Further, we show similar downstream per- and the type-level UniMorph tables of paradigms. formance for each annotation scheme in the task of Unfortunately, neither resource perfectly realizes morphological tagging. This tool enables a synergistic use of UniMorph Simple label Form PTB tag and Universal Dependencies, as well as teasing Present, 3rd singular “proves” VBZ out the annotation discrepancies within and across Present, other “prove” VBP projects. When one dataset disobeys its schema or Past “proved” VBD disagrees with a related language, the flaws may Past participle “proven” VBN not be noticed except by such a methodological Present participle “proving” VBG dive into the resources. When the maintainers of the resources ameliorate these flaws, the resources Table 1: Inflected forms of the English verb prove, move closer to the goal of a universal, cross-lingual along with their Penn Treebank tags inventory of features for morphological annotation. The contributions of this work are: where they store the inflected forms for retrieval as • We detail a deterministic mapping from the syntactic context requires. Instead, there needs UD morphological annotations to UniMorph. to be a mental process that can generate properly Language-specific edits of the tags in 31 lan- inflected words on demand. Berko(1958) showed guages increase harmony between converted this insightfully through the “wug”-test, an experi- UD and existing UniMorph data (x5). ment where she forced participants to correctly in- flect out-of-vocabulary lemmata, such as the novel • We provide an implementation of this map- noun wug. ping and post-editing, which replaces the UD Certain features of a word do not vary depending features in a CoNLL-U file with UniMorph on its context: In German or Spanish where nouns features.1 are gendered, the word for onion will always be • We demonstrate that downstream perfor- grammatically feminine. Thus, to prepare for later mance tagging accuracy on UD treebanks discussion, we divide the morphological features is similar, whichever annotation schema is of a word into two categories: the modifiable in- used (x7). flectional features and the fixed lexical features. A part of speech (POS) is a coarse syntactic • We provide a partial inventory of missing at- category (like “verb”) that begets a word’s partic- tributes or annotation inconsistencies in both ular menu of lexical and inflectional features. In UD and UniMorph, a guidepost for strength- English, verbs express no gender, and adjectives do ening and harmonizing each resource. not reflect person or number. The part of speech 2 Background: Morphological Inflection dictates a set of inflectional slots to be filled by the surface forms. Completing these slots for a given Morphological inflection is the act of altering the lemma and part of speech gives a paradigm: a base form of a word (the lemma, represented in mapping from slots to surface forms. Regular En- fixed-width type) to encode morphosyntac- glish verbs have five slots in their paradigm (Long, tic features. As an example from English, prove 1957), which we illustrate for the verb prove, us- takes on the form “proved” to indicate that the ac- ing simple labels for the forms in Table 1. tion occurred in the past. (We will represent all A morphosyntactic schema prescribes how lan- surface forms in quotation marks.) The process oc- guage can be annotated—giving stricter categories curs in the majority of the world’s widely-spoken than our simple labels for prove—and can vary languages, typically through meaningful affixes. in the level of detail provided. Part of speech The breadth of forms created by inflection creates a tags are an example of a very coarse schema, ig- challenge of data sparsity for natural language pro- noring details of person, gender, and number. A cessing: The likelihood of observing a particular slightly finer-grained schema for English is the word form diminishes. Penn Treebank tagset (Marcus et al., 1993), which A classic result in psycholinguistics (Berko, includes signals for English morphology. For in- 1958) shows that inflectional morphology is a fully stance, its VBZ tag pertains to the specially in- productive process. Indeed, it cannot be that hu- flected 3rd-person singular, present-tense verb form mans simply have the equivalent of a lookup table, (e.g. “proves” in Table 1). 1Available at https://www.github.com/ If the tag in a schema is detailed enough that unimorph/ud-compatibility. it exactly specifies a slot in a paradigm, it is called a morphosyntactic description (MSD).2 is an annotated treebank, making it a resource of These descriptions require varying amounts of de- token-level annotations. The schema is guided by tail: While the English verbal paradigm is small these treebanks, with feature names chosen for rele- enough to fit on a page, the verbal paradigm of vance to native speakers. (In x3.2, we will contrast the Northeast Caucasian language Archi can have this with UniMorph’s treatment of morphosyntac- over 1;500;000 slots (Kibrik, 1998). tic categories.) The UD datasets have been used in the CoNLL shared tasks (Zeman et al., 2017, 2018 3 Two Schemata, Two Philosophies to appear). Unlike the Penn Treebank tags, the UD and Uni- Morph schemata are cross-lingual and include a 3.2 UniMorph fuller lexicon of attribute-value pairs, such as PER- SON: 1. Each was built according to a different set In the Universal Morphological Feature Schema of principles. UD’s schema is constructed bottom- (UniMorph schema, Sylak-Glassman, 2016), there up, adapting to include new features when they’re are at least 212 values, spread across 23 attributes. identified in languages. UniMorph, conversely, is It identifies some attributes that UD excludes like top-down: A cross-lingual survey of the literature information structure and deixis, as well as pro- of morphological phenomena guided its design. viding more values for certain attributes, like 23 UniMorph aims to be linguistically complete, con- different noun classes endemic to Bantu languages. taining all known morphosyntactic attributes. Both As it is a schema for marking morphology, its part schemata share one long-term goal: a total inven- of speech attribute does not have POS values for tory for annotating the possible morphosyntactic punctuation, symbols, or miscellany (PUNCT, SYM, features of a word. and X in Universal Dependencies). Like the UD schema, the decomposition of a 3.1 Universal Dependencies word into its lemma and MSD is directly compara- The Universal Dependencies morphological ble across languages. Its features are informed by schema comprises part of speech and 23 additional a distinction between universal categories, which attributes (also called features in UD) annotating are widespread and psychologically “real” to speak- meaning or syntax, as well as language-specific ers; and comparative concepts, only used by lin- attributes.