The SIGMORPHON 2019 Shared Task: Morphological Analysis In
Total Page:16
File Type:pdf, Size:1020Kb
The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection Arya D. McCarthy♣, Ekaterina Vylomova♥, Shijie Wu♣, Chaitanya Malaviya♠, Lawrence Wolf-Sonkin♦, Garrett Nicolai♣, Christo Kirov♣,∗ Miikka Silfverberg♯, Sabrina J. Mielke♣, Jeffrey Heinz♭, Ryan Cotterell♣, and Mans Hulden♮ ♣Johns Hopkins University ♥University of Melbourne ♠Allen Institute for AI ♦Google ♯University of Helsinki ♭Stony Brook University ♮University of Colorado Abstract For instance, Romanian verb forms inflect for person, number, tense, mood, and voice; mean- The SIGMORPHON 2019 shared task on while, Archi verbs can take on thousands of forms cross-lingual transfer and contextual analysis (Kibrik, 1998). Such complex paradigms produce in morphology examined transfer learning of inflection between 100 language pairs, as well large inventories of words, all of which must be as contextual lemmatization and morphosyn- producible by a realistic system, even though a tactic description in 66 languages. The first large percentage of them will never be observed task evolves past years’ inflection tasks by ex- over billions of lines of linguistic input. Com- amining transfer of morphological inflection pounding the issue, good inflectional systems of- knowledge from a high-resource language to a ten require large amounts of supervised training low-resourcelanguage. This year also presents data, which is infeasible in many of the world’s a new second challenge on lemmatization and morphological feature analysis in context. All languages. submissions featured a neural component and This year’s shared task is concentrated on en- built on either this year’s strong baselines or couraging the construction of strong morpholog- highly ranked systems from previous years’ ical systems that perform two related but differ- shared tasks. Every participating team im- ent inflectional tasks. The first task asks partici- proved in accuracy over the baselines for the pants to create morphological inflectors for a large inflection task (though not Levenshtein dis- number of under-resourced languages, encourag- tance), and every team in the contextual analy- sis task improved on both state-of-the-art neu- ing systems that use highly-resourced, related lan- ral and non-neural baselines. guages as a cross-lingual training signal. The sec- ond task welcomes submissions that invert this op- 1 Introduction eration in light of contextual information: Given an unannotated sentence, lemmatize each word, While producing a sentence, humans combine var- and tag them with a morphosyntactic description. ious types of knowledge to produce fluent output— Both of these tasks extend upon previous morpho- various shades of meaning are expressed through logical competitions, and the best submitted sys- word selection and tone, while the language is tems now represent the state of the art in their re- arXiv:1910.11493v2 [cs.CL] 25 Feb 2020 made to conform to underlying structural rules via spective tasks. syntax and morphology. Native speakers are often quick to identify disfluency, even if the meaning 2 Tasks and Evaluation of a sentence is mostly clear. Automatic systems must also consider these 2.1 Task 1: Cross-lingual transfer for constraints when constructing or processing lan- morphological inflection guage. Strong enough language models can often Annotated resources for the world’s languages are reconstruct common syntactic structures, but are not distributed equally—some languages simply insufficient to properly model morphology. Many have more as they have more native speakers will- languages implement large inflectional paradigms ing and able to annotate more data. We explore that mark both function and content words with how to transfer knowledge from high-resource lan- a varying levels of morphosyntactic information. guages that are genetically related to low-resource ∗Now at Google languages. 1 The first task iterates on last year’s main task: and the average Levenshtein distance between the morphological inflection (Cotterell et al., 2018). predictions and their corresponding true forms. Instead of giving some number of training exam- ples in the language of interest, we provided only 2.2 Task 2: Morphological analysis in context a limited number in that language. To accom- Although inflection of words in a context-agnostic pany it, we provided a larger number of examples manner is a useful evaluation of the morphological in either a related or unrelated language. Each quality of a system, people do not learn morphol- test example asked participants to produce some ogy in isolation. other inflected form when given a lemma and a In 2018, the second task of the CoNLL– bundle of morphosyntactic features as input. The SIGMORPHON Shared Task (Cotterell et al., goal, thus, is to perform morphological inflection 2018) required submitting systems to complete an in the low-resource language, having hopefully ex- inflectional cloze task (Taylor, 1953) given only ploited some similarity to the high-resource lan- the sentential context and the desired lemma – an guage. Models which perform well here can aid example of the problem is given in the following downstream tasks like machine translation in low- lines: A successful system would predict the plu- resource settings. All datasets were resampled ral form “dogs”. Likewise, a Spanish word form from UniMorph, which makes them distinct from “ayuda” may be a feminine noun or a third-person past years. verb form, which must be disambiguated by con- The mode of the task is inspired by Zoph et al. text. (2016), who fine-tune a model pre-trained on a high-resource language to perform well on a low- The are barking. resource language. We do not, though, require (dog) that models be trained by fine-tuning. Joint mod- eling or any number of methods may be explored This year’s task extends the second task from instead. last year. Rather than inflect a single word in con- Example The model will have access to type- text, the task is to provide a complete morphologi- level data in a low-resource target language, plus cal tagging of a sentence: for each word, a success- a high-resource source language. We give an ex- ful system will need to lemmatize and tag it with a ample here of Asturian as the target language with morphsyntactic description (MSD). Spanish as the source language. The dogs are barking . Low-resource target training data (Asturian) the dog be bark . facer “fechu” V;V.PTCP;PST DET N;PL V;PRS;3;PL V;V.PTCP;PRS PUNCT aguar “agu`a” V;PRS;2;PL;IND . Context is critical—depending on the sentence, High-resource source language training data identical word forms realize a large number of po- (Spanish) tential inflectional categories, which will in turn tocar “tocando” V;V.PTCP;PRS influence lemmatization decisions. If the sen- bailar “bailaba” V;PST;IPFV;3;SG;IND tence were instead “The barking dogs kept us up mentir “minti´o” V;PST;PFV;3;SG;IND all night”, “barking” is now an adjective, and its . lemma is also “barking”. Test input (Asturian) baxar V;V.PTCP;PRS 3 Data Test output (Asturian) 3.1 Data for Task 1 “baxando” Language pairs We presented data in 100 lan- Table 1: Sample language pair and data format for guage pairs spanning 79 unique languages. Data Task 1 for all but four languages (Basque, Kurmanji, Murrinhpatha, and Sorani) are extracted from Evaluation We score the output of each system English Wiktionary, a large multi-lingual crowd- in terms of its predictions’ exact-match accuracy sourced dictionary with morphological paradigms 2 for many lemmata.1 20 of the 100 language pairs As these sets are nested, the highest-count triples are either distantly related or unrelated; this allows tend to appear in the smaller training sets.3 speculation into the relative importance of data The final 2000 triples were randomly shuffled quantity and linguistic relatedness. and then split in half to obtain development and test sets of 1000 forms each.4 The final shuffling Data format For each language, the basic data was performed to ensure that the development set consists of triples of the form (lemma, feature bun- is similar to the test set. By contrast, the devel- dle, inflected form), as in Table 1. The first fea- opment and test sets tend to contain lower-count ture in the bundle always specifies the core part of triples than the training set.5 speech (e.g., verb). For each language pair, sepa- rate files contain the high- and low-resource train- Other modifications We further adopted some ing examples. changes to increase compatibility. Namely, we cor- All features in the bundle are coded according rected some annotation errors created while scrap- to the UniMorph Schema, a cross-linguistically ing Wiktionary for the 2018 task, and we standard- consistent universal morphological feature set ized Romanian t-cedilla and t-comma to t-comma. (Sylak-Glassman et al., 2015a,b). (The same was done with s-cedilla and s-comma.) Extraction from Wiktionary For each of the 3.2 Data for Task 2 Wiktionary languages, Wiktionary provides a number of tables, each of which specifies the full Our data for task 2 come from the Universal inflectional paradigm for a particular lemma. As Dependencies treebanks (UD; Nivre et al., 2018, in the previous iteration, tables were extracted us- v2.3), which provides pre-defined training, devel- ing a template annotation procedure described in opment, and test splits and annotations in a uni- (Kirov et al., 2018). fied annotation schema for morphosyntax and de- pendency relationships. Unlike the 2018 cloze Sampling data splits From each language’s col- task which used UD data, we require no manual lection of paradigms, we sampled the training, de- data preparation and are able to leverage all 107 2 velopment, and test sets as in 2018. Crucially, monolingual treebanks.