Arxiv:2006.11572V2 [Cs.CL] 14 Jul 2020
Total Page:16
File Type:pdf, Size:1020Kb
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection Ekaterina Vylomova@ Jennifer WhiteQ Elizabeth SaleskyZ Sabrina J. MielkeZ Shijie WuZ Edoardo PontiQ Rowan Hall MaudslayQ Ran ZmigrodQ Josef ValvodaQ Svetlana ToldovaE Francis TyersI;E Elena KlyachkoE Ilya YegorovM Natalia KrizhanovskyK Paula CzarnowskaQ Irene NikkarinenQ Andrew KrizhanovskyK Tiago PimentelQ Lucas Torroba HennigenQ Christo Kirov5 Garrett Nicolaiá Adina WilliamsF Antonios Anastasopoulosì Hilaria CruzL Eleanor Chodroff7 Ryan CotterellQ,D Miikka Silfverbergá Mans HuldenX @University of Melbourne QUniversity of Cambridge ZJohns Hopkins University EHigher School of Economics MMoscow State University KKarelian Research Centre 5Google AI áUniversity of British Columbia FFacebook AI Research ìCarnegie Mellon University IIndiana University LUniversity of Louisville 7University of York DETH Z¨urich XUniversity of Colorado Boulder [email protected] [email protected] Abstract languages share many basic attributes (e.g., A broad goal in natural language process- Swadesh, 1950 and more recently, List et al., ing (NLP) is to develop a system that has 2016), grammatical features, and even abstract the capacity to process any natural lan- implications (proposed in Greenberg, 1963), guage. Most systems, however, are devel- each language nevertheless has a unique evo- oped using data from just one language lutionary trajectory that is affected by geo- such as English. The SIGMORPHON 2020 shared task on morphological reinflec- graphic, social, cultural, and other factors. As tion aims to investigate systems' ability to a result, the surface form of languages varies generalize across typologically distinct lan- substantially. The morphology of languages guages, many of which are low resource. can differ in many ways: Some exhibit rich Systems were developed using data from grammatical case systems (e.g., 12 in Erzya 45 languages and just 5 language families, and 24 in Veps) and mark possessiveness, oth- fine-tuned with data from an additional 45 ers might have complex verbal morphology languages and 10 language families (13 in total), and evaluated on all 90 languages. (e.g., Oto-Manguean languages; Palancar and A total of 22 systems (19 neural) from 10 L´eonard, 2016) or even \decline" nouns for teams were submitted to the task. All four tense (e.g., Tupi{Guarani languages). Linguis- winning systems were neural (two monolin- tic typology is the discipline that studies these gual transformers and two massively mul- variations by means of a systematic compari- tilingual RNN-based models with gated at- son of languages (Croft, 2002; Comrie, 1989). tention). Most teams demonstrate util- Typologists have defined several dimensions of ity of data hallucination and augmenta- morphological variation to classify and quantify tion, ensembles, and multilingual training for low-resource languages. Non-neural the degree of cross-linguistic variation. This arXiv:2006.11572v2 [cs.CL] 14 Jul 2020 learners and manually designed grammars comparison can be challenging as the categories showed competitive and even superior per- are based on studies of known languages and formance on some languages (such as In- are progressively refined with documentation grian, Tajik, Tagalog, Zarma, Lingala), es- of new languages (Haspelmath, 2007). Never- pecially with very limited data. Some lan- theless, to understand the potential range of guage families (Afro-Asiatic, Niger-Congo, morphological variation, we take a closer look Turkic) were relatively easy for most sys- tems and achieved over 90% mean accuracy at three dimensions here: fusion, inflectional while others were more challenging. synthesis, and position of case affixes (Dryer and Haspelmath, 2013). 1 Introduction Fusion, our first dimension of variation, Human language is marked by considerable di- refers to the degree to which morphemes versity around the world. Though the world's bind to one another in a phonological word (Bickel and Nichols, 2013b). Languages range just one morphological system among many. A from strictly isolating (i.e., each morpheme larger goal of natural language processing is is its own phonological word) to concatena- that the system work for any presented lan- tive (i.e., morphemes bind together within guage. If an NLP system is trained on just one a phonological word); non-linearities such as language, it could be missing important flexibil- ablaut or tonal morphology can also be present. ity in its ability to account for cross-linguistic From a geographic perspective, isolating lan- morphological variation. guages are found in the Sahel Belt in West In this year's iteration of the SIGMOR- Africa, Southeast Asia and the Pacific. Ablaut{ PHON shared task on morphological reinflec- concatenative morphology and tonal morphol- tion, we specifically focus on typological di- ogy can be found in African languages.Tonal{ versity and aim to investigate systems' ability concatenative morphology can be found in to generalize across typologically distinct lan- Mesoamerican languages (e.g., Oto-Manguean). guages many of which are low-resource. For Concatenative morphology is the most com- example, if a neural network architecture works mon system and can be found around the well for a sample of Indo-European languages, world. Inflectional synthesis, the second dimen- should the same architecture also work well sion considered, refers to whether grammatical for Tupi{Guarani languages (where nouns are categories like tense, voice or agreement are \declined" for tense) or Austronesian languages expressed as affixes (synthetic) or individual (where verbal morphology is frequently prefix- words (analytic) (Bickel and Nichols, 2013c). ing)? Analytic expressions are common in Eurasia (except the Pacific Rim, and the Himalaya and 2 Task Description Caucasus mountain ranges), whereas synthetic expressions are used to a high degree in the The 2020 iteration of our task is similar to Americas. Finally, affixes can variably sur- CoNLL-SIGMORPHON 2017 (Cotterell et al., face as prefixes, suffixes, infixes, or circumfixes 2017) and 2018 (Cotterell et al., 2018) in that (Dryer, 2013). Most Eurasian and Australian participants are required to design a model languages strongly favor suffixation, and the that learns to generate inflected forms from a same holds true, but to a lesser extent, for lemma and a set of morphosyntactic features South American and New Guinean languages that derive the desired target form. For each (Dryer, 2013). In Mesoamerican languages and language we provide a separate training, devel- African languages spoken below the Sahara, opment, and test set. More historically, all of prefixation is dominant instead. these tasks resemble the classic \wug"-test that These are just three dimensions of variation Berko(1958) developed to test child and human in morphology, and the cross-linguistic varia- knowledge of English nominal morphology. tion is already considerable. Such cross-lingual Unlike the task from earlier years, this year's variation makes the development of natural task proceeds in three phases: a Develop- language processing (NLP) applications chal- ment Phase, a Generalization Phase, and an lenging. As Bender(2009, 2016) notes, many Evaluation Phase, in which each phase intro- current architectures and training and tuning duces previously unseen data. The task starts algorithms still present language-specific bi- with the Development Phase, which was an ases. The most commonly used language for elongated period of time (about two months), developing NLP applications is English. Along during which participants develop a model of the above dimensions, English is productively morphological inflection. In this phase, we concatenative, a mixture of analytic and syn- provide training and development splits for thetic,and largely suffixing in its inflectional 45 languages representing the Austronesian, morphology. With respect to languages that Niger-Congo, Oto-Manguean, Uralic and Indo- exhibit inflectional morphology, English is rel- European language families. Table1 provides atively impoverished.1 Importantly, English is details on the languages. The Generaliza- tion Phase is a short period of time (it started 1Note that many languages exhibit no inflectional mor- phology e.g., Mandarin Chinese, Yoruba, etc.: Bickel about a week before the Evaluation Phase) dur- and Nichols(2013a). ing which participants fine-tune their models on new data. At the start of the phase, we pro- tradition of written form, while others have vide training and development splits for 45 new yet to incorporate a writing system. The six languages where approximately half are geneti- branches differ most notably in typology and cally related (belong to the same family) and syntax, with the Chadic language being the half are genetically unrelated (are isolates or be- main source of differences, which has sparked long to a different family) to the languages pre- discussion of the division of the family (Frajzyn- sented in the Development Phase. More specif- gier, 2018). For example, in the Egyptian and ically, we introduce (surprise) languages from Semitic branches, the root of a verb may not Afro-Asiatic, Algic, Dravidian, Indo-European, contain vowels, while this is allowed in Chadic. Niger-Congo, Sino-Tibetan, Siouan, Songhay, Although only four of the six branches, ex- Southern Daly, Tungusic, Turkic, Uralic, and cluding Chadic and Omotic, use a prefix and Uto-Aztecan families. See Table2 for more suffix in conjugation when adding a