Interpreting Sequence-To-Sequence Models for Russian Inflectional Morphology
Total Page:16
File Type:pdf, Size:1020Kb
Interpreting Sequence-to-Sequence Models for Russian Inflectional Morphology David L. King Andrea D. Sims Micha Elsner The Ohio State University king.2138, sims.120, elsner.14 @osu.edu { } Abstract Our broader linguistic contribution is to recon- nect the inflection task to the descriptive literature Morphological inflection, as an engineering task in NLP, has seen a rise in the use of neu- on morphological systems. Neural models for in- ral sequence-to-sequence models (Kann and flection are now being applied as cognitive models Schutze¨ , 2016; Cotterell et al., 2018; Aha- of human learning in a variety of settings (Mal- roni and Goldberg, 2017). While these out- ouf, 2017; Silfverberg and Hulden, 2018; Kirov perform traditional systems based on edit rule and Cotterell, 2018, and others). They are appeal- induction, it is hard to interpret what they ing cognitive models partly because of their high are learning in linguistic terms. We propose performance on benchmark tasks (Cotterell et al., a new method of analyzing morphological sequence-to-sequence models which groups 2016, and subsq.), and also because they make errors into linguistically meaningful classes, few assumptions about the morphological system making what the model learns more transpar- they are trying to model, dispensing with overly ent. As a case study, we analyze a seq2seq restrictive notions of segmentable morphemes and model on Russian, finding that semantic and discrete inflection classes. But while these con- lexically conditioned allomorphy (e.g. inani- structs are theoretically troublesome, they are still mate nouns like ZAVOD ‘factory’ and animates important for describing many commonly-studied like OTEC ‘father’ have different, animacy- languages; without them, it is relatively difficult conditioned accusative forms) are responsible for its relatively low accuracy. Augmenting to discover what a particular model has and has the model with word embeddings as a proxy not learned about a morphological system. This for lexical semantics leads to significant im- is often the key question which prevents us from provements in predicted wordform accuracy. using a general-purpose neural network system as a cognitive model (Gulordava et al., 2018). 1 Introduction Our error analysis allows us to understand more Neural sequence-to-sequence models excel at clearly how the sequence-to-sequence model di- learning inflectional paradigms from incomplete verges from human behavior, giving us new infor- input (Table 1 shows an example inflection prob- mation about its suitability as a cognitive model of lem.) These models, originally borrowed from the language learner. neural machine translation (Bahdanau et al., As a case study, we apply our error anal- 2014), read in a series of input tokens (e.g. char- ysis technique to Russian, one of the lowest- acters, words) and output, or translate, them as performing languages in SIGMORPHON 2016. another series. Although these models have be- We find a large class of errors in which the come adept at mapping input to output sequences, model incorrectly selects among lexically- or like all neural models, they are relatively uninter- semantically-conditioned allomorphs. Russian pretable. We present a novel error analysis tech- has semantically-conditioned allomorphy in nouns nique, based on previous systems for learning to and adjectives, and lexically-conditioned allomor- inflect which relied on edit rule induction (Durrett phy (inflection classes) in nouns and verbs (Tim- and DeNero, 2013). By using this to interpret the berlake, 2004); Section 3 gives a brief introduction output of a neural model, we can group errors into to the relevant phenomena. While these facts are linguistically salient classes such as producing the commonly known to linguists, their importance wrong case form or incorrect inflection class. to modeling the inflection task has not previously Source Features Target Faruqui et al. (2016) introduced the use ABASˇ pos=N, ABASˇ of attention-based neural sequence-to-sequence case=NOM, learning for the inflection task, building on models num=SG from machine translation (Bahdanau et al., 2014). JATAGAN pos=N, JATAGANAMI Their model treats input as a linear series where case=INS, grammatical features and characters are encoded num=PL as one-hot embeddings and passed to a bidirec- tional encoder LSTM; output for each paradigm Table 1: An example inflection problem: the task is to cell is produced by a separate decoder. Kann map the Source and Features to the correct, fully in- flected Target. and Schutze¨ (2016) extended Faruqui et al.’s ar- chitecture by using ensembling and by using a single decoder, shared across all output paradigm been pointed out. Section 4 shows that these phe- cells, to account for data sparsity. Later systems nomena account for most of Russian’s increased (Aharoni and Goldberg, 2017; Kann and Schutze¨ , difficulty relative to the other languages. In Sec- 2017) have made changes to the input represen- tion 6, we provide lexical-semantic information to tation and the architecture, for instance incorpo- the model, decreasing errors due to semantic con- rating variants of hard attention and autoencoding. ditioning of nouns by 64% and of verbs by 88%. From a theoretical standpoint, all these models are “a-morphous” (Anderson, 1992) or “inferential- realizational” (Stump, 2001)— rather than assume 2 Background a concatenative process which stitches discrete morphemes together into surface word forms, they The inflection task described above is an instance learn a flexible, generalizable transduction, ei- of the paradigm cell filling problem (Ackerman ther between a stem and surface form (Anderson, et al., 2009), and models a situation which both 1992; Stump, 2001), or between pairs of surface computational and human learners face. For hu- forms (Albright, 2002; Blevins, 2006). mans, the PCFP is closely related to the “wug test” (Berko, 1958): given some previously un- Some older learning-based inflection systems, seen word, how does a speaker produce a different such as Durrett and DeNero (2013), exploit se- inflected form? As Lignos and Yang (2016) and quence alignment across strings. Alignment-based Blevins et al. (2017) point out, the same Zipfian systems essentially treat morphology as concate- distribution that makes other NLP tasks (e.g. MT) nation. While they do not perform full-scale mor- difficult is also at play in morphology, namely that phological analysis (since they do not account for no corpus will ever exist that has every wordform phonological alternations), in languages which are from every lexeme. For theoretical morphologists, mostly concatenative, they do tend to isolate affix- the difficulty of the PCFP on average is a mea- like units as sequences of adjacent insertions or sure of the learnability of a morphological system, deletions. This property has been criticized in the with implications for language typology (Acker- neural literature (Faruqui et al., 2016) since it rep- man et al., 2009; Ackerman and Malouf, 2013; Al- resents processes like vowel harmony by enumer- bright, 2002; Bonami and Beniamine, 2016; Sims ating large sets of surface allomorphs, making the and Parker, 2016). learning problem harder. We agree with these crit- icisms from the modeling standpoint, but we ex- Ackerman et al.’s (2009) formulation of the ploit the interpretability of the technique in our PCFP relies on a simple concatenative model in analysis of model results. which words are divided into stems and affixes, and in which each affix is treated as a discrete Our study of Russian concludes that value. Cotterell et al. (2018) points out that this semantically- and lexically-conditioned allo- model is ill-suited to dealing with phenomena morphy constitutes a problem for current neural like phonological alterations or stem suppletion. reinflection models. This is because such models Newer models (Silfverberg and Hulden, 2018; are trained to map input to output character Malouf, 2017; Cotterell et al., 2018) use sequence- sequences; they do not typically have access to-sequence inflection models to avoid these short- to information about what the words they are comings. inflecting mean. We show that, by providing word embeddings as meaning representations, Case Singular Plural we can reduce this source of error and bring Nominative ,-’,-J,-IJ -”, -I,-II ; Russian closer to the other languages studied in Accusative N or G SIGMORPHON 2016. Genitive -A,-JA,-IJA -OV,-EJ, Recently the NLP community has also pushed -EV,-IEV for greater transparency with neural models (xci, Dative -U,-JU,-IJU -AM,-JAM, 2017; ana, 2019). Wilcox et al. (2018) showed -IJAM that RNNs learn hierarchical structure in sentences Instrumental -OM,-EM, -AMI,-JAMI, like island constraints. Faruqui et al. demon- -IEM -IJAMI strated that RNNs can automatically learn which Locative -E,-II -AX,-JAX, vowel pairs participate in vowel harmony alterna- -IJAX tion. Our error analysis allows us to interpret what neural models are learning, reconnecting inflec- Table 2: An example of class IA, showing the effect 2 tion tasks to linguistic intuitions by generalizing of animacy in the orthography across the singular and over error classes. plural accusative forms, where N or G indicate where syncretism occurs in the accusative form based on ani- 3 Russian Inflectional Morphology macy. We select Russian as our language of analysis be- cause it was among the three worst-performing gular and plural and in classes IB, II, and III ac- languages in the SIGMORPHON 2016 shared cusative plurals, the accusative exhibits syncretism task, falling 4+ percentage points behind the other with either the genitive (for animates) or the nom- languages. Problems with the design of the Navajo inative (for inanimates). In the case of the ani- and Maltese datasets may have been the source mate noun STUDENT (‘student’), for example, the of the problems with those languages,1 but this nominative singular form is student and the ac- cannot explain the Russian results. The discrep- cusative singular and genitive singular forms are ancy hints at some linguistic property which dis- both studenta.