Arxiv:1908.05838V2 [Cs.CL] 20 Aug 2019 Encompass More and More Languages, Modeling Proper Handling of the Huge Vocabulary
Total Page:16
File Type:pdf, Size:1020Kb
Pushing the Limits of Low-Resource Morphological Inflection Antonios Anastasopoulos and Graham Neubig Language Technologies Institute, Carnegie Mellon University {aanastas,gneubig}@cs.cmu.edu Abstract a g u à ··· Recent years have seen exceptional strides in P(y1 yK ) the task of automatic morphological inflec- softmax tion generation. However, for a long tail of 0 0 s1 ··· sK s ··· s languages the necessary resources are hard to 1 K come by, and state-of-the-art neural methods decoder t t x ··· x that work well under higher resource settings c1 ··· cK c1 cK perform poorly in the face of a paucity of attention attention data. In response, we propose a battery of im- t ··· t x ··· x provements that greatly improve performance h1 hM h1 hN under such low-resource conditions. First, we encoder encoder present a novel two-step attention architecture t ··· t x1 ··· xN for the inflection decoder. In addition, we in- 1 M vestigate the effects of cross-lingual transfer V PRS 2 PL IND a g u a r from single and multiple languages, as well as monolingual data hallucination. The macro- Figure 1: Visualization of our proposed two-step atten- averaged accuracy of our models outperforms tion architecture. The decoder first attends over the tag 0 the state-of-the-art by 15 percentage points.1 sequence T and then uses the updated decoder state s Also, we identify the crucial factors for suc- to attend over the character sequence X in order to pro- cess with cross-lingual transfer for morpho- duce the inflected form Y. (Example from Asturian.) logical inflection: typological similarity and a common representation across languages. recognition (Foley et al., 2018) and predictive key- boards (Breiner et al., 2019) for under-represented 1 Introduction languages, if they exist, largely still rely on uni- The majority of the world’s languages are cate- gram lexicons with performance inferior to the so- gorized as synthetic, meaning that they have rich phisticated language models used in high-resource morphology, be it fusional, agglutinative, polysyn- ones. Good inflection models would be invaluable thetic, or a mixture thereof. As Natural Language for predictive text technology in morphologically- Processing (NLP) keeps expanding its frontiers to rich languages as they could effectively enable arXiv:1908.05838v2 [cs.CL] 20 Aug 2019 encompass more and more languages, modeling proper handling of the huge vocabulary. of the grammatical functions that guide language Additionally, they could be very useful for generation is of utmost importance. building educational applications for languages of In the case of morphologically-rich languages, under-represented communities (along with their explicit modeling of the inflection processes has inverse, morphological analyzers). Encouraging significant potential to alleviate issues created by examples are the Yupik morphological analyzer data scarcity and the resulting lack of vocabu- (Schwartz et al., 2019) and the Inuktitut educa- lary coverage. Especially on low-resource, under- tional tools from the respective Native Peoples 2 represented languages and dialects, the potential communities. The social impact of such applica- for impact is much higher. For example, speech tions can be enormous, effectively raising the sta- tus of the languages slightly closer to the level of 1Our code is available at https://github.com/ antonisa/inflection. 2http://www.inuktitutcomputing.ca/ the dominant regional language. nation technique (§2.2). Third, we devise a train- Morphological inflection has been thoroughly ing schedule (§2.3) that substitutes structural bi- studied in monolingual high resource settings, es- ases for attention monotonicity. pecially through the recent SIGMORPHON chal- lenges (Cotterell et al., 2016, 2017, 2018). Low- 2.1 Model Architecture resource settings, in contrast, are relatively under- Our models are based on a sequence-to-sequence explored. One promising direction and the main model with attention (Bahdanau et al., 2015). In focus of the SIGMORPHON 2019 challenge (Mc- broad terms, the model is composed of three parts: Carthy et al., 2019b) is cross-lingual training, an encoder, the attention, and a recurrent decoder. which has been successfully applied in other low- In the setting of the inflection task, there is an ad- resource tasks such as Machine Translation (MT) ditional input provided (the set of morphological or parsing. tags) which requires an additional encoder. In this work we focus in this cross-lingual set- A visualization of our model is shown in Fig- ting for low-resource morphological inflection and ure1. First, an encoder transforms the input char- propose several simple, yet effective approaches acter sequence x1 ::: xN into a sequence of inter- to mitigating problems caused by extreme lack x x mediate representations h1 ::: hN. A second en- of data which, put together, improve accuracy by coder transforms the set of morphological tags 15 percentage points over a state-of-the-art base- t t t1 ::: tM into another sequence of states h1 ::: hM: line. This is achieved through the combination of a hx = encx(hx ; x ) and ht = enct(T): novel decoder architecture, a training regime that n n−1 n m alleviates the need for costly structural biases that In our implementation, we use a single layer bi- force attention monotonicity, and a data hallucina- directional recurrent encoder for the lemma, and tion technique. We also present thorough ablations a self-attention encoder (Vaswani et al., 2017) for and identify the crucial factors for success with our encoding the tags as there is no inherent order approach. (e.g. left-to-right) in their presentation. In prelim- Our system was the best performing system inary experiments, a self-attention lemma encoder in the 2019 SIGMORPHON shared task on mor- proved hard to train, while a recurrent tag encoder phological inflection when evaluated on accuracy, yielded quite competitive results. while it ranked third when evaluated with average Next, we have attention mechanisms that trans- Levenshtein distance. form the two sequences of input states into two sequences of context vectors via two matrices of 2 Task Definition and Approach attention weights (k is the current decoder time step): Morphological inflection is the process that cre- h i h i cx = P αx hx ct = P αt ht : ates grammatical forms (typically guided by sen- k n kn n k m km m tence structure) of a lexeme/lemma. As a com- Finally, the recurrent decoder computes a se- putational task it is framed as mapping from the quence of output states in a two-step process, from lemma and a set of morphological tags to the de- which a probability distribution over output char- sired form, which simplifies the task by removing acters can be computed: 0 t the necessity to infer the form from context. For an sk = sk−1 + ck example from Asturian, given the lemma aguar 0 0 x sk = dec(sk−1; ck; yk−1) and tags V;PRS;2;PL;IND, the task is to create the P(y ) = softmax(s0 ): indicative voice, present tense, 2nd person plural k k The attention mechanisms produce their weights form aguà. 0 as in Luong et al.(2015), with vx, vt, Ws , Ws , Let X = x1 ::: xN be a character sequence of the αx αt h h lemma, T = t1 ::: tM a set of morphological tags, Wαx , and Wαt being parameters to be learned: h i and Y = y1 ::: yK be an inflection target character t t s 0 h t αkm = softmax(v tanh( W t sk− ; W t hm )) sequence. The goal is to model P(Y j X; T). α 1 α x x h s0 h xi Our approach consists of three major compo- αkn = softmax(v tanh( Wαx sk; Wαx hn )): nents. First, we propose a novel two-step atten- The two-step attention process essentially first 0 tion decoder architecture (§2.1). Second, we aug- uses the decoder’s previous state sk−1 as the query ment the low-resource datasets with a data halluci- for attending over the tags. Then, it creates a tag- informed state s0 by adding the tag context ct to stem stem k k Original triple the previous state. The tag-informed state is then lemma π α ρ α κ ά μ π τ ω used as the query for attending over the source x characters and produce the context ck. The last step is then to update the recurrent state and pro- +V;2;SG;IPFV;PST παρέκαμπτες duce the output character. Ultimately, we desire that the provided tag set guides the generation, Hallucinated which also means influencing the attention over lemma π ξ ρακάμοτω the characters of the lemma. +V;2;SG;IPFV;PST π ξ ρέκαμοτες Additional Structural Biases for Attention In- Figure 2: Example of our hallucination process corporating structural biases in the model’s archi- (Greek). The lemma and inflected forms are aligned at the character level. The inside of stem-considered parts tecture or in the training objective can lead to im- (highlighted) are substituted with random characters, provements in performance, especially for tasks creating hallucinated triples (bottom). where the attention mechanism is expected to be- have similarly to an alignment model, like MT. This idea has been successfully applied to the in- loss Ll similar to Lample et al.(2018). How- flection task and is at the core of the state-of-the- ever, in order to encourage the encoder to learn art model (Wu and Cotterell, 2019). language-invariant representations, we reverse the One bias we deem important is coverage of all gradients flowing from that component into the en- input characters and tags from the attentions. Intu- coder during back-propagation. itively, this entails encouraging the model to “look at” the whole input.