ETH Zürich Team
Total Page:16
File Type:pdf, Size:1020Kb
SIGMORPHON 2020 Task 0 System Description ETH Zurich¨ Team Martina Forster1 Clara Meister1 1ETH Zurich¨ [email protected] [email protected] Abstract overfitting, by increasing the amount of training data used for each model. The strategy is viable due This paper presents our system for the SIG- MORPHON 2020 Shared Task. We build off to the typological similarities between languages of the baseline systems, performing exact in- within the same family. We combine two of the ference on models trained on language family neural baseline architectures provided by the task data. Our systems return the globally best so- organizers, a multilingual Transformer (Wu et al., lution under these models. Our two systems 2020) and a (neuralized) hidden Markov model achieve 80.9% and 75.6% accuracy on the test with hard monotonic attention (Wu and Cotterell, set. We ultimately find that, in this setting, 2019), albeit with a different decoding strategy: exact inference does not seem to help or hin- der the performance of morphological inflec- we perform exact inference, returning the globally tion generators, which stands in contrast to its optimal solution under the model. affect on Neural Machine Translation (NMT) models. 2 Background 1 Introduction Neural character-to-character transducers (Faruqui et al., 2016; Kann and Schutze¨ , 2016) define a Morphological inflection generation is the task of probability distribution p (y j x), where θ is a generating a specific word form given a lemma and θ set of weights learned by a neural network and x a set of morphological tags. It has a wide range and y are inputs and (possible) outputs, respec- of applications—in particular, it can be useful for tively. In the case of morphological inflection, x morphologically rich, but low-resource languages. represents the lemma we are trying to inflect and If a language has complex morphology, but only the morphosyntactic description (MSDs) indicat- scarce data are available, vocabulary coverage is ing the inflection we desire; y is then a candidate often poor. In such cases, morphological inflection inflected form of the lemma from the set of all valid can be used to generate additional word forms for character sequences Y. Note that valid character training data. sequences are padded with distinguished tokens, Typologically diverse morphological inflection BOS and EOS, indicating the beginning and end of is the focus of task 0 of the SIGMORPHON Shared the sequence. Tasks (Vylomova et al., 2020), to which we sub- The neural character-to-character transducers mit this system. Specifically, the task requires the we consider in this work are locally normalized. aforementioned transformation from lemma and Specifically, the model p is a probability distri- morphological tags to inflected form. A main chal- θ bution over the set of possible characters which lenge of the task is that it covers a typologically models p (· j x; y ) for any time step t. By the diverse set of languages, i.e. languages have a wide θ <t chain rule of probability, p (y j x) decomposes as range of structural patterns and features. Addition- θ ally, for a portion of these languages, only scant jyj resources are available. Y pθ(y j x) = pθ(yt j x; y<t) (1) Our approach is to train models on language t=1 families rather than solely on individual languages. This strategy should help us overcome the problems The decoding objective then aims to find the frequently encountered for low-resource tasks, e.g., most probable sequence among all valid sequences: 106 Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 106–110 Online, July 10, 2020. c 2020 Association for Computational Linguistics https://doi.org/10.18653/v1/P17 inflections in the training set and 9 in the develop- ? y = argmax log pθ(y j x) (2) ment set. y2Y This is known as maximum a posteriori (MAP) 4 System description decoding. While the above optimization problem Our systems are built using two model architec- ? implies that we find the global optimum y , we tures provided as baselines by the task organizers: often only perform a heuristic search, e.g., beam a multilingual Transformer (Wu et al., 2020) and search, since performing exact search can be quite a (neuralized) hidden Markov model (HMM) with computationally expensive due to the size of Y hard monotonic attention (Wu and Cotterell, 2019). and the dependency of pθ(· j x; y<t) on all previ- We then perform exact inference on the models. ous output tokens. For neural machine translation The following subsections explain the two compo- (NMT) specifically, while beam search often yields nents separately. better results than greedy search, translation quality almost always decreases for beam sizes larger than 4.1 Model Architectures 5. We refer the interested reader to the large num- The architectures of both models exactly follow ber of works that have studied this phenomenon in those of the Transformer and HMM proposed as detail (Koehn and Knowles, 2017; Murray and Chi- baselines for the SIGMORPHON 2020 Task 0. We ang, 2018; Yang et al., 2018; Stahlberg and Byrne, do this in part to create a clear comparison between 2019). morphological inflection generation systems that Exact decoding effectively stretches the beam perform inference with exact vs. heuristic decoding size to infinity (i.e. does not limit it), finding the strategies. globally best solution. While the effects of exact We trained HMMs for each language family for decoding have been explored for neural machine a maximum of 50 epochs and Transformers for a translation (Stahlberg and Byrne, 2019), to the best maximum of 20000 steps. Early stopping was per- of our knowledge, they have not yet been explored formed if subsequent validation set losses differed for morphological inflection generation. This is by less than 1e − 3. Batch sizes of 30 and 100, a natural research question as the architectures of respectively, were used. Other training configura- morphological inflection generation systems are tions followed those of the baseline systems. often based off of those for NMT. Due to the resource scarcity for many of the 3 Data task’s languages, we used entire language families to train models rather than individual languages. We use the data provided by the SIGMORPHON Specifically, we aggregated the data from all lan- 2020 shared task, which features lemmas, inflec- guages of a given family, using a cross-lingual tions, and corresponding MSDs (following uni- learning approach. We did not subsequently fine- morph schema (Kirov et al., 2018)) for 90 lan- tune the models on individual languages. Specifi- guages in total. Data was released in two phases; cally, we do not do any additional training on indi- the first phase included languages from five fam- vidual languages nor do we re-target the vocabulary ilies: Austronesian, Niger-Congo, Uralic, Oto- during decoding. This means generation of invalid Manguean, and Indo-European. Data from the characters (i.e. invalid for a specific language) is second phase included languages belonging to possible. Afro-Asiatic, Algic, Australian, Dravidian, Ger- manic, Indo-Aryan, Iranian, Niger-Congo, Nilo- 4.2 Decoding Sahan, Romance, Sino-Tibetan, Siouan, Tungu- For decoding, we perform exact inference with a sic, Turkic, Uralic, and Uto-Aztecan families. search strategy built on top of the SGNMT library The full list of languages can be found on the (Stahlberg et al., 2017). Specifically, we use Di- task website: https://sigmorphon.github.io/ jkstra’s search algorithm, which provably returns sharedtasks/2020/task0/. the optimal solution when path scores monotoni- Due to scarcity of resources available to the cally decrease with length. From equation1, we task organizers, many of the languages had only a can see that the scoring function for sequences y few morphological forms annotated. For example, is monotonically decreasing in t, therefore meet- Zarma, a Songhay language, had only 56 available ing this criterion. Additionally, to prevent a large 107 Greedy Beam5 acc dist acc dist languages that performed poorly, many were from ang 0.720 0.52 0.726 0.48 azg 0.921 0.22 0.920 0.28 the Germanic and Uralic family. Poor performance ceb 0.820 0.41 0.820 0.39 cly 0.764 0.44 0.762 0.50 on these languages may stem from the fact that cpa 0.846 0.22 0.845 0.23 czn 0.800 0.48 0.793 0.54 they belong to a family with high-resource lan- deu 0.977 0.03 0.976 0.03 dje 0.188 1.88 0.000 2.38 guages. As we trained on language family data and did not fine-tune the models, it is possible that Table 1: Accuracy and Levenshtein distance on the test lower-resource languages in a high-resource fam- set for greedy and beam search with beam size 5 for ily, which are underrepresented in the training data, HMMs. are not adequately modelled. In these setting, per- formance would likely be improve noticeably by Greedy Beam5 acc dist acc dist fine-tuning on the individual languages. ang 0.574 0.76 0.578 0.75 azg 0.808 0.63 0.813 0.62 ceb 0.874 0.27 0.874 0.27 6 Conclusion cly 0.653 0.72 0.657 0.71 cpa 0.651 0.52 0.653 0.52 czn 0.695 0.62 0.702 0.59 We perform exact inference on two baseline neural deu 0.883 0.19 0.882 0.19 dje 0.938 0.12 0.938 0.12 architectures for morphological inflection, a Trans- former and a (neuralized) hidden Markov model Table 2: Accuracy and Levenshtein distance on the test set for greedy and beam search with beam size 5 for with hard monotonic attention, to find the inflec- Transformers.