NRC Systems for Low Resource German-Upper Sorbian Machine Translation 2020: Transfer Learning with Lexical Modifications
Total Page:16
File Type:pdf, Size:1020Kb
NRC Systems for Low Resource German-Upper Sorbian Machine Translation 2020: Transfer Learning with Lexical Modifications Rebecca Knowles and Samuel Larkin and Darlene Stewart and Patrick Littell National Research Council Canada fRebecca.Knowles, Samuel.Larkin, Darlene.Stewart, [email protected] Abstract 2 Approaches We describe the National Research Council 2.1 General System Notes of Canada (NRC) neural machine translation In both translation directions, our final systems con- systems for the German–Upper Sorbian super- sist of ensembles of multiple systems, built using vised track of the 2020 shared task on Unsu- transfer learning (Section 2.2), BPE-Dropout (Sec- pervised MT and Very Low Resource Super- tion 2.3), alternative preprocessing of Czech data vised MT. Our models are ensembles of Trans- former models, built using combinations of (Section 2.4), and backtranslation (Section 2.5). BPE-dropout, lexical modifications, and back- We describe these approaches and related work in translation. the following sections, providing implementation details for reproducibility in Sections3,4 and5. 1 Introduction 2.2 Transfer Learning We describe the National Research Council of Zoph et al.(2016) proposed a transfer learning Canada (NRC) neural machine translation systems approach for neural machine translation, using lan- for the shared task on Unsupervised MT and Very guage pairs with larger amounts of data to pre-train Low Resource Supervised MT. We participated a parent system, followed by finetuning a child sys- in the supervised track of the low resource task, tem on the language pair of interest. Nguyen and building Upper Sorbian–German neural machine Chiang(2017) expand on that, showing improved translation (NMT) systems in both translation direc- performance using BPE and shared vocabularies tions. Upper Sorbian is a minority language spoken between the parent and child. We follow this ap- in Germany. We built baseline systems (standard proach: we build disjoint source and target BPE Transformer (Vaswani et al., 2017) with a byte-pair models and vocabularies, with one vocabulary for encoding vocabulary (BPE; Sennrich et al., 2016b)) German (DE) and one for the combination of Czech trained on all available parallel data (60,000 lines), (CS) and Upper Sorbian (HSB); see Section4. which resulted in unusually high BLEU scores for We chose to use Czech–German data as the par- a language pair with such limited data. ent language pair due to the task suggestions, rela- In order to improve upon this baseline, we used tive abundance of data, and the close relationship transfer learning with modifications to the training between Czech and Upper Sorbian (cf. Lin et al., lexicon. We did this in two ways: by experiment- 2019; Kocmi and Bojar, 2018). While Czech and ing with the application of BPE-dropout (Provilkov Upper Sorbian cognates are often not identical at et al., 2020) to the transfer learning setting (Sec- the character level (Table1), there is a high level of tion 2.3), and by modifying Czech data used for character-level overlap; trying to take advantage of training parent systems with word and character that overlap without assuming complete character- replacements in order to make it more “Upper level identity is a motivation for the explorations Sorbian-like” (Section 2.4). in subsequent sections (Section 2.3, Section 2.4). Our final systems were ensembles of systems Another relatively high-resource language related built using transfer learning and these two ap- to Upper Sorbian is Polish, but while the Czech proaches to lexicon modification, along with it- and Upper Sorbian orthographies are fairly simi- erative backtranslation. lar, mostly using the same characters for the same 1112 Proceedings of the 5th Conference on Machine Translation (WMT), pages 1112–1122 Online, November 19–20, 2020. c 2020 Association for Computational Linguistics sounds (with a few notable exceptions), Polish or- presence of the former during pre-training can ini- thography is more distinct. This, combined with tialize at least some of the subwords that the model the lack of a direct Polish–German parallel dataset will later see in Upper Sorbian. However, other in the constrained condition, led us to choose Czech times the character-level differences lead to seg- as our transfer language for these experiments. mentations where no subwords are shared (e.g. CS hospoda´r@@ˇ ska´ vs. HSB hospodar@@ Czech Upper Sorbian sce or potom vs. HSB po@@ tym). Consider- analyzovat analyzowac´ ing a wider variety of segmentations would, we donesl donjesł hypothesized, mean that Upper Sorbian subwords extern´ıch eksternych would have more chance of being initialized during hospoda´rskˇ a´ hospodarsce Czech pre-training (see AppendixC). kreativn´ı kreatiwne Rather than modifying the NMT system itself to okres wokrjes reapply BPE-dropout during training, we treated potom potym BPE-dropout as a preprocessing step. Additionally, projekt projekt we experimented with BPE-dropout in the context semantick´ a´ semantisku of transfer learning, examining the effects of us- velkym´ wulkim ing source-side, both-sides, or no dropout in both parent and child systems. Table 1: A sample of probable Czech–Upper Sorbian cognates and shared loanwords, mined from the Czech– 2.4 Pseudo-Sorbian German and German–Upper Sorbian parallel corpora and filtered by Levenshtein distance. For the Upper Sorbian–German direction, we also experimented with two techniques for modifying Other work on transfer learning for low-resource the Czech–German parallel data so that the Czech machine translation includes multilingual seed side is more like Upper Sorbian. In particular, we models (Neubig and Hu, 2018), dynamically concentrated on modification methods that require adding to the vocabulary when adding languages neither large amounts of data, nor in-depth knowl- (Lakew et al., 2018), and using a hierarchical archi- edge of the historical relationships between the tecture to use multiple language pairs (Luo et al., languages, since both of these are often lacking for 2019). the lower-resourced language. We considered two variations of this idea: 2.3 BPE-Dropout We apply the recently-proposed approach of per- • word-level modification, in which some fre- forming BPE-dropout (Provilkov et al., 2020), quent Czech words (e.g. prepositions) are which takes an existing BPE model and randomly replaced by likely Upper Sorbian equivalents, drops some merges at each merge step when ap- and plying the model to text. The goal of this, beside • character-level modification, where we at- leading to more robust subword representations tempt to convert Czech words at the character in general, is to produce subword representations level to forms that may more closely resemble that are more likely to overlap between the pre- Upper Sorbian words. training (Czech–German) and finetuning (Upper Sorbian–German) stages. We hypothesized that, in Note that in neither case do we know what the same way that BPE-Dropout leads to robust- particular conversions are correct; we ourselves ness against accidental spelling errors and variant do not know enough about historical Western spellings (Provilkov et al., 2020), it could likewise Slavic to predict the actual Upper Sorbian cog- lead to robustness to the kind of spelling variations nates of Czech words. Rather, we took inspiration we see between two related languages. from stochastic segmentation methods like BPE- For example, consider the putative Czech–Upper Dropout (Provilkov et al., 2020) and SentencePiece Sorbian cognates and shared loanwords presented (Kudo and Richardson, 2018): when we have an in Table1. Sometimes a fixed BPE segmenta- idea of the possible solutions to the segmentation tion happens to separate shared characters into problem but do not know which one is the correct shared subwords (e.g. CS analy@@ z@@ ovat one, we can sample randomly from the possible vs. HSB analy@@ z@@ owac´), such that the segmentations as a sort of regularization, with the 1113 goal of discouraging the model from relying too correspondences (e.g. CS v to HSB w, CS d to heavily on a single segmentation scheme and giv- HSB dz´ before front vowels, etc.), we wrote a pro- ing it some exposure to a variety of possible seg- gram to randomly replace the appropriate Czech mentations. Whereas BPE-dropout and Sentence- character sequences with possible correspondences Piece focus on possible segmentations of the word, in Upper Sorbian. Again, as Czech-Upper Sorbian our pseudo-Sorbian experiments focus on possi- correspondences are not entirely predictable (CS ble word- and character-level replacements. The e might happen to correspond, in a particular cog- goal was to discourage the parent Czech–German nate, to HSB e or ej or i or a or o, etc.), we model from relying too heavily on regularities in cannot expect that any given result is correct Upper Czech (e.g. the presence of particular frequent Sorbian. Rather, we can think of this process as words, the presence of particular Czech character attempting to train a system that can respond to in- n-grams) and perhaps also gain some prior expo- puts from a variety of possible (but not necessarily sure to Upper Sorbian words and characters that actual) Western Slavic languages, rather than just a will occur in the genuine Upper Sorbian data; we system that can respond to precisely-spelled Czech can also think of this as a form of low-resource and only Czech. data augmentation (Fadaee et al., 2017; Wang et al., 2018). See AppendixC for an analysis of increased 2.4.3 Combined pseudo-Sorbian subword overlap between pseudo-Sorbian and test data, as compared to BPE-dropout and the baseline In initial testing, we determined that a combina- approach. tion of word-level and character-level modification performed best; we ran each process on the Czech– 2.4.1 Word-level pseudo-Sorbian German corpus separately, then concatenated the To generate the word-level pseudo-Sorbian, we ran resulting corpora and trained a parent model on it.