Pre-Training Via Leveraging Assisting Languages for Neural Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
Pre-training via Leveraging Assisting Languages for Neural Machine Translation Haiyue Song1 Raj Dabre2 Zhuoyuan Mao1 Fei Cheng1 Sadao Kurohashi1 Eiichiro Sumita2 1Kyoto University, Kyoto, Japan 2National Institute of Information and Communications Technology, Kyoto, Japan fsong, feicheng, zhuoyuanmao, [email protected], fraj.dabre, [email protected] Abstract Pre-training using models like BERT (Devlin et al., 2018) have led to new state-of-the-art re- Sequence-to-sequence (S2S) pre-training us- ing large monolingual data is known to im- sults in text understanding. However, BERT-like prove performance for various S2S NLP tasks. sequence models were not designed to be used However, large monolingual corpora might not for NMT which is sequence to sequence (S2S). always be available for the languages of inter- Song et al.(2019) recently proposed MASS, a S2S est (LOI). Thus, we propose to exploit mono- specific pre-training task for NMT and obtained lingual corpora of other languages to comple- new state-of-the-art results in low-resource settings. ment the scarcity of monolingual corpora for MASS assumes that a large amount of monolin- the LOI. We utilize script mapping (Chinese gual data is available for the languages involved to Japanese) to increase the similarity (number of cognates) between the monolingual corpora but some language pairs may lack both parallel and of helping languages and LOI. An empirical monolingual corpora and are “truly low-resource” case study of low-resource Japanese–English and challenging. neural machine translation (NMT) reveals that Fortunately, languages are not isolated and often leveraging large Chinese and French mono- belong to “language families” where they have sim- lingual corpora can help overcome the short- ilar orthography (written script; shared cognates) age of Japanese and English monolingual cor- or similar grammar or both. Motivated by this, in pora, respectively, for S2S pre-training. Using only Chinese and French monolingual corpora, this paper we hypothesize that we should be able to we were able to improve Japanese–English leverage large monolingual corpora of other assist- translation quality by up to 8.5 BLEU in low- ing languages to help the monolingual pre-training resource scenarios. of NMT models for the languages of interest (LOI) that may lack monolingual corpora. Wherever pos- 1 Introduction sible, we subject the pre-training corpora to script Neural Machine Translation (NMT) (Sutskever mapping which should help minimize the vocabu- et al., 2014; Bahdanau et al., 2015) is known to lary and distribution differences, respectively, be- give state-of-the-art (SOTA) translations for lan- tween the pre-training, main training (fine-tuning) guage pairs with an abundance of parallel corpora. and testing time datasets. This should help the However, most language pairs are resource poor already consistent pre-training and fine-tuning ob- (Russian–Japanese, Marathi–English) as they lack jectives leverage the data much better and thereby, large parallel corpora and the lack of bilingual possibly, boost translation quality. training data can be compensated by by monolin- To this end, we experiment with ASPEC gual corpora. Although it is possible to utilise the Japanese–English translation in a variety of low- popular back-translation method (Sennrich et al., resource settings for the Japanese–English parallel 2016a), it is time-consuming to backtranslate a corpora. Our experiments reveal that while it’s large amount of monolingual data. Furthermore, possible to leverage unrelated languages for pre- poor quality backtranslated data tends to be of lit- training, using related languages is extremely im- tle help. Recently, another approach has gained portant. We utilized Chinese to Japanese script popularity where the NMT model is pre-trained mapping to maximize the similarities between through tasks that only require monolingual data the assisting languages (Chinese and French) and (Song et al., 2019; Qi et al., 2018). the languages of interest (Japanese and English). 279 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 279–285 July 5 - July 10, 2020. c 2020 Association for Computational Linguistics We show that only using monolingual corpora of Monolingual data Chinese and French for pre-training can improve Assisting Target Japanese–English translation quality by up to 8.5 languages languages Data Parallel data Mapping BLEU. Selection Target Mixed data The contributions of our work are as follows: languages 1. Leveraging assisting languages: We give a novel study of leveraging monolingual corpora Pre-train Fine-tune Pre-trained NMT model of related and unrelated languages for NMT model pre-training. Figure 1: An overview of our proposed method consist- 2. Empirical evaluation: We make a comparison ing of script mapping, data selection, pre-training and of existing and proposed techniques in a variety of fine-tuning corpora settings to verify our hypotheses. the languages of interest (LOI). The framework of our approach is shown in Figure1 which consists 2 Related work of script mapping, data selection, pre-training and Our research is at the intersection of works on fine-tuning. monolingual pre-training for NMT and leveraging 3.1 Data Pre-processing multilingualism for low-resource language transla- tion. Blindly pre-training a NMT model on vast amounts Pre-training has enjoyed great success in other of monolingual data belonging to the assisting lan- NLP tasks with the development of methods like guages and LOI might improve translation quality BERT (Devlin et al., 2018). Song et al.(2019) slightly. However, divergences between the lan- recently proposed MASS, a new state-of-the-art guages, especially their scripts (Hermjakob et al., NMT pre-training task that jointly trains the en- 2018) and also the distributions of data between coder and the decoder. Our approach builds on the different training phases is known to impact the initial idea of MASS, but focuses on complement- final result. Motivated by past works on using re- ing the potential scarcity of monolingual corpora lated languages (Dabre et al., 2017), orthography for the languages of interest using relatively larger mapping/unification (Hermjakob et al., 2018; Chu monolingual corpora of other (assisting) languages. et al., 2012) and data selection for MT (Axelrod On the other hand, leveraging multilingualism et al., 2011), we propose to improve the efficacy of involves cross-lingual transfer (Zoph et al., 2016) pre-training by reducing data and language diver- which solves the low-resource issue by using data gence. from different language pairs. Dabre et al.(2017) 3.1.1 Script Mapping showed the importance of transfer learning between Previous research has shown that enforcing shared languages belonging to the same language family orthography (Sennrich et al., 2016b; Dabre et al., but corpora might not always be available in a re- 2015) has a strong positive impact on translation. lated language. A mapping between Chinese and Following this, we propose to leverage existing Japanese characters (Chu et al., 2012) was shown to script mapping rules1 or script unification mech- be useful for Chinese–Japanese dictionary construc- anisms to, at the very least, maximize the possi- tion (Dabre et al., 2015). Mappings between scripts bility of cognate sharing and thereby bringing the or unification of scripts (Hermjakob et al., 2018) assisting language closer to the LOI. This should can artificially increase the similarity between lan- strongly impact languages such as Hindi, Punjabi guages which motivates most of our work. and Bengali belonging to the same family but writ- 3 Proposed Method: Using Assisting ten using different scripts. Languages For languages such as Korean, Chinese and Japanese there may exist a many to many mapping We propose a novel monolingual pre-training between their scripts. Thus, incorrect mapping of method for NMT which leverages monolingual 1Transliteration is another option but transliteration sys- corpora of assisting languages to overcome the tems are relatively unreliable compared to handcrafted rule scarcity of monolingual and parallel corpora of tables. 280 characters (basic unit of a script) might produce Algorithm 1: Length Distribution Data Se- wrong words and reduce cognate sharing. We pro- lection pose two solutions to address this. Input :TargetFile, InputFile, 1. One-to-one mapping: Here we do not care SelectNum about word level information and map each charac- Output :SelectedLines ter in one language to its corresponding character 1 TargetDistribution fg; in another language. Here, we just select the first 2 CurrentDistribution fg; mapping in the mapping list. 3 SelectedLines fg; 2. Many-to-many mapping with LM scoring: 4 TargetNum = # of Lines in TargetFile; A more sophisticated solution is where for each 5 foreach Line 2 TargetFile do tokenized word-level segment in one language we 6 TargetD[len(Line)]+ = 1 ; enumerate all possible combinations of mapped 7 foreach Line 2 InputFile do characters and use a language model in the other 8 if language to select the character combination with CurrentD[len(Line)]=SelectNum < the highest score as the result. TargetD[len(Line)]=TargetNum 3.1.2 Note on Chinese–Japanese Scripts then 9 CurrentD[len(Line)]+ = 1; Japanese is written in Kanji which was borrowed 10 SelectedLines from China. Over time the written scripts have di- SelectedLines [ fLineg; verged and the pronunciations are naturally differ- ent but there are a significant number of cognates written in both languages. As such pre-training on Chinese should benefit translation involving ically the monolingual corpus). When selecting Japanese. Chu et al.(2012) created a mapping table monolingual data of languages of interest, we can between them which can be leveraged to further first calculate the length distribution of parallel data increase the number of cognates. as target distribution (the ratio of all lengths in T argetF ile) and we fill the length distribution by 3.1.3 Data Selection selecting sentences from monolingual data of same Often, the pre-training monolingual data and the language.