Arxiv:1805.02282V1 [Cs.CL] 6 May 2018 Omn
Total Page:16
File Type:pdf, Size:1020Kb
Multi-Domain Neural Machine Translation Sander Tars and Mark Fishel Institute of Computer Science University of Tartu, Estonia [email protected], [email protected] Abstract lating multiple domains one has to run an individ- ual NMT system for each domain. We present an approach to neural ma- In this work we treat text domains as distinct chine translation (NMT) that supports mul- languages: for example, instead of English- tiple domains in a single model and al- to-Estonian translation we see it as translating lows switching between the domains when English news to Estonian news. We test two mul- translating. The core idea is to treat text tilingual NMT approaches (Johnson et al., 2016; domains as distinct languages and use mul- Ostling¨ and Tiedemann, 2017) in a bilingual tilingual NMT methods to create multi- multi-domain setting and show that both outper- domain translation systems; we show that form single-domain fine-tuning on all the text this approach results in significant trans- domains in our experiments. lation quality gains over fine-tuning. We However, this only works when the text domain also explore whether the knowledge of pre- is known both when training and translating. In specified text domains is necessary; turns some cases the text domain of the input segment out that it is after all, but also that when it is unknown – for example, web MT systems have is not known quite high translation quality to cope with a variety of text domains. Also, can be reached, and even higher than with some parallel texts do not have a single domain known domains in some cases. while they are either a mix of texts from different sources (like crawled corpora) or naturally consti- 1 Introduction tute a highly heterogeneous mix of texts (like sub- titles or Wikipedia articles). Data-driven machine translation (MT) systems de- We address these issues by replacing known do- pends on the text domain of their training data. In mains with automatically derived ones. At training a typical in-domain MT scenario the amount of time we cluster parallel sentences and then apply- arXiv:1805.02282v1 [cs.CL] 6 May 2018 parallel texts from a single domain is not enough ing the multi-domain approach to these clusters. to train a good translation system, even more so When translating, the input segments are classified for neural machine translation (NMT; Bahdanau as belonging to one of these clusters and translated et al., 2014); thus models are commonly trained with this automatically derived information. on a mixture of parallel texts from different do- In the following we review related work in Sec- mains and then later fine-tuned to in-domain texts tion 2, then present our methodology of multi- (Luong and Manning, 2015). domain NMT and sentence clustering in Section 3. In-domain fine-tuning has two main shortcom- After that, we describe our experiments in Sec- ings: it depends on the availability of sufficient tions 4 and 5 and discuss the results in Section 6. amounts of in-domain data in order to avoid over- Section 7 concludes the paper. fitting and it results in degraded performance for all other domains. The latter means that for trans- 2 Related Work c 2018 The authors. This article is licensed under a Creative Commons 3.0 licence, no derivative works, attribution, CC- The baseline to which we compare our work is BY-ND. fine-tuning NMT systems to a single text domain (Luong and Manning, 2015). There, the NMT sys- to the beginning of the source sequence. No tem is first trained on a mix of parallel texts changes to the NMT architecture are required from different domains and then fine-tuned via with this approach. They show that the method continued training on just the in-domain texts. improves NMT for all languages involved; as The method shows improved performance on in- an additional benefit, there is no increase in the domain test data but degrades performance on number of parameters, since all language pairs are other domains. included in the same model. In (Sennrich et al., 2016a) the NMT system is We adapt the language tag approach to parametrized with one additional input feature (po- text domains, appending the domain ID to liteness), which is included as part of the input se- each source sentence; thus, for instance, quence, similarly to one of our two approaches (in “How you doin’ ?” from OpenSubti- our work – the domain tag approach). However, tles2016 (Lison and Tiedemann, 2016) becomes their goal is different from ours. “ OpenSubs How you doin’ ?”. In (Kobus et al., 2017) additional word features The described method has two advantages. are used for specifying the text domain together Firstly, it is independent of the NMT architecture, with the same approach as (Sennrich et al., 2016a). and scaling to more domains means simply adding Although both methods overlap with the first part data for these domains. We can assign a domain to or our work (domain features and domain tags), each sentence pair of the training set sentence pair, they only test these methods on pre-specified do- or set the domain to “other” for sentences whose mains, while we include automatic domain clus- domain we cannot or do not want to identify. tering and identification. Also, they use in-domain Secondly, in a multilingual NMT model, all pa- trained NMT systems as baselines even for small rameters are implicitly shared by all the language parallel corpora and do experiments with a differ- pairs being modeled. This forces the model to gen- ent NMT architecture. Finally, their results show eralize across language boundaries during training. very modest improvements, while in our case the It is observed that when language pairs with lit- improvements are much greater. tle available data and language pairs with abun- Other approaches also define a mixture dant data are mixed into a single model, transla- of domains, for example (Britz et al., 2017; tion quality on the low resource language pair is Chen et al., 2016). However, both define custom significantly improved. NMT methods and also limit the experiments to We expect this to be even more useful for text the cases where the text domain is known. domains. Traditional tuning to a low-resource do- main, or for any specific domain for that matter, 3 Methodology would result in a likely over-fitting to that domain. In the following we describe two different ap- Our approach, where all parameters are shared, proaches to treating text domains as distinct learns target domain representations without harm- languages and using multi-lingual methods, re- ing other domains’ results while maintaining the sulting in multi-domain NMT models. The ability to generalize also on in-domain translation, first approach is inspired by Google’s multilin- because little to no over-fitting will be caused. Fur- gual NMT (Johnson et al., 2016) and the sec- thermore, since domains are much more similar ond one by the cross-lingual language mod- than languages, we expect the parameter sharing els (Ostling¨ and Tiedemann, 2017). Then we de- to have a stronger effect. scribe our methods of unsupervised domain seg- mentation used in our experiments in comparison 3.2 Domain as a Feature with the pre-specified text domains. The second approach is based on (Ostling¨ and Tiedemann, 2017) for continu- 3.1 Domain as a Tag ous multilingual language models. The authors The first approach is based on propose to use a single RNN model with language (Johnson et al., 2016). Their method of mul- vectors that indicate what language is used. As a tilingual translation is based on training the NMT result each language gets its own embedding, thus model on data from multiple language pairs, while ending up with a language model with a predictive appending a token specifying the target language distribution p(xt|x1...t−1, l) which is a continuous function of the language vector l. (Pagliardini et al., 2017). After that, we apply In our approach the same idea is im- KMeans clustering to identify the clusters in the plemented via word features of Nematus set of calculated sentence vectors. Finally, we tag (Sennrich et al., 2017), with their learned em- each sentence with the label that it was assigned by beddings replacing the language vector of KMeans. To find the optimal number of clusters, (Ostling¨ and Tiedemann, 2017). For example, we create several versions with different numbers translating ”This is a sentence .” to the Estonian of clusters. Wikipedia domain would mean an input of To tag the test/dev set sentences, we ”This|2wi is|2wi a|2wi sentence|2wi .|2wi”1 train a FastText (Bojanowski et al., 2016) Having a single language model learn (Joulin et al., 2016) supervised classification several languages helps similar lan- model on the tagged training set. For each of the guages improve each others representations cluster versions and for each language pair, we (Ostling¨ and Tiedemann, 2017). Also, they point train a separate FastText model. The additional out that this greatly alleviates the problem of benefit of this kind of clustering is that each sparse data for smaller languages. We expect new input sentence can be efficiently assigned the same effect for text domains, especially its cluster. Also, because of more potentially since similarity between different domains of the homogenous train-set clusters, the new sentence is same languages is higher than between different hypothetically assigned more appropriate domain languages. Moreover, similarly to the domain than it would be assigned in case of the pre-defined tag approach, the usage of many domains in one domains.