A -based Corpus for Contextualized Machine Translation

Jennifer Drexler∗, Pushpendre Rastogi†, Jacqueline Aguilar‡, Benjamin Van Durme‡, Matt Post‡ ∗Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology †Center for Language and Speech Processing, Johns Hopkins University ‡Human Language Technology Center of Excellence, Johns Hopkins University [email protected], {pushpendre,jacqui.aguilar,vandurme,post}@jhu.edu Abstract We describe a corpus for and experiments in target-contextualized machine translation (MT), in which we incorporate language models from target-language documents that are comparable in nature to the source documents. This corpus comprises (i) a set of curated articles describing news events along with (ii) their comparable Spanish counterparts, (iii) a number of the Spanish source articles cited within them, and (iv) English reference translations of all the Spanish data. In experiments, we evaluate the effect on translation quality when including language models built over these English documents and interpolated with other, separately-derived, more general language model sources. We find that even under this simplistic baseline approach, we achieve significant improvements as measured by BLEU score.

Keywords: Machine Translation, Domain Adaptation, Corpus

1. Introduction

comparable English Spanish article article We describe a corpus for target-contextualized machine translation (MT). Here, the task is to improve the trans- lation of source documents using language models built English translation over presumably related, comparable documents in the tar- Spanish get language. As a motivating example, this could be use- source ful in a situation where there is a collection of in-language Spanish source documents related to a topic, but a specific document of Spanish interest is available only in another language. Our corpus source comprises (i) a set of curated English Wikipedia articles English translation describing news events, along with (ii) their Spanish coun- English terparts, (iii) a number of the Spanish source articles cited translation English within them, and (iv) human-produced reference transla- translation tions of all the Spanish documents. In experiments, we translated these Spanish documents using a general trans- lation system built on out-of-domain data. We then built multiple “in-domain” language models on the comparable Figure 1: The data configuration. We collected compa- English documents — one for each pair of documents — rable English and Spanish documents along with sources and interpolated them with the larger model, evaluating the cited within the Spanish articles. Reference translations effect on translation quality, effectively casting this task as were then collected for all Spanish documents; in the ex- one of (hyper) domain adaptation for MT. We find that even periments, we use both the original (comparable) English under this simplistic baseline approach, we achieve signifi- articles (§3.3.) and the translated articles (§3.1.) to build a cant improvements as measured by BLEU score (Papineni language model to aid translation of each set of correspond- et al., 2002). ing source documents.

This work relates to previous efforts in domain adaptation in machine translation (Eidelman et al., 2012; Langlais et 2. Dataset al., 2000; Langlais and Lapalme, 2002; Brousseau et al., Our dataset consists of 9 English articles from Wikipedia 1995; Dymetman et al., 1994). Many of these previous on events of interest to the Spanish-speaking portion of problem formulations were concerned with recognizing the the world. We manually extracted and sentence-segmented general domain, or topic, of the material to be translated. each article along with the Spanish Wikipedia article on the In comparison, we are concerned here with adapting the same topic.1 It is important to note that these are not par- translation via access to information related to the exact allel documents, but are instead at best “comparable cor- events being described, presented in the form of a language pora”, existing in the form of independently-written article model. We hope this might help lead to richer approaches on the same topic. Many such articles on Wikipedia, in fact, to MT, where coreference and semantics might be measur- ably brought to bear. 1Documents were taken from Wikipedia on July 2, 2013.

3593 Topic Articles Sources λ 1.0 0.7 0.3 0.0 Eng. Spa. Spa. Articles 28.97 31.45 31.27 23.69 Bolivia gas war 236 114 21 Sources 26.76 27.89 27.50 18.01 Catalan autonomy protest 31 45 87 Chavez death 122 52 49 Table 2: BLEU scores (averaged across each article or group of Chilean mining accident 474 140 144 source articles) from changing the interpolation weights λ, which Comayagua prison fire 53 11 50 denotes the weight assigned to the large, general-purpose lan- H1N1 151 92 66 guage model. Articles denotes the average score from translating Papal election 147 70 82 the Spanish articles, while sources denotes the source documents Pemex explosion 29 82 117 cited within those articles. Venezuelan presidential election 110 44 79 Total 1,343 650 695

Table 1: The topics and sentence counts of the Spanish Wikipedia these reference translations can be thought of as parallel articles used as the basis for data collection. For each topic, we Wikipedia documents, in contrast to the comparable ones have the comparable Wikipedia articles (both English and Span- that we extracted. In the second, the English Wikipedia ar- ish) and the sources cited within the Spanish articles. ticles serve this role. We explore a range of interpolation weights in order to determine whether any setting will re- are written by authors writing independently in their native sult in some improvement. The purpose of these two exper- language on the same topic, but from their own perspective, iments is to determine whether in-domain language models in contrast to simply filling out the inter-language resources built even on very small, targeted pieces of data might be of the by translating documents from high- useful in improving machine translation quality: first, in a to low-resource languages. Our dataset leverages this fact, parallel setting, and second, in a comparable setting. Fi- providing a relatively small corpus with high information nally, in a third set of experiments, we test whether these content on the Spanish side, and comparable but impover- ideas can generalize when we have very small amounts of ished versions on the target (English side). This is pictured data, using a proxy setting to determine the interpolation in Figure 1. weights. In addition to the comparable document pairs, we reached 3.1. Experiment 1: Can the contextual data help into the Spanish documents and collected many of the (parallel setting)? sources cited within them. Many of these citations linked outside of Wikipedia, often to news websites. These The contextualized target-side language models are built on sources were then also manually extracted and sentence- very small amounts of data, numbering at most a few hun- segmented. In total, we collected 695; Table 1 contains a dred sentences (Table 1). This made them too unreliable list of the Wikipedia articles along with sentence-level ex- to use as separate language models during decoding (with tract statistics. their own weight assigned in the decoder’s linear model). All Spanish data was then manually translated by a single Instead, our approach was to interpolate the large, general bilingual speaker to produce a reference translation used for language model trained on millions of sentences (Europarl) translation. with the domain-specific one. For this first set of experiments, instead of using the (com- parable) English Wikipedia documents to build the contex- tualized language model, we looked at what should be an 3. Experiments easier setting: using the translations of the articles to build We used the Joshua Machine Translation Toolkit2 (Post et a language model when translating the sources, and vice al., 2013). We include both baseline results translating the versa. Table 2 presents the results four settings for the in- Spanish source articles as well as method of adapting the terpolation weight λ, which determines how much weight is baseline machine translation system to the domain of a spe- given to the large LM. With a weight of λ = 0.7, we see im- cific document. The baseline models were trained on gram- provements in BLEU score from including the in-domain mars extracted from the Spanish–English portion of version LM. To be clear, when translating the articles, the contex- 7 of the Europarl Corpus (Koehn, 2005). From these, we tualized LM is built on the sources, and when translating extracted Hiero grammars (Chiang, 2005) using the stan- the sources, the LM is built on the articles. Note that the dard extraction settings (Chiang, 2007). We then build a 5- setting λ = 1.0 is the baseline setting, with all the weight gram Kneser-Ney-smoothed language model from the En- assigned to the large, out-of-domain language model. glish side of this data using KenLM (Heafield, 2011). The parameters of the decoder’s linear model were tuned with 3.2. Experiment 2: Can contextualized LMs help Z-MERT (Och, 2003; Zaidan, 2009), which are included in (comparable setting)? the Joshua toolkit. For the second set of experiments, we use the comparable We describe three experiments. In the first, we use the ref- English Wikipedia articles to build the in-domain language erence translations of the articles as our set of target knowl- model. Table 3 contains the results. Note that the scores edge. Since they were produced from the Spanish versions, are all slightly lower than when using the parallel setting, which makes sense, since the comparable data are not cer- 2joshua-decoder.org tain to be as close. Importantly, however, the best setting

3594 λ 1.0 0.7 0.3 0.0 and enables research in this specific domain-adaptation sce- Sources 26.76 27.32 26.30 16.18 nario, and the results suggest that further research in this area could be productive. We hope this collection will help Table 3: Averaged BLEU scores from varying λ, this time using spur discussion on the task of contextualized machine trans- an LM built on the comparable target-language corpora (English lation, and are willing to extending the resource based on Wikipedia articles, instead of translated Spanish articles). community interest. Acknowledgments The authors thank the Human Lan- guage Technology Center of Excellence for supporting this remains unchanged. work via its summer workshop program. 3.3. Experiment 3: Can they generalize? The above experiment demonstrates that there are inter- 5. References polation weights that result in an improved BLEU score. Brousseau, J., Drouin, C., Foster, G. F., Isabelle, P., Kuhn, However, how can we automatically find that weight? The R., Normandin, Y., and Plamondon, P. (1995). French conventional approach is to use tuning data of the same sort speech recognition in an automatic dictation system for and use that to set it. However, our sources, articles, and translators: the transtalk project. In Eurospeech. their corresponding in-domain LMs are small enough that Chiang, D. (2005). A hierarchical phrase-based model for this is difficult. We could reserve some of the data as de- statistical machine translation. In Proceedings of the An- velopment data for tuning this parameter. Instead, we used nual Conference of the Association for Computational a proxy approach that uses unrelated (and abundant) data Linguistics. set up in an analogous situation to learn the interpolation Chiang, D. (2007). Hierarchical phrase-based translation. weights. Computational Linguistics, 33(2):201–228. This analogous data was created as follows: we gener- Dymetman, M., Brousseau, J., Foster, G., Isabelle, P., Nor- ated development data from the Spanish–English portion mandin, Y., and Plamondon, P. (1994). Towards an au- of the News Commentary Corpus, released as part of WMT tomatic dictation system for translators: The TransTalk 3 2013. We split this data into its constituent news stories, Project. arXiv preprint cmp-lg/9409012. and selected the longest ones. We split each of these sto- Eidelman, V., Boyd-Graber, J., and Resnik, P. (2012). ries in half, treating the English side of the first half as Topic models for dynamic translation model adaptation. context for the construction of a language model and us- In Proceedings of the 50th Annual Meeting of the As- ing the second half to tune the interpolation weight against. sociation for Computational Linguistics: Short Papers- This language model was then interpolated with the much Volume 2, pages 115–119. Association for Computa- larger Europarl language model, searching over interpola- tional Linguistics. tion weights with grid search. The best interpolation weight Heafield, K. (2011). Kenlm: Faster and smaller language is then used to interpolate the same Europarl LM with the model queries. In Proceedings of the Workshop on Sta- in-domain LM built over the English Wikipedia article as- tistical Machine Translation. sociated with whatever Spanish source document was being Koehn, P. (2005). Europarl: A parallel corpus for statisti- translated. A single interpolation weight is thus learned for cal machine translation. In MT summit, volume 5. all of the articles. Langlais, P. and Lapalme, G. (2002). Trans type: The best weight found in these “proxy experiments” over Development-evaluation cycles to boost translator’s pro- the weights presented in Table 2 was also 0.7. For these ductivity. Machine Translation, 17(2):77–98. experiments, this proxy situation allows us to discover the Langlais, P., Foster, G., and Lapalme, G. (2000). best interpolation weight, providing nice gains in BLEU Transtype: a computer-aided translation typing system. score over the baseline. In Proceedings of the 2000 NAACL-ANLP Workshop on 4. Summary Embedded machine translation systems-Volume 5, pages The domain adaptation method used here is based on a sim- 46–51. Association for Computational Linguistics. ple idea: we perform optimization using models tailored Och, F. J. (2003). Minimum error rate training in statistical to the development data, and then replace those models at machine translation. In Proceedings of the Annual Con- test time with models adapted to the test domain. Here, ference of the Association for Computational Linguistics, the news commentary corpus language model used during Sapporo, Japan, July. tuning is replaced with the target-contextualized language Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). model at test time. In this way, our methods can be used BLEU: a method for automatic evaluation of machine for documents for which no matching development data is translation. In Proceedings of the Annual Conference of available and can easily scale to the translation of large doc- the Association for Computational Linguistics, Philadel- ument collections. phia, Pennsylvania, USA, July. Along with these experiments, we release the associated Post, M., Ganitkevitch, J., Orland, L., Weese, J., Cao, Y., data collected from Wikipedia and translated.4 This data and Callison-Burch, C. (2013). Joshua 5.0: Sparser, bet- ter, faster, server. In Proceedings of the Eighth Work- 3statmt.org/wmt13/ shop on Statistical Machine Translation, pages 206–212, 4hltcoe.jhu.edu/publications/ Sofia, Bulgaria, August. Association for Computational data-sets-and-resources/ Linguistics.

3595 Zaidan, O. F. (2009). Z-MERT: A fully configurable open source tool for minimum error rate training of machine translation systems. The Prague Bulletin of Mathemati- cal Linguistics, 91:79–88.

3596