A Wikipedia-Based Corpus for Contextualized Machine Translation

A Wikipedia-based Corpus for Contextualized Machine Translation Jennifer Drexler∗, Pushpendre Rastogiy, Jacqueline Aguilarz, Benjamin Van Durmez, Matt Postz ∗Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology yCenter for Language and Speech Processing, Johns Hopkins University zHuman Language Technology Center of Excellence, Johns Hopkins University [email protected], fpushpendre,jacqui.aguilar,vandurme,[email protected] Abstract We describe a corpus for and experiments in target-contextualized machine translation (MT), in which we incorporate language models from target-language documents that are comparable in nature to the source documents. This corpus comprises (i) a set of curated English Wikipedia articles describing news events along with (ii) their comparable Spanish counterparts, (iii) a number of the Spanish source articles cited within them, and (iv) English reference translations of all the Spanish data. In experiments, we evaluate the effect on translation quality when including language models built over these English documents and interpolated with other, separately-derived, more general language model sources. We find that even under this simplistic baseline approach, we achieve significant improvements as measured by BLEU score. Keywords: Machine Translation, Domain Adaptation, Corpus 1. Introduction comparable English Spanish article article We describe a corpus for target-contextualized machine translation (MT). Here, the task is to improve the translation of source documents using language models built English translation over presumably related, comparable documents in the tar- Spanish get language. As a motivating example, this could be use- source ful in a situation where there is a collection of in-language Spanish source documents related to a topic, but a specific document of Spanish interest is available only in another language. Our corpus source comprises (i) a set of curated English Wikipedia articles English translation describing news events, along with (ii) their Spanish coun- English terparts, (iii) a number of the Spanish source articles cited translation English within them, and (iv) human-produced reference transla- translation tions of all the Spanish documents. In experiments, we translated these Spanish documents using a general translation system built on out-of-domain data. We then built multiple “in-domain” language models on the comparable Figure 1: The data configuration. We collected compa- English documents — one for each pair of documents — rable English and Spanish documents along with sources and interpolated them with the larger model, evaluating the cited within the Spanish articles. Reference translations effect on translation quality, effectively casting this task as were then collected for all Spanish documents; in the ex- one of (hyper) domain adaptation for MT. We find that even periments, we use both the original (comparable) English under this simplistic baseline approach, we achieve signifi- articles (x3.3.) and the translated articles (x3.1.) to build a cant improvements as measured by BLEU score (Papineni language model to aid translation of each set of correspond- et al., 2002). ing source documents. This work relates to previous efforts in domain adaptation in machine translation (Eidelman et al., 2012; Langlais et 2. Dataset al., 2000; Langlais and Lapalme, 2002; Brousseau et al., Our dataset consists of 9 English articles from Wikipedia 1995; Dymetman et al., 1994). Many of these previous on events of interest to the Spanish-speaking portion of problem formulations were concerned with recognizing the the world. We manually extracted and sentence-segmented general domain, or topic, of the material to be translated. each article along with the Spanish Wikipedia article on the In comparison, we are concerned here with adapting the same topic.1 It is important to note that these are not par- translation via access to information related to the exact allel documents, but are instead at best “comparable cor- events being described, presented in the form of a language pora”, existing in the form of independently-written article model. We hope this might help lead to richer approaches on the same topic. Many such articles on Wikipedia, in fact, to MT, where coreference and semantics might be measur- ably brought to bear. 1Documents were taken from Wikipedia on July 2, 2013. 3593 Topic Articles Sources λ 1.0 0.7 0.3 0.0 Eng. Spa. Spa. Articles 28.97 31.45 31.27 23.69 Bolivia gas war 236 114 21 Sources 26.76 27.89 27.50 18.01 Catalan autonomy protest 31 45 87 Chavez death 122 52 49 Table 2: BLEU scores (averaged across each article or group of Chilean mining accident 474 140 144 source articles) from changing the interpolation weights λ, which Comayagua prison fire 53 11 50 denotes the weight assigned to the large, general-purpose lan- H1N1 chile 151 92 66 guage model. Articles denotes the average score from translating Papal election 147 70 82 the Spanish articles, while sources denotes the source documents Pemex explosion 29 82 117 cited within those articles. Venezuelan presidential election 110 44 79 Total 1,343 650 695 Table 1: The topics and sentence counts of the Spanish Wikipedia these reference translations can be thought of as parallel articles used as the basis for data collection. For each topic, we Wikipedia documents, in contrast to the comparable ones have the comparable Wikipedia articles (both English and Span- that we extracted. In the second, the English Wikipedia ar- ish) and the sources cited within the Spanish articles. ticles serve this role. We explore a range of interpolation weights in order to determine whether any setting will re- are written by authors writing independently in their native sult in some improvement. The purpose of these two exper- language on the same topic, but from their own perspective, iments is to determine whether in-domain language models in contrast to simply filling out the inter-language resources built even on very small, targeted pieces of data might be of the encyclopedia by translating documents from high- useful in improving machine translation quality: first, in a to low-resource languages. Our dataset leverages this fact, parallel setting, and second, in a comparable setting. Fi- providing a relatively small corpus with high information nally, in a third set of experiments, we test whether these content on the Spanish side, and comparable but impover- ideas can generalize when we have very small amounts of ished versions on the target (English side). This is pictured data, using a proxy setting to determine the interpolation in Figure 1. weights. In addition to the comparable document pairs, we reached 3.1. Experiment 1: Can the contextual data help into the Spanish documents and collected many of the (parallel setting)? sources cited within them. Many of these citations linked outside of Wikipedia, often to news websites. These The contextualized target-side language models are built on sources were then also manually extracted and sentence- very small amounts of data, numbering at most a few hun- segmented. In total, we collected 695; Table 1 contains a dred sentences (Table 1). This made them too unreliable list of the Wikipedia articles along with sentence-level ex- to use as separate language models during decoding (with tract statistics. their own weight assigned in the decoder’s linear model). All Spanish data was then manually translated by a single Instead, our approach was to interpolate the large, general bilingual speaker to produce a reference translation used for language model trained on millions of sentences (Europarl) translation. with the domain-specific one. For this first set of experiments, instead of using the (comparable) English Wikipedia documents to build the contextualized language model, we looked at what should be an 3. Experiments easier setting: using the translations of the articles to build We used the Joshua Machine Translation Toolkit2 (Post et a language model when translating the sources, and vice al., 2013). We include both baseline results translating the versa. Table 2 presents the results four settings for the in- Spanish source articles as well as method of adapting the terpolation weight λ, which determines how much weight is baseline machine translation system to the domain of a spe- given to the large LM. With a weight of λ = 0:7, we see im- cific document. The baseline models were trained on gram- provements in BLEU score from including the in-domain mars extracted from the Spanish–English portion of version LM. To be clear, when translating the articles, the contex- 7 of the Europarl Corpus (Koehn, 2005). From these, we tualized LM is built on the sources, and when translating extracted Hiero grammars (Chiang, 2005) using the stan- the sources, the LM is built on the articles. Note that the dard extraction settings (Chiang, 2007). We then build a 5- setting λ = 1:0 is the baseline setting, with all the weight gram Kneser-Ney-smoothed language model from the En- assigned to the large, out-of-domain language model. glish side of this data using KenLM (Heafield, 2011). The parameters of the decoder’s linear model were tuned with 3.2. Experiment 2: Can contextualized LMs help Z-MERT (Och, 2003; Zaidan, 2009), which are included in (comparable setting)? the Joshua toolkit. For the second set of experiments, we use the comparable We describe three experiments. In the first, we use the ref- English Wikipedia articles to build the in-domain language erence translations of the articles as our set of target knowl- model. Table 3 contains the results. Note that the scores edge. Since they were produced from the Spanish versions, are all slightly lower than when using the parallel setting, which makes sense, since the comparable data are not cer- 2joshua-decoder.org tain to be as close. Importantly, however, the best setting 3594 λ 1.0 0.7 0.3 0.0 and enables research in this specific domain-adaptation sce- Sources 26.76 27.32 26.30 16.18 nario, and the results suggest that further research in this area could be productive.

Load more