Large SMT datasets extracted from Dan Tufi 1, Radu Ion 1, tefan Dumitrescu 1, Dan tefănescu 2 1Research Institute for Artificial Intelligence , 13 Calea 13 Septembrie, 050711, Bucharest, {tufis, radu, sdumitrescu}@racai.ro 2The University of Memphis, Department of Computer Science, Memphis, TN, 38152, USAUSA [email protected]

Abstract The article presents experiments on mining Wikipedia for extracting SMT useful sentence pairs in three language pairs. Each extracted sentence pair is associated with a crosslingual lexical similarity score based on which, several evaluations have been conducted to estimate the similarity thresholds which allow the extraction of the most useful data for training threelanguage pairs SMT systems. The experiments showed that for a similarity score higher than 0.7 all sentence pairs in the three language pairs were fully parallel. However, including in the training sets less parallel sentence pairs (that is with a lower similarity score) showed significant improvements in the translation quality (BLEUbased evaluations). The optimized SMT systems were evaluated on unseen testsets also extracted from Wikipedia. As one of the main goals of our work was to help Wikipedia contributors to translate (with as little post editing as possible) new articles from major languages into less resourced languages and viceversa, we call this type of translation experiments “ in-genre” translation. As in the case of “indomain” translation, our evaluations showed that using only “ingenre” training data for translating same genre new texts is better than mixing the training data with “outofgenre” (even) parallel texts.

Keywords: comparable corpora, ingenre translation, similaritybased text mining, Wikipedia

LEXACC, a language independent textminer for 1. Introduction comparable corpora (Ștefănescu et al., 2012), able to 1 extract crosslingually similar fragments of texts, SMT engines like Moses produce better translations similarity being judged by means of a linear weighted when presented with larger and larger parallel corpora. For a given testset, it is also known that Moses produces combination of multiple features. The textminer modules (sentences indexing, searching, filtering), the features and better translations when presented with indomain the learning of the optimal weights used by the similarity training data (data sampled form the same domain as the test data, e.g. news, laws, medicine, etc.), but collecting scoring function which endorses the extracted pairs are largely described in (Ștefănescu & Ion, 2013). parallel data from a given domain, in sufficiently large Unlike most comparable corpora miners that use binary quantities to be of use for statistical translation, is not an 2 s_t easy task. To date, OPUS (Tiedemann, 2012) is the classifiers, our miner associates each extracted pair ( ) a symmetric similarity score, sim_score ∈[0, 1]. Based on largest online collection of parallel corpora, comprising 3 these scores, one can experiment with various subsets of of juridical texts (EUROPARL and EUconst) , medical texts (EMEA), technical texts (e.g. software KDE an extracted data set (DS), selecting only the sentence pairs with similarity scores higher or equal to a decided manuals, PHP manuals), movie subtitles corpora (e.g. threshold value. Starting with a minimal threshold value OpenSubs) or news (SETIMES), but these corpora are not th available for all language pairs nor are their sizes similar ( ) for the similarity score (th 1=0.1), the experimenter will get the full data set ( ) which, certainly, contains with respect to the domain. To alleviate this data scarcity, many noisy pairs. Increasing the th , one gets less significant research and development efforts have been invested in parallel texts harvesting from the web (Resnik sentencepairs, but more (crosslingually) similar: ⊃ … ⊃ , with 0.1< ... th ... <1.0 & Smith, 2003) regarded as a huge multilingual i and = _| _(_) ≥ ℎ . comparable corpus with presumably large quantities of parallel data. The parallel text miner was used to extract sentencepairs from large corpora collected from the web (Skadiņa et al., Comparable corpora have been widely recognized as 2012) for a wide variety of language pairs: valuable resources for multilingual information extraction, but few large datasets were publicly released and even EnglishGerman, EnglishGreek, EnglishEstonian, EnglishLithuanian, EnglishLatvian, EnglishRomanian fewer evaluated in the context of specific crosslingual and EnglishSlovene. Although the “optimal” similarity applications. One of the most tempting uses of comparable corpora mining is in statistical machine scores slightly differed from one language pair to another, they were comparable, in the interval [0.35, 0.5]. To translation for underresourced language pairs (e.g. evaluate the value of the mined parallel sentences, the LithuanianEnglish) or limited/specialized domains (e.g. automotive). extracted GermanEnglish sentences were used for domain adaptation of a baseline SMT system (trained on Within the ACCURAT European project we developed Europarl+news corpus). This experiment, described in (Ștefănescu et al., 2012), showed significant quality 1 http://www.statmt.org/moses/ improvements over the baseline (+6.63 BLEU points) 2 http://opus.lingfil.uu.se/ when translating texts in the automotive domain. 3 JRCAcquis and DGT Translation Memories are other The evaluation of LEXACC in the ACCURAT project examples of large parallel juridical texts. encouraged us to explore its efficacy on a wellknown

656 multilingual strongly comparable corpus, namely translation lexicon from the titles of the linked articles, Wikipedia 4, and to investigate the feasibility of what we measuring the similarity of source (English) and target called “ in-genre ” statistical machine translation: (Dutch) sentences. Experiments were performed on 30 translating documents of the same genre (not necessary randomly selected EnglishDutch document pairs yielding the same domain) based on training data of the respective a few hundred parallel sentence pairs. genre. Wikipedia is a strongly comparable multilingual A few years later, Mohammadi and GhasemAghaee (2010), corpus and it is large enough to ensure reasonable for another language pair (PersianEnglish) developed the volumes of training data for “ingenre” translations. One Adafre and Rijke’s method by imposing additional useful property of Wikipedia is that, in spite of covering conditions on the extracted sentence pairs: the length of the many domains, articles are characterized by the same parallel sentence candidates had to correlate and the writing style and formatting conventions 5. The authors are Jaccard similarity of the lexicon entries mapped to source given precise guidelines contentwise, on the language (Persian) and target (English) had to be as high as possible. use, on formatting, and several other editorial instructions The experiments conducted by Mohammadi and to ensure genre unity across various topics as in the GhasemAghaee did not generate a parallel corpus, but only traditional . Therefore, we refer to the a couple of hundred parallel sentences intended as a proof translation of a Wikipedialike article, irrespective of its of concept. topic, based on training data extracted from Wikipedia, as Gamalo and Lopez (2010) collected the linked documents an “ingenre” (encyclopedic) translation. The parallel data in Wikipedia (CorpusPedia), building a large comparable mining processing flow (Ștefănescu & Ion, 2012) was corpus (EnglishSpanish, EnglishPortuguese and recently improved by what we called “boosted mining” SpanishPortuguese), with similar documents classified (Tufiș et al., 2013a,b), and a larger and quite different according to their main topics. However, they did not experiment was conducted with the aim of evaluating the measure the similarity among the aligned documents usability of texts mined from comparable corpora for neither did they extract the “parallel” sentences. statistical translation of texts similar in genre with the Another experiment, due to Smith et al. (2010), addressed extraction corpora. largescale parallel sentence mining from Wikipedia. Based on binary Maximum Entropy classifiers (Munteanu 2. Related work & Marcu, 2005), they automatically extracted large The interlingual linking in Wikipedia makes it possible to volumes of parallel sentences for EnglishSpanish (almost 2M pairs), EnglishGerman (almost 1.7M pairs) and extract documents in different languages such that the EnglishBulgarian (more than 145K pairs). According to linked documents are versions of each other. If the linked documents were always faithful translations of some hub Munteanu and Marcu (2005), a binary classifier can be trained to differentiate between parallel sentences and (original) documents, Wikipedia would be a huge nonparallel sentences using various features such as: word multilingual parallel corpus of encyclopedic texts. Yet, this is not the case and, frequently, articles originally written in alignment log probability, number of aligned/unaligned words, longest sequence of aligned words, etc. To enrich one language (English, most of the times) are adapted the feature set, Smith et al. (2010) proposed to translations (usually, shortened) in other languages. It also happens that there exist articles with no links to other automatically extract a bilingual dictionary from the Wikipedia document pairs and use this dictionary to languages or only with links to one or two (major) 6 supplement the word alignment lexicon derived from languages . Thus, Wikipedia is arguably the largest strongly comparable corpus available online and the best existing parallel corpora. Furthermore, they released their EnglishSpanish and English testsets data collection for “in genre” translation experiments. (but not the training data) and so, a direct comparison was The major motivation of the current endeavour is the validation of the idea that it is possible to develop systems made possible (Section 4). capable of supporting the translation (with minimal postediting) of Wikipedia articles from lessresourced 3. Mining Wikipedia languages into major languages and viceversa. Currently, Among the 287 language editions of Wikipedia Wikipedia does not offer an integrated translation engine to (http://meta.wikimedia.org/wiki/List_of_Wikipedias 7 ), assist the translation task, but this could be an option created under the auspices of , the worthy of consideration. English, German and Spanish ones are listed in the best It is, therefore, not surprising that several researchers were populated category, with more than 1,000,000 articles: tempted to explore parallel sentence mining on Wikipedia. English is the largest collection with 4,468,164 articles, Among the first, Adafre and Rijke (2006) approached German is the third largest with 1,765,430 articles (it was Wikipedia mining for parallel data using a MT system the second when we downloaded it), while Spanish is the (Babelfish) to translate from English to Dutch and then, by 8th in the top with 1,086,871 articles (it was word overlapping, to measure the similarity between the the sixth before). Romanian Wikipedia is in the medium translated sentences and the original sentences. They also populated category, and with 241,628 articles is the 28th experimented with an automatically induced (phrase) largest collection (it was the 25 th at the time of dump downloading). For our experiments we selected three very 4 http://www.wikipedia.org large Wikipedias (English, German and Spanish) and a 5 http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style , mediumsized Wikipedia (Romanian) intended for the http://en.wikipedia.org/wiki/Wikipedia:TONE 6 For instance, for RomanianSlovak language pair, we 7 th identified only 349 linked pairs, in spite of both languages Consulted on March 9 2014.The statistics on Wikipedia being in the category “+100,000” with more than 400,000 naturally varies at different times. All our experiments are articles together. based on the dump of December 22 nd , 2012.

657 SMT experiments on three language pairs: English – lexicons we performed the first phase of LEXACC German, English Spanish and English Romanian. extraction of comparable sentence pairs from the With the monolingual Wikipedias selected for parallel respective Wikipedias. Let us call this dataset, for a sentence mining, we downloaded the “database backup language pair L1L2, as WikiBase (L1, L2). The dumps” 8 for which Wikipedia states that they contain “a SMT experiments with WikiBase for three language complete copy of all Wikimedia wikis, in the form of pairs are described in Section 3.2; the subsets of wikitext source and metadata embedded in XML”. WikiBase (L1, L2) which maximized the BLEU Parsing the English XML dump, we kept only the proper scores were considered the most useful part for the encyclopedic articles which contain links to their second phase (boosting) of the experiments. corresponding articles in Spanish, German or Romanian. b) GIZA++ was run again and results symmetrized on Thus, we removed articles that were talks (e.g. Talk:Atlas the most SMT useful sentence pairs of WikiBase (L1, Shrugged), logs (e.g. Wikipedia:Deletion log), user L2), as resulted from the first step. The new translation related articles (e.g. User:AnonymousCoward), lexicons were merged with the initial ones and used membership related articles (e.g. Wikipedia:Building for a second phase of LEXACC extraction, thus Wikipedia membership), manuals and rules related getting a new and larger data set which we refer to as articles , etc. WikiTrain (L1, L2). The most useful parts of For each language, the retained articles were processed, WikiTrain were identified based on their impact on using regular expressions, to remove the XML markup in the BLEU score for the test set as described in Section order to keep only the raw, UTF8 encoded text, which 3.3 and used for the training of the WikiTranslators. was saved into a separate file. The nontextual entries like images or tables were stripped off. Each text document 3.1. Phase 1: Building WikiBase (L1, L2) was then sentencesplit using an inhouse sentence splitter Table 2 lists, for different similarity scores as extraction based on a Maximum Entropy classifier. thresholds, the number of SMT useful sentence pairs (P) found in each language pair dataset, as well as the number Language pair Document Size on disk Size of words (ignoring punctuation) per language (English pairs ratio Words, German Words, Romanian Words, Spanish (L1/L2) Words) in the respective sets of sentence pairs. As EnglishGerman 715,555 2.8 Gb (EN) 1,22 mentioned before, data extracted with a given Similarity 2.3 Gb (DE) score threshold is a proper subset of any data extracted EnglishRomanian 122,532 0.78 Gb (EN) with a lower Similarity score threshold. 3,91 0.2 Gb (RO) EnglishSpanish 573,771 2.5 Gb (EN) Sim . EN RO EN DE EN ES 1.5 Gb 1,66 score (ES) 0.9 Pairs: 42,201 Pairs: 38,390 Pairs: 91,630 EN Words: 0.81M EN Words: 0.55M EN Words: 1.13 M Table 1: Linked documents for three language pairs RO Words: 0.83M DE Words: 0.54M ES Words: 1.16 M 0.8 Pairs: 112,341 Pairs: 119,480 Pairs: 576,179 Table 1 lists the number of sentencesplit Wikipedia EN Words: 2.36M EN Words: 2.08M EN Words: 10.5M comparable document pairs (identified by following the RO Words: 2.4 M DE Words: 2.01M ES Words: 11.29M interlingual links) for each language pair considered. 0.7 Pairs: 142,512 Pairs: 190,135 Pairs: 1,219,866 Looking at the size ratio of the linked documents for each EN Words: 2.99M EN Words: 3.49M EN Words: 23.73M language pair it is apparent that Romanian documents are RO Words: 3.04M DE Words: 3.37M ES Words: 25.93M much shorter than the linked English ones. The size ratios 0.6 Pairs: 169,662 Pairs: 255,128 Pairs: 1,579,692 for other language pairs are more balanced, coming closer EN Words: 3.58M EN Words: 4.89M EN Words: 31.02M to expected language specific ratio for a parallel corpus. RO Words: 3.63M DE Words: 4.7M ES Words: 33.71M For extracting SMT useful sentence pairs 9, we applied a 0.5 Pairs: 201,263 Pairs: 322,011 Pairs: 1,838,794 two steps procedure as follows: EN Words: 4.26M EN Words: 6.45M EN Words: 36.51M a) We extracted the initial translation lexicons for RO Words: 4.33M DE Words: 6.19M ES Words: 39.55M EnglishRomanian and EnglishGerman language 0.4 Pairs: 252,203 Pairs: 412,608 Pairs: 2,102,025 pairs from the JRCAcquis parallel corpora. For EN Words: 5.42M EN Words: 8.47M EN Words: 42.32M EnglishSpanish pair we used the EUROPARL RO Words: 5.48M DE Words:8.13 M ES Words: 45.57M parallel corpus (as Smith et al. (2010) did). We ran 0.3 Pairs: 317,238 Pairs: 559,235 Pairs: 2,656,915 GIZA++ (Gao & Vogel, 2008) and symmetrized the EN Words: 6.89M EN Words: 11.8M EN Words: 54.93M extracted translation lexicons between the source and RO Words: 6.96M DE Words: 11.4M ES Words: 58.52M target languages. The RomanianEnglish lexicon extracted with GIZA++ was merged with an inhouse Table 2: Wikibase: number of parallel sentences and dictionary generated from our wordnet (Tufiș et al., words for each language pair, for a given threshold 2013c) aligned to Princeton WordNet. With these Depending on the similarity threshold, the extracted pairs of sentences may be really parallel, may contain real 8 http://dumps.wikimedia.org/backupindex.html of December parallel fragments, may be similar in meaning but with a 22 nd , 2012 9 different wording, or lexically unrelated in spite of We deliberately avoid using the term “parallel sentences”, domain similarity. That is, the lower the threshold, the because, as will be shown, in our experiments we consider not higher the noise. only parallel sentences.

658 By random manual inspection of the generated sentence of their BLUE score (Papineni et al., 2002) on the same pairs, we saw that, in general, irrespective of the language test data. pair, sentence pairs with a translation similarity measure We have to emphasize that the removal of the test and equal or higher than 0.7 are parallel. Those pairs with a development set sentences from the training corpora does translation similarity measure between 0.3 and 0.6 have not ensure an unbiased evaluation of the BLEU scores extended parallel fragments which an accurate word or since their context still remained in the training corpora. phrase aligner easily detects. Further down the threshold This requires some more explanations. For each extracted scale, below 0.3, we usually find sentences that roughly sentence pair, LEXACC stores in a bookkeeping file the speak of the same event but are not actual translations of ID of the documentpair out of which the extraction was each other. The noisiest data sets were extracted for the done. This information could be used for identification 0.1 and 0.2 similarity thresholds and we dropped them and elimination from the training set of all the pairs from SMT experiments. coming from the same documents from which the development and evaluation sets were selected. Yet, due 3.2 SMT experiments with WikiBase to the nature of the Wikipedia article authoring, even this In order to select the most MT useful parts of WikiBase strategy of filtering the development and evaluation for the three considered language pairs, we built three would not ensure an unbiased evaluation. The Wikipedia contributors are given specific instructions for authoring baseline Mosesbased SMT systems using only parallel 10 sentences, that is those pairs extracted with a similarity documents and because of complying with these score higher or equal to 0.7 (see Table 2). We instructions, inevitably one could find in different incrementally extended the training data by lowering the documents almost identical sentences except for a few similarity score threshold and, using the same testset, name entities. Indeed we found examples of such observed the variation of the BLEU score. The purpose of sentence pairs in the trainset similar, but not identical, to the evaluation of the SMT systems was only to indicate sentences in the testset, yet coming from different the best threshold for selecting the training set from the documentpairs. Certainly one could build a tough testset WikiTrain for subsequent building of the by removing all similar (patternbased) sentences from WikiTranslators. As the standard SMT system we chose trainset, but we did not do that because it would have Moses surface to surface translation, lexical reordering been beyond the purpose of this work. As we mentioned wbemsdbidirectionalfe model, a maximum 4 words before, this evaluation was meant only for estimating phraselength, and the default values for the rest of most useful extraction level for the second phase of parameters. training the WIKITranslators. The target language model (LM) for all experiments was trained on all monolingual, sentencesplit English TM based BLEU BLEU BLEU Wikipedia after removing the administrative articles as on SCORE SCORE SCORE described in Section 3. The language model was limited to WikiBase RO>EN DE>EN ES>EN 5grams and the counts were smoothed by the interpolated TM [0.3, 1] 37.24 39.16 47.59 KnesserNey method. TM [0.4, 1] 37.71 39.46 47.52 Since we experimentally noticed that the additional TM [0.5, 1] 37.99 39.52 47.53 sentence pairs extracted for a threshold of 0.6 were almost TM [0.6, 1] 37.85 39.5 47.44 as parallel as those extracted for higher thresholds we TM [0.7, 1] 37.39 39.24 47.28 included this interval too in the sampling process for TM [0.8, 1] 36.89 38.57 46.27 testset design. Thus, we proceeded to randomly sample TM [0.9, 1] 32.76 34.73 39.68 2,500 sentence pairs from similarity intervals ensuring parallelism ([0.6, 0.7), [0.7, 0.8), [0.8, 0.9) and [0.9, 1]). Table 3: Comparison among SMT systems trained on We obtained 10,000 parallel sentence pairs for each various parts of WikiBase language pair. Additionally, we extracted 1,000 parallel sentence pairs as development set ( devset ). These 11,000 Table 3 summarizes the results of this first step sentences were removed from the WikiBase (L1, L2) that experiment, with the best BLEU scores (bold figures) were meant as training corpora for each language pair. identifying the most MT useful parts of WikiBase When sampling parallel sentence pairs (test and (L1,L2). We considered TM [0.7, 1] as the baseline for all devsets), we were careful to observe the Moses’ filtering language pairs. constraints: both the source and target sentences must have at least 4 words and at most 60 words and the ratio of 3.3 Phase 2: Building WikiTrain (L1, L2) the longer sentence (in tokens) of the pair over the shorter The experiments on Wikibase revealed that the most one must not exceed 2. The duplicates were also removed. useful training data was extracted by using LEXACC with Further on, we trained seven translation models (TM), 0.5 similarity score for GermanEnglish and for each language pair, over cumulative threshold RomanianEnglish language pairs and 0.3 for intervals beginning with 0.3: TM 1 for [0.3, 1], TM 2 for SpanishEnglish pair (see Table 3). We reran GIZA++ on [0.4, 1] …, TM 7 for [0.9, 1]. The resulting seven training these subsets of WikiBase to extract new lexicons. corpora were filtered with Moses’ cleaning script with the The new lexicons were merged with the initial ones and same restrictions mentioned above. For every language, the LEXACC extraction was repeated with the resulted both the training corpora and the testset were tokenized mined comparable sentencepairs denoted as WikiTrain. using Moses’ tokenizer script and truecased. The quality of the translation systems was measured as usual in terms

10 http://en.wikipedia.org/wiki/Wikipedia:Translation

659 Sim. EN RO ENDE EN ES TM based TM [0.5, 1] TM [0.5, 1] TM [0.3, 1] score on RO>EN DE>EN ES>EN 0.9 Pairs 66,777 Pairs 97,930 Pairs 113,946 WikiTrain EN Words 1.08M EN Words 1.07M EN Words 1.16M BLEU RO Words 1.09M DE Words 1.04M ES Words 1.19M SCORE 41.09 40.82 46.28 0.8 Pairs 152,015 Pairs 272,358 Pairs 597,992 TM [0.5, 1] TM [0.5, 1] TM [0.3, 1] EN Words 2.69M EN Words 3.7M EN Words 9.73M EN > RO EN > DE EN > ES RO Words 2.7 M DE Words 3.6M ES Words 10.51M BLEU 0.7 Pairs 189,875 Pairs 434,019 Pairs 1,122,379 SCORE 29.61 35.18 46.00 EN Words 3.36M EN Words 6.2M EN Words 19.94M RO Words 3.37M DE Words 5,93M ES Words 21.82M Table 5: Best translation SMT systems, trained on 11 0.6 Pairs 221,661 Pairs 611,868 Pairs 1,393,444 WikiTrain EN Words 3.96M EN Words 8.94M EN Words 25.07M RO Words 3.97M DE Words 8.53M ES Words 27.41M 4. Comparison with other works 0.5 Pairs 260,287 Pairs 814,041 Pairs 1,587,276 Translation for RomanianEnglish language pair was also EN Words 4.72M EN Words 12.36M EN Words 28.99M studied in (Boroș et al., 2013; Dumitrescu et al., 2012; RO Words 4.72M DE Words 11.79M ES Words 31.57M 2013) among others. In these experiments we had explicit 0.4 Pairs 335,615 Pairs 1,136,734 Pairs 1,807,892 interests in experiments on using indomain vs. EN Words 6.33M EN Words 18.09M EN Words 33.62M outofdomain test/train data, and various configurations RO Words 6.32 M DE Words 17.31M ES Words 36.37M of the Moses decoder in surfacetosurface and factored 0.3 Pairs 444,102 Pairs 1,848,651 Pairs 2,288,163 translation. Out of the seven domainspecific corpora EN Words 8.71 M EN Words 31.41M EN Words 44.02M (Boroș et al., 2013) one was based on Wikipedia. The RO Words 8.70M DE Words 30.18M ES Words 47.18M translation experiments on Romanian>English, similar to those reported here, were surface based with training on Table 4: WikiTrain: number of similar sentences and parallel sentence pairs extracted from Wikipedia by words for each language pair, for a given threshold LEXACC at a fixed threshold: 0.5 (called “WIKI5”), without MERT optimization. A random selection of Table 4 shows the results of the boosted extraction unseen 1,000 Wikipedia Romanian test sentences 12 was process. As one can see, the extracted data, at each translated into English using combinations of: similarity score level, is significantly increased for the • a WIKI5based translation model (240K sentence EnglishRomanian and EnglishGerman language pairs. pairs)/WIKI5based language model; For EnglishSpanish, except for the similarity scores 0.8 • a global translation model (1.7M sentence and 0.9 the number of sentence pairs is smaller than in pairs)/global language model named “ALL”, trained WikiBase. The reason is that in this round we detected on the concatenation of all seven specific corpora. (and eliminated) several identical pairs with those in the training and development sets and several duplicated Table 6 gives the BLEU scores for the Moses pairs in the training set. Anyway, the EnglishSpanish configuration similar to ours. WikiTrain was the largest trainset and contains the highest percentage of fully parallel sentence pairs. WIKI5 TM ALL TM WIKI5 LM 29.99 29.95 3.4 SMT experiments with WikiTrain ALL LM 29.51 29.95 The WikiTrain corpora were used with the same Table 6: BLEU scores (RO>EN) on 1000 sentences experimental setup as described in Section 4.2. The Wikipedia testset of Boroș et al. (2013) training of each translation system was followed by the evaluation on the respective testsets (10,000 pairs) in Boroș et al. (2013)’s results confirm the conclusion we both translation directions. The results are presented in claimed earlier: the ALL system does not perform better Table 5. than the ingenre WIKI5 system. The large difference Having much more training data, in case of the between the herein BLEU score (41.09) and 29.99 in Romanian>English and German>English the BLEU (Boroș et al., 2013) may be explained by various factors. scores significantly increased (with 3.1 and 2.58 points First and more importantly, our current language model respectively). For SpanishEnglish the decrease of was entirely ingenre for the test data and much larger: the number of sentences in WikiTrain as compared to language model was built from entire Romanian WikiBase negatively impacted the new BLEU score, Wikipedia (more than 500,000 sentences), while the which is 1.31 point lower, suggesting that removing language model in (Boroș et al., 2013) was built only from duplicates was not the best filtering option. the Romanian sentences paired to English sentences (less As expected, the translations into nonEnglish languages than 240,000 sentences). Our translation model was built are less accurate due to a more complex morphology of from more than 260,000 sentence pairs versus 234,879 the target language (most of the errors are morphological sentence pairs of WIKI5). Another explanation might be ones), but still the BLEU scores are very high, better than the use of different Moses filtering parameters (e.g. the most of the results we are aware of (for ingenre or indomain experiments). 11 For a fair comparison with data in Table 3 we did not used here the MERT optimization 12 The testset construction followed the same methodology described in this article

660 length filtering parameters) and different testsets. As statistically significant (1.34 points), and one should also suggested by other researchers, Wikipedialike notice that our system used 10 times less training data. documents are more difficult to translate than, for Surprisingly, our TM [0.5, 1] for GermanEnglish performed instance, legal texts. Boroș et al. (2013) report BLEU on the new testset much worse than on our testset (24.64 scores on JRCAcquis testsets (with domain specific versus 40.82 14 BLEU points), which was not the case for training) almost double than those obtained on Wikipedia the SpanishEnglish language pair. We suspected that testsets. some GermanEnglish translation pairs in the Smith et al. The most similar experiments to ours were reported by (2010) testset were not entirely parallel. This idea was Smith et al. (2010). They mined for parallel sentences supported by the correlation of the evaluation results from Wikipedia producing parallel corpora of sizes even between our translations and Google’s for larger than ours. While they used all the extracted SpanishEnglish and GermanEnglish. Also, their sentence pairs for training, we used only those subsets that reported results on GermanEnglish were almost half of observed a minimal similarity score. We checked to see if the ones they obtained for SpanishEnglish. their testsets for EnglishSpanish (500 pairs) and for Therefore, we checked the GermanEnglish and EnglishGerman (314 pairs) contained sentences in our SpanishEnglish testsets (supposed to be parallel) by training sets and, as this was the case, we eliminated from running the LEXACC miner to see the similarity scores the training data several sentence pairs (about 200 for the paired sentences. The results confirmed our guess. sentence pairs from the EnglishSpanish training corpus The first insight was that the testsets contained many and about 140 sentence pairs from the EnglishGerman pieces of texts that looked like section titles. Such short training corpus). We retrained the two systems on the sentences were ignored by LEXACC. While out of the slightly modified training corpora. Since they used considered sentence pairs (ignoring the sentences with MERToptimized translation systems in their less than 3 words), for SpanishEnglish LEXACC experiments, we also optimized, by MERT, our new identified more than 92% as potentially useful SMT pairs 13 TM [0.5, 1] for German>English and new TM [0.4, 1] for (with a similarity score higher than or equal to 0.3 – this Spanish>English translation systems, using the was the extraction threshold for SpanishEnglish respective devsets (each containing 1,000 sentence pairs, sentencepairs), for GermanEnglish LEXACC identified as described in Section 3.2). only 35% potentially useful SMT pairs (a similarity score Their testsets for EnglishSpanish and for higher than or equal to 0.5 – this was the extraction EnglishGerman were translated (after being truecased) threshold for GermanEnglish sentencepairs). Even if the with our best translation models and also with Google threshold for GermanEnglish was lowered to 0.3 and Translate (as of midFebruary 2013). titles included only 45% passed the LEXACC filtering. Table 7 summarizes the results. In this table, As for parallelism status of the sentence pairs in the “Large+Wiki” denotes the best translation model of testsets (i.e. similarity scores higher than 0.6 for both Smith et al. (2010) which was trained on many corpora languages) the percentages were 78% for SpanishEnglish (including Europarl and JRC Acquis) and on more than and only 29% for GermanEnglish. Without ignoring the 1.5M parallel sentences mined from Wikipedia. “TM [0.4, short sentences (easy to translate) these percentages were 1] ” and “TM [0.5, 1] ” are our WikiTrain translation models a little bit higher (80.8% for SpanishEnglish and 32.82% as already explained. “Train data size” gives the size of for GermanEnglish). These evaluations also outline that training corpora in multiples of 1,000 sentence pairs. LEXACC is too conservative in its rankings: we noticed almost parallel sentences in the testset for 15 Language Train System BLEU SpanishEnglish even for a similarity score of 0.1 , while pair data size in the GermanEnglish the same happens for similarity (sentence scores lower than 0.3 16 . The most plausible explanation pairs) was that one of the LEXACC’s parameters (crosslinking SpanishEnglish 9,642K Large+Wiki 43.30 14 Note that this value for our TM was obtained on a very 2,288K TM [0.4, 1] 50.19 [0.5, 1] Google 44.43 different and much larger testset and also without MERT optimization. Yet, the difference is large enough to raise GermanEnglish 8,388K Large+Wiki 23.30 suspicions on the testset used for this comparison. 814K TM [0.5, 1] 24.64 15 Similarity score (ESEN) 0.1: Google 21.64 Sin embargo, el museo, llamado no fue terminado sino hasta el 10 de abril de 1981, dos días antes del vigésimo Table 7: Comparison between SMT systems on the aniversario del vuelo de Yuri Gagarin . > Wikipedia testset provided by Smith et al. (2010) However, it took until April 10, 1981 (two days before the 20th anniversary of Yuri Gagarin ' s flight) to complete the For SpanishEnglish testset of Smith et al. (2010) our preparatory work and open the Memorial Museum of Cosmonautics. result is significantly better (6.89 BLEU points) than 16 theirs, in spite of almost 4 times less training data. For the Similarity score (DEEN) 0.29: Die 64,5 Prozent , welche die SPD unter seiner Führung GermanEnglish pair, the BLEU score difference between erzielte , waren das höchste Ergebnis , welches je eine Partei TM [0.5, 1] and Large+Wiki systems is smaller, but still auf Bundeslandsebene bei einer freien Wahl in Deutschland erzielt hatte . > 13 Although in our earlier experiments on Spanish>English the In the election that was conducted in the western part of model TM [0.3, 1] was the best performing on our test data, on Berlin two months later , his popularity gave the SPD the the Microsoft testset the model that achieved the best BLEU highest win with 64.5 % ever achieved by any party in a free score was TM [0.4, 1] , exceeding with 0.32 BLEU points the election in Germany . performance of the TM [0.3, 1] model.

661 factor) strongly discourages longdistance reordering sentences) are freely available online 17 . (which was quite frequent in the GermanEnglish testset The archives contain all the extracted sentence pairs, and has also a few instances in the SpanishEnglish beginning with the similarity threshold 0.1 (see Table 8), testset). thus much more, but, to a large extent, noisy data. Yet, as said before, several useful sentence pairs for each 5. Conclusions language pair might be recovered.

Wikipedia is a rich resource for parallel sentence mining Sim. ENRO ENDE ENES in SMT. Comparing different translation models score containing MT useful data, ranging from strongly 0.1 Pairs 2,418,227 Pairs 11,694,784 Pairs 3,280,305 comparable to parallel, we concluded that there is En Words 63.99M En Words 295.22M En Words 84.30 M sufficient empirical evidence not to dismiss sentence pairs Ro Words 63.96M De Words 292.56M Es Words 87.37M that are not fully parallel on the suspicion that the inherent noise they contain might be detrimental to the translation Table 8: Full unfiltered datasets quality. On the contrary, our experiments demonstrated that The LEXACC text miner plus other miners are freely 18 ingenre comparable data are strongly preferable to available on ACCURAT project site with the newest 19 outofgenre parallel data. However, there is an optimum version on the RACAI’s METASHARE clone . level of similarity between the comparable sentences, which, according to our similarity metrics (for the References language pairs we worked with), is around 0.4 or 0.5. Adafre, S.F. & de Rijke, M. (2006). Finding similar Additionally, the two step procedure we presented, sentences across multiple languages in Wikipedia. In demonstrated that an initial ingenre translation dictionary Proceedings of the 11th Conference of the European is not necessary, and it can be constructed subsequently, Chapter of the Association for Computational starting with a dictionary extracted from whatever parallel Linguistics (EACL 2006), April 37, 2006. Trento, data. Italy, pp. 6269. We want to mention that it is not the case that our Boroș, T., Dumitrescu, Ș.D., Ion, R., Ștefănescu, D. & extracted Wikipedia data is the maximally MT useful Tufiș, D. (2013). RomanianEnglish Statistical data. First of all, LEXACC may be improved in many Translation at RACAI. În E. Mitocariu, M. A. Moruz, ways, which is a matter for future developments. For D. Cristea, D. Tufi, M. Clim (eds.) Proceedings of the instance, although the crosslinking feature is highly 9th International Conference "Linguistic Resources relevant for language pairs with similar word ordering, it and Tools for Processing the ", is not very effective for language pairs showing long 16-17 mai, 2013 ,, Miclăușeni, Romania, 2013. distance reordering. We also noticed that a candidate pair „Alexandru Ioan Cuza” University Publishing House., which did not include in both parts the same numeric ISSN 1843911X, 2013, pp. 8198. entities (e.g. numbers, dates, and times) was dropped for Dumitrescu, Ș.D., Ion, R., Ștefănescu, D., Boroș, T. & further consideration. Thirdly, the extraction parameters Tufiș. D. (2013). Experiments on Language and of LEXACC were not reestimated for the WikiTrain Translation Models Adaptation for Statistical Machine construction. Additionally, we have to mention that Translation. In Dan Tufiș, Vasile Rus, Corina Forăscu LEXACC evaluated and extracted only full sentences: a (eds.) TowardsMultilingual Europe 2020: A Romanian finergrained (subsentential) extractor would be likely to Perspective, 2013, pp. 205224. generate more MT useful data. Also, one should note that Dumitrescu, Ș.D., Ion, R., Ștefănescu, D., Boroș, T. & the evaluation figures are just indicative for the potential Tufiș. D. (2012_. Romanian to English Automatic MT of Wikipedia as a source for SMT training. In previous Experiments at IWSLT12. In Proceedings of the work it was shown that using factored models for International Workshop on Spoken Language inflectional target languages (Boroș et al, 2013) and Translation, December 6 and 7, 2012, Hong Kong pp. cascading translators (Tufiș & Dumitrescu, 2012) may 136143. significantly improve (several BLEU points) the Gao, Q. & Vogel, S. (2008). Parallel implementations of a translation accuracy of an SMT system. Some other word alignment tool. In Proceedings of ACL08 HLT: techniques may be used to improve at least translations Software Engineering, Testing, and Quality Assurance into English. For instance, given that English adjectives for Natural Language Processing, June 20, 2008. The and all functional words are not inflected, a very effective Ohio State University, Columbus, Ohio, USA, pp. way, for a source inflectional language would be to 4957. lemmatize all words in these categories. Another idea is to Koehn, K. (2005). Europarl: A Parallel Corpus for split compound words of a source language (such as Statistical Machine Translation, In Proceedings of the German) into their constituents. Both such simplifications tenth Machine Translation Summit, Phuket, Thailand, are, computationally, not very expensive (and for many pp. 7986 . languages appropriate tools are publicly available), and Koehn, P., Hoang, H., Birch, A., CallisonBurch, C., may significantly reduce the number of outofvocabulary Federico, M., Bertoldi, N., Cowan, B., Shen, W., input tokens. Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A. The parallel Wiki corpora, including the testsets (containing 10,000) and the dev–sets (containing 1,000 17 http://dev.racai.ro/dw 18 http://www.accuratproject.eu/ 19 http://ws.racai.ro:9191/repository/search/?q=lexacc

662 & Herbst, E. (2007). Moses: open source toolkit for Tufiș, D., Ion, R., Dumitrescu, Ș.D. & Ștefănescu, D. statistical machine translation. In Proceedings of the (2013a). Wikipedia as an SMT Training Corpus. In 45th Annual Meeting of the ACL on Interactive Poster Proceedings of the Conference on Recent Advances in and Demonstration Sessions (ACL '07). Association Natural Language Processing (RANLP 2013), Hissar , for Computational Linguistics, Stroudsburg, PA, USA, Bulgaria, September 713, 2013 . pp. 177180. Tufiș, D., Ion, R. & Dumitrescu, Ș.D. (2013b). Mohammadi M., & GhasemAghaee, N. (2010). Building WikiTranslator: Multilingual Experiments for bilingual parallel corpora based on Wikipedia. In InDomain Translations. In Computer Science Journal Computer Engineering and Applications (ICCEA of Moldova, vol.21, no.3(63), 2013, pp.128. 2010), Second International Conference on Computer Tufiș, D., Barbu Mititelu, V., tefănescu, D. & Ion, R. Engineering and Applications, Vol. 2,. IEEE Computer (2013c). The Romanian Wordnet in a Nutshell. Society Washington, DC, USA, pp. 264268. Language and Evaluation, Springer, Vol. 47, Issue 4, Munteanu, D. & Marcu, D. (2005). Improving machine ISSN 1574020X, DOI: 10.1007/s1057901392307, translation performance by exploiting comparable pp. 13051314. corpora. Computational Linguistics , 31(4), pp. 477–504. Otero, P.G. & Lopez, I.G. (2010). Wikipedia as Multilingual Source of Comparable Corpora. Proceedings of BUCC, Malta, pp. 2125 Papineni, K., Roukos, S., Ward T., & Zhu, W.J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), July 2002. Philadelphia, USA, pp. 311318. Resnik, P. & Smith, N. (2003). The Web as a Parallel Corpus. In Computational Linguistics, vol. 29, no. 3, MIT Press Cambridge, MA, USA, pp. 349380 Skadiņa, I., Aker, A., Glaros, N., Su, F., Tufis, D., Verlic, M., Vasiljevs, A., & Babych, B. (2012). Collecting and Using Comparable Corpora for Statistical Machine Translation. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), May 2326, 2012. Istanbul, Turkey, ISBN 9782951740877 Smith, J.R., Quirk C. & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, © Association for Computational Linguistics (2010), pp. 403411. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiș, D. & Varga, D. (2006). The JRCAcquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th LREC Conference , Genoa, Italy, 2228 May, 2006, ISBN 2951740824, EAN 9782951740822 Ștefănescu, D., Ion R. & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy, May 2830, 2012, pp. 137144. tefănescu, D., & Ion, R. (2013). ParallelWiki: A collection of parallel sentences extracted from Wikipedia. In Proceedings of the 14 th International Conference on Intelligent Text Processing and Computational Linguistics , March 2430, 2013, Samos, Greece. Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), May 2326, 2012. Istanbul, Turkey, ISBN 9782951740877.

663