Arxiv:2005.06166V1 [Cs.CL] 13 May 2020

Parallel Corpus Filtering via Pre-trained Language Models Boliang Zhang, Ajay Nagesh, and Kevin Knight DiDi Labs fboliangzhang, ajaynagesh, [email protected] Abstract (2018) and Belinkov and Bisk(2018) show that neural translation models are far more sensitive to Web-crawled data provides a good source of noisy parallel training data than statistical machine parallel corpora for training machine translation models. It is automatically obtained, but translation. Data selection methods that can fil- extremely noisy, and recent work shows that ter noisy parallel sentences from large-scale web neural machine translation systems are more crawled resources are in demand. sensitive to noise than traditional statistical ma- In this paper, we study the problem in a real- chine translation methods. In this paper, we world scenario where we crawl a large Japanese- propose a novel approach to filter out noisy Chinese parallel corpus from various websites and sentence pairs from web-crawled corpora via build open-domain machine translation systems pre-trained language models. We measure sentence parallelism by leveraging the multilin- between Japanese and Chinese, by filtering the gual capability of BERT and use the Genera- web crawled parallel corpus. In addition, a small tive Pre-training (GPT) language model as a amount of clean parallel data is available, in the domain filter to balance data domains. We software domain. In order to confirm our results evaluate the proposed method on the WMT on a public data, we also apply our filter to the 2018 Parallel Corpus Filtering shared task, and WMT 2018 German-English Parallel Corpus Fil- on our own web-crawled Japanese-Chinese tering shared task. parallel corpus. Our method significantly out- Previous work on parallel corpus filtering per- performs baselines and achieves a new state- of-the-art. In an unsupervised setting, our forms poorly in our scenario as it either requires method achieves comparable performance to large clean parallel corpora or dictionaries (Xu the top-1 supervised method. We also evalu- and Koehn, 2017; Artetxe and Schwenk, 2019; ate on a web-crawled Japanese-Chinese paral- Junczys-Dowmunt, 2018; Chaudhary et al., 2019), lel corpus that we make publicly available. or relies on multilingual word embeddings and ne- glects context when measuring translation paral- 1 Introduction lelism (Hangya and Fraser, 2018). Training modern neural machine translation (NMT) In this paper, we propose a simple but effec- systems requires large parallel-text resources. tive parallel corpus filtering method. Multilingual Publicly-available parallel corpora are mostly BERT (Devlin et al., 2019) projects multilingual arXiv:2005.06166v1 [cs.CL] 13 May 2020 paired with English, such as German-English, sentences into a shared space and has shown a great French-English, Chinese-English, etc., and their potential for cross-lingual model transfer (Pires domains are limited. For building machine transla- et al., 2019). We use pre-trained multilingual tion systems between non-English language pairs, BERT as prior knowledge and fine-tune it on a such as Chinese and Japanese, existing parallel synthetic dataset. This multilingual BERT-based corpora are insufficient and often low quality. To classifier forms an acceptability filter that deter- address this problem, system builders have trained mines whether or not a sentence pair consists of a NMT systems on web-crawled data and achieved bona-fide translation. promising results (Xu and Koehn, 2017; Junczys- As the domain of training data largely affects Dowmunt, 2018; Schwenk, 2018; Schwenk et al., machine translation model performance, we also in- 2019). However, data automatically crawled from troduce a domain filter. It uses the pre-trained Gen- the web is extremely noisy. Khayrallah and Koehn erative Pre-training (GPT) as in-domain language model and is an extension of the existing cross- similarity of the source and target sentence repre- entropy difference based domain filter (Moore and sentations. The sentence representations are pro- Lewis, 2010; Junczys-Dowmunt, 2018). duced by a sentence encoder trained on clean paral- We evaluate our proposed method on the WMT lel data via a neural encoder-decoder architecture. 2018 German-English Parallel Corpus Filtering Other works based on sentence embeddings include shared task and achieve a new state-of-the-art. Our Hangya and Fraser(2018) and Littell et al.(2018), unsupervised method achieves comparable perfor- as well as Schwenk et al.(2019), which mines mil- mance to the top system that is trained on millions of parallel sentences in 1620 language pairs lions of clean parallel sentence pairs. Our proposed from Wikipedia. These encoder-decoder based methods also significantly outperform baselines in methods require large amounts of clean parallel our own Japanese-Chinese parallel corpus filtering training data and are not applicable in our sce- task. nario where available data is noisy. Ondrej Bojar We make the following contributions: (2020) organize an open domain translation chal- lenge where participants are provided a large, noisy • We propose a novel approach to filter noisy set of Japanese-Chinese segment pairs built from parallel corpora by using pre-trained language web data, and the task is to clean the noisy data and models. Our approach outperforms strong build an end-to-end machine translation system. baselines and achieves a new state-of-the-art. Work on data selection is also related. Moore and Lewis(2010); Junczys-Dowmunt(2018) se- • We devise an unsupervised filtering approach lect domain-related data by computing the cross- that does not require an identifiable clean sub- entropy difference between in-domain and out- set of parallel segments. Our unsupervised domain language models. Duh et al.(2013) use method matches the results of previous super- neural language models for data selection. Axel- vised methods. rod et al.(2011) and Axelrod et al.(2015) expand cross-entropy difference filtering to both sides of • We release a large web-crawled Japanese- the parallel corpus. Since we aim to build a general Chinese parallel corpus which can be a useful machine translation system, instead of selecting resource for machine translation research on data that are relevant to a specific domain, we se- 1 non-English language pairs. lect data whose domains are as general as possible, by using Generative Pre-training (GPT) models 2 Related Work trained on large and diverse corpora. Several recent works address parallel corpus filter- 3 Method ing. Denkowski et al.(2012), Dyer et al.(2010) and Heafield(2011) use language models and word In this section we introduce a language detection alignments to determine how likely sentences are filter, a translation-acceptability filter, and a do- to be a good translation of another. Xu and Koehn main filter. Each filter produces a score for every (2017) introduce a noise filtering tool, Zipporah, candidate source/target sentence pair. The partial that discriminates parallel and non-parallel sen- score produced by each filter ranges from 0 to 1. tences based on word-frequency vectors and a dic- Values beyond this range are normalized by min- tionary. Junczys-Dowmunt(2018) proposes a dual max normalization: y^ = (y − min)=(max − min). conditional cross-entropy filtering method, which The final score is the product of the partial scores. achieved first place in the WMT 2018 German- English Parallel Corpus Filtering shared task. They 3.1 Language Detection Filter train two translation models in inverse directions on Targeting a web-crawler at a given language pair millions of parallel sentences and score sentence still results in many pages written in the wrong pairs based on the word-normalized conditional language. For example, while a URL pair may cross-entropy from the translation models. Artetxe clearly indicate translation (e.g., “.jp” and “.zh”), it and Schwenk(2019) and Schwenk(2018) propose may happen that the text content is simply copied a margin-based scoring method that compares the rather than translated. We observe this in both 1http://iwslt.org/doku.php?id=open_ our Japanese-Chinese data and the German-English domain_translation Paracrawl data set. It is necessary to filter out sentence pairs with undesired languages. Lample et al.(2018), where the cross-lingual word We adopt the fastText (Joulin et al., 2017, 2016) embeddings learned in an unsupervised manner are language identification toolkit in our language de- loosely aligned. However, after fine-tuning on a tection filter. For each sentence, the toolkit pro- few anchor pairs (word translations), they become duces a list of language candidates and their cor- more aligned. responding confidence scores. We select the lan- Similarly, we use an unsupervised synthetic guage that has the highest confidence score from training set as anchors to fine-tune multilingual fastText as the language of the sentence. Sentence BERT with a binary classification objective. Xu pairs that have both of the elements detected as the and Koehn(2017) did similar work to train a fil- desired language are assigned score 1 and other- tering classifier on synthetic data, but via bag-of- wise 0. By discarding sentence pairs with undesired words translation features. language IDs, we filter out 27% of our Chinese- Synthetic Training Set. In cases where a small Japanese parallel sentences and nearly 70% of the number of clean parallel sentence pairs are avail- German-English parallel sentences from Paracrawl able, we use them as positive training samples data set. for our classifier. In Japanese-Chinese filtering, we use around 300k sentence pairs, mostly from 3.2 Acceptability Filter open-source software documentation,2 as our positive samples. In extreme cases where no identifi- In this section, we introduce our translation accept- able, clean parallel data is available, we sub-select ability filter, one of the main contributions in the high quality parallel sentences, which are used as paper. It aims to measure the parallelism of sen- positive samples, from the noisy parallel corpus tence pairs and filter out sentence pairs that are not based on the Hunalign (Varga et al., 2007) sentence- mutual translations.

Load more