Paracrawl: Web-Scale Acquisition of Parallel Corpora

ParaCrawl: Web-Scale Acquisition of Parallel Corpora Marta Banõn´ y, Pinzhen Chenz, Barry Haddowz, Kenneth Heafieldz, Hieu Hoangz Miquel Espla-Gomis` F, Mikel ForcadaF, Amir Kamran, Faheem Kirefuz Philipp Koehnx, Sergio Ortiz-Rojasy, Leopoldo PlaF, Gema Ram´ırez-Sanchez´ y Elsa Sarr´ıasF, Marek Strelecz, Brian Thompsonx, William Waitesz, Dion WigginsN Jaume Zaragozay yPrompsit, zUniversity of Edinburgh, FUniversity of Alicante xJohns Hopkins University, TAUS, NOmniscien Technologies Abstract The execution of the pipeline has focused on of- ficial European Union languages, but also targeted We report on methods to create the largest pub- Russian, Sinhala, Nepali, Tagalog, Swahili, and licly available parallel corpora by crawling the web, using open source software. We empiri- Somali. We show that the obtained parallel cor- cally compare alternative methods and publish pora improve state-of-the-art results on common benchmark data sets for sentence alignment benchmarks, such as the WMT Shared Task on and sentence pair filtering. We also describe News Translation. the parallel corpora released and evaluate their quality and their usefulness to create machine 2 Related Work translation systems. While the idea of mining the web for parallel 1 Introduction data has been already pursued in the 20th cen- Parallel corpora are essential for building high- tury (Resnik, 1999), the most serious efforts have quality machine translation systems and have been limited to large companies such as Google found uses in many other natural language ap- (Uszkoreit et al., 2010) and Microsoft (Rarrick plications, such as learning paraphrases (Ban- et al., 2011), or targeted efforts on specific do- nard and Callison-Burch, 2005; Hu et al., 2019) mains such as the Canadian Hansards and Eu- or cross-lingual projection of language tools roparl (Koehn, 2005). The book Bitext Alignment (Yarowsky et al., 2001). (Tiedemann, 2011) describes some of the chal- We report on work to create the largest pub- lenges in greater detail. licly available parallel corpora by crawling hun- 2.1 Acquisition Efforts dreds of thousands of web sites, using open source tools. The processing pipeline consists of the Most publicly available parallel corpora are the re- steps: crawling, text extraction, document align- sult of targeted efforts to extract the translations ment, sentence alignment, and sentence pair fil- from a specific source. The French–English Cana- tering. We describe these steps in detail in Sec- dian Hansards3 were used in the earliest work on tions4–8. For some of these steps we evaluate sev- statistical machine translation. A similar popular eral methods empirically in terms of their impact corpus is Europarl (Koehn, 2005), used through- on machine translation quality. We provide the out the WMT evaluation campaign. data resources used in these evaluations as bench- Multi-lingual web sites are attractive targets. marks for future research. Rafalovitch and Dale(2009); Ziemski et al.(2015) As part of these effort, several open source com- extract data from the United Nations,T ager¨ (2011) ponents have been developed. These are integrated from European Patents, Lison and Tiedemann into the open-source tool Bitextor,1 a highly mod- (2016) from a collection of TV and movie subti- ular pipeline that allows harvesting parallel cor- tles. Cettolo et al.(2012) explain the creation of pora from multilingual websites or from preexist- a multilingual parallel corpus of subtitles from the ing or historical web crawls such as the one avail- TED Talks website which is popular due to its use able as part of the Internet Archive.2 in the IWSLT evaluation campaign. 1https://github.com/bitextor/bitextor 3https://www.isi.edu/natural-language/ 2https://archive.org/ download/hansard/ There are also various efforts targeted at a sin- (Uszkoreit et al., 2010). The vectors are then gle language pair. Martin et al.(2003) build a par- typically matched with cosine similarity (Buck allel corpus for Inuktitut–English. Utiyama and and Koehn, 2016a). The raw vectors may be re- Isahara(2003); Fukushima et al.(2006) worked centered around the mean vector for a web domain on creating Japanese–English corpora. Uchiyama (Germann, 2016) and Isahara(2007) report on the efforts to build a Document alignment quality can be improved Japanese–English patent corpus and Macken et al. with additional features such ratio of shared links, (2007) on efforts on a broad-based Dutch–English similarity of link URLs, ratio of shared images, corpus. Li and Liu(2008) mine the web for a binary feature indicating if the documents are Chinese–English corpus. A large Czech–English linked, DOM structure similarity (Espla-Gomis` corpus from various sources was collected (Bojar et al., 2016), same numbers (Papavassiliou et al., et al., 2010), linguistically annotated (Bojar et al., 2016), or same named entities (Lohar et al., 2016). 2012), and has been continuously extended to over Guo et al.(2019) introduce the use of docu- 300 million words (Bojar et al., 2016). ment embeddings, constructed from sentence em- All these efforts rely on methods and implemen- beddings, to the document alignment task. tations that are quite specific for each use case, not documented in great detail, and not publicly avail- 2.3 Sentence Alignment able. A discussion of the pitfalls during the con- Early sentence aligners (Brown et al., 1991; Gale struction of parallel corpora is given by Kaalep and Church, 1993) use scoring functions based and Veskis(2007). A large collection of corpora only on the number of words or characters in each is maintained at the OPUS web site4 (Tiedemann, sentence and alignment algorithms based on dy- 2012). namic programming. Europarl, for example, used metadata to align paragraphs, typically consist- 2.2 Document Alignment ing of 2-5 sentences, and using Gale and Church Document alignment can be defined as a matching (1993)’s method to align sentences within corre- task that takes a pair of documents and computes a sponding paragraphs. Later work added lexical score that reflects the likelihood that they are trans- features and heuristics to speed up search, such as lations of each others. The task is typically lim- limiting the search space to be near the diagonal ited to a single web domain (all web pages from (Moore, 2002; Varga et al., 2005). www.aaa.com and aaa.com, possibly aaa.de but More recent work introduced scoring methods not bbb.com) for efficiency. that use MT to get both documents into the same Matching may take the HTML structure into ac- language (Sennrich and Volk, 2010) or use pruned count, or purely rely on the textual content. Ex- phrase tables from a statistical MT system (Gomes amples of structural matching is the use of edit- and Lopes, 2016). Both methods “anchor” high- distance between linearized documents (Resnik probability 1–1 alignments in the search space and and Smith, 2003) and probability of a probabilis- then fill in and refine alignments. They later pro- tic DOM-tree alignment model (Shi et al., 2006). pose an extension (Sennrich and Volk, 2011) in Using the URL for matching is a very powerful which an SMT system is bootstrapped from an ini- indicator for some domains, typically by using a tial alignment and then used in Bleualign. predefined set of patterns for language marking or Vecalign (Thompson and Koehn, 2019) is a sen- simple Levenshtein distance (Le et al., 2016). tence alignment method that relies on bilingual Content matching requires crossing the lan- sentence embeddings and achieves linear run time guage barrier at some point, typically by using with a coarse-to-fine dynamic programming algo- bilingual dictionaries or translating one of the rithm. documents into the other document’s language 2.4 Sentence Pair Filtering (Uszkoreit et al., 2010). Parallel corpora that have been crawled from un- Documents may be represented by vectors over verified web sites and processed by error-prone ex- word frequencies, typically td-idf-weighted. Vec- traction and alignment methods are likely to con- tors may also be constructed over bigrams (Dara tain noise, such as random text fragments, text and Lin, 2016) or even higher order n-grams in the wrong language, translations produced by 4http://opus.lingfil.uu.se/ machine translation tools or bad translators, and misaligned sentence pairs. Such noise is specially due to the high cost of training these systems to harmful for neural machine translation (Khayral- evaluate different weight settings. A few partici- lah and Koehn, 2018), so filtering it out is an es- pants used instead a classifier that learns how to sential processing step. distinguish between good and bad sentence pairs There is a robust body of work on filtering out (where bad sentence pairs are either synthesized noise in parallel data but most recently this topic by scrambling good sentence pairs or selected has gained a lot of momentum, partly due to the from the raw crawled data). lack of robustness of neural models and fostered A novel method that was central to the best- by recent shared tasks on parallel corpus filtering performing submission in WMT 2019 was the under high-resource (Koehn et al., 2018) and low- use of cross-lingual sentence embeddings that resource data conditions (Koehn et al., 2019). were directly trained from parallel sentence pairs Most participants in these shared tasks used (Chaudhary et al., 2019). Other submissions used three components: pre-filtering rules, scoring monolingual word embeddings (Soares and Costa- functions for sentence pairs, and a classifier that jussa`, 2019; Kurfalı and Ostling¨ , 2019; Bernier- learned weights for feature functions. Colborne and Lo, 2019). Another approach is to first train a translation Pre-filtering rules. Some of the training data system on the clean data, then use it to translate can be discarded based on simple deterministic the non-English side into English and use mono- filtering rules.

Paracrawl: Web-Scale Acquisition of Parallel Corpora

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support