Wikimatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
Total Page:16
File Type:pdf, Size:1020Kb
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia Holger Schwenk Vishrav Chaudhary Shuo Sun Hongyu Gong Francisco Guzman´ Facebook AI Facebook AI Johns Hopkins Univ. of Illinois Facebook AI University Urbana-Champaign [email protected] [email protected] [email protected] hgong6 [email protected] @illinois.edu Abstract Wikipedia is probably the largest free multi- lingual resource on the Internet. The content of We present an approach based on multilin- Wikipedia is very diverse and covers many topics. gual sentence embeddings to automatically ex- Articles exist in more than 300 languages. Some tract parallel sentences from the content of content on Wikipedia was human translated from Wikipedia articles in 96 languages, including an existing article into another language, not neces- several dialects or low-resource languages. We systematically consider all possible language sarily from or into English. Eventually, the trans- pairs. In total, we are able to extract 135M lated articles have been later independently edited parallel sentences for 1620 different language and are not parallel anymore. Wikipedia strongly pairs, out of which only 34M are aligned with discourages the use of unedited machine transla- English. This corpus is freely available.1 tion,2 but the existence of such articles cannot be To get an indication on the quality of the ex- totally excluded. Many articles have been written tracted bitexts, we train neural MT baseline independently, but may nevertheless contain sen- systems on the mined data only for 1886 lan- tences which are mutual translations. This makes guages pairs, and evaluate them on the TED Wikipedia a very appropriate resource to mine for corpus, achieving strong BLEU scores for parallel texts for a large number of language pairs. many language pairs. The WikiMatrix bitexts To the best of our knowledge, this is the first work seem to be particularly interesting to train MT systems between distant languages without the to process the entire Wikipedia and systematically need to pivot through English. mine for parallel sentences in all language pairs. In this work, we build on a recent approach to 1 Introduction mine parallel texts based on a distance measure in a joint multilingual sentence embedding space Most of the current approaches in Natural Lan- (Schwenk, 2018; Artetxe and Schwenk, 2018a), guage Processing are data-driven. The size of the and a freely available encoder for 93 languages. resources used for training is often the primary con- We approach the computational challenge to mine cern, but the quality and a large variety of topics in almost six hundred million sentences by using may be equally important. Monolingual texts are fast indexing and similarity search algorithms. usually available in huge amounts for many topics The paper is organized as follows. In the next and languages. However, multilingual resources, section, we first discuss related work. We then sum- i.e. sentences which are mutual translations, are marize the underlying mining approach. Section4 more limited, in particular when the two languages describes in detail how we applied this approach to do not involve English. An important source of extract parallel sentences from Wikipedia in 1620 parallel texts is from international organizations language pairs. In section5, we assess the quality like the European Parliament (Koehn, 2005) or the of the extracted bitexts by training NMT systems United Nations (Ziemski et al., 2016). Several for a subset of language pairs and evaluate them on projects rely on volunteers to provide translations the TED corpus (Qi et al., 2018) for 45 languages. for public texts, e.g. news commentary (Tiede- The paper concludes with a discussion of future mann, 2012), OpensubTitles (Lison and Tiede- research directions. mann, 2016) or the TED corpus (Qi et al., 2018) 1https://github.com/facebookresearch/ 2https://en.wikipedia.org/wiki/ LASER/tree/master/tasks/WikiMatrix Wikipedia:Translation 1351 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 1351–1361 April 19 - 23, 2021. ©2021 Association for Computational Linguistics 2 Related work to all combinations of many different languages, written in more than twenty different scripts. In fol- There is a large body of research on mining parallel low up work, the same underlying mining approach sentences in monolingual texts collections, usually was applied to a huge collection of Common Crawl named “comparable coprora”. Initial approaches texts (Schwenk et al., 2019). Hierarchical mining in to bitext mining have relied on heavily engineered Common Crawl texts was performed by El-Kishky systems often based on metadata information, e.g. et al.(2020). 4 (Resnik, 1999; Resnik and Smith, 2003). More re- cent methods explore the textual content of the com- Wikipedia is arguably the largest comparable parable documents. For instance, it was proposed corpus. One of the first attempts to exploit this to rely on cross-lingual document retrieval, e.g. resource was performed by Adafre and de Rijke (Utiyama and Isahara, 2003; Munteanu and Marcu, (2006). An MT system was used to translate 2005) or machine translation, e.g. (Abdul-Rauf Dutch sentences into English and to compare them and Schwenk, 2009; Bouamor and Sajjad, 2018), with the English texts, yielding several hundreds typically to obtain an initial alignment that is then of Dutch/English bitexts. Later, a similar tech- further filtered. In the shared task for bilingual doc- nique was applied to Persian/English (Mohammadi ument alignment (Buck and Koehn, 2016), many and GhasemAghaee, 2010). Structural informa- participants used techniques based on n-gram or tion in Wikipedia such as the topic categories of neural language models, neural translation mod- documents was used in the alignment of multi- els and bag-of-words lexical translation probabil- lingual corpora (Otero and Lopez´ , 2010). In an- ities for scoring candidate document pairs. The other work, the mining approach of Munteanu and STACC method uses seed lexical translations in- Marcu(2005) was applied to extract large corpora duced from IBM alignments, which are combined from Wikipedia in sixteen languages (Smith et al., with set expansion operations to score translation 2010). Otero et al.(2011) measured the compa- candidates through the Jaccard similarity coeffi- rability of Wikipedia corpora by the translation cient (Etchegoyhen and Azpeitia, 2016; Azpeitia equivalents on three languages Portuguese, Span- et al., 2017, 2018). Using multilingual noisy web- ish, and English. Patry and Langlais(2011) came crawls such as ParaCrawl3 for filtering good quality up with a set of features such as Wikipedia en- sentence pairs has been explored in the shared tasks tities to recognize parallel documents, and their for high resource (Koehn et al., 2018) and low re- approach was limited to a bilingual setting. Tu- source (Koehn et al., 2019) languages. fis et al.(2013) proposed an approach to mine bi- In this work, we rely on massively multilin- texts from Wikipedia textual content, but they only gual sentence embeddings and margin-based min- considered high-resource languages, namely Ger- ing in the joint embedding space, as described in man, Spanish and Romanian paired with English. (Schwenk, 2018; Artetxe and Schwenk, 2018a,b). Tsai and Roth(2016) grounded multilingual men- This approach has also proven to perform best in tions to English Wikipedia by training cross-lingual a low resource scenario (Chaudhary et al., 2019; embeddings on twelve languages. Gottschalk and Koehn et al., 2019). Closest to this approach is the Demidova(2017) searched for parallel text pas- research described in Espana-Bonet˜ et al.(2017); sages in Wikipedia by comparing their named enti- Hassan et al.(2018); Guo et al.(2018); Yang et al. ties and time expressions. Finally, Aghaebrahimian (2019). However, in all these works, only bilingual (2018) propose an approach based on bilingual BiL- sentence representations have been trained. Such STM sentence encoders to mine German, French an approach does not scale to many languages, in and Persian parallel texts with English. Parallel particular when considering all possible language data consisting of aligned Wikipedia titles have pairs in Wikipedia. Finally, related ideas have been been extracted for twenty-three languages.5 We also proposed in Bouamor and Sajjad(2018) or are not aware of other attempts to systematically Gregoire´ and Langlais(2017). However, in those mine for parallel sentences in the textual content of works, mining is not solely based on multilingual Wikipedia for a large number of languages. sentence embeddings, but they are part of a larger system. To the best of our knowledge, this work is the first one that applies the same mining approach 4http://www.statmt.org/cc-aligned/ 5https://linguatools.org/tools/ 3http://www.paracrawl.eu/ corpora/wikipedia-parallel-titles-corpora/ 1352 3 Distance-based mining approach 3.2 Multilingual sentence embeddings The underlying idea of the mining approach used Distance-based bitext mining requires a joint sen- in this work is to first learn a multilingual sentence tence embedding for all the considered languages. embedding. The distance in that space can be used One may be tempted to train a bi-lingual em- as an indicator of whether two sentences are mutual bedding for each language pair, e.g. (Espana-˜ translations or not. Using a simple absolute thresh- Bonet et al., 2017; Hassan et al., 2018; Guo et al., old on the cosine distance was shown to achieve 2018; Yang et al., 2019), but this is difficult to competitive results (Schwenk, 2018). However, it scale to thousands of language pairs present in has been observed that an absolute threshold on Wikipedia. Instead, we chose to use one single the cosine distance is globally not consistent, e.g. massively multilingual sentence embedding for (Guo et al., 2018). This is particularly true when all languages, namely the one proposed by the mining bitexts for many different language pairs.