WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from

Holger Schwenk Vishrav Chaudhary Shuo Sun Hongyu Gong Francisco Guzman´ Facebook AI Facebook AI Johns Hopkins Univ. of Illinois Facebook AI University Urbana-Champaign [email protected] [email protected] [email protected] hgong6 [email protected] @illinois.edu

Abstract Wikipedia is probably the largest free multi- lingual resource on the Internet. The content of We present an approach based on multilin- Wikipedia is very diverse and covers many topics. gual sentence embeddings to automatically ex- Articles exist in more than 300 languages. Some tract parallel sentences from the content of content on Wikipedia was human translated from Wikipedia articles in 96 languages, including an existing article into another language, not neces- several dialects or low-resource languages. We systematically consider all possible language sarily from or into English. Eventually, the trans- pairs. In total, we are able to extract 135M lated articles have been later independently edited parallel sentences for 1620 different language and are not parallel anymore. Wikipedia strongly pairs, out of which only 34M are aligned with discourages the use of unedited machine transla- English. This corpus is freely available.1 tion,2 but the existence of such articles cannot be To get an indication on the quality of the ex- totally excluded. Many articles have been written tracted bitexts, we train neural MT baseline independently, but may nevertheless contain sen- systems on the mined data only for 1886 lan- tences which are mutual translations. This makes guages pairs, and evaluate them on the TED Wikipedia a very appropriate resource to mine for corpus, achieving strong BLEU scores for parallel texts for a large number of language pairs. many language pairs. The WikiMatrix bitexts To the best of our knowledge, this is the first work seem to be particularly interesting to train MT systems between distant languages without the to process the entire Wikipedia and systematically need to pivot through English. mine for parallel sentences in all language pairs. In this work, we build on a recent approach to 1 Introduction mine parallel texts based on a distance measure in a joint multilingual sentence embedding space Most of the current approaches in Natural Lan- (Schwenk, 2018; Artetxe and Schwenk, 2018a), guage Processing are data-driven. The size of the and a freely available encoder for 93 languages. resources used for training is often the primary con- We approach the computational challenge to mine cern, but the quality and a large variety of topics in almost six hundred million sentences by using may be equally important. Monolingual texts are fast indexing and similarity search algorithms. usually available in huge amounts for many topics The paper is organized as follows. In the next and languages. However, multilingual resources, section, we first discuss related work. We then sum- i.e. sentences which are mutual translations, are marize the underlying mining approach. Section4 more limited, in particular when the two languages describes in detail how we applied this approach to do not involve English. An important source of extract parallel sentences from Wikipedia in 1620 parallel texts is from international organizations language pairs. In section5, we assess the quality like the European Parliament (Koehn, 2005) or the of the extracted bitexts by training NMT systems United Nations (Ziemski et al., 2016). Several for a subset of language pairs and evaluate them on projects rely on volunteers to provide translations the TED corpus (Qi et al., 2018) for 45 languages. for public texts, e.g. news commentary (Tiede- The paper concludes with a discussion of future mann, 2012), OpensubTitles (Lison and Tiede- research directions. mann, 2016) or the TED corpus (Qi et al., 2018)

1https://github.com/facebookresearch/ 2https://en.wikipedia.org/wiki/ LASER/tree/master/tasks/WikiMatrix Wikipedia:Translation

1351

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 1351–1361 April 19 - 23, 2021. ©2021 Association for Computational Linguistics 2 Related work to all combinations of many different languages, written in more than twenty different scripts. In fol- There is a large body of research on mining parallel low up work, the same underlying mining approach sentences in monolingual texts collections, usually was applied to a huge collection of Common Crawl named “comparable coprora”. Initial approaches texts (Schwenk et al., 2019). Hierarchical mining in to bitext mining have relied on heavily engineered Common Crawl texts was performed by El-Kishky systems often based on metadata information, e.g. et al.(2020). 4 (Resnik, 1999; Resnik and Smith, 2003). More re- cent methods explore the textual content of the com- Wikipedia is arguably the largest comparable parable documents. For instance, it was proposed corpus. One of the first attempts to exploit this to rely on cross-lingual document retrieval, e.g. resource was performed by Adafre and de Rijke (Utiyama and Isahara, 2003; Munteanu and Marcu, (2006). An MT system was used to translate 2005) or machine translation, e.g. (Abdul-Rauf Dutch sentences into English and to compare them and Schwenk, 2009; Bouamor and Sajjad, 2018), with the English texts, yielding several hundreds typically to obtain an initial alignment that is then of Dutch/English bitexts. Later, a similar tech- further filtered. In the shared task for bilingual doc- nique was applied to Persian/English (Mohammadi ument alignment (Buck and Koehn, 2016), many and GhasemAghaee, 2010). Structural informa- participants used techniques based on n-gram or tion in Wikipedia such as the topic categories of neural language models, neural translation mod- documents was used in the alignment of multi- els and bag-of-words lexical translation probabil- lingual corpora (Otero and Lopez´ , 2010). In an- ities for scoring candidate document pairs. The other work, the mining approach of Munteanu and STACC method uses seed lexical translations in- Marcu(2005) was applied to extract large corpora duced from IBM alignments, which are combined from Wikipedia in sixteen languages (Smith et al., with set expansion operations to score translation 2010). Otero et al.(2011) measured the compa- candidates through the Jaccard similarity coeffi- rability of Wikipedia corpora by the translation cient (Etchegoyhen and Azpeitia, 2016; Azpeitia equivalents on three languages Portuguese, Span- et al., 2017, 2018). Using multilingual noisy web- ish, and English. Patry and Langlais(2011) came crawls such as ParaCrawl3 for filtering good quality up with a set of features such as Wikipedia en- sentence pairs has been explored in the shared tasks tities to recognize parallel documents, and their for high resource (Koehn et al., 2018) and low re- approach was limited to a bilingual setting. Tu- source (Koehn et al., 2019) languages. fis et al.(2013) proposed an approach to mine bi- In this work, we rely on massively multilin- texts from Wikipedia textual content, but they only gual sentence embeddings and margin-based min- considered high-resource languages, namely Ger- ing in the joint embedding space, as described in man, Spanish and Romanian paired with English. (Schwenk, 2018; Artetxe and Schwenk, 2018a,b). Tsai and Roth(2016) grounded multilingual men- This approach has also proven to perform best in tions to by training cross-lingual a low resource scenario (Chaudhary et al., 2019; embeddings on twelve languages. Gottschalk and Koehn et al., 2019). Closest to this approach is the Demidova(2017) searched for parallel text pas- research described in Espana-Bonet˜ et al.(2017); sages in Wikipedia by comparing their named enti- Hassan et al.(2018); Guo et al.(2018); Yang et al. ties and time expressions. Finally, Aghaebrahimian (2019). However, in all these works, only bilingual (2018) propose an approach based on bilingual BiL- sentence representations have been trained. Such STM sentence encoders to mine German, French an approach does not scale to many languages, in and Persian parallel texts with English. Parallel particular when considering all possible language data consisting of aligned Wikipedia titles have pairs in Wikipedia. Finally, related ideas have been been extracted for twenty-three languages.5 We also proposed in Bouamor and Sajjad(2018) or are not aware of other attempts to systematically Gregoire´ and Langlais(2017). However, in those mine for parallel sentences in the textual content of works, mining is not solely based on multilingual Wikipedia for a large number of languages. sentence embeddings, but they are part of a larger system. To the best of our knowledge, this work is the first one that applies the same mining approach 4http://www.statmt.org/cc-aligned/ 5https://linguatools.org/tools/ 3http://www.paracrawl.eu/ corpora/wikipedia-parallel-titles-corpora/

1352 3 Distance-based mining approach 3.2 Multilingual sentence embeddings The underlying idea of the mining approach used Distance-based bitext mining requires a joint sen- in this work is to first learn a multilingual sentence tence embedding for all the considered languages. embedding. The distance in that space can be used One may be tempted to train a bi-lingual em- as an indicator of whether two sentences are mutual bedding for each language pair, e.g. (Espana-˜ translations or not. Using a simple absolute thresh- Bonet et al., 2017; Hassan et al., 2018; Guo et al., old on the cosine distance was shown to achieve 2018; Yang et al., 2019), but this is difficult to competitive results (Schwenk, 2018). However, it scale to thousands of language pairs present in has been observed that an absolute threshold on Wikipedia. Instead, we chose to use one single the cosine distance is globally not consistent, e.g. massively multilingual sentence embedding for (Guo et al., 2018). This is particularly true when all languages, namely the one proposed by the mining bitexts for many different language pairs. open-source LASER toolkit (Artetxe and Schwenk, 2018b). Training one joint multilingual embedding 3.1 Margin criterion on many languages at once also has the advantage The alignment quality can be substantially im- that low-resource languages can benefit from the proved by using a margin criterion (Artetxe and similarity to other languages in the same language Schwenk, 2018a). The margin between two can- family. For example, we were able to mine paral- didate sentences x and y is defined as the ratio be- lel data for several Romance (minority) languages tween the cosine distance between the two sentence like Aragonese, Lombard, Mirandese or Sicilian embeddings, and the average cosine similarity of although data in those languages was not used to its nearest neighbors in both directions: train the multilingual LASER embeddings. The reader is referred to Artetxe and Schwenk(2018b) cos(x, y) for a detailed description how LASER was trained. M(x, y) = X cos(x, z) X cos(y, z) + 3.3 Fast similarity search 2k 2k z∈NNk(x) z∈NNk(y) In this work, we use the open-source FAISS li- (1) brary7 which implements highly efficient algo- where NN (x) denotes the k unique nearest neigh- k rithms to perform similarity search on billions of bors of x in the other language, and analogously vectors (Johnson et al., 2017). Our sentence rep- for NN (y). We used k = 4 in all experiments. k resentations being 1024-dimensional, all English We follow the “max” strategy of Artetxe and sentences require 134·106×1024×4=536 GB of Schwenk(2018a): the margin is first calculated memory. Therefore, dimensionality reduction and in both directions for all sentences in language L 1 data compression are needed for efficient search. and L . We then create the union of these forward 2 We chose a rather aggressive compression based and backward candidates. Candidates are sorted on a 64-bit product-quantizer (Jegou´ et al., 2011), and pairs with source or target sentences that were and partitioning the search space in 32k cells.8 We already used are omitted. We then apply a thresh- build one FAISS index for each language. old on the margin score to decide whether two The compressed FAISS index for English re- sentences are mutual translations or not. quires only 9.2GB, i.e. more than fifty times The complexity of a distance-based mining ap- smaller than the original sentences embeddings. proach is O(N × M), where N and M are the This makes it possible to load the whole index on number of sentences in each monolingual corpus. a standard GPU and to run the search in a very This makes a brute-force approach with exhaustive efficient way on multiple GPUs in parallel, without distance calculations intractable for large corpora. the need to shard the index. The overall mining The languages with the largest Wikipedia are En- process for German/English requires less than 3.5 glish and German with 134M and 51M sentences, hours on 8 GPUs, including the nearest neighbor respectively. This would require 6.8 × 1015 dis- search in both direction and scoring all candidates. tance calculations.6 We show in Section 3.3 how to tackle this computational challenge. 7https://github.com/facebookresearch/ faiss 6Strictly speaking, Cebuano and Swedish are larger than 8FAISS index type OPQ64,IVF32768,PQ64, see German, yet mostly consist of template/machine translated https://github.com/facebookresearch/ text faiss/wiki/Faiss-indexes

1353 4 Bitext mining in Wikipedia L1 (French) Ceci est une tres` grande maison

For each Wikipedia article, it is possible to get the L2 (German) Das ist ein sehr großes Haus link to the corresponding article in other languages. This is a very big house This could be used to mine sentences limited to the Ez egy nagyon nagy haz´ respective articles. On one hand, this local min- Ini rumah yang sangat besar ing has several advantages: 1) mining is very fast since each article usually has a few hundreds of Table 1: Illustration how sentences in the wrong lan- sentences only; 2) it seems reasonable to assume guage can hurt the alignment process with a margin cri- that a translation of a sentence is more likely to terion. See text for a detailed discussion. be found in the same article than anywhere in the whole Wikipedia. On the other hand, we hypothe- it. We used a freely available Python tool10 to size that the margin criterion will be less efficient detect sentence boundaries. Regular expressions since one article usually has few sentences which were used for most of the Asian languages, falling are similar. This may lead to many sentences in the back to English for the remaining languages. This overall mined corpus of the type “NAME was born gives us 879 million sentences in 300 languages. on DATE in CITY”, “BUILDING is a monument in The margin criterion to mine for parallel data re- CITY built on DATE”, etc. Although those align- quires that the texts do not contain duplicates. This ments may be correct, we hypothesize that they are removes about 25% of the sentences.11 of limited use to train an NMT system. LASER’s sentence embeddings are totally lan- The other option is to consider the whole guage agnostic. This has the side effect that the sen- Wikipedia for each language: for each sentence tences in other languages (e.g. citations or quotes) in the source language, we mine in all target sen- may be considered closer in the embedding space tences. This global mining has the advantages that than a potential translation in the target language. we can try to align two languages even though there Table1 illustrates this problem. The algorithm are only a few articles in common. A drawback is a would not select the German sentence although it potentially increased risk of misalignment. In this is a perfect translation. The sentences in the other work, we chose the global mining option. languages are also valid translations which would yield a very small margin. To avoid this problem, 4.1 Corpus preparation we perform language identification (LID) on all Extracting the textual content of Wikipedia arti- sentences and remove those which are not in the ex- cles in all languages is a rather challenging task, pected language. LID is performed with fasttext12 i.e. removing all tables, citations, footnotes or for- (Joulin et al., 2016). Fasttext does not support all matting markup. There are several ways to down- the 300 languages present in Wikipedia and we dis- load Wikipedia content. In this study, we use the regarded the missing ones (which typically have so-called CirrusSearch dumps since they di- only a few sentences anyway). After deduplication rectly provide the textual content without any meta and LID, we dispose of 595M sentences in 182 lan- information.9 We downloaded this dump in March guages. English accounts for 134M sentences, and 2019. A total of about 300 languages are avail- German with 51M sentences is the second largest able, but the size obviously varies a lot between language. The sizes for all languages are given in languages. We applied the following processing: Tables3 and5 (in the appendix). 1) extract the textual content; 2) split the paragraphs into sentences; 3) remove duplicate sentences; and 4.2 Threshold optimization 4) perform language identification and remove sen- Artetxe and Schwenk(2018a) optimized their min- tences which are not in the expected language. ing approach for each language pair on a provided It should be pointed out that sentence segmen- corpus of gold alignments. This is not possible tation is not a trivial task. Some languages do not when mining Wikipedia, in particular when con- use specific symbols to mark the end of a sentence, 10https://pypi.org/project/ namely Thai. We are not aware of a freely available sentence-splitter/ sentence segmenter for Thai and we had to exclude 11The Cebuano and Waray Wikipedia were largely created by a bot and contain more than 65% of duplicates. 9https://dumps.wikimedia.org/other/ 12https://fasttext.cc/docs/en/ cirrussearch/ language-identification.html

1354 25 3000 17 600 de-en 2800 cs-de de-fr cs-fr 550 24.5 2600 16 500 2400 24 450 2200 15 2000 400 23.5 1800 14 350 1600 23 300 1400 BLEU BLEU 13 1200 250 22.5 1000 size in k Bitext 200 size in k Bitext 12 22 800 150 600 100 21.5 400 11 200 50 21 0 10 0 1.06 1.055 1.05 1.045 1.04 1.035 1.03 1.06 1.055 1.05 1.045 1.04 1.035 1.03 Margin threshold Margin threshold

Figure 1: BLEU scores (continuous lines) for several NMT systems trained on bitexts extracted from Wikipedia for different margin thresholds. The size of the mined bitexts are depicted as dashed lines. sidering many language pairs. In this work, we Bitexts de-en de-fr cs-de cs-fr use an evaluation protocol inspired by the WMT 1.9M 1.9M 568k 627k shared task on parallel corpus filtering for low- 21.5 23.6 14.9 21.5 resource conditions (Koehn et al., 2019): an NMT Europarl system is trained on the extracted bitexts – for dif- 1.0M 370k 200k 220k ferent thresholds – and the resulting BLEU scores 21.2 21.1 12.6 19.2 newstest2014 are compared. We choose of Mined 1.0M 372k 201k 219k the WMT evaluations since it provides an N-way Wikipedia 24.4 22.7 13.1 16.3 parallel test sets for English, French, German and Czech. We favoured the translation between two Europarl 3.0M 2.3M 768k 846k morphologically rich languages from different fam- + Wikipedia 25.5 25.6 17.7 24.0 ilies and considered the following language pairs: German/English, German/French, Czech/German Table 2: Comparison of NMT systems trained on the Europarl corpus and on bitexts automatically mined in and Czech/French. The size of the mined bitexts is Wikipedia by our approach at a threshold of 1.04. We in the range of 100k to more than 2M (see Table2 give the number of sentences (first line) and the BLEU and Figure1). We did not try to optimize the archi- score (second line of each bloc) on newstest2014. tecture of the NMT system to the size of the bitexts and used the same architecture for all systems: the encoder and decoder are 5-layer transformer mod- 1.05 when many sentences can be extracted, in our els as implemented in fairseq (Ott et al., 2019). case German/English and German/French. When The goal of this study is not to develop the best per- less parallel data is available, i.e. Czech/German forming NMT system for the considered languages and Czech/French, a value in the range of 1.03– pairs, but to compare different mining parameters. 1.04 seems to be a better choice. Aiming at one The evolution of the BLEU score in function of threshold for all language pairs, we chose a value the margin threshold is given in Figure1. Decreas- of 1.04. It seems to be a good compromise for most ing the threshold naturally leads to more mined language pairs. However, for the open release of data – we observe an exponential increase of the this corpus, we provide all mined sentence with a data size. The performance of the NMT systems margin of 1.02 or better. This enables end users to trained on the mined data seems to change as ex- choose an optimal threshold for their particular ap- pected: the BLEU score first improves with increas- plications. However, it should be emphasized that ing amounts of available training data, reaches a we do not expect that many sentence pairs with a maximum and than decreases since the additional margin as low as 1.02 are good translations. data gets more and more noisy, i.e. contains wrong For comparison, we also trained NMT systems translations. It is also not surprising that a careful on the Europarl corpus V7 (Koehn, 2005), i.e. pro- choice of the margin threshold is more important fessional human translations, first on all available in a low-resource setting. Every additional parallel data, and then on the same number of sentences as sentence is important. According to Figure1, the the mined ones (see Table2). With the exception optimal value of the margin threshold seems to be of Czech/French, we were able to achieve better

1355 BLEU scores with the mined bitexts in Wikipedia 24 official languages of the European Union. This than with Europarl of the same size. Adding the N-way parallel corpus enables training of MT sys- mined bitexts to the full Europarl corpus, leads to tem to directly translate between these languages, further improvements of 1.1 to 3.1 BLEU. without the need to pivot through English. This is usually not the case when translating between 5 Result analysis other major languages, for example in Asia. Some interesting language pairs for which we were able We run the alignment process for all possible com- to mine more than 100k sentences include: Ko- binations of languages in Wikipedia. This yielded rean/Japanese (222k), Russian/Japanese (196k), In- 1620 language pairs for which we were able to donesian/Vietnamese (146k), or Hebrew/Romance mine at least ten thousand sentences. Remember languages (120–150k sentences). that mining L → L is identical to L → L , and 1 2 2 1 Overall, we were able to extract at least ten thou- is counted only once. We propose to analyze and sand parallel sentences for 96 different languages. evaluate the extracted bitexts in two ways. First, For several low-resource languages, we were able we discuss the amount of extracted sentences (Sec- to extract more parallel sentences with other lan- tion 5.1). We then turn to a qualitative assessment guages than English. These include, among oth- by training NMT systems for all language pairs ers, Aragonse with Spanish, Lombard with Italian, with more than twenty-five thousand mined sen- Breton with several Romance languages, Western tences (Section 5.2). Frisian with Dutch, Luxembourgish with German 5.1 Quantitative analysis or Egyptian Arabic and Wu Chinese with the re- spective major language. Due to space limits, Table3 summarizes the num- Finally, Cebuano (ceb) falls clearly apart: it has ber of extracted parallel sentences only for lan- a rather huge Wikipedia (17.9M filtered sentences), guages which have a total of at least five hun- but most of it was generated by a bot, as for the dred thousand parallel sentences (with all other Waray language.13 This certainly explains that only languages at a margin threshold of 1.04). Addi- a very small number of parallel sentences could be tional results are given in Table5 in the Appendix. extracted. Although the same bot was also used There are many reasons which can influence the to generate articles in the Swedish Wikipedia, our number of mined sentences. Obviously, the larger alignments seem to be better for that language. the monolingual texts, the more likely it is to mine many parallel sentences. Not surprisingly, we ob- 5.2 Qualitative evaluation serve that more sentences could be mined when En- Aiming to perform a large-scale assessment of glish is one of the two languages. Let us point out the quality of the extracted parallel sentences, we some languages for which it is usually not obvious trained NMT systems on the bitexts. We identified to find parallel data with English, namely Indone- a publicly available dataset which provides test sets sian (1M), Hebrew (545k), Farsi (303k) or Marathi for many language pairs: translations of TED talks (124k sentences). The largest mined texts not as proposed in the context of a study on pretrained involving English are Russian/Ukrainian (2.5M), word embeddings for NMT14 (Qi et al., 2018). We Catalan/Spanish (1.6M), or between the Romance would like to emphasize that we did not use the languages French, Spanish, Italian and Portuguese training data provided by TED – we only trained (480k–923k), and German/French (626k). on the mined sentences from Wikipedia. The goal It is striking to see that we were able to mine of this study is not to build state-of-the-art NMT more sentences when Galician and Catalan are system for for the TED task, but to get an estimate paired with Spanish than with English. On one of the quality of our extracted data, for many lan- hand, this could be explained by the fact that guage pairs. In particular, there may be a mismatch LASER’s multilingual sentence embeddings may in the topic and language style between Wikipedia be better since the involved languages are linguisti- texts and the transcribed and translated TED talks. cally very similar. On the other, it could be that the NMT systems are trained with a transformer Wikipedia articles in both languages share a lot of model from fairseq (Ott et al., 2019) with the content, or are obtained by mutual translation. 13https://en.wikipedia.org/wiki/Lsjbot Services from the European Commission pro- 14https://github.com/neulab/ vide human translations of (legal) texts in all the word-embeddings-for-nmt

1356 885 697 995 652 603 803 581 584 706 4435 3770 1991 2348 8047 5325 4006 3728 2817 2511 1699 2122 3998 3969 3666 1781 3629 4357 4743 2539 2091 2761 1388 1648 6364 4771 6035 4515 3337 1464 2726 2623 1515 3510 5107 1315 1315 1015 3698 7043 4660 4415 9692 5382 34321 14980 12441 10084 10932 11311 8 9 6 7 86 19 10 60 31 31 90 80 57 62 39 44 23 42 64 46 62 30 42 75 83 14 57 35 45 17 17 88 63 92 74 36 16 38 42 24 45 73 11 27 20 10 69 10 72 89 134 786 174 157 137 267 165 148 7 8 6 6 8 7 “size” 93 10 60 31 38 74 68 75 42 40 25 38 60 58 66 26 46 74 16 75 49 33 57 15 14 84 79 84 96 42 14 38 48 30 56 74 10 19 16 17 77 73 106 107 206 165 146 143 213 122 1073 9 70 19 15 80 37 39 98 62 68 50 56 30 41 76 51 73 33 55 83 73 18 92 14 48 57 57 26 26 11 69 82 45 21 51 50 29 71 82 12 30 30 16 67 11 126 104 165 681 187 170 144 104 172 156 2486 8 8 2 8 9 8 7 5 9 8 4 9 9 3 4 5 6 8 2 1 1 3 9 1 9 8 8 5 1 1 14 12 16 14 11 23 32 26 10 26 12 11 23 12 18 13 16 23 13 25 17 10 8 7 7 69 42 11 56 33 33 77 75 55 56 37 43 29 42 72 43 54 29 42 75 79 16 84 10 47 36 43 16 17 90 61 92 72 35 15 37 39 27 43 72 12 29 21 12 127 477 147 130 112 140 119 2 1 1 4 8 7 3 5 1 5 1 1 3 2 4 2 1 18 21 13 31 25 23 32 20 75 17 48 15 21 42 22 13 17 20 21 41 12 12 16 29 24 28 45 29 29 17 16 17 10 14 35 5 1 2 8 5 1 3 3 1 3 4 3 6 27 23 12 35 34 21 57 18 91 19 71 16 11 12 24 65 17 25 18 16 30 19 58 31 13 11 15 35 23 42 54 27 55 13 15 16 10 18 39 3 3 7 3 4 5 1 3 4 6 27 11 21 12 15 30 31 20 58 16 95 16 57 16 13 21 23 56 15 21 13 16 29 24 44 37 17 13 14 37 24 39 42 23 54 14 14 15 13 19 33 5 4 2 6 7 8 9 6 8 8 7 9 4 3 7 8 9 6 4 3 2 8 4 8 9 6 9 13 12 14 15 11 20 11 51 21 12 19 10 13 13 19 12 16 12 16 20 15 17 16 58 19 20 16 63 54 38 97 62 53 58 38 42 54 67 40 56 88 79 28 96 16 51 46 48 44 56 17 12 75 46 48 51 50 31 51 102 168 216 546 181 126 186 153 151 270 121 155 140 8 9 7 49 10 10 10 65 25 67 63 43 81 52 36 34 21 30 46 92 39 45 22 53 56 13 83 51 29 29 20 19 61 47 69 58 15 34 47 25 130 395 107 205 106 101 114 373 9 7 5 9 6 6 4 30 26 17 19 34 30 23 51 28 19 53 18 12 17 25 48 20 23 14 21 27 32 48 28 17 16 25 13 13 34 26 35 47 30 44 19 10 17 19 180 8 7 7 6 39 10 46 26 34 57 64 40 46 32 93 35 20 24 46 86 33 38 21 53 55 46 13 81 50 26 30 39 21 25 10 64 51 68 85 49 78 46 22 36 106 318 9 8 8 6 32 10 43 25 25 50 39 94 35 41 81 34 20 21 50 83 29 36 18 35 56 37 13 72 48 25 30 29 22 25 10 66 43 81 71 42 85 27 22 474 178 3 1 2 8 9 9 6 4 1 9 2 3 1 5 32 31 16 52 37 32 50 31 28 84 19 27 74 36 21 19 26 23 68 19 15 19 42 28 40 76 43 42 17 115 9 8 8 6 35 10 41 20 52 50 37 68 38 26 81 27 19 21 39 72 30 35 17 46 43 13 67 40 23 25 52 21 22 10 53 42 56 74 44 70 178 224 666 47 42 62 59 81 96 47 72 84 56 85 30 32 89 88 46 50 19 14 125 161 270 169 186 109 368 114 393 139 410 131 149 127 303 196 107 199 121 285 312 136 1661 71 14 15 13 69 46 37 82 70 78 43 48 30 40 69 56 67 35 51 87 94 20 79 12 47 39 56 35 47 14 96 74 11 97 110 129 631 183 206 161 177 25 24 23 76 62 91 76 58 77 63 85 35 22 93 64 93 58 86 21 24 157 114 358 153 123 294 144 923 131 558 227 133 148 204 480 175 218 161 200 2461 74 22 19 19 96 52 48 89 77 77 73 43 50 65 84 44 71 93 27 17 63 70 56 41 47 17 13 121 176 285 668 235 119 255 126 219 128 177 103 9 2 3 2 9 4 6 8 8 9 7 6 4 8 4 7 8 3 9 1 5 6 6 7 5 4 9 57 11 17 37 35 10 17 10 20 13 124 58 13 15 10 58 35 36 86 60 47 49 29 32 86 52 64 27 48 75 83 22 81 11 41 39 46 32 35 13 102 303 207 636 181 166 150 133 73 22 21 14 84 50 45 76 81 72 44 49 66 86 40 65 27 18 56 57 53 41 46 17 144 139 110 472 796 272 126 331 121 101 240 123 2 1 1 3 8 9 9 8 5 6 9 3 8 1 4 6 9 1 1 12 13 17 15 12 21 15 25 10 25 11 12 10 10 10 24 4 1 2 8 5 2 3 32 35 20 56 41 35 58 32 28 98 21 10 10 30 83 39 25 11 24 28 23 78 23 10 16 22 124 5 1 2 7 6 6 2 32 29 19 45 36 30 51 27 71 28 65 20 11 11 29 62 29 26 24 27 25 58 21 10 16 21 8 9 8 6 52 86 23 39 61 51 39 64 52 30 92 32 19 27 40 83 35 42 24 52 47 55 12 73 48 26 28 395 9 7 7 5 33 42 21 23 45 55 35 78 34 28 76 35 19 20 47 71 27 35 16 31 48 33 12 62 47 23 157 5 5 9 6 48 11 38 20 22 52 53 37 82 35 26 29 18 26 43 89 28 43 18 27 49 45 83 306 108 222 6 4 2 4 6 9 6 7 6 5 8 9 9 7 3 8 8 2 11 11 12 15 23 20 26 24 11 24 14 83 23 13 12 71 38 36 76 69 48 57 33 46 87 50 82 35 48 99 77 18 103 105 217 851 219 214 179 25 26 24 64 52 75 54 64 56 80 31 123 102 316 161 115 388 119 101 671 131 744 120 121 146 146 2126 4 3 3 8 9 7 18 17 11 23 23 20 34 15 85 15 42 14 10 22 38 14 17 14 20 16 9 90 17 12 60 36 38 78 63 73 46 41 27 46 63 56 63 31 47 70 107 107 198 161 1019 60 15 12 11 68 41 39 92 69 64 57 56 35 39 90 50 65 33 58 105 192 488 167 164 9 38 10 10 47 23 57 63 43 87 43 29 94 33 21 24 48 85 33 41 21 164 259 9 4 3 38 30 21 16 37 38 25 57 26 22 71 20 13 20 28 60 21 28 231 9 68 12 10 58 34 30 84 72 55 56 39 39 25 36 64 41 109 545 153 136 9 50 10 12 43 27 32 54 44 80 48 46 32 43 25 47 268 446 610 154 29 29 24 68 60 85 65 71 163 117 490 185 142 626 137 134 905 156 2757 53 14 12 10 61 37 36 83 95 75 55 46 70 33 34 163 375 155 9 5 58 16 37 20 20 44 45 29 66 35 23 83 24 14 303 7 6 5 24 25 14 16 77 36 26 53 23 25 22 119 154 8 7 40 10 43 26 25 54 62 45 41 31 89 106 243 31 28 28 81 70 174 122 181 140 418 145 149 1580 3377 9 9 9 37 40 27 24 81 75 39 39 186 298 71 28 33 999 357 280 210 519 436 620 1205 1573 8 66 10 10 62 36 33 90 70 54 95 99 34 27 20 70 71 132 180 233 180 9 53 11 13 53 33 32 86 75 67 17 16 14 79 47 43 100 94 15 17 16 76 41 44 9 9 7 34 34 21 8 4 4 40 38 54 14 14 16 3 2 11 7 15 17 536 488 272 503 235 472 320 740 235 312 449 size az ba be bg bn bs ca cs da de el en eo es et eu fa fi fr gl he hi hr hu id is it ja kk ko lt mk ml mr ne nl no oc pl pt ro ru sh si sk sl sq sr sv sw ta te tl tr tt uk vi zh total 1873 1690 1412 1060 8332 7634 2881 2259 3954 7428 6962 1353 2229 7702 3899 2031 1487 1364 1904 1836 4226 1629 1509 4067 6727 3327 3211 2371 2303 2221 1684 5400 3504 2069 6516 50944 25202 34494 21025 13354 32537 17906 12871 12306 31614 14091 16270 134431 Name Lang. Family Azerbaijani Turkic BashkirBelarusian Slavic Turkic BengaliBosnian Indo-Aryan Catalan Slavic Czech Romance DanishGerman Slavic Germanic Germanic English Germanic Spanish Romance BasqueFarsi Isolate FinnishFrench Iranian Uralic Hebrew Romance Hindi Semitic CroatianHungarian Slavic Indo-Aryan Uralic Indonesian Malayo-Pol. IcelandicItalian Germanic Romance Lithuanian Baltic MacedonianSlavic Malayalam Dravidian MarathiNepali Indo-Aryan Norwegian Indo-Aryan Germanic Occitan Romance Portuguese Romance Russian Slavic SinhalaSlovak Indo-Aryan Slovenian Slavic Albanian Slavic Serbian Albanian Swedish Slavic Swahili Germanic Tamil Niger-Congo TeluguTagalog Dravidian Turkish Dravidian Malayo-Pol. Tatar Turkic Ukrainian Slavic Vietnamese Vietic Turkic Chinese Chinese Bulgarian Slavic GreekEsperanto Hellenic constructed Estonian Uralic Galician Romance JapaneseKazakh Japonic Korean Turkic Koreanic Dutch Germanic PolishRomanian Slavic Romance Serbo-Croat Slavic Arabic Arabic az ba be bn bs ca cs da de en es eu fa fi fr he hi hr hu id is it lt mk ml mr ne no oc pt ru si sk sl sq sr sv sw ta te tl tr tt uk vi zh bg el eo et gl ja kk ko nl pl ro sh ISO ar gives the number of lines in the monolingual texts after deduplication and LID. Table 3: WikiMatrix: number of extracted sentences for each language pair (in thousands), e.g. Es-Ja=219 corresponds to 219,260 sentences (rounded). The column

1357 Src/Trgar bg bs cs da de el en eo es frfr-ca gl he hr hu id it ja ko mk nb nl pl ptpt-br ro ru sk sl sr sv tr uk vizh-cnzh-tw ar 4.9 1.8 3.0 4.5 3.8 6.7 20.3 4.1 13.2 12.2 9.0 5.6 3.5 2.2 2.7 9.2 9.9 4.2 5.3 5.5 4.9 4.4 3.0 12.0 12.2 5.6 5.6 1.5 2.7 1.2 4.0 2.4 4.5 12.3 8.2 4.9 bg 3.0 4.2 6.7 8.7 8.5 10.0 25.3 7.7 16.3 14.7 11.9 8.4 3.2 6.4 4.7 10.3 12.2 4.0 5.4 16.2 8.1 7.7 6.3 14.7 15.4 9.8 12.4 4.0 6.4 2.7 7.1 2.4 10.4 11.7 6.4 4.7 bs 1.2 6.1 4.1 3.7 5.8 4.5 21.7 9.9 6.7 4.6 5.7 1.0 30.9 2.4 5.3 6.5 1.4 12.9 3.8 2.9 9.9 10.9 5.7 4.8 3.1 10.4 10.6 4.4 1.5 5.4 5.8 2.8 1.8 cs 1.9 7.8 3.7 7.1 8.3 6.4 20.0 10.4 12.6 11.4 8.6 5.0 2.5 6.5 4.9 7.8 9.6 4.1 5.5 5.0 6.3 7.6 8.1 10.8 12.1 6.3 9.4 28.1 6.7 1.6 7.0 2.6 7.8 9.0 6.6 4.7 da 2.0 8.9 4.0 5.2 14.0 9.0 32.9 6.7 16.7 16.7 12.8 7.3 3.5 4.4 4.7 10.8 13.4 4.7 6.0 6.2 33.1 12.4 4.8 14.0 16.2 8.5 7.8 3.1 5.2 1.4 25.8 2.7 6.3 11.2 7.3 4.9 de 2.4 9.7 4.9 8.1 16.9 7.8 24.5 15.9 17.4 18.3 14.7 6.8 4.3 5.5 7.2 8.6 13.5 6.4 6.6 5.6 11.5 17.6 6.8 14.2 15.2 8.7 9.2 5.4 8.6 1.5 12.7 3.6 7.8 11.3 9.2 4.3 el 4.1 11.2 4.8 5.5 9.7 6.7 27.9 8.0 18.8 16.3 13.5 10.1 4.0 5.6 5.3 13.1 15.3 5.5 6.1 10.2 9.3 8.7 5.1 18.0 18.3 10.4 8.4 3.1 6.4 2.0 7.2 3.4 7.2 14.4 8.7 5.3 en 11.9 23.9 14.7 15.5 30.9 20.4 27.1 22.6 35.8 32.6 25.1 24.3 17.3 18.8 13.5 28.8 29.5 10.2 18.6 21.8 31.8 25.1 12.0 31.4 37.0 20.4 17.4 13.8 16.5 5.5 29.1 10.3 17.6 26.9 18.0 10.7 eo 1.8 6.4 7.3 7.4 13.5 8.1 23.1 16.1 17.6 12.7 10.5 1.9 3.0 4.8 9.2 13.9 2.5 4.8 6.2 7.2 12.1 6.7 12.4 16.0 6.9 8.6 7.4 5.6 0.6 8.8 1.8 5.4 7.7 5.2 3.6 es 6.2 14.3 6.8 8.0 15.7 12.9 16.4 33.2 13.5 25.6 19.9 30.1 8.1 8.7 7.7 16.1 23.8 7.9 9.9 11.9 13.3 14.1 8.1 27.6 27.8 14.7 11.6 6.0 8.1 2.6 13.7 5.2 10.0 17.8 12.3 6.6 fr 6.0 12.7 5.0 8.3 15.6 14.5 15.4 31.6 18.6 26.4 17.2 6.8 6.9 7.7 15.0 24.6 7.3 8.7 10.7 13.1 15.9 8.1 24.2 26.0 16.5 12.7 5.8 7.1 2.1 13.8 4.2 9.8 17.0 11.1 6.6 fr-ca 4.9 12.5 2.8 7.8 14.3 12.9 15.4 27.8 14.0 23.7 18.1 6.8 6.8 7.9 13.7 23.4 7.5 8.8 10.3 15.1 7.2 18.6 23.2 15.3 11.3 6.1 6.6 3.3 12.5 5.0 9.7 15.4 6.2 gl 2.6 7.3 4.7 2.9 7.4 5.3 8.5 23.4 9.2 34.4 16.0 15.2 2.5 3.5 2.8 9.8 19.3 3.6 4.2 5.6 6.9 6.5 4.3 22.4 23.7 7.7 5.9 1.7 3.1 0.2 4.3 2.3 3.9 9.4 5.8 4.2 he 3.4 5.7 1.8 3.7 6.1 5.4 6.3 25.7 4.2 15.3 13.6 10.3 5.1 2.5 3.2 8.9 11.2 4.1 4.5 4.6 5.4 6.5 3.7 13.7 15.0 6.3 8.0 1.9 2.6 1.0 5.2 1.8 5.1 10.7 6.9 4.6 hr 1.6 8.7 29.9 7.0 6.5 5.8 6.6 24.4 4.4 12.6 9.9 7.8 5.7 1.5 4.3 8.2 10.4 1.8 4.0 14.1 5.3 6.1 5.4 12.1 12.6 7.3 8.3 4.6 12.6 11.7 6.0 1.9 6.9 9.4 4.8 2.9 hu 1.6 5.6 2.6 5.9 7.0 5.5 16.7 6.5 10.8 10.9 9.3 3.9 2.1 3.6 6.6 8.2 4.4 6.2 3.9 4.5 6.0 4.3 9.7 9.9 7.1 5.8 3.4 4.2 1.2 5.3 3.0 4.4 8.5 6.5 4.2 id 4.1 9.1 4.2 5.2 10.1 6.7 11.1 24.9 8.2 16.4 15.1 11.1 9.9 5.1 5.6 5.2 12.7 5.8 9.1 7.0 10.0 9.4 5.5 14.6 16.7 9.8 8.1 3.6 5.6 1.8 9.3 4.6 7.3 18.5 11.0 6.2 it 5.3 11.7 5.0 6.9 13.1 11.5 14.5 30.0 13.9 26.4 24.9 20.0 19.3 6.2 7.0 6.7 14.0 7.3 9.0 9.9 13.3 12.8 7.3 22.8 24.9 13.3 10.2 5.4 7.1 2.3 11.9 4.6 8.8 15.3 10.7 5.8 ja 1.4 1.9 0.7 1.8 3.1 2.7 2.5 7.9 2.2 6.0 6.0 4.7 2.3 1.4 1.3 2.0 3.5 4.6 16.9 1.6 2.6 3.1 1.8 5.1 4.9 2.8 2.7 1.2 1.8 0.5 2.6 1.9 2.2 5.9 ko 0.9 1.7 1.3 2.0 1.7 1.7 8.7 1.3 4.7 4.4 3.4 1.5 0.9 0.7 1.5 3.1 3.2 9.2 1.2 1.3 1.9 1.4 4.3 4.6 2.0 2.1 1.0 1.5 0.4 1.5 1.4 1.5 4.3 mk 2.4 18.2 12.0 5.4 7.2 4.3 10.3 23.4 8.9 15.2 11.5 10.0 5.4 4.0 12.6 3.7 8.9 11.1 3.5 4.7 7.5 6.7 4.5 13.9 15.6 8.3 7.5 3.5 7.1 3.7 5.6 2.0 6.4 10.9 6.3 4.3 nb 2.7 8.5 5.4 32.7 9.8 8.9 35.1 9.5 17.0 14.6 8.0 3.1 3.9 4.5 9.9 15.0 5.5 5.7 6.7 4.7 14.2 9.5 7.5 3.2 3.9 0.6 26.3 1.9 6.2 7.8 nl 2.4 8.2 2.9 5.9 14.2 16.1 8.4 26.5 13.4 16.8 16.7 13.5 7.3 3.9 4.8 5.3 11.4 13.3 5.2 5.9 5.9 5.3 13.8 15.4 7.8 7.6 4.1 5.1 1.6 11.1 3.2 6.1 10.6 8.0 5.1 pl 1.8 7.4 2.9 8.2 6.6 6.5 5.4 15.1 7.5 11.4 11.3 8.6 5.2 2.3 4.8 4.0 7.5 8.6 4.2 5.7 5.2 4.1 6.2 9.6 9.9 6.2 9.5 6.2 5.3 1.5 5.6 2.4 8.9 7.7 6.3 3.6 pt 7.4 15.2 7.0 8.0 15.5 13.2 18.7 35.0 12.0 32.4 26.7 19.0 23.0 8.8 10.3 8.4 17.5 24.4 7.8 10.6 13.1 14.3 14.9 8.9 15.3 11.7 6.6 8.1 2.0 14.3 6.2 9.3 18.0 12.9 5.8 pt-br 6.5 14.7 7.4 8.6 16.8 12.9 17.6 37.3 16.0 31.0 26.6 20.3 23.0 8.7 9.8 8.1 18.6 24.8 7.8 10.7 12.5 14.8 8.5 15.1 11.8 6.4 8.9 2.8 14.6 5.3 10.8 18.8 13.2 6.7 ro 3.2 9.7 3.7 5.1 9.4 7.5 10.4 25.0 6.7 18.8 19.3 14.6 10.0 4.0 5.7 5.9 11.0 15.5 4.3 6.4 7.3 8.0 8.1 5.2 15.4 17.7 8.0 3.6 5.0 1.9 7.0 3.3 6.6 12.7 7.8 4.9 ru 3.3 12.6 4.2 7.7 8.5 8.8 8.3 18.7 9.9 14.3 14.5 11.0 6.0 4.9 6.8 5.6 9.5 11.7 6.1 7.7 7.4 8.0 8.1 8.9 12.4 13.9 8.2 5.8 5.4 2.7 8.2 2.9 22.5 11.5 9.1 5.2 sk 0.7 5.1 2.7 27.0 4.3 5.7 3.2 16.9 9.3 9.4 8.5 6.7 2.7 1.0 5.1 3.7 4.9 6.9 2.2 3.9 3.5 4.9 5.0 7.1 7.8 8.5 3.5 6.6 5.0 1.5 4.3 1.6 5.4 5.4 2.3 2.5 sl 1.2 6.2 7.6 5.5 4.7 7.6 5.8 17.3 5.9 11.4 8.5 6.4 3.2 1.2 11.2 3.9 6.5 7.8 2.7 4.2 6.3 4.3 5.5 4.8 9.9 4.8 5.9 3.6 1.9 4.1 2.2 4.3 7.4 3.8 2.6 sr 1.8 7.6 33.2 5.1 3.7 3.8 5.8 22.8 3.2 11.9 9.1 7.5 3.5 1.1 30.4 2.7 6.1 8.2 1.2 2.9 13.7 3.3 3.3 3.9 10.8 11.3 5.6 7.2 2.8 9.5 3.0 1.1 5.7 6.8 4.3 3.1 sv 2.2 7.4 4.8 5.9 26.5 12.6 8.1 31.8 11.0 16.9 15.7 10.7 7.1 3.3 5.5 5.4 11.6 13.3 4.8 6.2 5.2 25.4 11.5 5.8 15.0 17.4 7.9 8.1 3.8 4.7 1.0 3.2 6.9 12.9 7.8 4.8 tr 2.2 3.5 2.0 2.6 3.9 4.1 4.7 15.9 2.9 9.4 7.7 6.7 3.6 1.6 2.1 3.4 6.7 6.4 4.3 7.0 3.5 3.1 4.2 2.5 9.0 8.4 4.6 4.0 1.8 2.3 0.8 3.5 3.3 8.2 6.7 4.4 uk 2.9 12.3 5.3 7.4 7.5 7.5 8.4 20.7 6.5 14.2 14.1 11.2 5.5 3.5 6.6 4.7 9.5 11.2 4.9 5.8 7.2 6.3 6.9 9.6 12.9 7.2 23.5 4.9 5.7 2.6 6.9 2.6 11.4 7.9 4.9 vi 4.2 7.5 4.0 4.7 8.5 6.0 8.8 20.2 7.3 13.7 13.2 9.9 6.5 4.6 4.9 4.7 14.7 10.7 5.6 9.3 6.9 5.7 7.3 4.5 13.0 14.1 8.5 7.2 3.4 4.6 1.7 8.2 4.0 6.7 9.9 6.7 zh-cn 2.1 3.2 1.0 2.2 3.8 3.2 4.5 11.8 3.8 8.2 7.6 3.2 1.7 1.9 3.0 6.6 6.0 3.4 3.8 2.2 7.1 7.9 4.1 4.1 1.6 2.4 0.9 3.1 2.3 3.0 10.8 zh-tw 2.2 3.1 1.1 2.1 3.7 2.8 3.9 10.7 3.4 7.5 7.2 6.1 2.8 1.8 1.6 3.0 6.2 5.4 2.8 3.5 2.3 6.3 6.9 3.5 3.9 1.4 2.1 0.9 3.0 2.4 2.9 10.0

Table 4: BLEU scores on the TED test set as proposed in (Qi et al., 2018). NMT systems were trained on bitexts mined in Wikipedia only (with at least twenty-five thousand parallel sentences). No other resources were used.

parameter settings shown in Figure2 in the ap- guages in Wikipedia, independently of the quality pendix. Since the TED development and test sets of LASER’s sentence embeddings for each one. were already tokenized, we first detokenize them using Moses. We trained NMT systems for all pos- 6 Conclusion sible language pairs with more than 25k mined sen- We have presented an approach to systematically tences. This gives us in total 1886 language pairs mine for parallel sentences in the textual con- L → L L → L in 45 languages. We train 1 2 and 2 1 tent of Wikipedia, for all possible language pairs. L L with the same mined bitexts 1/ 2. Scores on the We use a mining approach based on massively test sets were computed with SacreBLEU (Post, multilingual sentence embeddings (Artetxe and 2018), see Table4. Some additional results are Schwenk, 2018b) and a margin criterion (Artetxe reported in Table6 in the annex. 23 NMT systems and Schwenk, 2018a). The same approach is achieve BLEU scores over 30, the best one being used for all language pairs without the need for 37.3 for Brazilian Portuguese to English. Several a language-specific optimization. In total, we make results are worth mentioning, like Farsi/English: available 135M parallel sentences in 96 languages, 16.7, Hebrew/English: 25.7, Indonesian/English: out of which only 34M sentences are aligned with 24.9 or English/Hindi: 25.7 We also achieve inter- English. We were able to mine more than ten thou- esting results for translation between various non sands sentences for 1620 different language pairs. English language pairs for which it is usually not This corpus of parallel sentences is freely avail- ↔ easy to find parallel data, e.g. Norwegian Dan- able.15 We also performed a large scale evaluation ≈ ↔ ≈ ish 33, Norwegian Swedish 25, Indonesian of the quality of the mined sentences by training ↔ ≈ ≈ Vietnamese 16 or Japanese / Korean 17. 1886 NMT systems and evaluating them on the Our results on the TED set give an indication 45 languages of the TED corpus (Qi et al., 2018). on the quality of the mined parallel sentences. This approach was recently extended to mine in These BLEU scores should be of course appre- Common Crawl texts (Schwenk et al., 2019). ciated in context of the sizes of the mined corpora as given in Table3. Finally, we would like to point 15https://github.com/facebookresearch/ out that we run our approach on all available lan- LASER/tree/master/tasks/WikiMatrix

1358 References Thierry Etchegoyhen and Andoni Azpeitia. 2016. Set- Theoretic Alignment for Comparable Corpora. In Sadaf Abdul-Rauf and Holger Schwenk. 2009. On the ACL, pages 2009–2018. Use of Comparable Corpora to Improve SMT perfor- mance. In EACL, pages 16–23. Simon Gottschalk and Elena Demidova. 2017. Mul- tiwiki: Interlingual text passage alignment in Sisay Fissaha Adafre and Maarten de Rijke. 2006. Wikipedia. ACM Transactions on the Web (TWEB), Finding similar sentences across multiple languages 11(1):6. in Wikipedia. In Proceedings of the Workshop on NEW TEXT and blogs and other dynamic text Francis Gregoire´ and Philippe Langlais. 2017. BUCC sources. 2017 Shared Task: a First Attempt Toward a Deep Ahmad Aghaebrahimian. 2018. Deep neural networks Learning Framework for Identifying Parallel Sen- at the service of multilingual parallel sentence ex- tences in Comparable Corpora. In BUCC, pages 46– traction. In Coling. 50.

Mikel Artetxe and Holger Schwenk. 2018a. Margin- Mandy Guo, Qinlan Shen, Yinfei Yang, Heming based Parallel Corpus Mining with Multilingual Sen- Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith tence Embeddings. https://arxiv.org/abs/ Stevens, Noah Constant, Yun-Hsuan Sung, Brian 1811.01136. Strope, and Ray Kurzweil. 2018. Effective Paral- lel Corpus Mining using Bilingual Sentence Embed- Mikel Artetxe and Holger Schwenk. 2018b. Mas- dings. arXiv:1807.11906. sively multilingual sentence embeddings for zero- shot cross-lingual transfer and beyond. In https: Hany Hassan, Anthony Aue, Chang Chen, Vishal //arxiv.org/abs/1812.10464. Chowdhary, Jonathan Clark, Christian Feder- mann, Xuedong Huang, Marcin Junczys-Dowmunt, Andoni Azpeitia, Thierry Etchegoyhen, and Eva William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Mart´ınez Garcia. 2017. Weighted Set-Theoretic Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Alignment of Comparable Sentences. In BUCC, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, pages 41–45. Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. Achieving Human Parity on Andoni Azpeitia, Thierry Etchegoyhen, and Eva Automatic Chinese to English News Translation. Mart´ınez Garcia. 2018. Extracting Parallel Sen- arXiv:1803.05567. tences from Comparable Corpora with STACC Vari- ants. In BUCC. Jeff Johnson, Matthijs Douze, and Herve´ Jegou.´ 2017. Billion-scale similarity search with GPUs. arXiv Houda Bouamor and Hassan Sajjad. 2018. preprint arXiv:1702.08734. H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Armand Joulin, Edouard Grave, Piotr Bojanowski, and Sentence Embeddings. In BUCC. Tomas Mikolov. 2016. Bag of tricks for efficient text https://arxiv.org/abs/1607. Christian Buck and Philipp Koehn. 2016. Findings of classification. 01759 the wmt 2016 bilingual document alignment shared . task. In Proceedings of the First Conference on Ma- chine Translation, pages 554–563, Berlin, Germany. H. Jegou,´ M. Douze, and C. Schmid. 2011. Prod- IEEE Association for Computational Linguistics. uct quantization for nearest neighbor search. Trans. PAMI, 33(1):117–128. Vishrav Chaudhary, Yuqing Tang, Francisco Guzman,´ Holger Schwenk, and Philipp Koehn. 2019. Low- Philipp Koehn. 2005. Europarl: A parallel corpus for resource corpus filtering using multilingual sentence statistical machine translation. In MT summit. embeddings. In Proceedings of the Fourth Confer- ence on Machine Translation (WMT). Philipp Koehn, Francisco Guzman,´ Vishrav Chaud- hary, and Juan M. Pino. 2019. Findings of the Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guz- wmt 2019 shared task on parallel corpus filtering man, and Philipp Koehn. 2020. A massive collection for low-resource conditions. In Proceedings of the of cross-lingual web-document pairs. In Proceed- Fourth Conference on Machine Translation, Volume ings of the 2020 Conference on Empirical Methods 2: Shared Task Papers, Florence, Italy. Association in Natural Language Processing (EMNLP), pages for Computational Linguistics. 5960–5969. Philipp Koehn, Huda Khayrallah, Kenneth Heafield, Cristina Espana-Bonet,˜ Ad´ am´ Csaba Varga, Alberto and Mikel L. Forcada. 2018. Findings of the wmt Barron-Cede´ no,˜ and Josef van Genabith. 2017. An 2018 shared task on parallel corpus filtering. In Pro- Empirical Analysis of NMT-Derived Interlingual ceedings of the Third Conference on Machine Trans- Embeddings and their Use in Parallel Sentence Iden- lation: Shared Task Papers, pages 726–739, Bel- tification. IEEE Journal of Selected Topics in Signal gium, Brussels. Association for Computational Lin- Processing, pages 1340–1348. guistics.

1359 Taku Kudo and John Richardson. 2018. Sentencepiece: pages 529–535, New Orleans, Louisiana. Associa- A simple and language independent subword tok- tion for Computational Linguistics. enizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Philip Resnik. 1999. Mining the Web for Bilingual Methods in Natural Language Processing: System Text. In ACL. Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics. Philip Resnik and Noah A. Smith. 2003. The Web as a Parallel Corpus. Computational Linguistics, P. Lison and J. Tiedemann. 2016. Opensubtitles2016: 29(3):349–380. Extracting large parallel corpora from movie and tv subtitles. In LREC. Holger Schwenk. 2018. Filtering and mining parallel data in a joint multilingual space. In ACL, pages Mehdi Zadeh Mohammadi and Nasser GhasemAghaee. 228–234. 2010. Building bilingual parallel corpora based on Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Wikipedia. In 2010 Second International Confer- Edouard Grave, Armand Joulin, and Angela Fan. ence on Computer Engineering and Applications, 2019. CCMatrix: Mining billions of high-quality pages 264–268. parallel sentences on the web. In https://arxiv. Dragos Stefan Munteanu and Daniel Marcu. 2005. Im- org/abs/1911.04944. proving Machine Translation Performance by Ex- Jason R. Smith, Chris Quirk, and Kristina Toutanova. ploiting Non-Parallel Corpora. Computational Lin- 2010. Extracting parallel sentences from compa- guistics, 31(4):477–504. rable corpora using document level alignment. In P Otero, I Lopez,´ S Cilenis, and Santiago de Com- NAACL, pages 403–411. postela. 2011. Measuring comparability of multi- J. Tiedemann. 2012. Parallel data, tools and interfaces lingual corpora extracted from Wikipedia. Iberian in OPUS. In LREC. Cross-Language Natural Language Processings Tasks (ICL), page 8. Chen-Tse Tsai and Dan Roth. 2016. Cross-lingual wikification using multilingual embeddings. In Pro- Pablo Gamallo Otero and Isaac Gonzalez´ Lopez.´ 2010. ceedings of the 2016 Conference of the North Amer- Wikipedia as multilingual source of comparable cor- ican Chapter of the Association for Computational pora. In Proceedings of the 3rd Workshop on Build- Linguistics: Human Language Technologies, pages ing and Using Comparable Corpora, LREC, pages 589–598. 21–25.

Dan Tufis, Radu Ion, S, tefan Daniel, Dumitrescu, and Myle Ott, Sergey Edunov, Alexei Baevski, Angela Dan S, tefanescu.˘ 2013. Wikipedia as an smt training Fan, Sam Gross, Nathan Ng, David Grangier, and corpus. In RANLP, pages 702–709. Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of Masao Utiyama and Hitoshi Isahara. 2003. Reliable the 2019 Conference of the North American Chap- Measures for Aligning Japanese-English News Arti- ter of the Association for Computational Linguistics cles and Sentences. In ACL. (Demonstrations), pages 48–53, Minneapolis, Min- nesota. Association for Computational Linguistics. Yinfei Yang, Gustavo Hernandez´ Abrego,´ Steve Yuan, Mandy Guo, Qinlan Shen, Daniel Cer, Yun-Hsuan Alexandre Patry and Philippe Langlais. 2011. Identify- Sung, Brian Strope, and Ray Kurzweil. 2019. ing parallel documents from a large bilingual collec- Improving multilingual sentence embedding using tion of texts: Application to parallel article extrac- bi-directional dual encoder with additive margin tion in Wikipedia. In Proceedings of the 4th Work- softmax. In https://arxiv.org/abs/1902. shop on Building and Using Comparable Corpora: 08564. Comparable Corpora and the Web, pages 87–95. As- sociation for Computational Linguistics. Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations Parallel Cor- Matt Post. 2018. A call for clarity in reporting bleu pus v1.0. In LREC. scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186– 191, Belgium, Brussels. Association for Computa- tional Linguistics.

Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad- manabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neu- ral machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers),

1360 A Appendix Table2 gives the detailed configuration which was used to train NMT models on the mined data Table5 provides the amounts of mined parallel in Section5. An 5000 subword vocabulary was sentences for languages which have a rather small learnt using SentencePiece (Kudo and Richardson, Wikipedia. Aligning those languages obviously 2018). Decoding was done with beam size 5 and yields to a very small amount of parallel sentences. length normalization 1.2. Therefore, we only provide these results for align- ment with high resource languages. It is also likely --arch transformer that several of these alignments are of low quality --share-all-embeddings --encoder-layers 5 since the LASER embeddings were not directly --decoder-layers 5 trained on most these languages, but we still hope --encoder-embed-dim 512 to achieve reasonable results since other languages --decoder-embed-dim 512 --encoder-ffn-embed-dim 2048 of the same family may be covered. --decoder-ffn-embed-dim 2048 --encoder-attention-heads 2 --decoder-attention-heads 2 --encoder-normalize-before

ISO Name Language size ca da de en es fr it nl pl pt sv ru zh total --decoder-normalize-before Family --dropout 0.4

an Aragonese Romance 222 24 7 12 23 33 16 13 9 10 14 9 11 6 324 --attention-dropout 0.2 arz Egyptian Arabic 120 7 6 11 18 12 12 10 8 9 10 8 12 7 278 --relu-dropout 0.2 Arabic --weight-decay 0.0001 as Assamese Indo-Aryan 124 8 6 11 7 11 12 10 9 9 8 8 9 3 216 azb South Azer-Turkic 398 6 4 9 8 9 10 9 7 6 8 6 7 3 172 --label-smoothing 0.2 baijani --criterion label smoothed cross entropy bar Bavarian Germanic 214 7 6 41 16 12 12 10 8 9 10 8 10 5 261 bpy Bishnupriya Indo-Aryan 128 2 1 4 4 3 4 2 2 3 2 2 3 1 71 --optimizer adam br Breton Celtic 413 20 16 22 23 22 19 16 6 200 --adam-betas ’(0.9, 0.98)’ ce Chechen Northeast 315 2 1 2 2 2 2 2 2 2 2 2 2 1 56 Caucasian --clip-norm 0 ceb Cebuano Malayo- 17919 14 9 22 29 27 24 24 15 17 20 55 21 9 594 --lr-scheduler inverse sqrt Polynesian --warmup-update 4000 ckb Central Kur-Iranian 127 2 2 6 8 5 5 4 4 4 4 3 6 4 113 dish --warmup-init-lr 1e-7 cv Chuvash Turkic 198 4 3 5 4 6 6 7 5 4 6 5 8 2 129 --lr 1e-3 --min-lr 1e-9 dv Maldivian Indo-Aryan 52 2 2 5 6 4 4 3 3 3 3 3 5 3 96 fo Faroese Germanic 114 13 12 14 32 21 18 15 11 11 17 12 13 6 335 --max-tokens 4000 fy Western Germanic 493 13 8 16 32 21 18 17 38 12 18 13 14 5 453 --update-freq 4 Frisian gd Gaelic Celtic 66 1 1 1 1 1 1 1 1 1 1 1 1 1 41 --max-epoch 100 ga Irish Irish 216 2 3 4 3 3 3 2 2 3 2 3 1 70 --save-interval 10 gom Goan Indo-Aryan 69 9 7 10 8 13 13 13 9 9 11 9 10 4 240 Konkami ht Haitian Cre-Creole 60 2 1 3 4 3 4 3 2 3 2 2 3 1 72 Figure 2: Model settings for NMT training with ole ilo Iloko Philippine 63 3 2 4 5 4 4 4 3 3 4 3 4 2 96 fairseq io Ido constructed 153 5 3 6 11 7 7 5 5 5 6 5 5 3 143 jv Javanese Malayo- 220 8 5 8 13 12 10 11 8 7 11 8 8 3 219 Polynesian ka Georgian Kartvelian 480 11 7 15 12 16 17 16 12 11 14 12 13 5 288 ku Kurdish Iranian 165 5 4 8 5 8 7 8 7 6 7 6 6 3 222 Finally, Table6 gives the BLEU scores on the la Latin Romance 558 12 9 17 32 20 18 17 12 13 18 13 14 6 478 lb LuxembourgishGermanic 372 12 7 26 22 19 18 15 11 11 16 12 11 4 305 TED corpus when translating into and from English lmo Lombard Romance 147 6 3 7 10 7 7 11 6 5 7 5 5 3 144 for some additional languages. mg Malagasy Malayo- 263 6 5 9 13 9 12 8 7 7 7 8 7 4 199 Polynesian mhr Eastern Uralic 61 3 2 4 3 4 4 5 3 3 4 3 4 2 96 Mari Lang xx → en en → xx min MinangkabauMalayo- 255 4 2 6 7 5 5 5 4 4 4 5 5 2 121 Polynesian mn Mongolian Mongolic 255 4 3 7 5 6 6 7 6 5 5 5 5 3 197 et 15.9 14.3 mwl Mirandese Romance 64 6 3 4 10 8 6 5 3 4 34 3 4 2 154 nds nl Low Ger-Germanic 65 5 4 6 10 7 7 6 15 5 6 5 5 3 151 eu 10.1 7.6 man/Saxon ps Pashto Iranian 89 2 3 2 3 3 3 3 3 3 3 3 1 73 fa 16.7 8.8 rm Romansh Italic 57 2 2 10 5 4 4 3 2 3 3 3 3 1 86 sah Yakut Turkic/Sib 134 4 3 7 5 6 6 6 5 5 5 5 6 3 134 fi 10.9 10.9 scn Sicilian Romance 81 5 3 6 9 7 7 11 5 5 6 5 5 2 143 sd Sindhi Iranian 115 3 9 8 8 7 7 6 7 5 8 5 152 lt 13.7 10.0 su Sundanese Malayo- 120 4 3 5 7 6 5 6 4 4 5 4 4 2 117 Polynesian hi 17.8 21.9 tk Turkmen Turkic 56 2 2 3 3 4 3 4 2 2 4 2 3 1 76 tg Tajik Iranian 248 5 4 11 15 9 9 8 8 7 8 6 10 6 192 mr 2.6 3.5 ug Uighur Turkic 83 4 3 9 10 7 8 6 6 5 6 5 9 6 168 ur Urdu Indo-Aryan 150 2 2 3 5 3 3 3 3 3 3 3 3 2 123 wa Walloon Romance 56 3 2 4 5 5 4 4 3 3 4 3 3 2 93 Table 6: BLEU scores on the TED test set as proposed wuu Wu Chinese Chinese 75 8 6 11 17 12 11 10 8 9 11 9 10 43 283 yi Yiddish Germanic 131 3 2 4 3 4 4 5 3 3 4 3 4 1 92 in (Qi et al., 2018). NMT systems were trained on bi- texts mined in Wikipedia only. No other resources were Table 5: WikiMatrix (part 2): number of extracted sen- used. tences (in thousands) for languages with a rather small Wikipedia. Alignments with other languages yield less than 5k sentences and are omitted for clarity.

1361