Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer

Yunsu Kim2 Hendrik Rosendahl1,2 Nick Rossenbach2 Jan Rosendahl2 Shahram Khadivi1 Hermann Ney2 1eBay, Inc., Aachen, Germany {hrosendahl,skhadivi}@ebay.com 2RWTH Aachen University, Aachen, Germany {surname}@cs.rwth-aachen.de

Abstract 2011; Xu and Koehn, 2017). Most of the partici- pants in the WMT 2018 parallel corpus filtering We propose a novel model architecture and task use large-scale neural MT models and lan- training algorithm to learn bilingual sentence embeddings from a combination of parallel guage models as the features (Koehn et al., 2018). and monolingual data. Our method connects Bilingual sentence embeddings can be an ele- autoencoding and neural machine translation gant and unified solution for parallel corpus min- to force the source and target sentence embed- ing and filtering. They compress the information dings to share the same space without the help of each sentence into a single vector, which lies of a pivot language or an additional trans- in a shared space between source and target lan- formation. We train a guages. Scoring a source-target sentence pair is on top of the sentence embeddings to extract good bilingual sentence pairs from nonparallel done by computing similarity between the source or noisy parallel data. Our approach shows embedding vector and the target embedding vec- promising performance on sentence align- tor. It is much more efficient than scoring by de- ment recovery and the WMT 2018 parallel coding, e.g. with a translation model. corpus filtering tasks with only a single model. Bilingual sentence embeddings have been stud- ied primarily for transfer learning of monolingual downstream tasks across languages (Hermann and 1 Introduction Blunsom, 2014; Pham et al., 2015; Zhou et al., Data crawling is increasingly important in ma- 2016). However, few papers apply it to bilin- chine translation (MT), especially for neural net- gual corpus mining; many of them require paral- work models. Without sufficient bilingual data, lel training data with additional pivot languages neural machine translation (NMT) fails to learn (Espana-Bonet et al., 2017; Schwenk, 2018) or meaningful translation parameters (Koehn and lack an investigation into similarity between the Knowles, 2017). Even for high-resource language embeddings (Guo et al., 2018). pairs, it is common to augment the training data This work solves these issues as follows: with web-crawled bilingual sentences to improve the translation performance (Bojar et al., 2018). • We propose a simple end-to-end training Using crawled data in MT typically involves approach of bilingual sentence embeddings two core steps: mining and filtering. Mining with parallel and monolingual data only of parallel sentences, i.e. aligning source and tar- the corresponding language pair. get sentences, is usually done with lots of heuris- • We use a multilayer perceptron (MLP) as a tics and features: document/URL meta infor- trainable similarity measure to match source mation (Resnik and Smith, 2003; Espla-Gomis´ and target sentence embeddings. and Forcada, 2009), sentence lengths with self- • induced lexicon (Moore, 2002; Varga et al., 2005; We compare various similarity measures for Etchegoyhen and Azpeitia, 2016), word alignment embeddings in terms of score distribution, geometric interpretation, and performance in statistics and linguistic tags (S. tefanescu˘ et al., 2012; Kaufmann, 2012). downstream tasks. Filtering aligned sentence pairs also often in- • We demonstrate competitive performance in volves heavy (Taghipour et al., sentence alignment recovery and parallel cor-

61 Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 61–71 Florence, Italy, August 2, 2019. c 2019 Association for Computational Linguistics pus filtering tasks without a complex combi- feedforward network and use the parallel sen- nation of translation/language models. tences as input to the two encoders. Similarly to • We analyze the effect of negative examples our work, the feedforward network measures the on training an MLP similarity, using different similarity of sentence pairs, except that the source levels of negativity. and target sentence embeddings are combined via dot product instead of concatenation. Their model, 2 Related Work however, is not directly optimizing the source and target sentences to be translations of each other; Bilingual representation of a sentence was at first it only attaches two encoders in the output level built by averaging pre-trained bilingual word em- without a decoder. beddings (Huang et al., 2012; Klementiev et al., Based on the model of Artetxe and Schwenk 2012). The compositionality from words to sen- (2018b), Artetxe and Schwenk(2018a) scale tences is integrated into end-to-end training in cosine similarity between sentence embeddings Hermann and Blunsom(2014). with average similarity of the nearest neighbors. Explicit modeling of a sentence-level bilingual Searching for the nearest neighbors among hun- embedding was first discussed in Chandar et al. dreds of millions of sentences may cause a huge (2013), training an on monolingual computational problem. On the other hand, our sentence embeddings of two languages. Pham similarity calculation is much quicker and support et al.(2015) jointly learn bilingual sentence and batch computation while preserving strong perfor- word embeddings by feeding a shared sentence mance in parallel corpus filtering. embedding to n-gram models. Zhou et al.(2016) Neither of the above-mentioned methods utilize add document-level alignment information to this monolingual data. We integrate autoencoding into model as a constraint in training. NMT to maximize the usage of parallel and mono- Recently, sequence-to-sequence NMT models lingual data together in learning bilingual sentence were adapted to learn cross-lingual sentence em- embeddings. beddings. Schwenk and Douze(2017) connect multiple source encoders to a shared decoder of 3 Bilingual Sentence Embeddings a pivot target language, forcing the consistency of encoder representations. Schwenk(2018) extend A bilingual sentence embedding function maps this work to use a single encoder for many source sentences from both the source and target lan- languages. Both methods rely on N-way parallel guage into a single joint vector space. Once we training data, which are seriously limited to cer- obtain such a space, we can search for a similar tain languages and domains. Artetxe and Schwenk target sentence embedding given a source sentence (2018b) relax this data condition to pairwise paral- embedding, or vice versa. lel data including the pivot language, but it is still 3.1 Model unrealistic for many scenarios (see Section 4.2). In contrast, our method needs only parallel and In this work, we learn bilingual sentence embed- monolingual data for source and target languages dings via NMT and autoencoding given parallel of concern without any pivot languages. and monolingual corpora. Since our purpose is Hassan et al.(2018) train a bidirectional NMT to pair source and target sentences, translation is model with a single encoder-decoder, taking the a natural base task to connect sentences in two average of top-layer encoder states as the sentence different languages. We adopt a basic encoder- embedding. They do not include any details on the decoder approach from Sutskever et al.(2014). data or translation performance before/after the fil- The encoder produces a fixed-length embedding tering with this embedding. Junczys-Dowmunt of a source sentence, which is used by the decoder (2018) apply this method to WMT 2018 paral- to generate the target hypothesis. J lel corpus filtering task, yet showing significantly First, the encoder takes a source sentence f1 = worse performance than a combination of trans- f1, ..., fj, ..., fJ (length J) as input, where each fj lation/language models. Our method shows com- is a source word. It computes hidden representa- D parable results to such model combinations in the tions hj ∈ R for all source positions j: same task. J J Guo et al.(2018) replace the decoder with a h1 = encsrc(f1 ) (1)

62 e1 eI sentence without a proper source language input. We can perform decoding with an empty source s0 s1 . . . sI input and take the last decoder state sI as the I sentence embedding of e1, but it is not compati- e0 ei−1 ble with the source embedding and contradicts the way in which the model is trained.

Element-wise Element-wise Therefore, we attach another encoder of the tar- Max-pooling Max-pooling get language to the same (target) decoder:

¯I I ¯ ¯ ¯ h1 = enctgt(e1) (5) h1 h2 . . . hJ h1 h2 . . . hI  > ¯ ¯ s0 = max hi1, ... , max hiD (6) i=1,...,I i=1,...,I f1 f2 fJ e1 e2 eI enc has the same architecture as enc . The Figure 1: Our proposed model for learning bilingual tgt src sentence embeddings. A decoder (above) is shared over model has now an additional information flow two encoders (below). The decoder accepts a max- from a target input sentence to the same target pooled representation from either one of the encoders (output) sentence, also known as sequential au- as its first state s0, depending on the training objective toencoder (Li et al., 2015). (Equation7 and8). Figure1 is a diagram of our model. A decoder is shared between NMT and autoencoding parts; it encsrc is implemented as a bidirectional recurrent takes either source or target sentence embedding neural network (RNN). We denote a target output and does not differentiate between the two when I sentence by e1 = e1, ..., ei, ..., eI (length I). The producing an output. The two encoders are con- decoder is an unidirectional RNN whose internal strained to provide mathematically consistent rep- state for a target position i is: resentations over the languages (to the decoder). Note that our model does not have any attention si = dec(si−1, ei−1) (2) component (Bahdanau et al., 2014). The atten- tion mechanism in NMT makes the decoder attend where its initial state is element-wise max-pooling to encoder representations at all source positions. J of the encoder representations h1 : This is counterintuitive for our purpose; we need to optimize the encoder to produce a single rep- s = maxpool(hJ ) 0 1 resentation vector, but the attention model allows >   the encoder to distribute information over many = max hj1, ... , max hjD (3) j=1,...,J j=1,...,J different positions. In our initial experiments, the same model with the attention mechanism showed We empirically found that the max-pooling per- exorbitantly bad performance, so we removed it in forms much better than averaging or choosing the the main experiments of Section4. first (h1) or last (hJ ) representation. Finally, an output layer predicts a target word ei: 3.2 Training and Inference

i−1 J Let θenc , θenc , and θdec the parameters of pθ(ei|e1 , f1 ) = softmax(linear(si)) (4) src tgt the source encoder, the target encoder, and the where θ denotes a set of model parameters. (shared) decoder, respectively. Given a paral- Note that the decoder has access to the source lel corpus P and a target monolingual corpus sentence only through s0, which we take as the Mtgt, the training criterion of our model is the J sentence embedding of f1 . This assumes that cross-entropy on two input-output paths. The the source sentence embedding contains sufficient NMT objective (Equation7) is for training θ1 = information for translating to a target sentence, {θencsrc , θdec}, and the autoencoding objective which is desired for a bilingual embedding space. (Equation8) is for training θ2 = {θenctgt , θdec}: However, this plain NMT model can generate X I J only source sentence embeddings through the en- Lemb(θ) = − log pθ1 (e1|f1 ) (7) coder. The decoder cannot process a new target J I (f1 ,e1)∈P 63 X I I − log pθ2 (e1|e1) (8) Euclidean distance indicates how much distance I e1∈Mtgt must be traveled to move from the end of a vector to that of the other (transition). We reverse this where θ = {θ , θ }. During the training, each 1 2 distance to use it as a similarity measure: mini-batch contains examples of the both objec- tives with a 1:1 ratio. In this way, we prevent Euclidean(a, b) = −ka − bk (14) one encoder from being optimized more than the other, forcing the two encoders produce balanced However, these simple measures, i.e. a single sentence embeddings that fit to the same decoder. rotation or transition, might not be sufficient to The autoencoding part can be trained with a define the similarity of complex natural language separate target monolingual corpus. To provide a sentences across different languages. Also, the stronger training signal for the shared embedding learned joint embedding space is not necessarily space, we use also the target side of P; the model perfect in the sense of vector space geometry; even learns to produce the same target sentence from if we train it with a decent algorithm, the structure the corresponding source and target inputs. and quality of the embedding space are highly de- In order to guide the training to bilingual repre- pendent on the amount of parallel training data and sentations, we initialize the lay- its domain. This might hinder the simple functions ers with a pre-trained bilingual word embedding. from working well for our purpose. The word embedding for each language is trained Trainable multilayer perceptron To model re- with a skip-gram algorithm (Mikolov et al., 2013), lations of sentence embeddings by combining ro- later mapped across the languages with adversarial tation, shift, and even nonlinear transformations, training (Conneau et al., 2018) and self-dictionary We train a small multilayer perceptron (MLP) refinements (Artetxe et al., 2017). (Bishop et al., 1995) and use it as a similarity mea- Our model can be built also in the opposite di- sure. We design the MLP network q(a, b) as a rection, i.e. with a target-to-source NMT model simple binary classifier whose input is a concate- and a source autoencoder: nation of source and target sentence embeddings: X J I > Lemb(θ) = − log pθ2 (f1 |e1) (9) [a; b] . It is passed through feedforward hidden J I (f1 ,e1)∈P layers with nonlinear activations. The output layer X has a single node with sigmoid activation, repre- − log p (f J |f J ) (10) θ1 1 1 senting how probable the source and target sen- f J ∈M 1 src tences are translations of each other. Once the model is trained, we need only the en- To train this model, we must have positive ex- coders to query sentence embeddings. Let a and amples (real parallel sentence pairs, Ppos) and J b be embeddings of a source sentence f1 and a negative examples (nonparallel or noisy sentence I target sentence e1, respectively: pairs, Pneg). The training criterion is: J a = maxpool(encsrc(f1 )) (11) X L = − log q(a, b) I sim b = maxpool(enctgt(e )) (12) 1 (a,b)∈Ppos X 3.3 Computing Similarities − (1 − log q(a, b)) (15)

The next step is to evaluate how close the two em- (a,b)∈Pneg beddings are to each other, i.e. to compute a sim- ilarity measure between them. In this paper, we which naturally fits to the main task of interest: consider two types of similarity measures. parallel corpus filtering (Section 4.2). Note that the output of the MLP can be quite biased to the Predefined mathematical functions Cosine extremes (0 or 1) in order to clearly distinguish similarity is a conventional choice for measuring good and bad examples. This has both advantages the similarity in vector space modeling of infor- and disadvantages as explained in Section 5.1. mation retrieval or text mining (Singhal, 2001). It Our MLP similarity can be optimized differ- computes the angle between two vectors (rotation) ently for each embedding space. Furthermore, the and ignore the lengths: user can inject domain-specific knowledge into the a · b MLP similarity by training only with in-domain cos(a, b) = (13) kakkbk parallel data. The resulting MLP would devalue 64 not only nonparallel sentence pairs but also out- the original alignments; also known as corpus re- of-domain instances. construction (Schwenk and Douze, 2017). Given a source sentence, we compute a simi- 4 Evaluation larity score with every possible target sentence in the data and take the top-scored one as the align- We evaluated our bilingual sentence embedding ment. The error rate is the number of incorrect and the MLP similarity on two tasks: sentence sentence alignments divided by the total number alignment recovery and parallel corpus filtering. of sentences. We compute this also in the oppo- The sentence embedding was trained with WMT site direction and take an average of the two error 2018 English-German parallel data and 100M rates. It is an intrinsic evaluation for parallel cor- German sentences from the News Crawl mono- pus mining. We choose two test sets: WMT new- 1 lingual data , where we use German as the au- stest2018 (2998 lines) and IWSLT tst2015 (1080 toencoded language. All sentences were lower- lines). cased and limited to the length of 60. We learned As baselines, we used character-level Leven- the byte pair encoding (Sennrich et al., 2016) shtein distance and length-normalized posterior jointly for the two languages with 20k merge oper- scores of German→English/English→German ations. We pre-trained bilingual word embeddings NMT models. Each NMT model is a 3-layer base on 100M sentences from the News Crawl data Transformer (Vaswani et al., 2017) trained on the for each language using FASTTEXT (Bojanowski same training data as the sentence embedding. et al., 2017) and MUSE (Conneau et al., 2018). Our sentence embedding model has 1-layer Error [%] RNN encoder/decoder, where the word embed- Method WMT IWSLT ding and hidden layers have a size of 512. The training was done with stochastic gradient descent Levenshtein distance 37.4 54.6 with initial learning rate of 1.0, batch size of 120 NMT de-en + en-de 1.7 13.3 sentences, and maximum 800k updates. After Our method (Cosine similarity) 4.3 13.8 100k updates, we reduced the learning rate by a Our method (MLP similarity) 89.9 72.6 factor of 0.9 for every 50k updates. Our MLP similarity model has 2 hidden lay- Table 1: Sentence alignment recovery results. Our ers of size 512 with ReLU (Nair and Hinton, method results use cosine similarity except the last row. 2010), trained with SCIKIT-LEARN (Pedregosa et al., 2011) with maximum 1,000 updates. For a Table1 shows the results. The Levenshtein dis- positive training set, we used newstest2007-2015 tance gives a poor performance. NMT models from WMT (around 21k sentences). Unless other- are better than the other methods, but takes too wise noted, we took a comparable size of negative long to compute posteriors for all possible pairs examples from the worst-scored sentence pairs of of source and target sentences (about 12 hours for ParaCrawl2 English-German corpus. The scoring the WMT test set). This is absolutely not feasible was done with our bilingual sentence embedding for a real mining task with hundreds of millions of and cosine similarity. sentences. Note that the negative examples are selected via Our bilingual sentence embeddings (with us- cosine similarity but the similarity values are not ing cosine similarity) show error rates close to the used in the MLP training (Equation 15). Thus NMT models, especially in the IWSLT test set. it does not learn to mimic the cosine similarity Computing similarities between embeddings is ex- function again, but has a new sorting of sentence tremely fast (about 3 minutes for the WMT test pairs—also encoding the domain information. set), which perfectly fits to mining scenarios. However, the MLP similarity performs bad in 4.1 Sentence Alignment Recovery aligning sentence pairs. Given a source sentence, it puts all reasonably similar target sentences to the In this task, we corrupt the sentence alignments of score 1 and does not precisely distinguish between a parallel test set by shuffling one side, and find them. Detailed investigation of this behavior is in 1http://www.statmt.org/wmt18/translation-task.html Section 5.1. As we will find out, this is ironically 2https://www.paracrawl.eu/ very effective in parallel corpus filtering.

65 BLEU [%] 10M words 100M words Method test2017 test2018 test2017 test2018 Random sampling 19.1 23.1 23.2 29.3 Pivot-based embedding (Schwenk and Douze, 2017) 26.1 32.4 30.0 37.5 NMT + LM, 4 models (Rossenbach et al., 2018) 29.1 35.2 31.3 38.2 Our method (cosine similarity) 23.0 28.4 27.9 34.4 Our method (MLP similarity) 29.2 35.4 30.6 37.5

Table 2: Parallel corpus filtering results (German→English).

4.2 Parallel Corpus Filtering similarity was used by default for sentence embed- We also test our methods in the WMT 2018 paral- ding methods except the last row. Pivot-based sen- lel corpus filtering task (Koehn et al., 2018). tence embedding (Schwenk and Douze, 2017) im- proves upon the random sampling, but it has an Data The task is to score each line of a very impractical data condition. The four-model com- noisy, web-crawled corpus of 104M parallel lines bination of NMT models and LMs (Rossenbach (ParaCrawl English-German). We pre-filtered the et al., 2018) provide 1-3% more BLEU improve- given raw corpus with the heuristics of Rossen- ment. Note that, for the third method, each model bach et al.(2018). Only the data for WMT 2018 costs 1-2 weeks to train. English-German news translation task is allowed Our bilingual sentence embedding method to train scoring models. The evaluation procedure greatly improves over the random sampling base- is: subsample top-scored lines which amounts to line up to 5.3% BLEU in the 10M-word case and 10M/100M words, train a small NMT model with 5.1% BLEU in the 100M-word case. With our the subsampled data, and check its translation per- MLP similarity, the improvement in BLEU is up formance. We follow the official pipeline except to 12.3% and 8.2% in the 10M-word case and the that we train 3-layer Transformer NMT model us- 100M-word case, respectively. It outperforms the ing Sockeye (Hieber et al., 2017) for evaluation. pivot-based embedding method significantly and gets close to the performance of the four-model Baselines We have three comparative baselines: combination. Note that we use only a single model 1) random sampling, 2) bilingual sentence em- trained with only given parallel/monolingual data bedding learned with a third pivot target lan- for the corresponding language pair, i.e. English- guage (Schwenk and Douze, 2017), 3) combina- German. In contrast to sentence alignment recov- tion of source-to-target/target-to-source NMT and ery experiments, the MLP similarity boosts the fil- source/target LM (Rossenbach et al., 2018), a top- tering performance by a large margin. ranked system in the official evaluation. Note that the second method violates the of- 5 Analysis ficial data condition of the task since it requires parallel data in German-Pivot and English-Pivot. In this section, we provide more in-depth analyses This method is not practical when learning mul- to compare 1) various similarity measures and 2) tilingual embeddings for English and other lan- different choices of the negative training set for the guages, since it is hard to collect pairwise parallel MLP similarity model. data involving a non-English pivot language (ex- 5.1 Similarity Measures cept among European languages). We trained this In Table3, we compare sentence alignment re- method with N-way parallel UN corpus (Ziemski covery performance with different similarity mea- et al., 2016) with French as the pivot language. sures. The size of this model is the same as that of our Euclidean distance shows a worse performance autoencoding-based model except the word em- than cosine similarity. This means that in a sen- bedding layers. tence embedding space, we should consider ro- The results are shown in Table2, where cosine tation more than transition when comparing two

66 Error [%] 1

Similarity de-en en-de Average 0.8 Euclidean 7.9 99.8 53.8 Cosine 4.3 4.2 4.3 0.6

CSLS 1.9 2.2 2.1 0.4

MLP 85.0 94.8 89.9 similarity score 0.2 Table 3: Sentence alignment recovery results with dif- ferent similarity measures (newstest2018). 0 0 1 2 3 4 sentence index (×107) vectors. Particularly, the English→German direc- (a) Cosine similarity tion has a peculiarly bad result with Euclidean dis- 1 tance. This is due to a hubness problem in a high- dimensional space, where some vectors are highly 0.8 likely to be nearest neighbors of many others. 0.6

0.4

a2 similarity score a1 0.2

0 a3 0 1 2 3 4 b1 b2 sentence index (×107) b3 (b) MLP similarity a4 b4 Figure 3: The score distribution of similarity measures. The sentences are sorted by their similarity scores. Co- Figure 2: Schematic diagram of the hubness problem. sine similarity values are linearly rescaled to [0, 1]. Filled circles indicate German sentence embeddings, while empty circles denote English sentence embed- penalizing similarity values in dense areas of the dings. All embeddings are assumed to be normalized. embedding distribution (Conneau et al., 2018):

Figure2 illustrates that Euclidean distance is CSLS(a, b) = 2 · cos(a, b) (16) more prone to the hubs than cosine similarity. As- 1 X − cos(a, b0) (17) sume that German sentence embeddings an and K b0∈NN(a) English sentence embeddings bn should match to 1 X each other with the same index n, e.g. (a1,b1) − cos(a0, b) (18) is a correct match. With cosine similarity, the K a0∈NN(b) nearest neighbor of an is always bn for all n = 1, ..., 4 and vice versa, considering only the an- where K is the number of nearest neighbors. gles between the vectors. However, when us- CSLS outperforms cosine similarity in our exper- ing Euclidean distance, there is a discrepancy be- iments. For a large-scale mining scenario, how- tween German→English and English→German ever, the measure requires heavy computations for directions: The nearest neighbor of each an is the penalty terms (Equation 17 and 18), i.e. near- bn, but the nearest neighbor of all bn is always est neighbor search in all combinations of source a4. This leads to a serious performance drop only and target sentences and sorting the scores over in English→German. The figure is depicted in a e.g. a few hundred million instances. two-dimensional space for simplicity, but the hub- The MLP similarity is not performing well as ness problem becomes worse for an actual high- opposed to its results in parallel corpus filtering. dimensional space of sentence embeddings. To explain this, we depict score distributions of Cross-domain similarity local scaling (CSLS) is cosine and MLP similarity over the ParaCrawl cor- developed to counteract the hubness problem by pus in Figure3. As for cosine similarity, only

67 Similarity German sentence English sentence Cosine MLP the requested URL / dictionary / m / additionally, a 404 Not Found error was mar eisimpleir.htm was not found on encountered while trying to use an Er- 0.185 0.000 this server. rorDocument to handle the request. becoming Prestigious In The Right Way how I Feel About School 0.199 0.000 nach dieser Aussage sollte die turkische¨ according to his report, the Turkish Armee somit eine internationale Inter- army was aiming to provoke an inter- 0.563 1.000 vention gegen Syrien provozieren . national intervention against Syria. allen Menschen und Beschaftigten,¨ die to pay tribute to all people and work- um Freiheit kampfen¨ oder bei Kundge- ers who have been fighting for freedom bungen ums Leben kamen, Achtung or fallen in demonstrations and demand 0.427 0.999 zu bezeugen und die unverzugliche¨ the immediate release of all detainees Freilassung aller Inhaftierten zu fordern

Table 4: Example sentence pairs in the ParaCrawl corpus (Section 4.2) with their similarity values. a small fraction of the corpus is given low- or parallel, eventually filtering it out from the train- high-range scores (smaller than 0.2 or larger than ing data. On the other hand, our MLP similarity 0.6). The remaining sentences are distributed al- correctly evaluates this difficult case by giving a most uniformly within the score range inbetween. nearly perfect score. The distribution curve of the MLP similarity has However, the MLP is not optimized for precise a completely different shape. It has a strong ten- differentiation among the good parallel matches. dency to classify a sentence pair to be extremely It is thus not appropriate for sentence alignment bad or extremely good: nearly 80% of the corpus recovery that requires exact 1-1 matching of po- is scored with zero and only 3.25% gets scores be- tential source-target pairs. A steep drop in the tween 0.99 and 1.0. Table4 shows some example curve of Figure 3b also explains why it performs sentence pairs with extreme MLP similarity val- slightly inferior to the best system in the 100M- ues. word filtering task (Table2). The subsampling ex- This is the reason why the MLP similarity does ceeds the dropping region and includes many zero- a good job in filtering, especially in selecting a scored sentence pairs, where the MLP similarity small portion (10M-word) of good parallel sen- cannot measure the quality well. tences. Table4 compares cosine similarities and 5.2 Negative Training Examples the MLP scores for some sentence pairs in the raw corpus for our filtering task (Section 4.2). The In the MLP similarity training, we can use pub- first two sentence pairs are absolutely nonparallel; licly available parallel corpora as the positive sets. both similarity measures give low scores, while For the negative sets, however, it is not clear which the MLP similarity emphasizes the bad quality dataset we should use: entirely nonparallel sen- with zero scores. The third example is a decent tences, partly parallel sentences, or sentence pairs parallel sentence pair with a minor ambiguity, i.e. of quality inbetween. We experimented with nega- his in English can be a translation of dieser in Ger- tive examples of different quality in Table5. Here man or not, depending on the document-level con- is how we vary the negativity: text. Both measures see this sentence pair as a pos- 1. Score the sentence pairs of the ParaCrawl itive example. corpus with our bilingual sentence embed- The last example is parallel but the translation ding using cosine similarity. involves severe reordering: long-distance changes in verb positions, switching the order of relative 2. Sort the sentence pairs by the scores. clauses, etc. Here, cosine similarity has trouble 3. Divide the sorted corpus into five portions by in rating this case highly even if it is perfectly top-scored cut of 20%, 40%, 60%, 80%, and

68 Negative examples BLEU [%] with parallel and monolingual data of the corre- sponding language pair, with neither pivot lan- Random sampling 33.3 guages nor N-way parallel data. We also propose 20% worst 29.9 to use a binary classification MLP as a similarity 40% worst 33.3 measure for matching source and target sentence 60% worst 33.7 embeddings. 80% worst 32.1 Our bilingual sentence embeddings show con- 100% worst 25.7 sistently strong performance in both sentence alignment recovery and the WMT 2018 parallel Table 5: Parallel corpus filtering results (10M-word corpus filtering tasks with only a single model. We task) with different negative sets for training MLP sim- ilarity (newstest2016, i.e. the validation set). compare various similarity measures for bilingual sentence matching, verifying that cosine similarity is preferred for a mining task and our MLP simi- 100%. larity is very effective in a filtering task. We also 4. Take the last 100k lines for each portion. show that a moderate level of negativity is appro- priate for training the MLP similarity, using either A negative set from the 20%-worst part stands random examples or mid-range scored examples for relatively less problematic sentence pairs, in- from a noisy parallel corpus. tending for elaborate classification among perfect Future work would be regularizing the MLP parallel sentences (positive set) and almost perfect training to obtain a smoother distribution of the ones. With the 100%-worst examples, we focus similarity scores, which could supplement the on removing absolutely nonsense pairing of sen- weakness of the MLP similarity (Section 5.1). tences. As a simple baseline, we also take 100k Furthermore, we plan to adjust our learning pro- sentences randomly without scoring, representing cedure towards the downstream tasks, e.g. with an mixed levels of negativity. additional training objective to maximize the co- The results in Table5 show that a moderate sine similarity between the source and target en- level of negativity (60%-worst) is most suitable coders (Arivazhagan et al., 2019). Our method for training an MLP similarity model. If the neg- should be tested also on many other language pairs ative set contains too many excellent examples, which do not have parallel data involving a pivot the model may mark acceptable parallel sentence language. pairs with zero scores. If the negative set con- sists only of certainly nonparallel sentence pairs, the model is weak in discriminating mid-quality Acknowledgments instances, some of which are crucial to improve the translation system. Random selection of sentence pairs also works surprisingly well compared to carefully tailored negative sets. It does not require us to score and sort the raw corpus, so it is very efficient, sacrific- This work has received funding from the Euro- ing performance slightly. We hypothesize that the pean Research Council (ERC) (under the Euro- average negative level of this random set is also pean Union’s Horizon 2020 research and inno- moderate and similar to that of the 60%-worst. vation programme, grant agreement No 694537, 6 Conclusion project ”SEQCLAS”), the Deutsche Forschungs- gemeinschaft (DFG; grant agreement NE 572/8-1, In this work, we present a simple method to project ”CoreTec”), and eBay Inc. The GPU clus- train bilingual sentence embeddings by combin- ter used for the experiments was partially funded ing vanilla RNN NMT (without attention compo- by DFG Grant INST 222/1168-1. The work re- nent) and sequential autoencoder. By optimizing a flects only the authors’ views and none of the shared decoder with combined training objectives, funding agencies is responsible for any use that we force the source and target sentence embed- may be made of the information it contains. dings to share their space. Our model is trained

69 References Miquel Espla-Gomis´ and Mikel L. Forcada. 2009. Bi- textor, a free/open-source software to harvest trans- Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee lation memories from multilingual websites. In Pro- Aharoni, Melvin Johnson, and Wolfgang Macherey. ceedings of MT Summit XII, Ottawa, Canada. 2019. The missing ingredient in zero-shot neural machine translation. arXiv:1903.07091. Thierry Etchegoyhen and Andoni Azpeitia. 2016. Set- theoretic alignment for comparable corpora. In Pro- Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. ceedings of the 54th Annual Meeting of the Associa- Learning bilingual word embeddings with (almost) tion for Computational Linguistics (Volume 1: Long no bilingual data. In Proceedings of the 55th An- Papers), volume 1, pages 2009–2018. nual Meeting of the Association for Computational Linguistics (ACL 2017), volume 1, pages 451–462. Mandy Guo, Qinlan Shen, Yinfei Yang, Heming Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith Mikel Artetxe and Holger Schwenk. 2018a. Margin- Stevens, Noah Constant, Yun-hsuan Sung, Brian based parallel corpus mining with multilingual sen- Strope, et al. 2018. Effective parallel corpus mining tence embeddings. arXiv:1811.01136. using bilingual sentence embeddings. In Proceed- Mikel Artetxe and Holger Schwenk. 2018b. Mas- ings of the Third Conference on Machine Transla- sively multilingual sentence embeddings for tion: Research Papers, pages 165–176. zero-shot cross-lingual transfer and beyond. Hany Hassan, Anthony Aue, Chang Chen, Vishal arXiv:1812.10464. Chowdhary, Jonathan Clark, Christian Feder- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- mann, Xuedong Huang, Marcin Junczys-Dowmunt, gio. 2014. Neural machine translation by jointly William Lewis, Mu Li, et al. 2018. Achieving hu- learning to align and translate. arXiv, pages arXiv– man parity on automatic chinese to english news 1409. translation. arXiv preprint arXiv:1803.05567.

Christopher M Bishop et al. 1995. Neural networks for Karl Moritz Hermann and Phil Blunsom. 2014. Multi- pattern recognition. Oxford university press. lingual models for compositional distributed seman- tics. In Proceedings of the 52nd Annual Meeting of Piotr Bojanowski, Edouard Grave, Armand Joulin, and the Association for Computational Linguistics (Vol- Tomas Mikolov. 2017. Enriching word vectors with ume 1: Long Papers), volume 1, pages 58–68. subword information. Transactions of the Associa- tion for Computational Linguistics, 5:135–146. Felix Hieber, Tobias Domhan, Michael Denkowski, David Vilar, Artem Sokolov, Ann Clifton, and Matt Ondrejˇ Bojar, Christian Federmann, Mark Fishel, Post. 2017. Sockeye: A toolkit for neural machine Yvette Graham, Barry Haddow, Matthias Huck, translation. arXiv preprint arXiv:1712.05690. Philipp Koehn, and Christof Monz. 2018. Find- ings of the 2018 conference on machine translation Eric H Huang, Richard Socher, Christopher D Man- (wmt18). In Proceedings of the Third Conference ning, and Andrew Y Ng. 2012. Improving word on Machine Translation, Volume 2: Shared Task Pa- representations via global context and multiple word pers, pages 272–307, Belgium, Brussels. prototypes. In Proceedings of the 50th Annual Meet- ing of the Association for Computational Linguis- AP Sarath Chandar, Mitesh M Khapra, Balaraman tics: Long Papers-Volume 1, pages 873–882. Asso- Ravindran, Vikas Raykar, and Amrita Saha. 2013. ciation for Computational Linguistics. Multilingual . In Deep Learning Work- shop at NIPS. Marcin Junczys-Dowmunt. 2018. Dual conditional cross-entropy filtering of noisy parallel corpora. In Alexis Conneau, Guillaume Lample, Marc’Aurelio Proceedings of the Third Conference on Machine Ranzato, Ludovic Denoyer, and Herve´ Jegou.´ 2018. Translation: Shared Task Papers, pages 888–895. Word translation without parallel data. In Proceed- ings of 6th International Conference on Learning Max Kaufmann. 2012. Jmaxalign: A maximum en- Representations (ICLR 2018). tropy parallel sentence alignment tool. Proceedings of COLING 2012: Demonstration Papers, pages Dan S. tefanescu,˘ Radu Ion, and Sabine Hunsicker. 277–288. 2012. Hybrid parallel sentence mining from com- parable corpora. In Proceedings of the 16th Con- Alexandre Klementiev, Ivan Titov, and Binod Bhat- ference of the European Association for Machine tarai. 2012. Inducing crosslingual distributed rep- Translation, pages 137–144. resentations of words. In Proceedings of COLING 2012, pages 1459–1474. Cristina Espana-Bonet, Adam´ Csaba Varga, Alberto Barron-Cede´ no,˜ and Josef van Genabith. 2017. An Philipp Koehn, Huda Khayrallah, Kenneth Heafield, empirical analysis of nmt-derived interlingual em- and Mikel L Forcada. 2018. Findings of the wmt beddings and their use in parallel sentence identifi- 2018 shared task on parallel corpus filtering. In Pro- cation. IEEE Journal of Selected Topics in Signal ceedings of the Third Conference on Machine Trans- Processing, 11(8):1340–1350. lation: Shared Task Papers, pages 726–739.

70 Philipp Koehn and Rebecca Knowles. 2017. Six chal- Rico Sennrich, Barry Haddow, and Alexandra Birch. lenges for neural machine translation. In Proceed- 2016. Neural machine translation of rare words with ings of the 1st ACL Workshop on Neural Machine subword units. In Proceedings of the 54th Annual Translation (WNMT 2017), pages 28–39. Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), volume 1, pages Jiwei Li, Thang Luong, and Dan Jurafsky. 2015. A 1715–1725. hierarchical neural autoencoder for paragraphs and documents. In Proceedings of the 53rd Annual Amit Singhal. 2001. Modern information retrieval: A Meeting of the Association for Computational Lin- brief overview. Bulletin of the Technical Committee guistics and the 7th International Joint Conference on, page 35. on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1106–1115. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural net- works. In Proceedings of the 27th Interna- Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- tional Conference on Neural Information Processing frey Dean. 2013. Efficient estimation of word Systems-Volume 2, pages 3104–3112. MIT Press. representations in vector space. arXiv preprint arXiv:1301.3781. Kaveh Taghipour, Shahram Khadivi, and Jia Xu. 2011. Parallel corpus refinement as an outlier detection al- Robert C Moore. 2002. Fast and accurate sentence gorithm. In Proceedings of the 13th Machine Trans- alignment of bilingual corpora. In Conference of the lation Summit (MT Summit XIII), pages 414–421. Association for Machine Translation in the Ameri- cas, pages 135–144. Springer. Daniel´ Varga, Andras´ Kornai, Viktor Nagy, Laszl´ o´ Nemeth,´ and Viktor Tron.´ 2005. Parallel corpora Vinod Nair and Geoffrey E Hinton. 2010. Rectified for medium density languages. In Proceedings of linear units improve restricted boltzmann machines. RANLP 2005, pages 590–596. In Proceedings of the 27th international conference on (ICML-10), pages 807–814. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Kaiser, and Illia Polosukhin. 2017. Attention is all B. Thirion, O. Grisel, M. Blondel, P. Pretten- you need. In Advances in Neural Information Pro- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- cessing Systems, pages 5998–6008. sos, D. Cournapeau, M. Brucher, M. Perrot, and Hainan Xu and Philipp Koehn. 2017. Zipporah: a fast E. Duchesnay. 2011. Scikit-learn: Machine learning and scalable data cleaning system for noisy web- in Python. Journal of Machine Learning Research, crawled parallel corpora. In Proceedings of the 2017 12:2825–2830. Conference on Empirical Methods in Natural Lan- guage Processing, pages 2945–2950. Hieu Pham, Thang Luong, and Christopher Manning. 2015. Learning distributed representations for mul- Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2016. tilingual text sequences. In Proceedings of the 1st Cross-lingual sentiment classification with bilingual Workshop on Vector Space Modeling for Natural document representation learning. In Proceedings Language Processing, pages 88–94. of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- Philip Resnik and Noah A Smith. 2003. The web as a pers), volume 1, pages 1403–1412. parallel corpus. Computational Linguistics, 29(3). Michal Ziemski, Marcin Junczys-Dowmunt, and Bruno Nick Rossenbach, Jan Rosendahl, Yunsu Kim, Miguel Pouliquen. 2016. The united nations parallel corpus Grac¸a, Aman Gokrani, and Hermann Ney. 2018. v1.0. In Proceedings of Language Resources and The rwth aachen university filtering system for the Evaluation (LREC 2016), Portoroz,˘ Slovenia. wmt 2018 parallel corpus filtering task. In Proceed- ings of the Third Conference on Machine Transla- tion: Shared Task Papers, pages 946–954.

Holger Schwenk. 2018. Filtering and mining paral- lel data in a joint multilingual space. In Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Pa- pers), pages 228–234.

Holger Schwenk and Matthijs Douze. 2017. Learn- ing joint multilingual sentence representations with neural machine translation. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 157–167.

71