Learning Bilingual Sentence Embeddings Via Autoencoding and Computing Similarities with a Multilayer Perceptron
Total Page:16
File Type:pdf, Size:1020Kb
Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron Yunsu Kim2 Hendrik Rosendahl1;2 Nick Rossenbach2 Jan Rosendahl2 Shahram Khadivi1 Hermann Ney2 1eBay, Inc., Aachen, Germany fhrosendahl,[email protected] 2RWTH Aachen University, Aachen, Germany [email protected] Abstract 2011; Xu and Koehn, 2017). Most of the partici- pants in the WMT 2018 parallel corpus filtering We propose a novel model architecture and task use large-scale neural MT models and lan- training algorithm to learn bilingual sentence embeddings from a combination of parallel guage models as the features (Koehn et al., 2018). and monolingual data. Our method connects Bilingual sentence embeddings can be an ele- autoencoding and neural machine translation gant and unified solution for parallel corpus min- to force the source and target sentence embed- ing and filtering. They compress the information dings to share the same space without the help of each sentence into a single vector, which lies of a pivot language or an additional trans- in a shared space between source and target lan- formation. We train a multilayer perceptron guages. Scoring a source-target sentence pair is on top of the sentence embeddings to extract good bilingual sentence pairs from nonparallel done by computing similarity between the source or noisy parallel data. Our approach shows embedding vector and the target embedding vec- promising performance on sentence align- tor. It is much more efficient than scoring by de- ment recovery and the WMT 2018 parallel coding, e.g. with a translation model. corpus filtering tasks with only a single model. Bilingual sentence embeddings have been stud- ied primarily for transfer learning of monolingual downstream tasks across languages (Hermann and 1 Introduction Blunsom, 2014; Pham et al., 2015; Zhou et al., Data crawling is increasingly important in ma- 2016). However, few papers apply it to bilin- chine translation (MT), especially for neural net- gual corpus mining; many of them require paral- work models. Without sufficient bilingual data, lel training data with additional pivot languages neural machine translation (NMT) fails to learn (Espana-Bonet et al., 2017; Schwenk, 2018) or meaningful translation parameters (Koehn and lack an investigation into similarity between the Knowles, 2017). Even for high-resource language embeddings (Guo et al., 2018). pairs, it is common to augment the training data This work solves these issues as follows: with web-crawled bilingual sentences to improve the translation performance (Bojar et al., 2018). • We propose a simple end-to-end training Using crawled data in MT typically involves approach of bilingual sentence embeddings two core steps: mining and filtering. Mining with parallel and monolingual data only of parallel sentences, i.e. aligning source and tar- the corresponding language pair. get sentences, is usually done with lots of heuris- • We use a multilayer perceptron (MLP) as a tics and features: document/URL meta infor- trainable similarity measure to match source mation (Resnik and Smith, 2003; Espla-Gomis´ and target sentence embeddings. and Forcada, 2009), sentence lengths with self- • induced lexicon (Moore, 2002; Varga et al., 2005; We compare various similarity measures for Etchegoyhen and Azpeitia, 2016), word alignment embeddings in terms of score distribution, geometric interpretation, and performance in statistics and linguistic tags (S. tefanescu˘ et al., 2012; Kaufmann, 2012). downstream tasks. Filtering aligned sentence pairs also often in- • We demonstrate competitive performance in volves heavy feature engineering (Taghipour et al., sentence alignment recovery and parallel cor- 61 Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 61–71 Florence, Italy, August 2, 2019. c 2019 Association for Computational Linguistics pus filtering tasks without a complex combi- feedforward network and use the parallel sen- nation of translation/language models. tences as input to the two encoders. Similarly to • We analyze the effect of negative examples our work, the feedforward network measures the on training an MLP similarity, using different similarity of sentence pairs, except that the source levels of negativity. and target sentence embeddings are combined via dot product instead of concatenation. Their model, 2 Related Work however, is not directly optimizing the source and target sentences to be translations of each other; Bilingual representation of a sentence was at first it only attaches two encoders in the output level built by averaging pre-trained bilingual word em- without a decoder. beddings (Huang et al., 2012; Klementiev et al., Based on the model of Artetxe and Schwenk 2012). The compositionality from words to sen- (2018b), Artetxe and Schwenk(2018a) scale tences is integrated into end-to-end training in cosine similarity between sentence embeddings Hermann and Blunsom(2014). with average similarity of the nearest neighbors. Explicit modeling of a sentence-level bilingual Searching for the nearest neighbors among hun- embedding was first discussed in Chandar et al. dreds of millions of sentences may cause a huge (2013), training an autoencoder on monolingual computational problem. On the other hand, our sentence embeddings of two languages. Pham similarity calculation is much quicker and support et al.(2015) jointly learn bilingual sentence and batch computation while preserving strong perfor- word embeddings by feeding a shared sentence mance in parallel corpus filtering. embedding to n-gram models. Zhou et al.(2016) Neither of the above-mentioned methods utilize add document-level alignment information to this monolingual data. We integrate autoencoding into model as a constraint in training. NMT to maximize the usage of parallel and mono- Recently, sequence-to-sequence NMT models lingual data together in learning bilingual sentence were adapted to learn cross-lingual sentence em- embeddings. beddings. Schwenk and Douze(2017) connect multiple source encoders to a shared decoder of 3 Bilingual Sentence Embeddings a pivot target language, forcing the consistency of encoder representations. Schwenk(2018) extend A bilingual sentence embedding function maps this work to use a single encoder for many source sentences from both the source and target lan- languages. Both methods rely on N-way parallel guage into a single joint vector space. Once we training data, which are seriously limited to cer- obtain such a space, we can search for a similar tain languages and domains. Artetxe and Schwenk target sentence embedding given a source sentence (2018b) relax this data condition to pairwise paral- embedding, or vice versa. lel data including the pivot language, but it is still 3.1 Model unrealistic for many scenarios (see Section 4.2). In contrast, our method needs only parallel and In this work, we learn bilingual sentence embed- monolingual data for source and target languages dings via NMT and autoencoding given parallel of concern without any pivot languages. and monolingual corpora. Since our purpose is Hassan et al.(2018) train a bidirectional NMT to pair source and target sentences, translation is model with a single encoder-decoder, taking the a natural base task to connect sentences in two average of top-layer encoder states as the sentence different languages. We adopt a basic encoder- embedding. They do not include any details on the decoder approach from Sutskever et al.(2014). data or translation performance before/after the fil- The encoder produces a fixed-length embedding tering with this embedding. Junczys-Dowmunt of a source sentence, which is used by the decoder (2018) apply this method to WMT 2018 paral- to generate the target hypothesis. J lel corpus filtering task, yet showing significantly First, the encoder takes a source sentence f1 = worse performance than a combination of trans- f1; :::; fj; :::; fJ (length J) as input, where each fj lation/language models. Our method shows com- is a source word. It computes hidden representa- D parable results to such model combinations in the tions hj 2 R for all source positions j: same task. J J Guo et al.(2018) replace the decoder with a h1 = encsrc(f1 ) (1) 62 e1 eI sentence without a proper source language input. We can perform decoding with an empty source s0 s1 . sI input and take the last decoder state sI as the I sentence embedding of e1, but it is not compati- e0 ei−1 ble with the source embedding and contradicts the way in which the model is trained. Element-wise Element-wise Therefore, we attach another encoder of the tar- Max-pooling Max-pooling get language to the same (target) decoder: ¯I I ¯ ¯ ¯ h1 = enctgt(e1) (5) h1 h2 . hJ h1 h2 . hI > ¯ ¯ s0 = max hi1; ::: ; max hiD (6) i=1;:::;I i=1;:::;I f1 f2 fJ e1 e2 eI enc has the same architecture as enc . The Figure 1: Our proposed model for learning bilingual tgt src sentence embeddings. A decoder (above) is shared over model has now an additional information flow two encoders (below). The decoder accepts a max- from a target input sentence to the same target pooled representation from either one of the encoders (output) sentence, also known as sequential au- as its first state s0, depending on the training objective toencoder (Li et al., 2015). (Equation7 and8). Figure1 is a diagram of our model. A decoder is shared between NMT and autoencoding parts; it encsrc is implemented as a bidirectional recurrent takes either source or target sentence embedding neural network (RNN). We denote a target output and does not differentiate between the two when I sentence by e1 = e1; :::; ei; :::; eI (length I). The producing an output. The two encoders are con- decoder is an unidirectional RNN whose internal strained to provide mathematically consistent rep- state for a target position i is: resentations over the languages (to the decoder). Note that our model does not have any attention si = dec(si−1; ei−1) (2) component (Bahdanau et al., 2014). The atten- tion mechanism in NMT makes the decoder attend where its initial state is element-wise max-pooling to encoder representations at all source positions.