Improving Unsupervised Word-By-Word Translation with Language Model and Denoising Autoencoder

Improving Unsupervised Word-by-Word Translation with Language Model and Denoising Autoencoder Yunsu Kim Jiahui Geng Hermann Ney Human Language Technology and Pattern Recognition Group RWTH Aachen University Aachen, Germany [email protected] Abstract Recent work by Artetxe et al.(2018) and Lam- ple et al.(2018) train sequence-to-sequence MT Unsupervised learning of cross-lingual word models of both translation directions together in an embedding offers elegant matching of words unsupervised way. They do back-translation (Sen- across languages, but has fundamental limi- tations in translating sentences. In this pa- nrich et al., 2016a) back and forth for every itera- per, we propose simple yet effective methods tion or batch, which needs an immensely long time to improve word-by-word translation of cross- and careful tuning of hyperparameters for massive lingual embeddings, using only monolingual monolingual data. corpora but without any back-translation. We Here we suggest rather simple methods to build integrate a language model for context-aware an unsupervised MT system quickly, based on search, and use a novel denoising autoencoder word translation using cross-lingual word embed- to handle reordering. Our system surpasses state-of-the-art unsupervised neural transla- dings. The contributions of this paper are: tion systems without costly iterative training. • We formulate a straightforward way to com- We also analyze the effect of vocabulary size bine a language model with cross-lingual and denoising type on the translation perfor- word similarities, effectively considering mance, which provides better understanding of learning the cross-lingual word embedding context in lexical choices. and its usage in translation. • We develop a postprocessing method for word-by-word translation outputs using a de- 1 Introduction noising autoencoder, handling local reorder- Building a machine translation (MT) system re- ing and multi-aligned words. quires lots of bilingual data. Neural MT mod- • We analyze the effect of different artificial els (Bahdanau et al., 2015), which become the noises for the denoising model and propose current standard, are even more difficult to train a novel noise type. without huge bilingual supervision (Koehn and • We verify that cross-lingual embedding on Knowles, 2017). However, bilingual resources subword units performs poorly in translation. are still limited to some of the selected language • We empirically show that cross-lingual map- pairs—mostly from or to English. ping can be learned using a small vocabulary A workaround for zero-resource language pairs without losing the translation performance. is translating via an intermediate (pivot) language. To do so, we need to collect parallel data and train The proposed models can be efficiently trained MT models for source-to-pivot and pivot-to-target with off-the-shelf softwares with little or no individually; it takes a double effort and the de- changes in the implementation, using only mono- coding is twice as slow. lingual data. The provided analyses help for bet- Unsupervised learning is another alternative, ter learning of cross-lingual word embeddings for where we can train an MT system with only mono- translation purpose. Altogether, our unsupervised lingual corpora. Decipherment methods (Ravi and MT system outperforms the sequence-to-sequence Knight, 2011; Nuhn et al., 2013) are the first work neural models even without training signals from in this direction, but they often suffer from a huge the opposite translation direction, i.e. via back- latent hypothesis space (Kim et al., 2017). translation. 862 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 862–868 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics 2 Cross-lingual Word Embedding 3.1 Context-aware Beam Search As a basic step for unsupervised MT, we learn a The word translation using nearest neighbor word translation model from monolingual corpora search does not consider context around the cur- of each language. In this work, we exploit cross- rent word. In many cases, the correct translation is lingual word embedding for word-by-word trans- not the nearest target word but other close words lation, which is state-of-the-art in terms of type with morphological variations or synonyms, de- translation quality (Artetxe et al., 2017; Conneau pending on the context. et al., 2018). The reasons are in two-fold: 1) Word embed- Cross-lingual word embedding is a continu- ding is trained to place semantically related words ous representation of words whose vector space nearby, even though they have opposite meanings. is shared across multiple languages. This en- 2) A hubness problem of high-dimensional em- ables distance calculation between word embed- bedding space hinders a correct search, where lots dings across languages, which is actually finding of different words happen to be close to each other translation candidates. (Radovanovic´ et al., 2010). We train cross-lingual word embedding in a In this paper, we integrate context information fully unsupervised manner: into word-by-word translation by combining a language model (LM) with cross-lingual word em- 1. Learn monolingual source and target embed- bedding. Let f be a source word in the current dings independently. For this, we run skip- position and e a possible target word. Given a his- gram algorithm augmented with character n- tory h of target words before e, the score of e to be gram (Bojanowski et al., 2017). the translation of f would be: 2. Find a linear mapping from source embed- L(e; f; h) = λ log q(f; e) + λ log p(ejh) ding space to target embedding space by emb LM adversarial training (Conneau et al., 2018). Here, q(f; e) is a lexical score defined as: We do not pre-train the discriminator with d(f; e) + 1 a seed dictionary, and consider only the top q(f; e) = 2 Vcross-train words of each language as input to the discriminator. where d(f; e) 2 [−1; 1] is a cosine similarity between f and e. It is transformed to the range [0; 1] Once we have the cross-lingual mapping, we to make it similar in scale with the LM probability. can transform the embedding of a given source In our experiments, we found that this simple lin- word and find a target word with the closest em- ear scaling is better than sigmoid or softmax func- bedding, i.e. nearest neighbor search. Here, we tions in the final translation performance. apply cross-domain similarity local scaling (Con- Accumulating the scores per position, we per- neau et al., 2018) to penalize the word similarities form a beam search to allow only reasonable trans- in dense areas of the embedding distribution. lation hypotheses. We further refine the mapping obtained from Step 2 as follows (Artetxe et al., 2017): 3.2 Denoising 3. Build a synthetic dictionary by finding mu- Even when we have correctly translated words for tual nearest neighbors for both translation di- each position, the output is still far from an ac- ceptable translation. We adopt sequence denois- rections in vocabularies of Vcross-train words. ing autoencoder (Hill et al., 2016) to improve the 4. Run a Procrustes problem solver with the dic- translation output of Section 3.1. The main idea tionary from Step 3 to re-train the mapping is to train a sequence-to-sequence neural network (Smith et al., 2017). model that takes a noisy sentence as input and pro- 5. Repeat Step 3 and 4 for a fixed number of duces a (denoised) clean sentence as output, both iterations to update the mapping further. of which are of the same (target) language. The model was originally proposed to learn sentence 3 Sentence Translation embeddings, but here we use it directly to actually In translating sentences, cross-lingual word em- remove noise in a sentence. bedding has several drawbacks. We describe each Training label sequences for the denoising net- of them and our corresponding solutions. work would be target monolingual sentences, but 863 we do not have their noisy versions at hand. Given target word should be generated from no source a clean target sentence, the noisy input should words for fluency. For example, a German word be ideally word-by-word translation of the corre- “im” must be “in the” in English, but word transla- sponding source sentence. However, such bilin- tion generates only one of the two English words. gual sentence alignment is not available in our un- Another example is shown in Figure2. supervised setup. Instead, we inject artificial noise into a clean one of the best sentence to simulate the noise of word-by-word Denoising translation. We design different noise types after one the best the following aspects of word-by-word translation. Word-by-word 3.2.1 Insertion eine der besten Word-by-word translation always outputs a target word for every position. However, there are a Figure 2: Example of denoising a deletion noise. plenty of cases that multiple source words should be translated to a single target word, or that some To simulate such situations, we drop some source words are rather not translated to any word words randomly from a clean target sentence (Hill to make a fluent output. For example, a German et al., 2016): sentence “Ich hore¨ zu.” would be translated to “I’m listening to.” by a word-by-word transla- 1. For each position i, sample a probability pi ∼ tor, but “I’m listening.” is more natural in English Uniform(0; 1). (Figure1). 2. If pi < pdel, drop the word in the position i. I’m listening 3.2.3 Reordering Denoising Also, translations generated word-by-word are not I’m listening to in an order of the target language.

Load more