Bilingual Dictionary Based Neural Machine Translation Without Using
Total Page:16
File Type:pdf, Size:1020Kb
Bilingual Dictionary Based Neural Machine Translation without Using Parallel Sentences Xiangyu Duan1, Baijun Ji1, Hao Jia1, Min Tan1, Min Zhang1,∗ Boxing Chen2, Weihua Luo2, Yue Zhang3 1 Institute of Aritificial Intelligence, School of Computer Science and Technology, Soochow university 2 Alibaba DAMO Academy 3 School of Engineering, Westlake University {xiangyuduan,minzhang}@suda.edu.cn; {bjji,hjia,mtan2017}@stu.suda.edu.cn; {boxing.cbx,weihua.luowh}@alibaba-inc.com; [email protected] Abstract supervised/semi-supervised MT task that mainly depends on parallel sentences (Bahdanau et al., In this paper, we propose a new task of ma- 2015; Gehring et al., 2017; Vaswani et al., 2017; chine translation (MT), which is based on no Chen et al., 2018; Sennrich et al., 2016a). parallel sentences but can refer to a ground- truth bilingual dictionary. Motivated by the The bilingual dictionary is often utilized as a ability of a monolingual speaker learning to seed in bilingual lexicon induction (BLI) that aims translate via looking up the bilingual dictio- to induce more word pairs within the language nary, we propose the task to see how much pair (Mikolov et al., 2013). Another utilization potential an MT system can attain using the of the bilingual dictionary is for translating low- bilingual dictionary and large scale monolin- frequency words in supervised NMT (Arthur et al., gual corpora, while is independent on paral- 2016; Zhang and Zong, 2016). We are the first to lel sentences. We propose anchored train- ing (AT) to tackle the task. AT uses the utilize the bilingual dictionary and the large scale bilingual dictionary to establish anchoring monolingual corpora to see how much potential points for closing the gap between source an MT system can achieve without using parallel language and target language. Experiments sentences. This is different from using artificial on various language pairs show that our ap- bilingual dictionaries generated by unsupervised proaches are significantly better than various BLI for initializing an unsupervised MT system baselines, including dictionary-based word-by- (Artetxe et al., 2018c,b; Lample et al., 2018a), we word translation, dictionary-supervised cross- use the ground-truth bilingual dictionary and apply lingual word embedding transformation, and unsupervised MT. On distant language pairs it throughout the training process. that are hard for unsupervised MT to perform We propose Anchored Training (AT) to tackle well, AT performs remarkably better, achiev- this task. Since word representations are learned ing performances comparable to supervised over monolingual corpora without any parallel sen- SMT trained on more than 4M parallel sen- tence supervision, the representation distances be- 1 tences . tween source language and target language are of- ten quite large, leading to significant translation 1 Introduction difficulty. As one solution, AT selects words cov- Motivated by a monolingual speaker acquiring ered by the bilingual dictionary as anchoring points translation ability by referring to a bilingual dic- to drive the distance between the source language tionary, we propose a novel MT task that no par- space and the target language space closer so that allel sentences are available, while a ground-truth translation between the two languages becomes bilingual dictionary and large-scale monolingual easier. Furthermore, we propose Bi-view AT that corpora can be utilized. This task departs from places anchors based on either source language unsupervised MT task that no parallel resources, view or target language view, and combines both including the ground-truth bilingual dictionary, are views to enhance the translation quality. allowed to utilize (Artetxe et al., 2018c; Lam- Experiments on various language pairs show that ple et al., 2018b). This task is also distinct to AT performs significantly better than various base- lines, including word-by-word translation through ∗ Corresponding Author. 1Code is available at https://github.com/ looking up the dictionary, unsupervised MT, and mttravel/Dictionary-based-MT dictionary-supervised cross-lingual word embed- 1570 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1570–1579 July 5 - 10, 2020. c 2020 Association for Computational Linguistics ding transformation to make distances between et al., 2018b; Yang et al., 2018; Sun et al., 2019), both languages closer. Bi-view AT further im- which does not use parallel sentences neither. The proves AT performance due to mutual strengthen- difference is that UNMT may use the artificial ing of both views of the monolingual data. When dictionary generated by unsupervised BLI for ini- combined with cross-lingual pretraining (Lample tialization (Artetxe et al., 2018c; Lample et al., and Conneau, 2019), Bi-view AT achieves perfor- 2018a) or abandon the artificial dictionary by us- mances comparable to traditional SMT systems ing joint BPE so that multiple BPE units can be trained on more than 4M parallel sentences. The shared by both languages (Lample et al., 2018b). main contributions of this paper are as follows: We use the ground-truth dictionary instead and ap- ply it throughout a novel training process. UNMT • A novel MT task is proposed which can only works well on close language pairs such as English- use the ground-truth bilingual dictionary and French, while performs remarkably bad on distant monolingual corpora, while is independent on language pairs in which aligning the embeddings parallel sentences. of both side languages is quite challenging. We • AT is proposed as a solution to the task. AT use the ground-truth dictionary to alleviate such uses the bilingual dictionary to place anchors problem, and experiments on distant language pairs that can encourage monolingual spaces of show the necessity of using the bilingual dictionary. both languages to become closer so that trans- Other utilizations of the bilingual dictionary for lation becomes easier. tasks beyond MT include cross-lingual dependency parsing (Xiao and Guo, 2014), unsupervised cross- • The detailed evaluation on various language lingual part-of-speech tagging and semi-supervised pairs shows that AT, especially Bi-view AT, cross-lingual super sense tagging (Gouws and Sø- performs significantly better than various gaard, 2015), multilingual word embedding train- methods, including word-by-word translation, ing (Ammar et al., 2016; Duong et al., 2016), and unsupervised MT, and cross-lingual embed- transfer learning for low-resource language model- ding transformation. On distant language ing (Cohn et al., 2017). pairs that unsupervised MT struggled to be effective, AT and Bi-view AT perform remark- 3 Our Approach ably better. There are multiple freely available bilingual dictio- 2 Related Work naries such as Muse dictionary2 (Conneau et al., 3 4 The bilingual dictionaries used in previous works 2018), Wiktionary , and PanLex . We adopt Muse are mainly for bilingual lexicon induction (BLI), dictionary which contains 110 large-scale ground- which independently learns the embedding in each truth bilingual dictionaries. language using monolingual corpora, and then We propose to inject the bilingual dictionary into learns a transformation from one embedding space the MT training by placing anchoring points on the to another by minimizing squared euclidean dis- large scale monolingual corpora to drive the se- tances between all word pairs in the dictionary mantic spaces of both languages becoming closer (Mikolov et al., 2013; Artetxe et al., 2016). Later ef- so that MT training without parallel sentences be- forts for BLI include optimizing the transformation comes easier. We present the proposed Anchored further through new training objectives, constraints, Training (AT) and Bi-view AT in the following. or normalizations (Xing et al., 2015; Lazaridou 3.1 Anchored Training (AT) et al., 2015; Zhang et al., 2016; Artetxe et al., 2016; Smith et al., 2017; Faruqui and Dyer, 2014; Lu Since word embeddings are trained on monolin- et al., 2015). Besides, the bilingual dictionary is gual corpora independently, the embedding spaces also used for supervised NMT which requires large- of both languages are quite different, leading to scale parallel sentences (Arthur et al., 2016; Zhang significant translation difficulty. AT forces words and Zong, 2016). To our knowledge, we are the of a translation pair to share the same word embed- first to use the bilingual dictionary for MT without ding as an anchor. We place multiple anchors by using any parallel sentences. 2https://github.com/facebookresearch/MUSE Our work is closely related to unsupervised 3https://en.wiktionary.org/wiki/Wiktionary:Main_Page NMT (UNMT) (Artetxe et al., 2018c; Lample 4https://panlex.org/ 1571 Figure 1: Illustration of (a) AT and (b) Bi-view AT. We use a source language sentence “s1s2s3s4” and a target language sentence “t1t2t3t4t5” from the large-scale monolingual corpora as an example. denotes an anchoring point which replaces a word with its translation based on the bilingual dictionary. Thin arrows of # denote NMT decoding, thick arrows of + denote training an NMT model, 99K and L99 denote generating the anchored sentence 0 based on the dictionary. Words with primes such as t1 denote the decoding output of a thin arrow. selecting words covered by the bilingual dictionary. tences constitute a sentence pair for training the With stable anchors, the embedding spaces of both source-to-target translation model. Note that dur- languages become