Translating Pro-Drop Languages with Reconstruction Models Longyue Wang Zhaopeng Tu∗ Shuming Shi ADAPT Centre, Dublin City Univ
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Translating Pro-Drop Languages with Reconstruction Models Longyue Wang Zhaopeng Tu∗ Shuming Shi ADAPT Centre, Dublin City Univ. Tencent AI Lab Tencent AI Lab [email protected] [email protected] [email protected] Tong Zhang Yvette Graham Qun Liu Tencent AI Lab ADAPT Centre, Dublin City Univ. ADAPT Centre, Dublin City Univ. [email protected] [email protected] [email protected] Abstract Input (它) 根本 没 那么严重 Pronouns are frequently omitted in pro-drop languages, Ref It is not that bad such as Chinese, generally leading to significant chal- SMT Wasn ’t that bad lenges with respect to the production of complete trans- lations. To date, very little attention has been paid to the NMT It ’s not that bad dropped pronoun (DP) problem within neural machine Input 这块 面包 很 美味 ! 你 烤 的 (它) 吗 ? translation (NMT). In this work, we propose a novel reconstruction-based approach to alleviating DP trans- Ref The bread is very tasty ! Did you bake it ? lation problems for NMT models. Firstly, DPs within SMT This bread , delicious ! Did you bake ? all source sentences are automatically annotated with parallel information extracted from the bilingual train- NMT The bread is delicious ! Are you baked ? ing corpus. Next, the annotated source sentence is re- constructed from hidden representations in the NMT Table 1: Examples of translating DPs where words in brack- model. With auxiliary training objectives, in terms of ets are dropped pronouns that are invisible in decoding. reconstruction scores, the parameters associated with the NMT model are guided to produce enhanced hidden NMT model’s successes on translating simple dummy pro- representations that are encouraged as much as possi- noun (upper panel), while fails on a more complicated one ble to embed annotated DP information. Experimental (bottom panel); SMT model fails in both cases. results on both Chinese–English and Japanese–English dialogue translation tasks show that the proposed ap- proach significantly and consistently improves transla- tion (SMT) models showing promising results (Le Nagard tion performance over a strong NMT baseline, which is directly built on the training data annotated with DPs. and Koehn 2010; Xiang, Luo, and Zhou 2013; Wang et al. 2016a). Modeling DP translation for the more advanced Neural Machine Translation (NMT) models, however, has Introduction received substantially less attention, resulting in low perfor- In pro-drop languages, such as Chinese and Japanese, pro- mance in this respect even for state-of-the-art approaches. nouns can be omitted from sentences when it is possible to NMT models, due to their ability to capture semantic infor- infer the referent from the context. When translating sen- mation with distributed representations, currently only man- tences from a pro-drop language to a non-pro-drop language age to successfully translate some simple DPs, but still fail (e.g., Chinese to English), machine translation systems gen- when translating anything more complex. Table 1 includes erally fail to translate invisible dropped pronouns (DPs). typical examples of when our strong baseline NMT system This problem is especially severe in informal genres such as fails to accurately translate dropped pronouns. In this pa- dialogues and conversation, where pronouns are more fre- per, we narrow the gap between correct DP translation for quently omitted to make utterances more compact (Yang, NMT models to improve translation quality for pro-drop lan- Liu, and Xue 2015). For example, our analysis of a large guages with advanced models. Chinese–English dialogue corpus showed that around 26% More specifically, we propose a novel reconstruction- of pronouns were dropped from the Chinese side of the based approach to alleviate DP problems for NMT. Firstly, corpus. This high proportion within informal genres shows we explicitly and automatically label DPs for each source the importance of addressing the challenge of translation of sentence in the training corpus using alignment informa- dropped pronouns. tion from the parallel corpus (Wang et al. 2016a). Accord- Researchers have investigated methods of alleviating the ingly, each training instance is represented as a triple (x, DP problem for conventional Statistical Machine Transla- y, xˆ), where x and y are source and target sentences, and ∗ xˆ Zhaopeng Tu is the corresponding author. is the labelled source sentence. Next, we apply a stan- Copyright c 2018, Association for the Advancement of Artificial dard encoder-decoder NMT model to translate x, and ob- Intelligence (www.aaai.org). All rights reserved. tain two sequences of hidden states from both encoder and 4937 decoder. This is followed by introduction of an additional Genres Sents ZH-Pro EN-Pro DP reconstructor (Tu et al. 2017b) to reconstruct back to the Dialogue 2.15M 1.66M 2.26M 26.55% labelled source sentence xˆ with hidden states from either Newswire 3.29M 2.27M 2.45M 7.35% encoder or decoder, or both components. The central idea behind is to guide the corresponding hidden states to embed Table 2: Extent of DP in different genres. The Dialogue cor- the recalled source-side DP information and subsequently to pus consists of subtitles extracted from movie subtitle web- help the NMT model generate the missing pronouns with sites; The Newswire corpus is CWMT2013 news data. these enhanced hidden representations. To this end, the re- constructor produces a reconstruction loss, which measures how well the DP can be recalled and serves as an auxiliary while the correct translation should be “Did you bake it?”. training objective. Additionally, the likelihood score pro- Such omissions may not be problematic for humans since duced by the standard encoder-decoder measures the qual- they can easily recall missing pronouns from the context. ity of general translation and the reconstruction score mea- They do, however, cause challenges for machine translation sures the quality of DP translation, and linear interpolation from a source pro-drop language to a target non-pro-drop of these scores is employed as an overall score for a given language, since translation of such dropped pronouns gener- translation. ally fails. Experiments on a large-scale Chinese–English corpus As shown in Table 2, we analyzed two large Chinese– show that the proposed approach significantly improves English corpora and found that around 26.55% of English translation performance by addressing the DP translation pronouns can be dropped in the dialogue domain, while only problem. Furthermore, when reconstruction is applied only 7.35% of pronouns were dropped in the newswire domain. in training, it improves parameter training by producing bet- DPs in formal text genres (e.g., newswire) are not as com- ter hidden representations that embed the DP information. mon as those in informal genres (e.g., dialogue), and the Results show improvement over a strong NMT baseline sys- most frequently dropped pronouns in Chinese newswire is tem of +1.35 BLEU points without any increase in decod- the third person singular 它 (“it”) (Baran, Yang, and Xue ing speed. When additionally applying reconstruction dur- 2012), which may not be crucial to translation performance. ing testing, we obtain a further +1.06 BLEU point improve- As the dropped pronoun phenomenon is more prevalent in ment with only a slight decrease in decoding speed of ap- informal genres, we test our method with respect to the dia- proximately 18%. Experiments for Japanese–English trans- logue domain. lation task show a significant improvement of 1.29 BLEU points, demonstrating the potential universality of the pro- Encoder-Decoder Based NMT posed approach across language pairs. Neural machine translation (Sutskever, Vinyals, and Le 2014; Bahdanau, Cho, and Bengio 2015) has greatly ad- Contributions Our main contributions can be summa- vanced state-of-the-art within machine translation. Encoder- rized as follows: decoder architecture is now widely employed, where the en- 1. We show that although NMT models advance SMT mod- coder summarizes the source sentence x = x1,x2,...,xJ els on translating pro-drop languages, there is still large into a sequence of hidden states {h1, h2,...,hJ }. Based on room for improvement; the encoder-side hidden states, the decoder generates the tar- y = y1,y2,...,yI 2. We introduce a reconstruction-based approach to improve get sentence word by word with another {s1, s2,...,sI } dropped pronoun translation; sequence of decoder-side hidden states : 3. We release a large-scale bilingual dialogue corpus, which I I consists of 2.2M Chinese–English sentence pairs.1 P (y|x)= P (yi|y<i, x)= g(yi−1, si, ci) (1) i=1 i=1 Background where g(·) is a softmax layer. The decoder hidden state si at Pro-Drop Language Translation step i is computed as A pro-drop language is a language in which certain classes si = f(yi−1, si−1, ci) (2) of pronouns are omitted to make the sentence compact · c yet comprehensible when the identity of the pronouns can where f( ) is an activation function. i is a weighted be inferred from the context. Since pronouns contain rich J sum of encoder hidden states ct = j=1 αt,jhj, where anaphora knowledge in discourse and the sentences in di- αt,j is the alignment probability calculated by an attention alogue are generally short, DPs not only result in missing model (Bahdanau, Cho, and Bengio 2015; Luong, Pham, translations of pronouns, but also harm the sentence struc- and Manning 2015). The parameters of the NMT model are ture and even the semantics of output. Take the second case trained to maximize the likelihood of a set of training exam- in Table 1 as an example, when the object pronoun “它”is n n N ples {[x , y ]}n=1: dropped, the sentence is translated into “Are you baked?”, N 1Our released corpus is available at https://github.com/ L(θ) = arg max log P (yn|xn; θ) (3) longyuewangdcu/tvsub.