<<

Chinese Character Decomposition for Neural MT with Multi-Word Expressions

Lifeng Han1, Gareth J. F. Jones1, Alan F. Smeaton2 and Paolo Bolzoni 1 ADAPT Research Centre 2 Insight Centre for Data Analytics School of Computing, Dublin City University, Dublin, Ireland lifeng.@adaptcentre.ie, [email protected]

Abstract made in terms of rare and unseen words by incor- porating sub-word knowledge using Byte Pair En- Chinese character decomposition has been coding (BPE) (Sennrich et al., 2016). However, used as a feature to enhance Machine such methods cannot be directly applied to Chi- Translation (MT) models, combining rad- nese, Japanese and other ideographic . icals into character and word level mod- Integrating sub-character level information, els. Recent work has investigated ideo- such as Chinese ideograph and radicals as learning graph or level embedding. How- knowledge has been used to enhance features in ever, questions remain about the different NMT systems (Han and Kuang, 2018; and decomposition levels of Chinese character Matsumoto, 2018; Zhang and Komachi, 2018). representations, radical and strokes, best Han and Kuang (2018), for example, explain that suited for MT. To investigate the impact the meaning of some unseen or low frequency Chi- of Chinese decomposition embedding in nese characters can be estimated and translated us- detail, i.e., radical, stroke, and intermedi- ing radicals decomposed from the Chinese char- ate levels, and how well these decomposi- acters, as as the learning model can acquire tions represent the meaning of the original knowledge of these radicals within the training character sequences, we carry out analy- corpus. sis with both automated and human evalu- often include two pieces of ation of MT. Furthermore, we investigate information, with semantics encoded within radi- if the combination of decomposed Mul- cals and a phonetic part. The phonetic part is re- tiword Expressions (MWEs) can enhance lated to the pronunciation of the overall character, model learning. MWE integration into either the same or similar. For instance, Chinese MT has seen more than a decade of explo- characters with this two-stroke radical, 刂 (t´ı dao¯ ration. However, decomposed MWEs has pang),´ ordinarily relate to knife in meaning, such not previously been explored. as the Chinese character 劍 (jian,` sword) and multi-character expression 鋒利 (fengl¯ `ı, sharp). 1 Introduction The radical 刂 (t´ı dao¯ pang)´ preserves the mean- Neural Machine Translation (NMT) (Cho et al., ing of knife because it is a variation of a drawing 2014; Johnson et al., 2016; Vaswani et al., 2017; of a knife evolving from the original bronze in- Lample and Conneau, 2019) has recently replaced scription (Fig. 4 in Appendices). Statistical Machine Translation (SMT) (Brown Not only can the radical part of a character be et al., 1993; Och and Ney, 2003; Chiang, 2005; decomposed into smaller fragments of strokes but Koehn, 2010) as the state-of-the-art for Machine the phonetic part can also be decomposed. Thus Translation (MT). However, research questions there are often several levels of decomposition that still remain, such as how to deal with out-of- can be applied to Chinese characters by combin- vocabulary (OOV) words, how best to integrate ing different levels of decomposition of each part linguistic knowledge and how best to correctly of the Chinese character. As one example, Fig- translate multi-word expressions (MWEs) (Sag ure 1 shows the three decomposition levels from et al., 2002; Moreau et al., 2018; Han et al., our model and the full stroke form of the above 2020a). For OOV word translation for European mentioned characters 劍(jian)` and 鋒(feng)¯ . To languages, substantial improvements have been date, little work has been carried out to investigate the full potential of these alternative levels of de- 2 Related Work composition of Chinese characters for the purpose Chinese character decomposition has been ex- of Machine Translation (MT). plored recently for MT. For instance, Han In this work, we investigate Chinese charac- and Kuang (2018) and Zhang and Matsumoto ter decomposition, and another area related to (2018), considered radical embeddings as ad- Chinese characters, namely Chinese MWEs. We ditional features for Chinese → English and firstly investigate translation at increasing levels of Japanese ⇔ Chinese NMT. Han and Kuang decomposition of Chinese characters using under- (2018) tested a range of encoding models lying radicals, as well as the additional Chinese including word+character, word+radical, and character strokes (corresponding to ever-smaller word+character+radical. This final setting with units), breaking down characters into component word+character+radical achieved the best perfor- parts as this is likely to reduce the number of un- mance on a standard NIST 2 MT evaluation data known words. Then, in order to better deal with set for Chinese → English. Furthermore, Zhang MWEs which have a common occurrence in gen- and Matsumoto (2018) applied radical embed- eral contexts (Sag et al., 2002), and working in dings as additional features to character level the opposite direction in terms of meaning rep- LSTM-based NMT on Japanese → Chinese trans- resentation, we investigate translating larger units lation. None of the aforementioned work has how- of Chinese text, with the aim of restricting trans- ever investigated the performance of decomposed lation of larger groups of Chinese characters that character sequences and the effects of varied de- should be translated together as one unit. In ad- composition degrees in combination with MWEs. dition to investigating the effects of decompos- Subsequently, Zhang and Komachi (2018) devel- ing characters we simultaneously apply methods oped bidirectional English ⇔ Japanese, English of incorporating MWEs into translation. MWEs ⇔ Chinese and Chinese ⇔ Japanese NMT with can appear in Chinese in a range of ways, such word, character, ideograph (the phonetics and se- as fixed (or semi-fixed) expressions, metaphor, id- mantics parts of characters are separated) and iomatic phrases, and institutional, personal or lo- stroke levels, with experiments showing that the cation names, amongst others. ideograph level was best for ZH→EN MT, while In summary, in this paper, we investigate: (i) the stroke level was best for JP→EN MT. Al- the degree to which Chinese radical and stroke se- though their ideograph and stroke level setting re- quences represent the original word and charac- placed the original character and word sequences, ter sequences that they are composed of; (ii) the there was no investigation of intermediate decom- difference in performance achieved by each de- position performance, and they only used BLEU composition level; (iii) the effect of radical and score for automated evaluation with no human as- stroke representations in MWEs for MT. Further- sessment involved. This gives us inspiration to ex- more, we offer: plore the performance of intermediate level em- • an open-source suite of Chinese character de- bedding between ideograph and strokes for the composition extraction tools; MT task.

• a Chinese ⇔ English MWE corpus where 3 Chinese Character Decomposition Chinese characters have been decomposed In this section, we introduce a character decom- available at radical4mt1. position approach and the extraction tools which The rest of this paper is organized as follows: we apply in this work (code will be publicly avail- Section 2 provides details of related work in char- able). We utilize the open source IDS dictionary 3 acter and radical related MT; Sections 3 and 4 in- which was derived from the CHISE (CHarac- 4 troduce our Chinese decomposition procedure into ter Information Service Environment) project . It radical and strokes, and our experimental design; is comprised of 88,940 Chinese characters from Section 5 provides details of our evaluations from CJK (Chinese, Japanese, Korean ) Unified both automatic and human perspectives; Section 6 2https://www.nist.gov/ describes conclusions and plans for future work. programs-projects/machine-translation 3https://github.com/cjkvi/cjkvi-ids 1https://github.com/poethan/MWE4MT 4http://www.chise.org/ Level-1

劍 (jiàn) 鋒 (fēng)

Level-1: (phonetic, qiān) 僉⺉(semantic, knife) (semantic, metal) ⾦夆 (phonetic, féng)

Level-2: 亼吅从 ⺉ ⼈王丷 夂丰

Level-3: ⼈⼀⼝⼝⼈⼈ ⺉ ⼈⼀⼟丷 夂三⼁ … … … … … Full-stroke: ⼃㇏⼀⼁�⼀⼁�⼀⼃㇏⼃㇏ ⼁⼅ ⼃㇏⼀⼀⼁⼂㇀⼀ ㇀㇇㇏⼀⼀⼀⼁

Figure 1: Examples of the decomposition of Chinese characters.

Ideographs and the corresponding decomposition Character Decomposition Decomposition

sequences of each character. Most characters are 丽 (lì) ⿱⼀⿰⿵⼌⼂⿵⼌⼂ ⿰⿱⼀⿵⼌⼂⿱⼀⿵ decomposed as a single sequence, but characters [G] ⼌⼂[T] can have up to four possible decomposed repre- 具 (jù) ⿱⿴且⼀八[GTKV] ⿳⽬⼀八[J]

sentations. The reason for this is that the character 函 (hán) ⿶⼐⿻了⿱丷八[GTV] ⿶⼐⿻丂⿱丷八[JK] can come from different resources, such as Chi- 勇 (yǒng) ⿱甬⼒[GTV] ⿱⿱龴⽥⼒[JK] nese Hanzi (G, H, T for Mainland, , and ), Japanese (J), Korean Character construction: ⿱: up-down, ⿰: left-right, ⿵⿶ ⿴: inside-outside, ⿻: embedded (K), and Vietnamese ChuNom (V), etc.5 Even though they have the same root of Hanzi, the his- Figure 2: Character examples from IDS dictio- torical development of languages and writing sys- nary; the grey parts of decomposition graphs rep- tems in different territories has resulted in certain resent the construction structure of the character. degrees of variation in their appearance and . For instance, (且, qie)ˇ vs (目, mu)` from the second character example in Figure 2. glish, using the preprocessed (word segmented) Figure 2 shows example characters that have data as training data (Bojar et al., 2018). We two different decomposition sequences. In our preserve the original word boundaries in decom- experiments, when there is more than one de- position sequences. To get better generalizabil- composed representation of a given character, we ity of our decomposition model, we use a large choose the Chinese mainland decomposition stan- size training set, the first 5 million parallel sen- dard (G) for the model, since the corpora we use tences for training across all learning steps. The correspond best to simplified Chinese as used in corpora “newsdev2017” used for development and mainland . The examples in Figure 2 also “newstest2017” for testing are from the WMT- show the general construction and corresponding 2017 MT shared task (Bojar et al., 2017). These decomposition styles of Chinese characters, such include 2002 and 2001 parallel Chinese ⇔ En- as left-right, up-down, inside-outside, and embed- glish respectively. We use the THUMT (Zhang ded amongst others. To obtain a decomposition et al., 2017) toolkit which is an implementation of level L representation of Chinese character α, we several attention-based Transformer architectures go through the IDS file L times. Each time, we (Vaswani et al., 2017) for NMT and set up the search the IDS file character list to match the encoder-decoder as 7+7 layers. Batch size is set as newly generated smaller sized characters and re- 6250. For sub-word encoding BPE technology, we place them with decomposed representation recur- use 32K BPE operations that are learned from the sively. bilingual training set. We use Google’s Colab plat- form to run our experiments6. We call our base- 4 NMT Experiments line model using character sequences (with word We test the various levels of decomposed Chinese boundary) the character sequence model. For and Chinese MWEs using publicly available data MWE integrated models, we apply the same bilin- from the WMT-2018 shared tasks Chinese to En- gual MWE extraction pipeline from our previous work (Han et al., 2020b), similar to (Rikters and 5Universal Coded Character Set (10646:2017) standards.iso.org/ittf/ PubliclyAvailableStandards 6https://colab.research.google.com Results of the Direct Assessment human eval- uation are shown in Table 1 where similarly per- forming systems are clustered together (denoted by horizontal lines in the table). Systems in a given lower ranked cluster are significantly out- performed by all systems in a higher ranked clus- Figure 3: Chinese→English BLEU scores for in- ter. Amongst the six models included in the creasing learning steps; RXD1/2/3 represents the human evaluation, the first five form a cluster decomposition level of Chinese characters. RXD1 with very similar performance according to human indicates ideograph from (Zhang and Komachi, assessors, including the baseline, MWE, RXD1, 2018) RXD3MWE, and RXD3 which do not outperform each other with any significance. RXD2, on the other hand, is far behind the other models in terms Bojar, 2017), which is an automated pre-defined of performance according to human judges (also PoS pattern-based extraction procedure with fil- the automated BLEU score) performing signifi- tering threshold set to 0.85 to remove lower qual- cantly worse than all other runs (at p < 0.05). As ity translation pairs. We integrate these extracted the tradition of WMT shared task workshop, we bilingual MWEs back into the training set to in- cluster the first five models into one group, while vestigate if they help the MT learning. In the de- the RXD2 into a second group. Furthermore, - composed models, we replace the original Chinese man evaluation results in Table 1 show that the top character sequences from the corpus with decom- five models all achieve high performance on-par posed character-piece sequence inputs for train- with state-of-the-art in Chinese to English MT. ing, development and testing (keeping the original We also discovered that the decomposed models word boundary). generated fewer system parameters for the neural 5 Evaluation nets to learn, which potentially reduces compu- tational complexity. For instance, the total train- In order to assess the performance of each model able variable size of the character sequence base- employing a different meaning representation in line model is 89,456,896, while this number de- terms of decomposition and MWEs, we carried creased to 80,288,000 and 80,591,104 respectively out both automatic evaluation using BLEU (Pap- for the RXD3 and RXD2 models (a 10.25% drop ineni et al., 2002) in Fig. 3, and human evalua- for RXD3). As mentioned by Goodfellow et al. tion (Direct Assessment) of the outputs of the sys- (2016), in NLP tasks the total number of possible tem. Since decomposition level 3 yields generally words is so large that the word sequence models higher scores than the other two levels, we also ap- have to operate on an extremely high-dimensional plied decomposition of MWEs to level 3 and con- and sparse discrete space. The decomposition catenated the bilingual glossaries to the training. model reduced the overall size of possible tokens We used the models with the most learning for the model to learn, which is more space effi- steps, 180K, and run human evaluation on the cient. Amazon Mechanical Turk crowd-sourcing plat- 7 For the automatic and human evaluation results, form, including the strict quality control mea- where decomposition level 2 achieved a surpris- sures of Graham et al. (2016). Direct Assessment ingly lower score than the other levels, error anal- scores for systems were calculated as in Graham ysis revealed an important insight. While level 1 et al. (2019) by firstly computing an average score decomposition encoded the original character se- per translation before calculating the overall aver- quences into radical representations, and this typi- age for a system from its average scores for trans- cally contains semantic and phonetic parts of the lations. Significance tests in the form of Wilcoxon character, and level 3 gives a deeper decompo- Rank-Sum test are then applied to score distri- sition of the character such as the stroke level butions of the latter to identify systems that sig- pieces with sequence order. In contrast, however, nificantly outperform other systems in the human level-2 decomposition appears to introduce some evaluation. intermediate characters that mislead model learn- 7https://www.mturk.com ing. These intermediate level characters are usu- Ave. Ave. z n N crowd sourced human evaluation with the direct raw assessment (DA) methodology. Our conclusion is that the Chinese character decomposition levels 1 73.2 0.161 1,232 1,639 BASE and 3 can be used to represent or replace the origi- 71.6 0.125 1,262 1,659 MWE nal character sequence in an MT task, and that this 71.6 0.113 1,257 1,672 RXD1 achieves similar performance to the original char- 71.3 0.109 1,214 1,593 RXD3MWE acter sequence model in our NMT setting. How- 70.2 0.073 1,260 1,626 RXD3 ever, decomposition level 2 is not suitable to rep- − RXD 53.9 0.533 1,227 1,620 2 resent the original character sequence in meaning at least for MT. We leave it to future work to ex- Table 1: Human evaluation results for systems plore the performance of different decomposition using Direct Assessment, where Ave. raw = levels in other NLP tasks. the average score for translations calculated from Another finding from our experiments is that raw Direct Assessment scores for translations, while adding bilingual MWE terms can both in- Ave. z = the average score for translations af- crease character and decomposed level MT score ter score standardization per human assessor mean according to the automatic metric BLEU, the hu- and standard deviation score, n is the number of man evaluation shows no statistical significance distinct translations included in the human evalua- between them. Significance testing using auto- tion (the sample size used in significance testing), mated evaluation metrics will be carried out in N is the number of human assessments (including our future work, such as METEOR (Banerjee and repeat assessment). Lavie, 2005), and LEPOR (Han et al., 2012; Han, 2014), in addition to BLEU. We will consider different MWE integration ally constructed from fewer strokes than the orig- methods in future and reduce the training set to in- inal root character, but can be decomposed from vestigate the differences in low-resource scenarios it. As in Figure 1, from decomposition level 2, (5 million sentence pairs for training set were used we get new characters 从 (cong)´ and 王 ()´ in this work). We will also sample a set of the test- respectively from 劍 (Jian,` sword) and 鋒 (feng,¯ ing results and conduct a human analysis regard- edge/sharp point), but they have no direct mean- ing the MWE translation accuracy from different ing from their father characters, instead meaning representation models. We will further investigate “from” and “king” respectively. In summary, de- different strategies of combining several level of composition level-2 tends to generate some inter- decompositions together and their corresponding mediate characters that do not preserve the mean- performances in semantic representation, such as ing of the original root character’s radical, nor MT task. The IDS file we applied to this work lim- those of the strokes, but rather smaller sized inde- ited the performance of full stroke level capability, pendent characters with fewer strokes that result in and we will look for alternative methods to achieve other meanings. full-stroke level character sequence extraction for NLP tasks investigation. 6 Conclusions and Future Work Acknowledgments In this work, we examined varying degrees of Chi- nese character decomposition and their effect on We thank Yvette Graham for helping with human Chinese to English NMT with attention architec- evaluation, Eoin Brophy for helps with Colab, and ture. To the best of our knowledge, this is the the anonymous reviewers for their thorough re- first work on detailed decomposition level of Chi- views and insightful feedback. The ADAPT Cen- nese characters for NMT, and decomposition rep- tre for Digital Content Technology is funded un- resentation for MWEs. We conducted experiments der the SFI Research Centres Programme (Grant for decomposition levels 1 to 3; we had a look 13/RC/2106) and is co-funded under the Euro- at level 4 decomposition and it appears similar pean Regional Development Fund. The input of to level 3 sequences. We publish our extraction Alan Smeaton is part-funded by Science Founda- toolkit free for academic usage. We conducted tion Ireland under grant number SFI/12/RC/2289 automated evaluation with the BLEU metric, and (Insight Centre). References Electronic Lexicons, pages 44–57, online. Associa- tion for Computational Linguistics. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved Lifeng Han, Gareth Jones, and Alan Smeaton. 2020b. correlation with human judgments. In Proceedings MultiMWE: Building a multi-lingual multi-word of the ACL. expression (MWE) parallel corpora. In Proceed- ings of the 12th Resources and Evaluation Ondrejˇ Bojar, Rajen Chatterjee, Christian Federmann, Conference, pages 2970–2979, Marseille, France. Yvette Graham, Barry Haddow, Shujian , European Language Resources Association. Matthias Huck, Philipp Koehn, Qun , Varvara Logacheva, Christof Monz, Matteo Negri, Matt Lifeng Han and Shaohui Kuang. 2018. Incorporat- Post, Raphael Rubino, Lucia Specia, and Marco ing chinese radicals into neural machine transla- Turchi. 2017. Findings of the 2017 conference tion: Deeper than character level. In Proceedings of on machine translation (WMT17). In Proceedings ESSLLI-2018 Student Session, pages 54–65. Associ- of the Second Conference on Machine Translation, ation for Logic, Language and Information (FoLLI). pages 169–214, Copenhagen, Denmark. Association for Computational Linguistics. Lifeng Han, Derek F. Wong, and Lidia S. Chao. 2012. Ondrejˇ Bojar, Christian Federmann, Mark Fishel, Lepor: A robust evaluation metric for machine trans- Yvette Graham, Barry Haddow, Matthias Huck, lation with augmented factors. In Proceedings of Philipp Koehn, and Christof Monz. 2018. Find- the 24th International Conference on Computational ings of the 2018 conference on machine translation Linguistics (COLING 2012), page 441–450. Associ- (wmt18). In Proceedings of the Third Conference ation for Computational Linguistics. on Machine Translation, Volume 2: Shared Task Pa- pers, pages 272–307, Belgium, Brussels. Associa- Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim tion for Computational Linguistics. Krikun, Yonghui , Zhifeng , Nikhil Tho- , Fernanda B. Viegas,´ Martin Wattenberg, Greg Peter F. Brown, Stephen A. Della Pietra, Vincent J. Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Della Pietra, and Robert L. Mercer. 1993. The math- Google’s multilingual neural machine translation ematics of statistical machine translation: Parameter system: Enabling zero-shot translation. CoRR, estimation. Computational Linguistics, 19(2):263– abs/1611.04558. 311. Philipp Koehn. 2010. Statistical Machine Translation. David Chiang. 2005. A hierarchical phrase-based Cambridge University Press. model for statistical machine translation. In Pro- ceedings of the 43rd Annual Meeting of the As- Guillaume Lample and Alexis Conneau. 2019. Cross- sociation for Computational Linguistics (ACL’05), lingual language model pretraining. CoRR, pages 263–270, Ann Arbor, Michigan. Association abs/1901.07291. for Computational Linguistics. Erwan Moreau, Ashjan Alsulaimani, Alfredo Mal- KyungHyun Cho, Bart van Merrienboer, Dzmitry Bah- donado, Lifeng Han, Carl Vogel, and Koel danau, and Yoshua Bengio. 2014. On the properties Dutta Chowdhury. 2018. Semantic reranking of of neural machine translation: Encoder-decoder ap- CRF label sequences for verbal multiword expres- proaches. CoRR, abs/1409.1259. sion identification. In Multiword expressions at length and in depth: Extended papers from the MWE Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2017 workshop, pages 177 – 207. Language Science 2016. Deep Learning. MIT Press. http://www. Press. deeplearningbook.org. Franz Josef Och and Hermann Ney. 2003. A systematic Yvette Graham, Timothy Baldwin, Alistair Moffat, and comparison of various statistical alignment models. Justin Zobel. 2016. Can machine translation sys- Computational Linguistics, 29(1):19–51. tems be evaluated by the crowd alone. Natural Lan- guage Engineering, FirstView:1–28. Kishore Papineni, Salim Roukos, Todd Ward, and Yvette Graham, Barry Haddow, and Philipp Koehn. jing . 2002. Bleu: a method for automatic eval- 2019. Translationese in machine translation evalu- uation of machine translation. In ACL, pages 311– ation. CoRR, abs/1906.09833. 318.

Lifeng Han. 2014. LEPOR: An Augmented - Mat¯ıss Rikters and Ondrejˇ Bojar. 2017. Paying Atten- chine Translation Evaluation Metric. University of tion to Multi-Word Expressions in Neural Machine Macau. Translation. In Proceedings of the 16th Machine Translation Summit (MT Summit 2017), Nagoya, Lifeng Han, Gareth Jones, and Alan Smeaton. 2020a. Japan. AlphaMWE: Construction of multilingual parallel corpora with MWE annotations. In Proceedings of Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann the Joint Workshop on Multiword Expressions and Copestake, and Dan Flickinger. 2002. Multiword expressions: A pain in the neck for nlp. In Com- From the automated evaluation results, we see putational Linguistics and Intelligent Text Process- that decomposition model RXD3 has very close ing, pages 1–15, Berlin, Heidelberg. Springer Berlin BLEU scores to the baseline character sequence Heidelberg. (both with word boundary) model. This is very Rico Sennrich, Barry Haddow, and Alexandra Birch. interesting since the level 3 Chinese decompo- 2016. Neural machine translation of rare words sition is typically impossible (or too difficult) with subword units. In Proceedings of the 54th An- nual Meeting of the Association for Computational for even native language human speakers to read Linguistics (Volume 1: Long Papers), pages 1715– and understand. Furthermore, by adding the de- 1725, Berlin, Germany. Association for Computa- composed MWEs back into the learning corpus, tional Linguistics. “rxd3+MWE” (RXD3MWE) yields higher BLEU Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob scores in some learning steps than the baseline Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz model. To gain further insight, we provide the Kaiser, and Illia Polosukhin. 2017. Attention is all learning curve with the learning steps and corre- you need. In Conference on Neural Information sponding automated-scores in Figure 6. Processing System, pages 6000–6010. The BLEU score increasing ratio in decom- Jiacheng Zhang, Yanzhuo , Shiqi Shen, Yong posed models (from RXD3 to RXD3MWE) is , Maosong , Huan- Luan, and larger than the ratio in original character sequence Liu. 2017. Thumt: An open source toolkit for neural models (from BASE to BASEMWE) by adding machine translation. ArXiv, abs/1706.06415. MWEs in general. Furthermore, the increase in Jinyi Zhang and Tadahiro Matsumoto. 2018. Improv- performance is very consistent by adding MWEs ing character-level japanese-chinese neural machine from the decomposed model, compared to the con- translation with radicals as an additional input fea- ture. CoRR, abs/1805.02937. ventional character sequence model. For instance, the performance has a surprisingly drop at 100K Longtu Zhang and Mamoru Komachi. 2018. Neural learning steps for BASEMWE. machine translation of logographic language using sub-character level information. In Proceedings of the Third Conference on Machine Translation: Re- Appendix C: Looking into MT Examples search Papers, pages 17–25, Brussels, Belgium. As- sociation for Computational Linguistics. From the learning curves in Fig. 6, we suggest that with 5 million training sentences and 7+7 layers Appendices of encoder-decoder neural nets, the Transformer model becomes too flat in its learning rate curve Appendix A: Chinese Character Knowledge with 100K learning steps, and this applies to both Figure 4 demonstrates the meaning preservation original character sequence model and decompo- root of Chinese radicals, where the evolution of sition models. the Chinese character 刀 (Dao),¯ meaning knife, In light of this, we look at the MT outputs from evolved from bronze inscription form to contem- head sentences of testing file at 100K learning porary character and radical form, 刂 (named as: steps models, and provide some insight into er- t´ı dao¯ pang).´ rors made by each model. Even though the au- NMT for Asian languages has included trans- tomated BLEU metric gives the baseline model a lation at the level of phrase, word, and character higher score 21.56 than the RXD3 model (20.75) sequences (see Figure 5). the translation of some Chinese MWE terms is better with the RXD3 model. For instance, in Fig- Appendix B: More Details of Evaluation ure 7, the Chinese MWE 商场 (Shangch¯ ang)ˇ in The evaluation scores of character sequence the first sentence is correctly translated as mall baseline NMT, character decomposed NMT and by RXD3 model but translated as shop by the MWE-NMT according to the BLEU metric are baseline character sequence model; the MWE 楼 presented in Fig. 3. The RXD1 model, decompo- 梯 间 (lout´ ¯ıjian)¯ in the second sentence is cor- sition level 1, is the ideograph model Zhang and rectly translated as stairwell by the RXD3 model Komachi (2018) used for their experiments where while translated as stairs by baseline. Further- the phonetics (声旁 sheng¯ pang)´ and semantics more, the MWE 近日 (J`ınr`ı) meaning recently is (形旁 x´ıng pang)´ parts of character are separated totally missed out by the original character se- initially. quence model, which results in a misleading am- Chinese radical 刂 (Dāo, knife) evolution from to

⻄周 Western- 戰國 Warring 東漢 Eastern 商 Dynasty States period Han (from 57AD (1600-1046BC) (202BC-220) (1045-771BC) (476-221BC) on) Bronze Oracle bone Bronze Silk (on ) Regular script inscriptions script Inscription 篆

Figure 4: Example Chinese radical, 刂 (Dao),¯ where the character evolved from leftmost pictogram to present day regular script (rightmost) containing only two strokes. The two strokes are called as 豎 (,` vertical) + 豎 (Shu` gou,¯ vertical with hook). The corresponding character representation is 刀 (Dao).¯

Word level 28 / / / / / / / / / 歲 廚師 被 發現 死 於 舊金⼭ 一家 商場 Character 28 歲廚師被發現死於舊金⼭一家商場 Pronunciation èr bā Suì chú shī bèi fā xiàn sǐ yú jiù jīn shān yī jiā shāng chǎng Radical 28 㥂 횧 止戌 广尌 ¤帀 衤皮 癶 王見 歹匕 方仒 萑臼 人王丷 ⼭ 一 宀豕 亠丷冏 土昜 English Ref. 28-Year-Old Chef Found Dead at San Francisco Mall

Figure 5: Example of Chinese word to character level changes for MT. Pronunciation is Mandarin in . The English reference here is taken from the corpus we used for our experiments. biguous translation of an even larger content, i.e., did the chief moved to San Francisco (SF) recently or this week. We will not get this clearly from the character base sequence model, however, the MWE 近日 (J`ınr`ı) is correctly translated by the RXD3 model and the overall meaning of the sen- tence is clear that the chef moved to SF recently and was found dead this week. We also attach the translations of these two sen- tences by four other models. With regard to the first sentence MWEs, all the four models trans- late San Francisco mall correctly as REF and RXD3 beating BASE model. In terms of the second sen- tence MWEs, BASEMWE and RXD2 drop out the MWE 近日 (J`ınr`ı, recently) as BASE model, and all the four models drop out the translation of MWE 楼梯间 (lout´ ¯ıjian,¯ stairwell). 22

20 BASE RXD1 RXD2 18 RXD3 RXD3MWE 16 BLEU scores

14

12

20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000 Learning Steps

Figure 6: Learning curves from different models with BLEU metric

28 岁 厨师 被 发现 死 于 旧⾦⼭ ⼀家 商场 src 近⽇ 刚 搬 ⾄ 旧⾦⼭ 的 ⼀位 28 岁 厨师 本周 被 发现 死 于 当地 ⼀家 商场 的 楼梯间 。 28 @-@ Year @-@ Old Chef Found Dead at San Francisco Mall ref a 28 @-@ year @-@ old chef who had recently moved to San Francisco was found dead in the stairwell of a local mall this week . the 28 @-@ year @-@ old chef was found dead at a San Francisco mall rxd3 a 28 @-@ year @-@ old chef who recently moved to San Francisco has been found dead on a stairwell in a local mall this week . the 28 @-@ year @-@ old chef was found dead in a shop in San Francisco base a 28 @-@ year @-@ old chef who has moved to San Francisco this week was found dead on the stairs of a local mall .

base 28 @-@ year @-@ old chef was found dead at a San Francisco mall MWE a 28 @-@ year @-@ old chef who recently moved to San Francisco was found dead this week at a local mall .

rxd3 28 @-@ year @-@ old chef was found dead at a San Francisco mall MWE a 28 @-@ year @-@ old chef recently moved to San Francisco was found dead this week at a local mall .

the 28 @-@ year @-@ old chef was found dead at a San Francisco mall rxd1 a 28 @-@ year @-@ old chef recently moved to San Francisco was found dead in a local shopping mall this week .

the 28 @-@ year @-@ old chef was found dead in a San Francisco mall rxd2 a 28 @-@ year @-@ old San Francisco chef was found dead in a local mall this week .

Figure 7: Samples of the English MT output at 100K learning steps: RXD1, RXD2 and RXD3 are the Chi- nese decomposition with level 1 to 3, BASE is the character sequence model, BASEMWE and RXD3MWE are character sequence model with MWEs and decomposition level 3 model with decomposed MWEs, and src/ref represents source/reference.