Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Total Page:16
File Type:pdf, Size:1020Kb
Chinese Character Decomposition for Neural MT with Multi-Word Expressions Lifeng Han1, Gareth J. F. Jones1, Alan F. Smeaton2 and Paolo Bolzoni 1 ADAPT Research Centre 2 Insight Centre for Data Analytics School of Computing, Dublin City University, Dublin, Ireland [email protected], [email protected] Abstract made in terms of rare and unseen words by incor- porating sub-word knowledge using Byte Pair En- Chinese character decomposition has been coding (BPE) (Sennrich et al., 2016). However, used as a feature to enhance Machine such methods cannot be directly applied to Chi- Translation (MT) models, combining rad- nese, Japanese and other ideographic languages. icals into character and word level mod- Integrating sub-character level information, els. Recent work has investigated ideo- such as Chinese ideograph and radicals as learning graph or stroke level embedding. How- knowledge has been used to enhance features in ever, questions remain about the different NMT systems (Han and Kuang, 2018; Zhang and decomposition levels of Chinese character Matsumoto, 2018; Zhang and Komachi, 2018). representations, radical and strokes, best Han and Kuang (2018), for example, explain that suited for MT. To investigate the impact the meaning of some unseen or low frequency Chi- of Chinese decomposition embedding in nese characters can be estimated and translated us- detail, i.e., radical, stroke, and intermedi- ing radicals decomposed from the Chinese char- ate levels, and how well these decomposi- acters, as long as the learning model can acquire tions represent the meaning of the original knowledge of these radicals within the training character sequences, we carry out analy- corpus. sis with both automated and human evalu- Chinese characters often include two pieces of ation of MT. Furthermore, we investigate information, with semantics encoded within radi- if the combination of decomposed Mul- cals and a phonetic part. The phonetic part is re- tiword Expressions (MWEs) can enhance lated to the pronunciation of the overall character, model learning. MWE integration into either the same or similar. For instance, Chinese MT has seen more than a decade of explo- characters with this two-stroke radical, 刂 (t´ı dao¯ ration. However, decomposed MWEs has pang),´ ordinarily relate to knife in meaning, such not previously been explored. as the Chinese character 劍 (jian,` sword) and multi-character expression 鋒) (fengl¯ `ı, sharp). 1 Introduction The radical 刂 (t´ı dao¯ pang)´ preserves the mean- Neural Machine Translation (NMT) (Cho et al., ing of knife because it is a variation of a drawing 2014; Johnson et al., 2016; Vaswani et al., 2017; of a knife evolving from the original bronze in- Lample and Conneau, 2019) has recently replaced scription (Fig. 4 in Appendices). Statistical Machine Translation (SMT) (Brown Not only can the radical part of a character be et al., 1993; Och and Ney, 2003; Chiang, 2005; decomposed into smaller fragments of strokes but Koehn, 2010) as the state-of-the-art for Machine the phonetic part can also be decomposed. Thus Translation (MT). However, research questions there are often several levels of decomposition that still remain, such as how to deal with out-of- can be applied to Chinese characters by combin- vocabulary (OOV) words, how best to integrate ing different levels of decomposition of each part linguistic knowledge and how best to correctly of the Chinese character. As one example, Fig- translate multi-word expressions (MWEs) (Sag ure 1 shows the three decomposition levels from et al., 2002; Moreau et al., 2018; Han et al., our model and the full stroke form of the above 2020a). For OOV word translation for European mentioned characters 劍(jian)` and 鋒(feng)¯ . To languages, substantial improvements have been date, little work has been carried out to investigate the full potential of these alternative levels of de- 2 Related Work composition of Chinese characters for the purpose Chinese character decomposition has been ex- of Machine Translation (MT). plored recently for MT. For instance, Han In this work, we investigate Chinese charac- and Kuang (2018) and Zhang and Matsumoto ter decomposition, and another area related to (2018), considered radical embeddings as ad- Chinese characters, namely Chinese MWEs. We ditional features for Chinese ! English and firstly investigate translation at increasing levels of Japanese , Chinese NMT. Han and Kuang decomposition of Chinese characters using under- (2018) tested a range of encoding models lying radicals, as well as the additional Chinese including word+character, word+radical, and character strokes (corresponding to ever-smaller word+character+radical. This final setting with units), breaking down characters into component word+character+radical achieved the best perfor- parts as this is likely to reduce the number of un- mance on a standard NIST 2 MT evaluation data known words. Then, in order to better deal with set for Chinese ! English. Furthermore, Zhang MWEs which have a common occurrence in gen- and Matsumoto (2018) applied radical embed- eral contexts (Sag et al., 2002), and working in dings as additional features to character level the opposite direction in terms of meaning rep- LSTM-based NMT on Japanese ! Chinese trans- resentation, we investigate translating larger units lation. None of the aforementioned work has how- of Chinese text, with the aim of restricting trans- ever investigated the performance of decomposed lation of larger groups of Chinese characters that character sequences and the effects of varied de- should be translated together as one unit. In ad- composition degrees in combination with MWEs. dition to investigating the effects of decompos- Subsequently, Zhang and Komachi (2018) devel- ing characters we simultaneously apply methods oped bidirectional English , Japanese, English of incorporating MWEs into translation. MWEs , Chinese and Chinese , Japanese NMT with can appear in Chinese in a range of ways, such word, character, ideograph (the phonetics and se- as fixed (or semi-fixed) expressions, metaphor, id- mantics parts of characters are separated) and iomatic phrases, and institutional, personal or lo- stroke levels, with experiments showing that the cation names, amongst others. ideograph level was best for ZH!EN MT, while In summary, in this paper, we investigate: (i) the stroke level was best for JP!EN MT. Al- the degree to which Chinese radical and stroke se- though their ideograph and stroke level setting re- quences represent the original word and charac- placed the original character and word sequences, ter sequences that they are composed of; (ii) the there was no investigation of intermediate decom- difference in performance achieved by each de- position performance, and they only used BLEU composition level; (iii) the effect of radical and score for automated evaluation with no human as- stroke representations in MWEs for MT. Further- sessment involved. This gives us inspiration to ex- more, we offer: plore the performance of intermediate level em- • an open-source suite of Chinese character de- bedding between ideograph and strokes for the composition extraction tools; MT task. • a Chinese , English MWE corpus where 3 Chinese Character Decomposition Chinese characters have been decomposed In this section, we introduce a character decom- available at radical4mt1. position approach and the extraction tools which The rest of this paper is organized as follows: we apply in this work (code will be publicly avail- Section 2 provides details of related work in char- able). We utilize the open source IDS dictionary 3 acter and radical related MT; Sections 3 and 4 in- which was derived from the CHISE (CHarac- 4 troduce our Chinese decomposition procedure into ter Information Service Environment) project . It radical and strokes, and our experimental design; is comprised of 88,940 Chinese characters from Section 5 provides details of our evaluations from CJK (Chinese, Japanese, Korean script) Unified both automatic and human perspectives; Section 6 2https://www.nist.gov/ describes conclusions and plans for future work. programs-projects/machine-translation 3https://github.com/cjkvi/cjkvi-ids 1https://github.com/poethan/MWE4MT 4http://www.chise.org/ Level-1 劍 (jiàn) 鋒 (fēng) Level-1: (phonetic, qiān) 僉⺉(semantic, knife) (semantic, metal) ⾦夆 (phonetic, féng) Level-2: 亼吅从 ⺉ ⼈王丷 夂丰 Level-3: ⼈⼀⼝⼝⼈⼈ ⺉ ⼈⼀⼟丷 夂三⼁ … … … … … Full-stroke: ⼃㇏⼀⼁�⼀⼁�⼀⼃㇏⼃㇏ ⼁⼅ ⼃㇏⼀⼀⼁⼂㇀⼀ ㇀㇇㇏⼀⼀⼀⼁ Figure 1: Examples of the decomposition of Chinese characters. Ideographs and the corresponding decomposition Character Decomposition Decomposition sequences of each character. Most characters are 丽 (lì) ⿱⼀⿰⿵⼌⼂⿵⼌⼂ ⿰⿱⼀⿵⼌⼂⿱⼀⿵ decomposed as a single sequence, but characters [G] ⼌⼂[T] can have up to four possible decomposed repre- 具 (jù) ⿱⿴且⼀八[GTKV] ⿳⽬⼀八[J] sentations. The reason for this is that the character 函 (hán) ⿶⼐⿻了⿱丷八[GTV] ⿶⼐⿻丂⿱丷八[JK] can come from different resources, such as Chi- 勇 (yǒng) ⿱甬⼒[GTV] ⿱⿱龴⽥⼒[JK] nese Hanzi (G, H, T for Mainland, Hong Kong, and Taiwan), Japanese Kanji (J), Korean Hanja Character construction: ⿱: up-down, ⿰: left-right, ⿵⿶ ⿴: inside-outside, ⿻: embedded (K), and Vietnamese ChuNom (V), etc.5 Even though they have the same root of Hanzi, the his- Figure 2: Character examples from IDS dictio- torical development of languages and writing sys- nary; the grey parts of decomposition graphs rep- tems in different territories has resulted in certain resent the construction structure of the character. degrees of variation in their