Arxiv:2104.04497V1 [Cs.CL] 9 Apr 2021
Total Page:16
File Type:pdf, Size:1020Kb
Chinese Character Decomposition for Neural MT with Multi-Word Expressions Lifeng Han1, Gareth J. F. Jones1, Alan F. Smeaton2 and Paolo Bolzoni 1 ADAPT Research Centre 2 Insight Centre for Data Analytics School of Computing, Dublin City University, Dublin, Ireland [email protected], [email protected] Abstract porating sub-word knowledge using Byte Pair En- coding (BPE) (Sennrich et al., 2016). However, Chinese character decomposition has been such methods cannot be directly applied to Chi- used as a feature to enhance Machine nese, Japanese and other ideographic languages. Translation (MT) models, combining rad- Integrating sub-character level information, icals into character and word level mod- such as Chinese ideograph and radicals as learning els. Recent work has investigated ideo- knowledge has been used to enhance features in graph or stroke level embedding. How- NMT systems (Han and Kuang, 2018; Zhang and ever, questions remain about different de- Matsumoto, 2018; Zhang and Komachi, 2018). composition levels of Chinese character Han and Kuang (2018), for example, explain that representations, radical and strokes, best the meaning of some unseen or low frequency Chi- suited for MT. To investigate the impact nese characters can be estimated and translated us- of Chinese decomposition embedding in ing radicals decomposed from the Chinese char- detail, i.e., radical, stroke, and intermedi- acters, as long as the learning model can acquire ate levels, and how well these decomposi- knowledge of these radicals within the training tions represent the meaning of the original corpus. character sequences, we carry out analy- Chinese characters often include two pieces of sis with both automated and human evalu- information, with semantics encoded within radi- ation of MT. Furthermore, we investigate cals and a phonetic part. The phonetic part is re- if the combination of decomposed Mul- lated to the pronunciation of the overall character, tiword Expressions (MWEs) can enhance either the same or similar. For instance, Chinese the model learning. MWE integration into characters with this two-stroke radical, 刂 (t´ı dao¯ MT has seen more than a decade of explo- pang),´ ordinarily relate to knife in meaning, such ration. However, decomposed MWEs has as the Chinese character 劍 (jian,` sword) and not previously been explored. multi-character expression 鋒) (fengl¯ `ı, sharp). The radical 刂 (t´ı dao¯ pang)´ preserves the mean- 1 Introduction ing of knife because it is a variation of a drawing Despite Neural Machine Translation (NMT) (Cho of a knife evolving from the original bronze in- et al., 2014; Johnson et al., 2016; Vaswani et al., scription (Fig. 4 in Appendices). arXiv:2104.04497v1 [cs.CL] 9 Apr 2021 2017; Lample and Conneau, 2019) having recently Not only can the radical part of a character be replaced Statistical Machine Translation (SMT) decomposed into smaller fragments of strokes but (Brown et al., 1993; Och and Ney, 2003; Chi- the phonetic part can also be decomposed. Thus ang, 2005; Koehn, 2010) as the state-of-the-art, re- there are often several levels of decomposition that search questions still remain, such as how to deal can be applied to Chinese characters by combin- with out-of-vocabulary (OOV) words, how best to ing different levels of decomposition of each part integrate linguistic knowledge and how best to cor- of the Chinese character. As one example, Fig- rectly translate multi-word expressions (MWEs) ure 1 shows the three decomposition levels from (Sag et al., 2002; Moreau et al., 2018; Han et al., our model and the full stroke form of the above 2020a). For OOV word translation for European mentioned characters 劍(jian)` and 鋒(feng)¯ . To languages, substantial improvements have been date, little work has been carried out to investigate made in terms of rare and unseen words by incor- the full potential of these alternative levels of de- composition of Chinese characters for the purpose 2 Related Work of Machine Translation (MT). Chinese character decomposition has been In this work, we investigate Chinese charac- explored recently for MT. For instance, Han ter decomposition, and additionally we investigate and Kuang (2018) and Zhang and Matsumoto another area related to Chinese characters, namely (2018), considered radical embeddings as Chinese MWEs. We firstly investigate translation additional features for Chinese ! English at increasing levels of decomposition of Chinese and Japanese , Chinese NMT. In Han and characters using underlying radicals, as well as the Kuang (2018), a range of encoding models additional Chinese character strokes (correspond- including word+character, word+radical, and ing to ever-smaller units), breaking down charac- word+character+radical were tested. The final ters into component parts as this is likely to re- setting with word+character+radical achieved duce the number of unknown words. Then, in or- the best performance on a standard NIST 2 MT der to better deal with MWEs which have a com- evaluation data set for Chinese ! English. Fur- mon occurrence in the general context (Sag et al., thermore, Zhang and Matsumoto (2018) applied 2002), and working in the opposing direction in radical embeddings as additional features to terms of meaning representation, we investigate character level LSTM-based NMT on Japanese ! translating larger units of Chinese text, with the Chinese translation. None of the aforementioned aim of restricting translation of larger groups of work has however investigated the performance Chinese characters that should be translated to- of decomposed character sequences and the gether as one unit. In addition to investigating effects of varied decomposition degrees in com- the effects of decomposing characters we simul- bination with MWEs. Subsequently, Zhang and taneously apply methods of incorporating MWEs Komachi (2018) developed bidirectional English into translation. MWEs can appear in Chinese in , Japanese, English , Chinese and Chinese , a range of ways, such as fixed (or semi-fixed) ex- Japanese NMT with word, character, ideograph pressions, metaphor, idiomatic phrases, and insti- (the phonetics and semantics parts of characters tutional, personal or location names, amongst oth- are separated) and stroke levels, with experiments ers. showing that the ideograph level was best for In summary, in this paper, we investigate (i) the ZH!EN MT, while the stroke level was best degree to which Chinese radical and stroke se- for JP!EN MT. Although their ideograph and quences represent the original word and charac- stroke level setting replaced the original character ter sequences that they are composed of; (ii) the and word sequences, there was no investigation difference in performance achieved by each de- of intermediate decomposition performance, and composition level; (iii) the effect of radical and they only used BLEU score as the automated eval- stroke representations in MWEs for MT. Further- uation with no human assessment involved. This more, we offer (available at radical4mt1): gives us inspiration to explore the performance of intermediate level embedding between ideograph • an open-source suite of Chinese character de- and strokes for the MT task. composition extraction tools; 3 Chinese Character Decomposition • a Chinese , English MWE corpus where Chinese characters have been decomposed We introduce the character decomposition ap- proach and the extraction tools which we apply in The rest of this paper is organized as follows: Sec- this work (code will be publicly available). We uti- tion 2 provides some related work in character and lize the open source IDS dictionary 3 which was radical related MT; Section 3 and 4 introduce our derived from the CHISE (CHaracter Information Chinese decomposition procedure into radical and Service Environment) project4. It is comprised strokes, and the experimental design; Section 5 of 88,940 Chinese characters from CJK (Chi- provides our evaluations from both automatic and nese, Japanese, Korean script) Unified Ideographs human perspectives; Section 6 includes conclu- 2https://www.nist.gov/ sions and plans for future work. programs-projects/machine-translation 3https://github.com/cjkvi/cjkvi-ids 1https://github.com/poethan/MWE4MT 4http://www.chise.org/ Level-1 劍 (jiàn) 鋒 (fēng) Level-1: (phonetic, qiān) 僉⺉(semantic, knife) (semantic, metal) ⾦夆 (phonetic, féng) Level-2: 亼吅从 ⺉ ⼈王丷 夂丰 Level-3: ⼈⼀⼝⼝⼈⼈ ⺉ ⼈⼀⼟丷 夂三⼁ … … … … … Full-stroke: ⼃㇏⼀⼁�⼀⼁�⼀⼃㇏⼃㇏ ⼁⼅ ⼃㇏⼀⼀⼁⼂㇀⼀ ㇀㇇㇏⼀⼀⼀⼁ Figure 1: Examples of the decomposition of Chinese characters. and the corresponding decomposition sequences Character Decomposition Decomposition of each character. Most characters are decom- 丽 (lì) ⿱⼀⿰⿵⼌⼂⿵⼌⼂ ⿰⿱⼀⿵⼌⼂⿱⼀⿵ posed as a single sequence, but characters can have [G] ⼌⼂[T] up to four possible decomposed representations. 具 (jù) ⿱⿴且⼀八[GTKV] ⿳⽬⼀八[J] The reason for this is that the character can come 函 (hán) ⿶⼐⿻了⿱丷八[GTV] ⿶⼐⿻丂⿱丷八[JK] from different resources, such as Chinese Hanzi 勇 (yǒng) ⿱甬⼒[GTV] ⿱⿱龴⽥⼒[JK] (G, H, T for Mainland, Hong Kong, and Taiwan), Japanese Kanji (J), Korean Hanja (K), and Viet- Character construction: ⿱: up-down, ⿰: left-right, ⿵⿶ ⿴: inside-outside, ⿻: embedded namese ChuNom (V), etc.5 Even though they have the same root of Hanzi, the historical development Figure 2: Character examples from IDS dictio- of languages and writing systems in different ter- nary; the grey parts of decomposition graphs rep- ritories has resulted in their certain degree of vari- resent the construction structure of the character. ations