Chinese Character Decomposition for Neural MT with Multi-Word Expressions Lifeng Han1, Gareth J. F. Jones1, Alan F. Smeaton2 and Paolo Bolzoni 1 ADAPT Research Centre 2 Insight Centre for Data Analytics School of Computing, Dublin City University, Dublin, Ireland
[email protected],
[email protected] Abstract porating sub-word knowledge using Byte Pair En- coding (BPE) (Sennrich et al., 2016). However, Chinese character decomposition has been such methods cannot be directly applied to Chi- used as a feature to enhance Machine nese, Japanese and other ideographic languages. Translation (MT) models, combining rad- Integrating sub-character level information, icals into character and word level mod- such as Chinese ideograph and radicals as learning els. Recent work has investigated ideo- knowledge has been used to enhance features in graph or stroke level embedding. How- NMT systems (Han and Kuang, 2018; Zhang and ever, questions remain about different de- Matsumoto, 2018; Zhang and Komachi, 2018). composition levels of Chinese character Han and Kuang (2018), for example, explain that representations, radical and strokes, best the meaning of some unseen or low frequency Chi- suited for MT. To investigate the impact nese characters can be estimated and translated us- of Chinese decomposition embedding in ing radicals decomposed from the Chinese char- detail, i.e., radical, stroke, and intermedi- acters, as long as the learning model can acquire ate levels, and how well these decomposi- knowledge of these radicals within the training tions represent the meaning of the original corpus. character sequences, we carry out analy- Chinese characters often include two pieces of sis with both automated and human evalu- information, with semantics encoded within radi- ation of MT.