Combining Character and Word Information in Neural Machine Translation Using a Multi-Level Attention † † ‡ † † Huadong Chen , Shujian Huang ,∗ David Chiang , Xinyu Dai , Jiajun Chen †State Key Laboratory for Novel Software Technology, Nanjing University chenhd,huangsj,daixinyu,chenjj @nlp.nju.edu.cn { } ‡Department of Computer Science and Engineering, University of Notre Dame
[email protected] Abstract of (sub)words are based purely on their contexts, but the potentially rich information inside the unit Natural language sentences, being hierarchi- itself is seldom explored. Taking the Chinese word cal, can be represented at different levels of 被打伤 (bei-da-shang) as an example, the three granularity, like words, subwords, or charac- ters. But most neural machine translation sys- characters in this word are a passive voice marker, tems require the sentence to be represented as “hit” and “wound”, respectively. The meaning of a sequence at a single level of granularity. It the whole word, “to be wounded”, is fairly com- can be difficult to determine which granularity positional. But this compositionality is ignored if is better for a particular translation task. In this the whole word is treated as a single unit. paper, we improve the model by incorporating Secondly, obtaining the word or sub-word multiple levels of granularity. Specifically, we boundaries can be non-trivial. For languages like propose (1) an encoder with character atten- tion which augments the (sub)word-level rep- Chinese and Japanese, a word segmentation step resentation with character-level information; is needed, which must usually be trained on la- (2) a decoder with multiple attentions that en- beled data.