Neural Machine Translation of Logographic Language Using Sub

Neural Machine Translation of Logographic Languages Using Sub-character Level Information Longtu Zhang Mamoru Komachi Tokyo Metropolitan University Tokyo Metropolitan University 6-6 Asahigaoka, Hino, 6-6 Asahigaoka, Hino, Tokyo 191-0065, Japan Tokyo 191-0065, Japan [email protected] [email protected] Abstract model better encode the source sentence and de- code the target sentence, particularly when the Recent neural machine translation (NMT) source and target languages share some similar- systems have been greatly improved by ities. encoder-decoder models with attention mechanisms and sub-word units. However, im- Almost all of the methods used to improve portant differences between languages with NMT systems were developed for alphabetic lan- logographic and alphabetic writing systems guages such as English, French, and German as have long been overlooked. This study fo- either the source or target language, or both. An cuses on these differences and uses a simple alphabetic language typically uses an alphabet: a approach to improve the performance of NMT small set of letters (basic writing symbols) that systems utilizing decomposed sub-character level information for logographic languages. each roughly represents a phoneme in the spo- Our results indicate that our approach not ken language. Words are composed by ordered only improves the translation capabilities of letters, and sentences are composed by space- NMT systems between Chinese and English, segmented ordered words. However, in other but also further improves NMT systems be- major writing systems—namely, logographic (or tween Chinese and Japanese, because it uti- character-based) languages such as Chinese, lizes the shared information brought by simi- Japanese, and traditional Korean—strokes are lar sub-character units. used to construct ideographs; ideographs are used 1 Introduction to construct characters, which are the basic units for meaningful words. Words can then further Neural machine translation (Cho et al., 2014) compose sentences. In alphabetic languages, (NMT) systems based on sequence-to-sequence sub-word units are easy to identify, whereas in models (Sutskever et al., 2014) have recently be- logographic languages, a similar effect can be come the de facto standard architecture. The achieved only if sub-character level information models use attention mechanisms (Bahdanau is taken into consideration.1 et al., 2015; Luong et al., 2015) to keep records Having noticed this significant difference of all encoding results, and can focus on particu- between these two writing systems, Shi et lar parts of these results during decoding, so that al. (2015), Liu et al. (2017), Peng et al. (2017), the model can produce longer and more accurate and Cao et al. (2017) used stroke-level informa- translations. Sub-word units are another tech- tion for logographic languages when constructing nique first introduced by Sennrich’s (2016) appli- word embeddings; Toyama et al. (2017) used vi- cation of the byte pair encoding (BPE) algorithm, sual information for strokes and Japanese Kanji and are used to break up words in both source and target sentences into sequences of smaller units, 1Taking the ASPEC corpus as an example, the average learned without supervision. This alleviates the word lengths are roughly 1.5 characters (Chinese words, tokenized by Jieba tokenizer), 1.7 characters (Japanese words, risk of producing <unk> symbols when the model tokenized by MeCab tokenizer), and 5.7 characters (English encounters infrequent “unknown” words, also words, tokenized by Moses tokenizer), respectively. There- known as the out-of-vocabulary (OOV) problem. fore, when a sub-word model of similar vocabulary size is applied directly, English sub-words usually contain several let- Moreover, sub-word units, which can be viewed ters, which are more effective in facilitating NMT, whereas as learned stems and affixes, can help the NMT Chinese and Japanese sub-words are largely just characters. Proceedings of the Third Conference on Machine Translation17 (WMT), Volume 1: Research Papers, pages 17–25 Belgium, Brussels, October 31 - Novermber 1, 2018. c 2018 Association for Computational Linguistics https://doi.org/10.18653/v1/W18-64003 radicals in a text classification task.2 2. We facilitate the encoding or decoding pro- Some studies have performed NMT tasks using cess by using sub-character sequences on ei- various sub-word “equivalents”. For instance, ther the source or target side of the NMT Du and Way (2017) trained factored NMT mod- system. This will improve translation perfor- els using “Pinyin”3 sequences on the source side. mance; if sub-character information is shared Unfortunately, they did not apply a BPE algo- between the encoder and decoder, it will fur- rithm during training, and their model also cannot ther benefit the NMT system. perform factored decoding. Wang et al. (2017) 4 directly applied a BPE algorithm to character se- 3. Specifically, Chinese ideograph data and quences before building NMT models. However, Japanese stroke data are the best choices for they did not take advantage of sub-character level relevant NMT tasks. information during the training of sub-word and 2 Background NMT models. Kuang and Han (2018) also attempted to use a factored encoder for Chinese 2.1 NMT with Attention Mechanisms and NMT systems using radical data. It is worth not- Sub-word Units ing that although the idea of using ideographs and In this study, we applied a sequence-to-sequence strokes in NLP tasks (particularly in NMT tasks) model with an attention mechanism (Bahdanau is not new, no previous NMT research has fo- et al., 2015). The basic recurrent unit is the “long cused on the decoding process. If it is also possi- short-term memory” (Hochreiter and Schmidhu- ble to construct an ideograph/stroke decoder, we ber, 1997) unit. Because of the nature of the can further investigate translations between lo- sequence-to-sequence model, the vocabulary size gographic languages. Additionally, no NMT re- must be limited for the computational efficiency search has previously used stroke data. of the Softmax function. In such cases, the de- To summarize, there are three potential in- coder outputs an <unk> symbol for any word formation gaps associated with current studies that is not in the vocabulary, which will harm on NMT systems for logographic languages us- the translation quality. This is called the out-of- ing sub-character level data: 1) no research has vocabulary (OOV) problem. been performed on the decoding process; 2) no Sub-word unit algorithms (such as BPE algo- studies have trained models using sub-character rithms) first break up a sentence into the smallest level sub-words; and 3) no studies have attempted possible units. Then, two adjacent units at a time to build NMT models for logographic language are merged according to some standard (e.g., the pairs, despite their sharing many similarities. co-occurrence frequency). Finally, after n steps, This study investigates whether sub-character in- the algorithm collects the merged units as “sub- formation can facilitate both encoding and decod- word” units. By using sub-word units, it is pos- ing in NMT systems and between logographic sible to represent a large number of words with language pairs, and aims to determine the best a small vocabulary. Originally, sub-word units sub-character unit granularity for each setting. were only applied to unknown words (Sennrich The main contributions of this study are three- et al., 2016). However, in the recent GNMT (Wu fold: et al., 2016) and transformer systems (Vaswani et al., 2017), all words are broken up into sub- 1. We create a sub-character database of Chi- word units to better represent the shared informa- nese character-based languages, and conduct tion. MT experiments using various types of sub- For alphabetic languages, researchers have in- character NMT models. dicated that sub-word units are useful for solving 2To be more precise, there is another so-called syl- OOV problems, and that shared information can labic writing system, which uses individual symbols to further improve translation quality. The Senten- represent symbols rather than phonemes. Japanese hira- 5 gana and katakana are actually syllabic symbols rather than cepiece project compared several combinations ideographs. In this paper, we focus only on the logographic of word-pieces (Kudo, 2018) and BPE sub-word part. 3An official Romanization system for standard Chinese 4We use the term “logographic” to refer to writing sys- in mainland China. Pinyin includes both letters and dia- tems such as Chinese characters and Japanese Kanji, and critics, which represent phonemic and tonal information, re- “ideograph” to refer to the character components. spectively. 5https://github.com/google/sentencepiece 18 Semantic Phonetic Word Meaning Ideographs Character Pinyin ideograph ideograph 树木 Wood 木对木驰 run 马 horse 也 chí 森林 Forest 木木木木木池 pool 水(氵) water 也 chí 施 impose 方 direction 也 sh Table 2: Examples of multi-character words in Chi- 弛 loosen 弓 bow 也 chí nese and their ideograph sequences. 地 land 土 soil 也 dì 驱 drive 马 horse 区 q tively). A few ideographs can also be treated as Table 1: Examples of decomposed ideographs of Chi- standalone characters. nese characters. The composing ideographs of differ- To the best of our knowledge, however, no re- ent functionality might be shared across different char- search has been performed on logographic lan- acters. guage NMT beyond character-level data, except in the work of Du and Way (2017), who attempted models in English/Japanese NMT tasks. The sub- to use Pinyin sequences instead of character se- word units were trained on character (Japanese quences in Chinese–English NMT tasks. Consid- Kanji and Hiragana/Katakana) sequences. Sim- ering the fact that there are a large number of ho- ilarly, Wang et al. (2017) attempted to compare mophones and homonyms in Chinese languages, the effects of different segmentation methods on it was difficult for this method to be used to re- NMT tasks, including “BPE” units trained on construct characters in the decoding step.

Load more