Character Decomposition for Japanese-Chinese Character-Level Neural Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
IALP 2019, Shanghai, Nov 15-17, 2019 Character Decomposition for Japanese-Chinese Character-Level Neural Machine Translation Jinyi Zhang Tadahiro Matsumoto Graduate School of Engineering Faculty of Engineering Gifu University Gifu University Gifu, Japan Gifu, Japan Email: [email protected] Email: [email protected] Abstract—After years of development, Neural Machine For character-level NMT, there is also an advantage that Translation (NMT) has produced richer translation results errors and fluctuations do not occur in the process of than ever over various language pairs, becoming a new dividing sentences into words (word segmentation). machine translation model with great potential. For the NMT model, it can only translate words/characters contained in Compared with the word-level NMT, the vocabulary the training data. One problem on NMT is handling of the size is kept small in character-level NMT, but there low-frequency words/characters in the training data. In this are still many characters of extremely low-frequency in paper, we propose a method for removing characters whose the vocabulary. At word-level, the method of replacing frequencies of appearance are less than a given minimum a low-frequency word having low statistical reliability threshold by decomposing such characters into their compo- nents and/or pseudo-characters, using the Chinese character with another word of related high-frequency has been decomposition table we made. Experiments of Japanese- attempted [3], but such a substitution is difficult for char- to-Chinese and Chinese-to-Japanese NMT with ASPEC-JC acters. Therefore, we devised a method for reducing low- (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) frequency characters for character-level NMT between corpus show that the BLEU scores, the training time and Japanese and Chinese by dividing low-frequency Chi- the number of parameters are varied with the number of the given minimum thresholds of decomposed characters. nese characters into constituent elements of the character (radicals: traditionally recognized components of Chinese Keywords-character decomposition; neural machine trans- characters) and pseudo partial characters. We investigated lation; Japanese-Chinese; character-level; LSTM; encoder; decoder; the effects of the method on translation results, and the number of the parameters of the model by experiments. We used Luong’s NMT system as the base system I. INTRODUCTION [4], which follows an encoder-decoder architecture with Machine translation’s performance has greatly improved global attention at the character level. In our case, we from Statistical Machine Translation (SMT), due to the chose the character-level NMT as the baseline, because the appearance of Neural Machine Translation (NMT). For character-level NMT between Japanese and Chinese has NMT, one problem is handling of the low-frequency better translation performance than the word-level NMT. words/characters in the vocabulary of the training data [1]. The main contributions of this paper are the follow- For the NMT models, as the vocabulary size increases, the ing. We created a Chinese character composition table computational complexity becomes enormous. Therefore, for finding its constituent elements. We demonstrate the in a general word-level NMT model, the vocabulary size possibility to improve the translation performance of NMT (the number of different characters) is usually limited systems by dividing the Chinese and Japanese characters to about tens of thousands of words, and the remaining into constituent elements and share them with the other low-frequency words are uniformly treated as unknown characters in the vocabulary, without changing the neural words. The increasing of unknown words leads to reduce network architecture. We believe this capability makes our translation performance, therefore the handling of low approach applicable to different NMT architectures. frequency words is a big problem in NMT. In the remainder of this paper, Section II presents Byte Pair Encoding (BPE) made the NMT model ca- the related work of this paper. Section III gives a brief pable of open-vocabulary translation by encoding low- explanation of the architecture of the NMT that we used frequency and unknown words as sequences of subword as the base system and ASPEC-JC (Asian Scientific Paper units, was proposed by Sennrich et al. [2], to be used to Excerpt Corpus, Japanese-Chinese) corpus. Section IV solve the low frequency words’ problem. describes the proposed method, how to divide the Chinese However, Chinese mainly uses Chinese characters and Japanese characters into constituent elements and (Hanzi) which are logograms. Many Chinese words are share them with the other characters in the vocabulary. written with one or two Chinese characters, as a result, it is Section V reports the experimental framework and the difficult to divide a Chinese word into high-frequency sub- results obtained in the Japanese-Chinese and Chinese- word units. Therefore, it is considered that the character- Japanese translation (with ASPEC-JC [5]). Finally, Section level is suitable for NMT between Japanese and Chinese. VI concludes with the contributions of this paper and 978-1-7281-5014-7/19/$31.00 c 2019 IEEE 35 further work. The decoder is a recurrent neural network with LSTM units that predicts a target sequence y = (y1, . , yn). II. RELATED WORK Every word (or character in case of character-level NMT) The characters used in a language are usually much yi is predicted based on a recurrent hidden state si, fewer than the words of the language. Character-level the previously predicted word (or character) yi 1, and a − neural language models [6] and MT are explored and context vector ci. ci is computed as the weighted sum of achieved respective results. Previous works, such as POS the annotations hj. Finally, the weight of each annotation tagging [7], name entity recognition [8], parsing [9], learn- hj is computed through an alignment (or attention) model ing word representations [10], and character embeddings αij, which models the probability that yi is aligned to xj. [11], shown different advantages of using character-level The forward states of the encoder is expressed as below: information in Natural Language Processing (NLP). Besides, subword-based representations (the middle −→h j = tanh(−→W Exj + −→U −→h j 1) (1) of word-based and character-based representations) have − m Vx been explored in NMT [2], and are applied to English and where E R × is a word embedding matrix, n m ∈ n n −→W R × and −→U R × are weight matrices; m, n other western languages, where most of the words consist ∈ ∈ of several or a dozen characters. Contrastingly, Chinese and Vx are the word embedding size, the number of hidden characters, which are used in Chinese, Japanese and some units, and the vocabulary size of the source language, other Asian languages, are typical logograms. A logogram respectively. is a character that represents a concept or thing, namely B. ASPEC-JC Corpus a word; and thus, it is difficult to split those words into We implement our system with the ASPEC-JC corpus, subwords. Recently, Meng et al. [12] found that character- which was constructed by manually translating Japanese based models consistently outperform subword-based and scientific papers into Chinese [5]. The Japanese scientific word-based models for deep learning-based Chinese NLP papers are either the property of the Japan Science and works. Technology Agency (JST) or stored in Japan’s Largest For Chinese-Japanese NMT, the sub-character level Electronic Journal Platform for Academic Societies (J- information improved the translation performance [13], STAGE). by using sub-character sequences on either the source or ASPEC-JC is composed of three parts: training data target side. However, about their character decomposition, (672,315 sentence pairs), development data (2,090 sen- it still needs to be explored. Du and Way [14] trained tence pairs), development-test data (2,148 sentence pairs) factored NMT models using “Pinyin” sequences on the and test data (2,107 sentence pairs) on the assumption that source side. Pinyin, is the official romanization system it would be used for machine translation research. for Chinese. This work only applied to Chinese source- ASPEC-JC contains both abstracts and some parts of side NMT. Zhang and Matsumoto [15] also attempted the body texts. ASPEC-JC only includes “Medicine”, to use a factored encoder for Japanese-Chinese NMT “Information”, “Biology”, “Environmentology”, “Chem- system using radical information. They did not achieve istry”, “Materials”, “Agriculture” and “Energy” 8 fields good results in Chinese-to-Japanese NMT. Wang et al. because it was difficult to include all the scientific fields. [16] directly applied a BPE algorithm to sequences before These fields were selected by investigating the important building NMT models. This method has only been tested scientific fields in China and the use tendency of literature in the Chinese-English direction and is not comprehensive databases by researchers and engineers in Japan. In these enough. fields, sentences belonging to the same article are not III. NEURAL MACHINE TRANSLATION AND included. ASPEC-JC CORPUS Compared with other language pairs such as English- A. Neural Machine Translation French, which usually comprises millions of parallel sen- tences. ASPEC-JC corpus only has about 672k sentences. NMT completely adopts the neural network approach to Moreover, LSTMs+attention model is usually more robust compute the conditional probability p(y x) of the target | than the transformer model [17] on smaller datasets, due sentence y for the given source sentence x. We follow to the smaller number of parameters [12]. the NMT architecture by Luong et al. [4], which we will briefly describe here. This NMT system is implemented as IV. REDUCTION OF LOW-FREQUENCY CHARACTERS a global attentional encoder-decoder neural network with BY CHARACTER DECOMPOSITION Long Short-Term Memory (LSTM), and we simply use it During the training and translation process, the training at the character level.