Korean-To-Japanese Neural Machine Translation System Using Hanja Information
Total Page:16
File Type:pdf, Size:1020Kb
Korean-to-Japanese Neural Machine Translation System using Hanja Information Hwichan Kim Tosho Hirasawa Mamoru Komachi Tokyo Metropolitan University 6-6 Asahigaoka, Hino, Tokyo 191-0065, Japan fkim-hwichan, [email protected], [email protected] Abstract Korean (Hangul) 자연 ¸´ 처¬ Korean (Hanja) rP 言語 処理 In this paper, we describe our TMU neural ma- Japanese (Kanji) rP 言語 処理 chine translation (NMT) system submitted for English Natural Language Processing the Patent task (Korean→Japanese) of the 7th Workshop on Asian Translation (WAT 2020, Table 1: Example of conversion of SK words in Hangul Nakazawa et al., 2020). We propose a novel into Hanja and Japanese translation into Kanji. method to train a Korean-to-Japanese transla- tion model. Specifically, we focus on the vo- cabulary overlap of Korean Hanja words and embeddings are the same. Conneau et al.(2018) Japanese Kanji words, and propose strategies and Lample and Conneau(2019) improved trans- to leverage Hanja information. Our experi- lation accuracy by making embeddings between ment shows that Hanja information is effec- tive within a specific domain, leading to an im- source words and their corresponding target words provement in the BLEU scores by +1.09 points closer. From this fact, we hypothesize that if the compared to the baseline. embeddings of each corresponding word are closer, the translation accuracy will improve. 1 Introduction Based on this hypothesis, we propose two ap- The Japanese and Korean languages have a strong proaches to train a translation model. First, we connection with Chinese owing to cultural and his- follow Park and Zhao(2019)’s method to increase torical reasons (Lee and Ramsey, 2011). Many the vocabulary overlap to improve the Korean-to- words in Japanese are composed of Chinese char- Japanese translation accuracy. Therefore, we per- acters called Kanji. By contrast, Korean uses the form Hangul to Hanja conversion pre-processing Korean alphabet called Hangul to write sentences before training the translation model. Second, we in almost all cases. However, Sino-Korean1 (SK) propose another approach to obtain Korean and words, which can be converted into Hanja words, Japanese embeddings that are closer to Korean account for 65 percent of the Korean lexicon (Sohn, SK words and their corresponding Japanese Kanji 2006). Table1 presents an example of conversions words. SK words written in Hangul and their coun- of SK words into Hanja, which are compatible with terparts in Japanese Kanji are superficially differ- Japanese Kanji words. ent, but we make both embeddings close by using a In addition, several studies have suggested that loss function when training the translation model. overlapping tokens between the source and target In addition, in this study, we used the Japan languages can improve the translation accuracy Patent Office Patent Corpus 2.0, which consists of (Sennrich et al., 2016; Zhang and Komachi, 2019). four domains, namely chemistry (Ch), electricity Park and Zhao(2019) trained a Korean-to-Chinese (El), mechanical engineering (Me), and physics translation model by converting Korean SK words (Ph), whose training, development, and test-n12 from Hangul into Hanja to increase the vocabulary data have domain information. Our methods are overlap. more effective when the terms are derived from In other words, the meaning of a vocabulary Chinese characters; therefore, we expect that the overlap on NMT is that each corresponding word’s effect will be different per domain. This is because 1Sino-Korean (SK) refers to Korean words of Chinese 2Here, test-n, test-n1, test-n2, and test-n3 data, and test-n origin. consist of test-n1, test-n2, and test-n3. 127 Proceedings of the 7th Workshop on Asian Translation, pages 127–134 December 4, 2020. c 2020 Association for Computational Linguistics Korean (Hangul) ø래서 NX h 량@ 0.01 %tX\ \정\다. Korean (Hanja) ø래서 NX 含有2@ 0.01 %以下\ ¿a\다. Japanese そのため Nn 含有2o 0.01 %以下k ¿aする。 English Therefore, N content 0.01 % or less limit. Table 2: Korean sentence and sentence in which SK words are converted into Hanja, together with their Japanese and English translations. there are domains in which there are many terms from Hangul into Hanja to increase the vocabulary derived from Chinese characters. Therefore, to ex- overlap. amine which Hanja information is the most useful Conneau et al.(2018) proposed a method to build in each domain, we perform a domain adaptation a bilingual dictionary by aligning the source and tar- by fine-tuning the model pre-trained by all training get word embedding spaces using a rotation matrix. data using domain-specific data. They showed that word-by-word translation using In this study, we examine the effect of Hanja the bilingual dictionary can improve the translation information and domain adaptation in a Korean- accuracy in low-resource language pairs. Lample to-Japanese translation. The main contributions of and Conneau(2019) improved the translation ac- this study are as follows: curacy by fine-tuning a pre-trained cross-lingual language model (XLM). The authors observed that • We demonstrate that Hanja information is ef- the bilingual word embeddings in XLM are similar. fective for Korean to Japanese translations Based on these facts, we hypothesize that if the within a specific domain. embeddings of each corresponding word become • In addition, our experiment shows that the closer, the translation accuracy will improve. In translation model using Hanja information this study, we use Hanja information to make the tends to translate literally. embeddings of each corresponding word closer to each other. 2 Related Work Some studies have focused on exploiting Hanja Several studies have been conducted on Korean– information. Park and Zhao(2019) focused on the Japanese neural machine translation (NMT). Park linguistic connection between Korean and Chinese. et al.(2019) trained a Korean-to-Japanese trans- They used the parallel data in which the Korean lation model using a transformer-based NMT sys- sentences of SK words were converted into Hanja tem with relative positioning, back-translation, and to train a translation model and demonstrated that multi-source methods. There have been other their method improves the translation quality. In attempts that combine statistical machine trans- addition, they showed that conversion into Hanja lation (SMT) and NMT (Ehara, 2018; Hyoung- helped translate the homophone of SK words. Yoo Gyu Lee and Lee, 2015). Previous studies on et al.(2019) proposed an approach of training Ko- Korean–Japanese NMT did not use Hanja infor- rean word representations using the data in which mation, whereas we train a Korean-to-Japanese SK words were converted into Hanja words. They translation model using data in which SK words demonstrated the effectiveness of the representa- were converted into Hanja words. tion learning method on several downstream tasks, Sennrich et al.(2016) proposed byte-pair encod- such as a news headline generation and sentiment ing (BPE), i.e., a sub-word segmentation method, analysis. To train the Korean-to-Japanese trans- and suggested that overlapping tokens by joint lation model, we combine these two approaches BPE is more effective for training the translation using Hanja information for training the transla- model between European language pairs. Zhang tion model and word embeddings to improve the and Komachi(2019) increased the overlap of to- translation quality. kens between Japanese and Chinese by decompos- In domain adaptation approaches for NMT, ing Japanese Kanji and Chinese characters into an Bapna and Firat(2019); Gu et al.(2019) trained ideograph or stroke level to improve the accuracy an NMT model pre-trained with massive parallel of Chinese–Japanese unsupervised NMT. Follow- data and retrained it with small parallel data within ing previous studies, we convert Korean SK words the target domain. In addition, Hu et al.(2019) 128 Korean Japanese Model Partition Sent. Tokens Types Tokens Types train 999,758 31,151,846 21,936 31,065,360 24,178 dev 2,000 106,433 6,475 104,307 6,098 test-n 5,230 272,975 9,530 269,876 9,018 Baseline test-n1 2,000 108,327 6,425 106,947 6,137 test-n2 3,000 153,195 7,522 151,253 6,656 test-n3 230 11,453 1,666 11,676 1,644 train 999,755 32,066,032 26,046 30,474,136 27,541 dev 2,000 109,460 6,944 103,994 6,119 test-n 5,230 298,404 9,832 269,146 9,166 Hanja-conversion test-n1 2,000 111,543 6,941 106,653 6,162 test-n2 3,000 175,131 6,410 150,844 6,717 test-n3 230 11,730 1,759 11,649 1,634 Table 3: Statistics of parallel data after each pre-processing. proposed an unsupervised adaptation method that Kanji word closer to each other. We use a loss retrains a pre-trained NMT model using pseudo- function to achieve this goal as follows: in-domain data. In this study, we perform domain N adaptation by fine-tuning a pre-trained NMT model X L = L + (1 − Sim(E(S ); E(K )) (1) with domain-specific data to examine whether T n n n=1 Hanja information is useful, and if so, in which domains. where L is the loss function per-batch and LT is the loss of the transformer architecture. In addi- 3 NMT with Hanja Information tion, S and K are the lists of SK words and its Our model is based on the transformer architecture corresponding Japanese Kanji words in the batch, (Vaswani et al., 2017), and we share the embed- respectively, and N is the length of the S and K ding weights between the encoder input, decoder lists (e.g., when the sentence is the example of Ta- input, and output to make better use of the vocab- ble2 and the batch size is one, S = (h 량, t ulary overlap between Korean and Japanese.