Korean-To-Chinese Machine Translation Using Chinese Character As Pivot Clue
Total Page:16
File Type:pdf, Size:1020Kb
Korean-to-Chinese Machine Translation using Chinese Character as Pivot Clue Jeonghyeok Park1,2,3 and Hai Zhao1,2,3, ∗ 1Department of Computer Science and Engineering, Shanghai Jiao Tong University 2Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, China 3 MoE Key Lab of Artificial Intelligence AI Institute, Shanghai Jiao Tong University [email protected], [email protected] Abstract et al., 2018; Xiao et al., 2019). Meanwhile, there are few attempts to improve the performance of the Korean-Chinese is a low resource language NMT model using linguistic characteristics for sev- pair, but Korean and Chinese have a lot eral language pairs (Sennrich and Haddow, 2016). in common in terms of vocabulary. Sino- On the other hand, Most of the recently proposed Korean words, which can be converted into corresponding Chinese characters, account for statistical machine translation (SMT) systems have more then fifty of the entire Korean vocabu- attempted to improve translation performance by lary. Motivated by this, we propose a simple using linguistic features including part-of-speech linguistically motivated solution to improve (POS) tags (Ueffing and Ney, 2013), syntax (Zhang the performance of Korean-to-Chinese neural et al., 2007), semantics (Rafael and Marta, 2011), machine translation model by using their com- reordering information (Zang et al., 2015; Zhang et mon vocabulary. We adopt Chinese charac- al., 2016) and so on. ters as a translation pivot by converting Sino- Korean words in Korean sentence to Chinese In this work, we focus on machine translation be- characters and then train machine translation tween Korean and Chinese, which have few parallel model with the converted Korean sentences corpora but share a well-known culture heritage, the as source sentences. The experimental results Sino-Korean words. Chinese loanwords used in Ko- on Korean-to-Chinese translation demonstrate rean are called Sino-Korean words, and can also be that the models with the proposed method written in Chinese characters which are still used by improve translation quality up to 1.5 BLEU points in comparison to the baseline models. modern Chinese people. Such a shared vocabulary makes the two languages closer despite their huge linguistic difference and provides the possibility for 1 Introduction better machine translation. Neural machine translation (NMT) using sequence- Because of its long history of contact with China, to-sequence structure has achieved remarkable per- Koreans have used Chinese characters as their writ- formance for most language pairs (Bahdanau et al., ing system, and even after adopting Hangul(ôÇ/åJ in 2014; Cho et al., 2014; Sutskever et al., 2014; Lu- Korean) as the standard language, Chinese charac- ong and Manning, 2015). Many studies on NMT ters have a considerable influence in Korean vocabu- have tried to improve the translation performance lary. Currently, the writing system adopted by mod- by changing the structure of the network model or ern Korean is Hangul, but Chinese characters con- adding new strategies (Wu and Zhao, 2018; Zhang tinue to be used in Korean and Chinese characters ∗ used in Korean are called ”Hanja”. Korean vocab- Corresponding author. This paper was partially supported ulary can be categorized into native Korean words, by National Key Research and Development Program of China (No. 2017YFB0304100) and Key Projects of National Natural Sino-Korean words, and loanwords from other lan- Science Foundation of China (U1836222 and 61733011). guages. The Sino-Korean vocabulary refers to Ko- 522 Pacific Asia Conference on Language, Information and Computation (PACLIC 33), pages 522-530, Hakodate, Japan, September 13-15, 2019 Copyright © 2019 Jeonghyeok Park and Hai Zhao Systems Sentences Korean " §î îÉr 아Aü< °úᇀsᅵ ìÍøí÷&%3다. HH-Convert }令Ér 아Aü< °úᇀsᅵ 颁布÷&%3다. Chinese }令颁布如下。 English The command was promulgated as follows. Korean ᆼª²DGÉr Fg #3ôÇ% %iò \"f_ᅴ /BN1lx sᅵe`¦ XSþÙ¡다. HH-Convert $国Ér 广范ôÇ 领域\"f_ᅴ q同 )Ê`¦ n¤þÙ¡다. Chinese $国(广泛的领域n¤了q同)Ê。 English The two countries have confirmed common interests in a wide range of areas. Table 1: The HH-Convert is Korean sentence converted by Hangul-Hanja conversion of the Hanjaro. The underline denotes Sino-Korean word and its corresponding Chinese characters in Korean sentence and HH-Convert sentence, respectively. rean words of Chinese origin and can be converted ties between language pairs to improve MT perfor- into corresponding Chinese characters, and consid- mance. Li et al. (2009) improved the translation erably account for about 57% of Korean vocabu- quality for Chinese-to-Korean SMT by using Chi- lary. Table 1 shows some sentence pairs of Korean nese syntactic reordering for an adequate generation and Chinese with the converted Sino-Korean words. of Korean verbal phrases. In Table 1, some Chinese words are commonly ob- Since Chinese and Korean belong to entirely dif- served between the converted Korean sentence and ferent language families in terms of typology and the Chinese sentence. genealogy, many studies also tried to analyze sen- In this paper, we present a novel yet straightfor- tence structure and word alignment of the two lan- ward method for better Korean-to-Chinese MT by guages and then proposed the specific methods for exploiting the connection of Sino-Korean vocabu- their concern (Huang and Choi, 2000; Kim et al., lary. We convert all Sino-Korean words in Korean 2002; Li et al., 2008). Lu et al. (2015) proposed sentences into Chinese characters and take the con- a method of translating Korean words into Chinese verted Korean sentences as the updated source data using the Chinese character knowledge. for later MT model training. Our method is applied to two types of NMT models, recurrent neural net- There are several attempts to exploit the connec- work (RNN) and the Transformer, and shows signif- tion between the source language and the target lan- icant translation performance improvement. guage in machine translation. Kuang et al. (2018) proposed methods to somewhat shorten the distance 2 Related Work between the source and target words in NMT model, and thus strengthen their association, through a tech- There have been studies of linguistic annotation, nique bridging source and target word embeddings. such as dependency label (Wu et al., 2018; Li et al., For other low-resource language pairs, using pivot 2018a; Li et al., 2018b), semantic role labels (Guan language to overcome the limitation of the insuf- et al., 2019; Li et al., 2019) and so on. Sennrich and ficient parallel corpus has been a choice (Habash Haddow (2016) proved that various linguistic fea- and Hu, 2009; Zahabi et al., 2013; Ahmadnia et tures can be valuable for NMT. In this work, we fo- al., 2017). Chu et al. (2013) bulid a Chinese char- cus on the linguistic connection between Korean and acter mapping table for Japanese, Traditional Chi- Chinese to improve Korean-to-Chinese NMT. nese, and Simplified Chinese and verified the ef- There are several studies on Korean-Chinese fectiveness of shared Chinese characters for Chi- machine translation. For example, Kim et nese–Japanese MT. Zhao et al. (2013) used the Chi- al. (2002) proposed verb-pattern-based Korean-to- nese character, a common form of both languages, as Chinese MT system that uses pattern-based knowl- a translation bridge in the Vietnamese-Chinese SMT edge and consistently manages linguistic peculiari- model, and improved the translation quality by con- 523 北 B^ “北美a'>ᅨ¸ “南北a'>ᅨ%!3 Chinese, many homophones were created in their @/¨8” vocabulary in the process of translating the Chinese 3.1îr1lx 100ÅÒ¸ ´úᆽ아 ᆼ©# î #QL:\ "é¶Òo( words into their language. Around 35% of the Sino- 原r) IFGlᅵÂÒÃÌ Korean words registered in the Standard Korean Language Dictionary belong to homophones. Thus Table 2: News headlines with Chinese characters. The converting Sino-Korean words into (usually differ- underline denotes Chinese characters. ent) Chinese characters will have a similar impact as semantic disambiguation. For example, the Korean verting Vietnamese syllables into Chinese characters word uisa (_ᅴ사 in Korean) has many homophones with a pre-specified dictionary. Partially motivated and can have several meanings. To clarify the mean- by this work, we turn to Korean in terms of NMT ing of the word uisa in Korean context, these words models by fully exploiting the shared Sino-Korean are occasionally written in Chinese characters as fol- vocabulary between Korean and Chinese. lows: ;师 (doctor), 意思 (mind), Ië (martyr), ® 事 (proceedings). 3 Sino-Korean Words and Chinese In addition, There is a difference between Chinese Characters characters (Hanja) used in Korea and Chinese char- acters used in China. Chinese can be divided into Korea belongs to the Chinese cultural sphere, which two categories: Traditional Chinese and Simplified means that China has historically influenced regions Chinese. Chinese characters used in China and Ko- and countries of East Asia. Before the creation of rea are Simplified Chinese and Traditional Chinese, Hangul (Korean alphabet), all documents were writ- respectively. ten in Chinese characters, and Chinese characters were used continuously even after the creation of 4 The Proposed Approach Hangul. The proposed approach for Korean-to-Chinese MT Today, the standard writing system in Korea is has two phases: Hangul-Hanja conversion and NMT Hangul, and the use of Chinese characters in Korean model training. We first convert the Sino-Korean sentences is rare, but Chinese characters have left words of the Korean input sentences into Chinese a significant influence on Korean vocabulary. About characters, and convert the Traditional Chinese char- 290,000 (57%) out of the 510,000 words in the Stan- acters of the converted Korean input sentences into dard Korean Language Dictionary published by the Simplified Chinese characters to share the common National Institute of Korean Language belongs to units between source and target vocabulary.