Korean-To-Chinese Word Translation Using Chinese Character Knowledge
Total Page:16
File Type:pdf, Size:1020Kb
言語処理学会 第21回年次大会 発表論文集 (2015年3月) Korean-to-Chinese Word Translation using Chinese Character Knowledge Yuanmei Lu*, Toshiaki Nakazawa*,** and Sadao Kurohashi* *Graduate School of Informatics, Kyoto University **Japan Science and Technology Agency flu, [email protected], [email protected] 1 Introduction ity. The quality of the statistical machine translation highly relies on the amount of parallel corpora avail- able, and improving the lexical coverage of the par- 2 Sino-Korean Words allel corpora seems to play an important role in reducing the number of out-of-vocabulary (OOV) 2.1 Chinese Words in Korean and words. However, the number of vocabularies of lan- Japanese guages keeps growing, especially for technical terms. It is impossible to cover all the newly appeared In Korean, a significant portion of the words are words by augmenting the parallel corpora; there- composed of Sino-Korean. The Standardized Ko- fore we need to prepare bilingual dictionaries for rean Language Dictionary (표준국어대사전, 標準国 the new words, or translate them separately, for ex- 語大辞典) published by National Institute of the Ko- ample, using the transliteration technique. rean Language (NIKL) (국립국어원, 國立國語院) in 2004 contains near 57% of Sino-Korean words (漢 There are some parallel dictionaries available for 字語); the Survey of Korean Vocabulary frequency, limited language pairs and limited domains. In which is conducted in 1956, has shown that about addition, we can extract parallel resources from 70% of the frequently used words are Sino-Korean. Wikipedia. It offers hyper-linked pages of the same Nowadays, this kind of words are much frequently topic in different languages, and the title pairs of used in writings, such as newspapers and disser- the linked pages can be used as a parallel dictio- tations. Most of the Sino-Korean words are not nary. However, the coverage is not sufficient for written in Hanzi, but in Hangul. However, we can both cases especially for technical terms. Although convert them into Hanzi because there is a corre- there may exist enough resources between English spondence between them. Actually, some papers are and the other language, there is less resources be- published with combinations of Hangul and Hanzi tween two non-English languages, such as Korean, in order to specify definitions of vocabularies or em- Chinese and Japanese. phasize them. As in the same linguistic area, Korean, Chinese and The situation is similar in Japanese. According to Japanese have much in common in their languages. Studies on the Vocabulary of Modern Newspapers One of the aspects is that these languages use Chi- III published by National Institute for Japanese Lan- nese characters. In Japan, they use Kanji, which guage and Linguistics (国立国語研究所), the percent- is originated from Chinese, and Korean use Sino- age of Kanji (Chinese characters used in Japanese) Korean vocabularies, in which characters (Hangul) words are over 70% in newspapers. The Kanji words can be converted to corresponding Chinese charac- such as \使用 (use) " are preferably used than the ters (Hanzi). Even though the forms are different, native Japanese words such as \使う (to use) " in most of the vocabularies in these three languages Japanese formal writings. have one-to-one correspondence in character. In this paper, we propose a method of translating Ko- rean words into Chinese using the Chinese charac- 2.2 Related Work ter knowledge. We use the Hangul-to-Hanzi map- ping table to generate translation candidates and Since Korean characters are phonogram, we can find rank the candidates considering the possibility of a corresponding Hangul for a given Hanzi. Ac- the character combination and contextual similar- tually, almost all of the Hanzi can be converted ― 95 ― Copyright(C) 2015 The Association for Natural Language Processing. All Rights Reserved. to one (or scarcely several) Korean characters. (Huang,et,al. 2000) constructed a Chinese-Korean Table 1: A portion of the mapping table Character Transfer Table (CKCT Table) to reflect 한 闲, 韩, 恨, 限, 汉 the correspondence between Hanzi and Hangul. The 자 姐, 字, 磁, 子, 仔, 姿, 刺, 自, 资, 瓷 table contains 436 Hangul with corresponding 6763 Hanzi [1]. The number of daily-used Hanzi in Ko- rea is known as only 1800*1, and 3500 Hanzi are re- phological analyzer*3. The precision of the analyzer quired to learn for practical Chinese character level is announced to be higher than 95%. test*2. Obviously, many of the Hanzi in their table cannot be considered as practical ones. 3.2 Translation by Dictionary Matching In the following sections, we introduce the proposed method of our system in detail, and further demon- Some of the Korean nouns are translated into Chi- strate the result of conducted experiment . nese with a parallel dictionary as the initial step. As is well known, Wikipedia offers a wide range of parallel data for many languages, among them is aligned Wikipedia titles. In our method, we use the 3 Proposed Method aligned Wikipedia title pairs of Chinese and Korean as a parallel dictionary. In addition, we apply the Figure 1 shows the overview of our Korean-to- following processes to improve the quality and cov- Chinese word translation system. In this study, we erage of the parallel dictionary: only focus on the translation of Korean nouns be- cause the large number of technical terms are nouns. • Make full use of redirect pages of each page, and Given a Korean sentence, we first apply morphologi- validate the correctness using the first sentence cal analyses to extract Korean nouns. Then we look of the definition part to augment the parallel dictionary. up the Chinese translations of the Korean words in • a Korean-Chinese parallel dictionary. The words Convert Chinese characters of traditional Chi- not included in the parallel dictionary are passed to nese into simplified Chinese. the next step: generating possible Chinese character combinations as the translation candidates using the The Korean nouns which cannot be translated with Hangul-Hanzi mapping table. The candidates are the parallel dictionary are passed to the next pro- ranked by the combination score and context simi- cess. In addition, the Korean nouns which have mul- larity score. The combination score represents the tiple translation candidates (such as homonyms or possibility of the sequence of the chinese characters ambiguous words) are also passed to the next pro- calculated on the large Chinese web corpus. The cess. context similarity score considers the context of the input sentence and that of the sentences in the large Chinese web corpus. 3.3 Generating Translation Candidates The aim of this step is to generate possible trans- 3.1 Extracting Nouns lation candidates by combining Hanzi characters converted from the Hangul characters using the Korean words are separated from each other with Hangul-Hanzi mapping table. For instance, using spaces, however, each words splitted by a space the mapping table in Table 1, we can generate the 한자 may contain one or more morphological elements. translation candidates for \ (Chinese charac- For example, Korean word 학교에서 is composed ter, 汉字)": 闲姐, 韩姐,... 汉字, 汉子.... Whether of 학교 (noun) and 에서 (particle). Since most of these combinations have the proper meaning or not the Sino-Korean words are nouns, to obtain Sino- is still unknown. Most of the combined words may Korean words, we need a Korean morphological an- have no practical meaning. So we need to select the alyzer to extract nouns. most appropriate combination. For the given Korean sentences that contain Sino- Korean vocabularies, apart from splitting words 3.4 Rank the Translation Candidates with spaces, we extracted nouns from these sen- Now we have a large amount of combinations of tences with the help of a Java-library based mor- Hanzi characters. In order to select the most appro- priate one, we utilize combination score and context *1Wikipedia:상용한자 *2Korea Foreign Language Evaluation Institute *3KOMORAN ver 2.3 (Java Korean morphological ana- http://www.pelt.or.kr/cs/10/main/main.aspx lyzer) ― 96 ― Copyright(C) 2015 The Association for Natural Language Processing. All Rights Reserved. Figure 1: Lexicon Construction System similarity score calculated using a large Chinese web corpus. Table 2: The Korean-Chinese aligned resource Hanzi Meaning in Korean Hangul 强 강할 강 京 서울 경 3.4.1 Combination Score 界 지경 계 Combination score Scombi measures the strength of 計 셀 계 the link between the characters. For example, the combination score for \汉字" is calculated as S (汉字) = P (汉 j 字) × P (字 j 汉) combi The specified value of α ([0,..1]) is determined with 5-fold cross validation. We divide the set into 5 parts and recursively select one of them to test the 3.4.2 Context Similarity Score precision for each α. The final translation result For each combination, character-based context vec- is get with the best performed α. The character tor is constructed using the web corpus. We use sen- combination with the highest score is regarded as tences which contains the combination as the con- the final translation result. text window, and each element of the vector is the co-occurrence count of Chinese characters. We ig- nore stop characters such as 的 and 了 *4 and char- acters with less than 100 times of occurrence. 4 Experiment We also construct another context vector of the in- 4.1 Settings put Korean sentence using the formerly translated Korean words. The context similarity score Scontext For the Hanja-Hanzi mapping table, [Chu, et, al. can be calculated as the cosine similarity of the two 2012] have produced a Chinese character map- context vectors. ping table for Japanese (Kanji), Traditional Chi- nese (TC) and Simplified Chinese (SC) [2], thus for constructing a table further for Japanese, Chinese 3.4.3 Interpolation and Korean, we need to construct rather Japanese- The combination score is useful to examine if the Korean or Chinese-Korean table and merge these combination is appropriate or not, and the context two tables.