A Statistical Model for Hangeul-Hanja Conversion in Terminology Domain

A Statistical Model for Hangeul-Hanja Conversion in Terminology Domain Jin-Xia HUANG, Sun-Mee BAE, Key-Sun CHOI Department of Computer Science Korea Advanced Institute of Science and Technology/KORTERM/BOLA 373-1 Guseong-dong, Yuseong-gu, Daejeon, 305-701 {hgh, sbae, kschoi}@world.kaist.ac.kr and without-space formats even after part of Abstract speech (POS) tagging, because the space using is Sino-Korean words, which are historically very flexible in Korean language. For example, borrowed from Chinese language, could be “ ¨ © (Hanja bienhuan) (Hanja conversion)” ¨ © represented with both Hanja (Chinese could be in both “ ¨¦© ” and “ ” characters) and Hangeul (Korean characters) writing formats. It means a compound word writings. Previous Korean Input Method tokenization should be included as a pre- Editors (IMEs) provide only a simple processing in Hangeul-Hanja conversion. dictionary-based approach for Hangeul-Hanja Automatic Hangeul-Hanja conversion also suffers conversion. This paper presents a sentence- from another problem, that there are no enough based statistical model for Hangeul-Hanja Hanja corpora for statistical approach. In modern conversion, with word tokenization included Korean language, only few sino-Korean words are as a hidden process. As a result, we reach written in Hanja writing generally, and the same 91.4% of character accuracy and 81.4% of sino-Korean word with the same meaning could be word accuracy in terminology domain, when in either Hangeul or Hanja writing even in the only very limited Hanja data is available. same text. This paper presents a sentence-based statistical 1 Introduction model for Hangeul-Hanja conversion. The model includes a transfer model (TM) and a language More than one half of the Korean words are model (LM), in which word tokenization is Sino-Korean words (Chang, 1996). These words included as a hidden process for compound word are historically borrowed from Chinese language, tokenization. To find answer for the issues like could be represented with both Hanja and Hangeul adapt the model to character or word level, or limit writings. Hanja writing is rarely used in modern the conversion target to only noun or expand it to Korean language, but still plays important roles in other Part of Speech (POS) tags, a series of the word sense disambiguation (WSD) and word experiments has been performed. As a result, our origin tracing, especially in the terminology, system shows significant better result with only proper noun and compound noun domain. very limited Hanja data, when we compare it to the Automatic Hangeul-Hanja conversion is very dictionary-based conversion approach used in difficult for system because of several reasons. commercial products. There are 473 Hangeul characters (syllables) have In the following of this paper: Section 2 Hanja correspondences, map to 4888 common discusses related works. Section 3 describes our Hanja characters (Kim, 2003). Each of these model. Section 4 discusses several factors Hangeul characters could correspond to from one considered in the model implementation and to sixty-four Hanja characters, so it is difficult to experiment design. Section 5 gives the evaluation system to select the correct Hanja correspondence. approaches and a series of experiment results. Besides that, the sino-Korean Hangeul Section 6 presents our conclusion. characters/words could be also native Korean characters/words according to their meaning. For ¢¡ 2 Related Works example, “ (susul): stamen, operation, fringe”) £¡ could correspond to a native Korean word “ There are several related areas according to the tasks and approaches. First is previous Korean (stamen)”, a sino-Korean word “ ¤¦¥ (operation)”, ¡ Hanja, Japanese Kanji (Chinese characters in and a mixed word “ § (fringe)” (Bae, 2000). It Japanese language) and Chinese Pinyin input means in Hangeul-Hanja conversion, the same methods, the second one is English-Korean word may be either converted to Hanja or remain transliteration. as Hangeul writing. In addition, compound sino- Korean IME (Haansoft, 2002; Microsoft, 2002) Korean words could be written in both with-space supports word-based Hangeul-to-Hanja conversion. Pr(E,K) (Jung et, al., 2000). It provides all possible Hanja correspondences to all Hanja-related-Hangeul words in user selected 3 The Model range, without any candidate ranking and sino- Different from previous Hangeul-Hanja Korean word recognition. User has to select sino- conversion method in Korean IMEs, our system Korean words and pick out the correct Hanja uses statistical information in both sino-Korean correspondence. Word tokenization is performed word recognition and the best Hanja by left-first longest match method; no context nor correspondence selection. There are two sub- statistical information is considered in the models included in the model, one is Hangeul- correspondence providing, except last-used-first Hanja TM, and the other one is Hanja LM. They approach in one Korean IME (Microsoft, 2002). provide a unified approach to the whole conversion A multiple-knowledge-source based Hangeul- processing, including compound word tokenization, Hanja conversion method was also proposed (Lee, sino-Korean word recognition, and the correct 1996). It was a knowledge based approach which Hanja correspondence selection. used case-frame, noun-noun collocation, co- Let S be a Hangeul string (block) not longer than occurrence pattern between two nouns, last-used- a sentence. For any hypothesized Hanja conversion first and frequency information to distinguish the T, the task is finding the most likely T*, which is a sense of the sino-Korean words and select the most likely sequence of Hanja and/or Hangeul correct Hanja correspondence for the given characters/words, so as to maximize the highest Hangeul writing. Lee (1996) reported that for probability Pr(S, T): T* = argmax Pr(S, T). practical using, there should be enough knowledge T Pr(S, T) could be transfer probability Pr(T|S) base, including case-frame dictionary, collocation itself. And like the model in Pinyin IME (Chen and base and co-occurrence patterns to be developed. Lee, 2000), we also try to use a Hanja LM Pr(T), to There are several methods were proposed for measure the probabilities of hypothesized Hanja Japanese Kana-Kanji conversion, including last- and/or Hangeul sequences. The model is also a used-first, most-used-first, nearby character, sentence-based model, which chooses the probable collocation and case frame based approaches. The Hanja/Hangeul word according to the context. word co-occurrence pattern (Yamashita, 1988) and Now the model has two parts, TM Pr(T|S), and LM case-frame based approach (Abe, 1986) were Pr(T). We have: reported with a quite high precision. The * disadvantages include, there should be enough big T = Pr(S,T) = argmaxPr(T | S)Pr(T) (1) T knowledge-base developed before, and syntactic T is a word sequence which composed by t1, t2, analyzer was required for the case frame based …, t , where t could be either Hanja or Hangeul approach. n i word/character. We can see the model in equation Chinese Pinyin conversion is a similar task with (1) does not follow the bayes law. It is only a Hangeul-Hanja conversion, except that all Pinyin combination model of TM and LM, in which TM syllables are converted to Chinese characters. To reflects transfer probability, and LM reflects convert Pinyin P to Chinese characters H, Chen context information. Using linear interpolated and Lee (2000) used Bayes law to maximize bigram as LM, the model in equation (1) can be Pr(H|P), in which a LM Pr(H) and a typing model rewritten as equation 2. Pr(P|H) are included. The typing model reflects n S T t s t t t online typing error, and also measures if the input Pr( , ) ≈ ∏Pr( i | i ){β Pr( i | i−1 ) + (1− β ) Pr( i )) (2 is an English or Chinese word. As the report, the i=1 ) statistical based Pinyin conversion method showed Word tokenization is also a hidden process in better result than the rule and heuristic based ’ ’ ’ ’ Pinyin conversion method. model (2), so both T=t1, t2, …,tn and T =t 1,t 2,…t m Hangeul-Hanja conversion normally does not can be the correspondences of given source sentence S. In practice, a Viterbi algorithm is used need to convert online input. So we assume the * user input is perfect, and employ a transfer model to search the best T sequence. We do not use the noisy channel model instead of the typing model in Chen and Lee * (2000). Pr(T|S)=argmaxTPr(S|T)Pr(T) to get T , because The third related work is transliteration. In most of the Hanja characters has only one Hangeul statistical based English-Korean transliteration, to writing, so that most of the Pr(S|T) tend to be 1. So convert English word E to Korean word K, a model if we use the noisy channel model in Hangeul- could use Korean LM Pr(K) and TM Pr(E|K) (Lee, Hanja conversion, the model would be weakened 1999; Kim et.al, 1999) to maximize Pr(K|E), or use to Hanja LM Pr(T) in most of the cases. English LM Pr(E) and TM Pr(K|E) to maximize 4 Implementation 4.2 Transfer Model Weight There are several factors should be considered in Our model in equation 2 is not derived from the model implementation. For example, we could Bayes law. We just use the conditional probability adapt the model to character-level or word-level; Pr(T|S) to reflect the Hangeul-Hanja conversion we could adopt a TM weight as an interpolation possibility, and assume Hanja LM Pr(S) would be coefficient, and find out the suitable weight for helpful for the output smoothing. The model is best result; we can also consider about utilizing only a combination model, so we need a Chinese corpus to try to overcome the sparsness interpolation coefficient α - a TM weight, to get problem of Hanja data.

A Statistical Model for Hangeul-Hanja Conversion in Terminology Domain

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support