Advances in

Edit Calculation by Phonetic Rules and Word-length Normalization

Byunggul Bae+, Seung-Shik Kang+, Byung-Yeon Hwang++ Department of Computer Science Kookmin University+ 77, Jeongneung-ro, Seongbuk-gu, Seoul 136-792, Korea+ School of Computer Science and Information Engineering The Catholic University of Korea++ 43 Jibong-ro, Wonmi-gu, Bucheon-si, Gyeoinggi-do, 420-743, Korea++ [email protected], [email protected], [email protected]

Abstract: calculation is a well-known criterion for string-to-string comparison. It seems to be an appropriate method for alphabetic languages like English and European languages. However, the conventional edit distance criterion is not the best one for agglutinative languages like Korean. The reason is supposed that two or more letter units make a Korean character which is called as a syllable. This mechanism of syllable construction in the Korean language causes an inefficiency of edit distance calculation. We explored a new edit distance method by consonant normalization and the normalization factor.

Key-Words: edit distance, Korean character, consonant normalization, normalization factor

1 Introduction edit distance. Suppose that a complex vowel Edit distance is defined as a cost of unit operations ‘ㅘ’(wa) is substituted by ‘ㅝ’(wo). There are two that is required to transform a string to the other candidates of edit distance. Edit distance would be 1 string in which those two strings become the same because only a letter is different. However, it may string. There are four unit operations: insertion, be 0.5 because half of complex vowel is different deletion, replacement, and transposition in which from the key-stroke point of view, ‘ㅏ’ and ‘ㅓ’. insertions and deletions have equal costs and replacements have twice the cost of an insertion [1,2]. 2 Related Works Edit distance is widely used for string-to-string There are several distance measures: Hamming, comparison in alphabetic languages like English and Levenshtein, Damerau-Levenshtein, and Jaro- European languages. However, the conventional Winkler distance. is a edit distance metric is not so good for agglutinative between two strings of equal length [5]. It is used in languages like Korean [3,4]. We investigated the an to count the number of reason and found that word construction mechanism substitutions while transferring a text string. should be considered for the edit distance is a string-to-string distance calculation. In the Korean language, two or more metric that is recursively defined as in equation (1), letter units make a Korean character which is called where distance measures are insertion, deletion, and a syllable. Open syllable consists of a consonant substitution [6]. followed by a vowel, and close syllable consists of three letters, consonant + vowel + consonant. Furthermore, when we write a complex consonant or a complex vowel, two key-strokes are required. That is, a syllable is completed in a computer environment by two to five key-strokes from an input device like keyboard or mouse. The input mechanism of complex consonants and (1) complex vowels needs two key-strokes, but they make a letter in a Korean word. This is what makes Damerau-Levenshtein distance is an extension of it hard to set a specific guide-line to calculate the the Levenshtein distance by adding a new operation:

ISBN: 978-1-61804-126-5 315 Advances in Computer Science

transposition [7]. Jaro-Winkler distance has been 3.2 Syllable-based Edit Distance designed and best suited for short strings such as Korean word is defined as a sequence of syllables. person names [8]. Equation (2) describes the Jaro Syllable-based edit distance is to count the number distance dj of two given strings s1 and s2, where m is of different syllables in a string. This metric is the number of matching characters and t is half the adequate for long strings like text-to-text similarity, number of transpositions. but it cannot be used for short strings because all the distance score is 0 or 1 regardless of the similarity of syllables. The difference between syllables ‘다’ and ’답’ is the final consonant ‘ ’ and ‘ㅂ’ which is one third of the syllable. Syllable-based metric (2) imposed a critical problem when typo errors occur on a syllable boundary.

3 Edit Distance for Korean Words 3.3 Hybrid Edit Distance Korean word is a sequence of syllables which is Roh(2010) points out the inefficiencies of syllable- called as a Korean character. Korean syllable is a based and letter-based method [3]. Syllable-based combination of consonants(C) and vowels(V) in metric has a drawback in which one letter difference 1 which only C1VC2 format is allowed. Korean and whole syllable difference are treated as the same 2 character can be considered to be very similar to distance. Chinese character in which it consists of one or more basic letter units. However, the internal dsyl(공원, 고원) = dsyl(공원, 낙원) = 1 structure of Korean character is wholly different from Chinese character in that Korean character is In the above example, dsyl(공, 고) is the same separated into letters of consonant and vowels. distance as dsyl(공, 낙), though there are two General edit distance metric has been devised and matching letters in ‘공’ and ‘고’, where there is no applied for a simple, linear, and one-dimensional matching letter in ‘공’ and ‘낙’. string like English words [9,10]. When we try to Roh(2010) proposed a hybrid method of syllable apply this edit distance metric to the two- and letter-based metric. He defined a β –distance dimensional Korean word system, there are three for syllable-based metric and α -distance for letter- possible ways according to the basic operations. based metric. If there is one or two letters match

then α -distance is applied, and β –distance is used 3.1 Letter-based Edit Distance for whole syllable difference. In the letter-based Letter-based metric is just like to apply conventional metric, three letter-distance 3α is the maximum in Levenshtein distance to a letter string. A word is one syllable. supposed to be a sequence of consonants or vowels with no consideration of syllable construction. This method ignores a syllable boundary so that a 4 Phoneme-based Edit Distance distance calculation for misspelled or transposition Hybrid edit distance has been devised for the errors across the syllable boundaries are easily syllable structure of the Korean language. This handled. method divides a distance metric into syllable Letter-based metric easily computes transposition distance and letter distance. However, this hybrid errors that cannot be handled in syllable-based method did not completely solve the syllable- metric, but it does not consider the syllable boundary changes. In this paper, we propose a new characteristics of the Korean language where each method that considers phonetic change rules of the word is a sequence of well-formed syllables CV or Korean language. C1VC2. Syllable-based system naturally put a limitation of consonant and vowel combination. So, 4.1 Phoneme Normalization most typo errors are affected by the syllable Our new approach of Korean edit distance formation rules and edit distance metric should calculation introduces a phonetic pronunciation. As reflect this kind of syllable-structured characteristics.

2 The syllable concept is merely the basics of the Korean writing system. In other points of view, Korean words are 1 Final consonant C2 is optional. treated as a sequence of consonants and vowels.

ISBN: 978-1-61804-126-5 316 Advances in Computer Science

a preprocessing of input strings, consonants can be Zero means that two word strings are equal and 1 normalized to a phoneme representative. In this means that they are wholly different. For the approach, nine phonetic transformation rules in example word pairs in Fig. 1, distance normalization Lee(2008) are sequentially applied [11]. After then, score of <국어, 숙어> is 0.16(=1/6) and that of letter-based distance metric is applied to the <나무가지, 나뭇가지> is 0.08(=1/12). These distance modified input strings. Table 1 shows phonetically values mean that long word-length pair is more equal consonants and their corresponding similar than short word-length pair. representatives.

Table 1. Consonant representatives 5 The Experimentation Consonants Representative Experiments on normalization effects between two ㄱ, ㅋ, ㄲ ㄱ similar word pairs have been conducted. ㄷ, ㄸ, ㅌ ㄷ Experimental data are 699 confusing word pairs that have been built by the Ministry of Culture and ㅂ, ㅃ, ㅍ ㅂ Tourism [12]. Non-similar word pairs have been ㅅ, ㅆ ㅅ automatically constructed by all the other words except the one in the similar word pair. That is, non- ㅈ, ㅉ, ㅊ ㅈ similar word pairs are 688 for each word.

4.2 Edit Distance Normalization 5.1 Threshold Value of Normalization Edit distance normalization is introduced for two Table 2 shows the necessity of the normalization reasons. First, word-length independent metric will effect. X-axis is the threshold value for similarity get a better result than word-length dependent decision and y-axis is the percentage of word pairs metric. Second, non-similar words are easily filtered according to the threshold value. If we make a out when there are lots of word pairs for edit decision that a normalization score is less than 0.4, distance calculation. then 93.56% of word pairs are determined to be the synonym word.

Fig. 1. Example of conventional edit distance

Fig. 1 shows two examples of edit distance calculation. The meaning of edit distance 1 in these two cases is not the same. It is common that edit distance 1 or 2 between two short words shows that these two words are different words. In Fig. 1, the meaning of ‘국어’(national language) and ‘숙어’(phrase) are different, where ‘나무가지’(tree branch: old spelling standard) and ‘나뭇가지’(tree branch: new spelling standard) have the same meaning. Fig. 2. Normalization for similar word pairs In order to increase the discrimination power of the edit distance metric, word-length normalization is introduced. In equation (3), α is an edit distance Threshold value for similar or non-similar word pair and |wi| is a word length. is controversy that we should induce the best threshold value. Fig. 4 shows the result of N = α / max(|w1|, |w2|) (3) comparing the accuracy and threshold value 0.5 has where |wi| = 3 syl_length(wi) got the best result.3

Word length |wi| is a syllable length of the word multiplied by 3 in which the length of a syllable is 3 For non-similar 243951 word pairs, 242732 word defined as 3. The range of N is between 0 and 1. pairs have been determined correctly.

ISBN: 978-1-61804-126-5 317 Advances in Computer Science

For the edit distance 1 and 2, our method got excellent results.

6 Conclusion We proposed a new method of edit distance calculation for the syllable-structured word similarity. This metric has been devised to improve the performance of the syllable-based and letter- based metrics for word similarities. Edit distance performance has been improved through the phonetic pronunciation rules and word-length normalization. Experimental results show that phoneme-based metric got a better result compared to the methods of letter-based, syllable-based and hybrid distance. Fig. 4. Accuracy of normalization factor Phoneme-based metric is improved by applying a normalization method which is better than other methods both for similar word pairs and non-similar 5.2 Performance Evaluation word pairs. We performed an experimentation and the results are shown in Table 2. Threshold value of the proposed method was set to 0.5. The methods in References: Table 2 are as follows. [1] R. A. Wagner, M. J. Fischer, “The String-to- String Correction Problem”, Journal of the - SylED: Syllable-based distance metric ACM, Vol.21, No.1. , 1974, pp. 168-173. - LetED: Letter-based distance metric [2] G. Navarro, “A guided tour to approximate - HybED: Hybrid metric of syllable and letter-based string matching”, ACM Computing Surveys, - PhoED: Phoneme-based distance metric Vol.33, No.1, 2001, pp.31-88. - NphoED: Normalized phoneme-based metric [3] K. Roh, J. Kim, E. Kim, K. Park, and H. Cho, “Edit Distance Problem for the Korean Table 2 shows that phoneme-based and normalized Alphabet,” Journal of KIISE, Vol.37, No.2, phoneme-based metric got a better result than the 2010, pp.103-109. other methods. Especially, in the distance 2 and 5, [4] B. Bae, I. Park, and S. Kang, “Edit Distance phoneme based metric with normalization are better Calculation Method by using Phonological than the other edit distance metrics. Rules for Korean,” Proceedings of the 38th KIISE Fall Conference, Vol.38, No.2(B) , 2011, Table 2. Edit distance for similar word pairs pp.221-222. Dist. SylED LetED HybED PhoED NphoED [5] R. W. Hamming, “Error detecting and error 0 0 0 0 14.2 22.3 correcting codes”, Bell System Technical 1 · 68.4 68.4 68.7 71.2 Journal, Vol.29, No.2, 1950, pp.147-160. [6] V. Levenshtein, “Binary Codes Capable of 2 · 83.5 82.6 84.8 85.1 Correcting Deletions, Insertions and 3 75.9 89.6 89.1 91.6 91.6 Reversals,” Sov. Phys, Dokl., Vol.10, 1966, 4 94.2 93.8 95.7 95.7 · pp.707-710. 5 · 96.2 95.8 96.7 96.7 [7] F. J. Damerau, “A Technique for Computer 6 96.2 98.4 98.0 98.6 98.6 Detection and Correction of Spelling Errors,” 7 · 99.3 98.5 99.0 99.0 Communications of the ACM, 1964, pp.171- 8 99.7 99.7 99.4 99.4 176. · [8] W. E. Winkler, “String Comparator Metrics 9 99.3 99.9 99.9 99.9 99.9 and Enhanced Decision Rules in the Fellegi- 10 · 100.0 100.0 100.0 100.0 Sunter Model of Record Linkage,” Proceedings of the Section on Survey Research Methods, NphoED metric got the best performance of 91.6% 1990, pp.354-359. in an edit distance 3 and 98.6% in the edit distance 6. [9] D. Gusfield, Algorithms on Strings, Trees, and Sequences, Computer Science and

ISBN: 978-1-61804-126-5 318 Advances in Computer Science

Computational Biology, Cambridge Univ. Press, 2007. [10] K. Kukich, “Techniques for Automatically Correcting Words in Text,”, Journal of ACM Computing Surveys, Vol.24, 1992, pp.377-439. [11] H. Lee, A Study on the Efficient Education of Pronunciation in Korean Phonetic Transformation Rules, M.S. Thesis, Donga University, 2008. [12] S. Chang, S. Kim, and S. Chung, This Slip of the Tongue that Slip of the Pen, Ministry of Culture and Tourism, 2000.

ISBN: 978-1-61804-126-5 319