A Korean Search Pattern in the Like Operation
Total Page:16
File Type:pdf, Size:1020Kb
A KOREAN SEARCH PATTERN IN THE LIKE OPERATION Sung Chul Park1, Eun Hyang Lo1 1Kyungpook National University, Department of EECS 1370, Sangyeok-dong, Buk-gu, Daegu 702-701, Korea Jong Chul Park2, Young Chul Park1 2FusionSoft Co., Ltd. Research and Development 3-Division, 969-7 Dongcheon-dong, Buk-gu, Daegu, 702-250, Korea Keywords: SQL, LIKE, string pattern, Korean, Unicode. Abstract: The string pattern search operator LIKE of SQL has been developed based on English such that each search pattern of English of the operator works for each character in the alphabet of English. For finding Korean, search patterns of the operator can be expressed by both the alphabet and syllables of Korean. As a phonetic symbol, each syllable of Korean is composed either of a leading sound and a medial sound or of a leading sound, a medial sound, and a trailing sound. By utilizing that characteristic of Korean syllables, in addition to the traditional complete-syllable based search pattern of Korean, this paper proposes an incomplete- syllable based search pattern of Korean, as a pattern of the operator LIKE, to find Korean syllables having specific leading sounds, specific medial sounds, or both specific leading sounds and medial sounds. Formulating predicates that are equivalent with the incomplete-syllable based search pattern of Korean by way of existing SQL expressions is cumbersome and might cause the portability problem of applications depending on the underlying character set of the DBMS. 1 INTRODUCTION matched. Traditionally, the string pattern for Korean syllables has been a complete-syllable based one. The operator LIKE of the database language SQL is For example, finding strings that start with Korean a string pattern search operator. By providing the syllable ‘박’ (‘Park’, as it sounds in English) can be string pattern, the operator can identify column done by a string pattern ‘박%’. However, the values that match with the string pattern. As pattern problem of finding Korean syllables having specific characters of the string pattern, the standard SQL combinations of Korean alphabet has not been (American National Standards Institute 1992; addressed in the literature. We will come back to the Melton & Simson 1993) permits normal characters detailed specification of that problem in short right and reserved characters. The operator LIKE has after introducing the alphabet and syllables of been developed based on English such that each Korean. search pattern of English of the operator works for Modern Korean alphabet consists of 30 each character in the alphabet of English. For consonants (‘ㄱ’, ‘ㄲ’, ‘ㄳ’, ‘ㄴ’, ‘ㄵ’, ‘ㄶ’, ‘ㄷ’, finding Korean, search patterns of the operator can ‘ㄸ’, ‘ㄹ’, ‘ㄺ’, ‘ㄻ’, ‘ㄼ’, ‘ㄽ’, ‘ㄾ’, ‘ㄿ’, ‘ㅀ’, be expressed by both the alphabet and syllables of ‘ㅁ’, ‘ㅂ’, ‘ㅃ’, ‘ㅄ’, ‘ㅅ’, ‘ㅆ’, ‘ㆁ’, ‘ㅈ’, ‘ㅉ’, Korean. Once a Korean alphabet is used as a search ‘ㅊ’, ‘ㅋ’, ‘ㅌ’, ‘ㅍ’, and ‘ㅎ’, in lexicographic pattern, the Korean alphabet itself is matched with order) and 21 vowels (‘ㅏ’, ‘ㅐ’, ‘ㅑ’, ‘ㅒ’, ‘ㅓ’, the pattern, and once a Korean syllable is used as a ‘ㅔ’, ‘ㅕ’, ‘ㅖ’, ‘ㅗ’, ‘ㅘ’, ‘ㅙ’, ‘ㅚ’, ‘ㅛ’, ‘ㅜ’, search pattern, the Korean syllable itself is matched ‘ㅝ’, ‘ㅞ’, ‘ㅟ’, ‘ㅠ’, ‘ㅡ’, ‘ㅢ’, and ‘ㅣ’, in with the pattern. lexicographic order). As a phonetic symbol, each The string pattern of the operator LIKE allows Korean syllable is composed either of a leading any combination of consonants and vowels of sound and a medial sound or of a leading sound, a English alphabet. For example, by the string pattern medial sound, and a trailing sound. Only consonants ‘M% Ave% %a%’, strings like “Maple Avenue, can be used for leading sounds and trailing sounds of Evanston” and “Martin Ave. Chicago” can be Korean syllables. In Korean, the consonants that are 457 Chul Park S., Hyang Lo E., Chul Park J. and Chul Park Y. (2007). A KOREAN SEARCH PATTERN IN THE LIKE OPERATION. In Proceedings of the Ninth International Conference on Enterprise Information Systems - DISI, pages 457-464 DOI: 10.5220/0002376304570464 Copyright c SciTePress ICEIS 2007 - International Conference on Enterprise Information Systems used for leading sounds and trailing sounds are trailing consonants. In the case of the same leading called leading consonants and trailing consonants, consonant and the same vowel, a syllable that does respectively. For medial sounds of Korean syllables, not have any trailing consonant precedes syllables only vowels can be used and all the Korean vowels that have trailing consonants. are used for the medial sounds. Among the 30 In this paper, we are concerned about finding modern Korean consonants, 19 consonants (‘ㄱ’, Korean syllables that have specific leading sounds, ‘ㄲ’, ‘ㄴ’, ‘ㄷ’, ‘ㄸ’, ‘ㄹ’, ‘ㅁ’, ‘ㅂ’, ‘ㅃ’, ‘ㅅ’, specific medial sounds, or both specific leading ‘ㅆ’, ‘ㅇ’, ‘ㅈ’, ‘ㅉ’, ‘ㅊ’, ‘ㅋ’, ‘ㅌ’, ‘ㅍ’, and ‘ㅎ’, in lexicographic order) can be used as leading sounds and medial sounds. Our goal is specifying the combinations of Korean alphabet directly into consonants and 27 consonants (‘ㄱ’, ‘ㄲ’, ‘ㄳ’, ‘ㄴ’, the string patterns of the operator LIKE without ‘ㄵ’, ‘ㄶ’, ‘ㄷ’, ‘ㄹ’, ‘ㄺ’, ‘ㄻ’, ‘ㄼ’, ‘ㄽ’, ‘ㄾ’, having any notational difficulty. For that purpose, ‘ㄿ’, ‘ㅀ’, ‘ㅁ’, ‘ㅂ’, ‘ㅄ’, ‘ㅅ’, ‘ㅆ’, ‘ㆁ’, ‘ㅈ’, we have devised a two-dimensional table, which we ‘ㅊ’, ‘ㅋ’, ‘ㅌ’, ‘ㅍ’, and ‘ㅎ’, in lexicographic call the Korean syllable map. As shown in Figure 2, order) can be used as trailing consonants. the Korean syllable map has rows and columns for Figure 1 illustrates two Korean syllables: one representing the leading consonants and the vowels syllable ‘배’ (abdomen, pear, or vessel, in English) of Korean, respectively. Each cell in the map that is that is composed of a leading sound ‘ㅂ’ and a formulated by a specific row and a specific column contains all the 28 contiguous syllables that can be constructed with the leading consonant of the row Leading Medial Sound and the vowel of the column. All the rows, columns, Sound and syllables in each cell are arranged according to 배 the lexicographic order of the consonants, vowels, and syllables, respectively. According to that, all the 11,172 modern Korean syllables in Unicode Leading Medial Sound (Unicode, Inc. 2006) can be mapped into the Korean Sound syllable map. Trailing Sound 삶 In the Korean syllable map, indexes of rows, which we call row_indexes, start from 0 (for the initial consonant ‘ㄱ’) and end with 18 (for the Figure 1: Components of Korean syllables. initial consonant ‘ㅎ’), indexes of columns, which we call column_indexes, start from 0 (for the vowel medial sound ‘ㅐ’, and another syllable ‘삶’ (life, in ‘ ’) and end with 20 (for the vowel ‘ ’). We call English) that is composed of a leading sound ‘ㅅ’, a ㅏ ㅣ the row of row_index i as ROW and the column of medial sound ‘ㅏ’ and a trailing sound ‘ㄻ’. The i lexicographic order among Korean syllables follows column_index j as COLUMNj, and the cell of the order of <a leading consonant, a vowel, a trailing row_index i and column_index j as CELLi,j. Let the consonant> that constitute the Korean syllables, first syllable and the last syllable in CELLi,j be FSi,j which means that it keeps the order of leading and LSi,j, respectively. Then the syllables in CELLi,j, consonants; for the same leading consonant, it keeps ROWi, and COLUMNj are in the range of [FSi,j- the order of vowels; and for the same leading LSi,j], [FSi,0-LSi,20], and [FS0,j-LS0,j | FS1,j-LS1,j | … | consonant and the same vowel, it keeps the order of FS18,j-LS18,j], respectively. For example, the 041 ... ... 6 20 ㅏ ㅐ ... ㅓ ... ㅕ ... ㅣ 0 ㄱ 가각갂…갚갛 개객갞…갶갷 ... 거걱걲…겊겋 ... 겨격격...곂곃 ... 기긱긲...깊깋 1 ㄲ 까깍깎...깦깧 깨깩깪...꺂꺃 ... 꺼꺽꺾…껖껗 ... 껴껵껶...꼎꼏 ... 끼끽끾...낖낗 ... ... ... ... ... ... ... ... ... ... 7 ㅂ 바박밖...밮밯 배백밲...뱊뱋 ... 버벅벆…벞벟 ... 벼벽벾...볖볗 ... 비빅빆…빞빟 ... ... ... ... ... ... ... ... ... ... 11 ㅇ 아악앆...앞앟 애액앢...앺앻 ... 어억얶...엎엏 ... 여역엮...옆옇 ... 이익읶...잎잏 ... ... ... ... ... ... ... ... ... ... 18 ㅎ 하학핚...핲핳 해핵핶...햎햏 ... 허헉헊…헢헣 ... 혀혁혂...혚혛 ... 히힉힊...힢힣 Figure 2: The Korean syllable map for Unicode. 458 A KOREAN SEARCH PATTERN IN THE LIKE OPERATION syllables in CELL7,4, ROW7, and COLUMN4 are in Unicode is moved to the environment that uses the the range of [FS7,4-LS7,4] (i.e., [버–벟]), [FS7,0- character set of KS X 1001 or the other way, the LS7,20] (i.e., [바–빟]), and [FS0,4-LS0,4 | FS1,4-LS1,4 | values of FSi,j and LSi,j must be modified manually. … | FS18,4-LS18,4] (i.e., [거–겋 | 꺼–껗 | … | 허–헣]), This means that SQL applications adopting that respectively. simple solution might have the portability problem. Based on the Korean syllable map, the problem of Second, the simple solution might have the finding Korean syllables that have specific leading performance problem in executing search patterns of sounds, specific leading sounds and medial sounds, Type_COLUMN. As far as search patterns of or specific medial sounds becomes finding syllables Type_ROW and Type_CELL are concerned, they that are located in the specific ROWs, CELLs, or can be executed by checking whether a certain COLUMNs of the Korean syllable map, respectively. syllable lies in the specified range of syllables. According to that, string patterns for finding Korean However, the Type_COLUMN search pattern has 19 syllables that have specific leading sounds, specific ranges of syllables such that multiple comparisons leading sounds and medial sounds, or specific should be done to check whether a certain syllable medial sounds are called string patterns of matches with the search pattern. Type_ROW, Type_CELL, or Type_COLUMN, This paper presents an intuitive, uniform, and respectively. simple way of expressing the three types of Korean A simple solution of specifying those three types search patterns that is free from the portability of Korean search patterns could be expressing each problem of SQL applications.