<<

Hangul Tree Classifier for Type Clustering Using Horizontal and Vertical Strokes

Young-Bin Kwon Computer Vision Lab., Dept.Of Computer ., Chung-Ang University, Seoul, 156-756, Korea E-mail:[email protected]

Abstract tracts the mash vector from the sub-region predefined according to its style. is clustered into six different types in general. Hangul type classification is most effective when it is Type clustering has an effect of coarse classification in performed on -level but can only be applied as a syllable matching and becomes the pre-processing stage pre-processing stage for the Jaso-partition process. In this for the segmentation in the grapheme matching process. research, we propose a method for classifying Hangul In this paper, we define a set of grapheme region of a pre- characters into 6 different types to enhance the Jaso- consonants, vowel and post-consonant, that can absorb partition process with a high-level classification effect. the change of the ’s shape and the noise. A method for extracting the main of the horizontal 2. Type-categorization based on the princi- vowel and vertical vowel that appear in the region is ples of Hunminjeongum proposed in this paper. The Jaso (similar to grapheme) of Hangul is formed as an association of basic strokes and the Umjeol (similar to 1. Introduction syllable, one phonetic unit of a single Korean character) is formed as a 2-dimensional association of the Jaso to be Hangul, the Korean characters, consist of a much lar- able to express most natural sounds. Therefore, Hangul ger number of combination of characters and the combi- recognition should eventually be implemented based on nation have more complexities than English or Japanese Jaso units. A Hangul type classification method that is characters which makes it even more difficult to recog- reliable with a high classification rate is required because nize[1][7]. Research on Hangul recognition turned out a Jaso segmentation and Jaso region establishment is essen- promising result that classifies the characters into a tial to the Jaso-unit recognition. Jaso(similar to grapheme) unit and a syllable-unit. In Jaso-unit recognition [2,3], the number of characters to be 2.1. The essence of the foundation of Hunmin- matched is decreased to 40 characters for improved time jeongum spend to recognize characters. Syllable-unit recognition does not require such Jaso-partitioning process but also 14 Korean consonants are existing: ㄱ, ㄴ, ㄷ, ㄹ, has a large problem to be resolved because the number of ㅁ, ㅂ, ㅅ, ㅇ, ㅈ, ㅊ, ㅋ, ㅌ, ㅍ, and ㅎ. There are characters to be recognized reaches 11,172. Most re- 10 vowels for Hangul : ㅏ, ㅑ, ㅓ, ㅕ, ㅗ, ㅛ, ㅜ, ㅠ, searches that use the syllable-unit recognition method ㅡ, and ㅣ. Three rules for extending the set of opening only performs its operation on a restricted number of characters extracted by the high-occurrence method. element (consonant), middle phoneme element A study by Do [4,5] that applies the structural method (vowel), and closing phoneme element (consonant) are obtains all the runs for the four directions, horizontal, defined in the solution book of combination of Jaso. vertical, diagonal and anti-diagonal, and then sets the left- most and the top of the image as the base-location 2.2. Hangul types and the Jaso region for obtaining the horizontal and vertical stroke. Lee [6] Hangul characters can be classified into 6 different suggested the MRLP method as a characteristic for the types according to the shape of the vowel and the exis- type classification. MRLP creates a histogram by project- tence of a closing phoneme element(s). In our study, Han- ing the runs with the maximum length that exist in each gul characters are classified into 3 different types as row or line. The method implemented by Lee[3] deter- shown in Table 1, and then further classified into two mines the type through a type-classification neural net- other categories by the existence of a closing phoneme work that uses the mash-vector as its input and then ex- element.

1051-4651/02 $17.00 (c) 2002 IEEE

Table 1. Type classification of Hangul

(a) (b) (c) Figure 2. Stroke Extraction using Essential Region (a) original character (b)hrizontal stroke (c) vertical stroke

Extracting the vertical stroke and the horizontal stroke First step of extraction obtains a horizontal and vertical Each region illustrated in Figure 1 is a definition of the run that is thicker than the maximum thickness of a stroke region where only a portion of the Jaso appears. A portion and then merges the neighbouring runs together for the of the opening phoneme element should always appear in extraction. This method can effectively absorb the change the region of the opening phoneme element consonant and of the vertical and horizontal stroke caused by the slant or a portion of the vowel appears if either a horizontal vowel the noise. Figure 2-(a) is a normalization of the input or vertical vowel exists in the region. If a closing pho- character image and Figure 2-(b) and (c) show the ex- neme element exits, then parts of the closing phoneme tracted vertical stroke and horizontal stroke. The region element except the ‘ㄱ’ and ‘ㅋ’ characters will appear in indicated using the ‘O’ is a portion of the stroke that ap- the closing phoneme element consonant region. There- pears in the horizontal vowel region and the vertical fore, we can effectively extract the desired Jaso from the vowel region. However, a method for separating the corresponding region by investigating the connection strokes is required because the various strokes may be elements in the region. The suggested Jaso region is more merged together in this method. The basic information robust against the various character types and noise than required for the separation method detects the divergence at the peak used in the related studies [4,5]. and the change the thickness of the stroke for the separa- tion.

Distinguishing between the vertical vowel and the long horizontal vowel The vertical vowel in Hangul can be obtained by find- ing the stroke that fits the characteristic of the vertical vowel from the set of vertical strokes extracted from the vertical vowel region. This can be thought as discarding Figure 1. Essential Jaso Region for Hangul characters the strokes that are not a vertical vowel from a set of can- didate group. Because the vertical vowel region resides in the upper-right corner, the vertical strokes extracted can 2.3. Proposed Tree Classifier be thought to be generated from a opening phoneme ele- A tree classifier obtains the stable characteristics by ment consonant if not a vertical vowel. We have analyzed simplifying the complex characteristics step-by-step and the types of vertical strokes that can occur from a opening then gathering the types. The types are further classified phoneme element consonant. Because a vertical stroke within each group. exists in the right-hand side, characters ㄱ,ㄱ, ㄲ, ㅁ, In this study, we extended the tree classifier [4], which ㅂ, ㅃ, ㅋ should be discarded if they appear in the is devoted to the structural characteristics of Hangul, for vertical vowel region.ㄱ, ㄲ, ㅁ, and ㅋ can be dis- classification as shown in Figure 3. The suggested tree carded using the information of the horizontal stroke that classifier performs its horizontal vowel classification at appears in the top, and characters ㅂ and ㅃ can be the bottom-most level of the tree. The region that the discarded using the vertical stroke that occurs on the left- horizontal vowel can appear and the limitations set by the hand side and the information of the horizontal-stroke that number of horizontal branches are used to determine the appears at the bottom. If the characteristic of the existing existence of the horizontal vowel. long horizontal vowel is accurate, all of the vertical vowel in the set of candidates can be discarded. This is because a vertical stroke cannot co-exist with a long horizontal vowel.

1051-4651/02 $17.00 (c) 2002 IEEE

the process of estimating the number of horizontal branches. In the PMS, assuming that the number of hori- zontal branch is n, then the number of horizontal branches is determined to be n-1 in case a contact is made between a horizontal branch and a vertical branch. The reason for this is to reduce the limitation on the association rule for the horizontal stroke by reducing the number of horizontal branches as shown in Table 2.

Table 2. Combination rule of Horizontal and Vertical Vowel

v- vowel -brench: 2

h-brench : 1 h-brench : 3

Figure 3. Proposed Tree Classification right left h-vowel ㅏ, ㅑ, ㅒ, ㅕ, Distinguishing the closing phoneme element ㅣ ㅓ, ㅔ Distinguishing the closing phoneme element that be- ㅐ ㅖ ㅗ comes the prop can be divided into two different types { { × × according to the type of the vowel, as shown in Figure 3 ㅜ of the tree classifier. If a vertical vowel exists, the main { × { × characteristic is contained in the bottom point of the cor- ㅡ responding vertical stroke. If the bottom point of the ver- { × × × ㅛ ㅠ tical stroke is below the threshold, it can be concluded , × × × × that a closing phoneme element exists. And if it is larger than the threshold, the existence of a closing phoneme In a closing vowel with a single vertical stroke, the element is determined by the characteristic of the branch length of the horizontal run that consists of a horizontal on the left side that is generated from the vertical stroke. stroke is measured to select the horizontal branch candi- This is because the vertical stroke extracted from the dates. The horizontal run with the maximum value is vertical vowel region becomes longer if the vertical stroke compared with the threshold values to select the horizon- of the closing phoneme element consonant makes contact tal branch candidate whose length is measured between with the vertical vowel. A prop cannot exist if the long the minimum and the maximum threshold. horizontal vowel exists in the bottom whereas it can if it is A horizontal stroke that extends to the left can be mis- located in the middle layer. If the long horizontal vowel judged as a horizontal branch if a part of a vertical vowel exists in the middle layer and no prop exists, then the makes contact with the opening phoneme element, hori- horizontal vowel can either be the character ‘ㅜ’ or ‘ㅠ’, zontal vowel, and closing phoneme element. Therefore, which means that the existence of the prop can easily be the run extending to the lower left side must be tracked determined by extracting the characteristic of the vertical and any run whose length exceeds the maximum thresh- branch. old should be discarded from the set of candidates. The same process is applied on the left vertical stroke to select Estimating the number of vertical branches for the the horizontal branch for a vertical vowel with a double vertical vowel vertical stroke. Table 2 shows the association rule between a horizon- tal vowel and a vertical vowel, and the ‘O’ indicates Mixed vowel selection the possible association and the ‘X’ shows an impossible The long horizontal vowels in type 3 and 4 in Table 1 association rule. have already been classified in the upper-level nodes As shown in Table 2, the possible association of a through the tree classifier. The change in location is more horizontal vowel and a vertical vowel (v-vowel) is limited rapid for the horizontal vowel than the vertical vowel and to 4 pairs which means that the number and direction of the characteristics that appear in the horizontal vowel tend the vertical branch can be an important factor in limiting to be similar to those of the horizontal stroke of a opening the shape of the horizontal vowel. If the number of hori- phoneme element or closing phoneme element. zontal branches (h-branch) for the vertical vowel is not known correctly, then a fatal error may occur on the ex- traction process of the horizontal stroke. Therefore, we propose the use of a pessimistic minus strategy (PMS) for

1051-4651/02 $17.00 (c) 2002 IEEE

The classification of the horizontal vowel is processed fonts. Table 3 shows the results of existing work on print- at the bottom level of the tree classifier to resolve such ing font Hangul recognition that have classified Hangul problem in this study. The region that the horizontal into 6 types and have the highest classification rate. vowel appears has been minimized using the grouping information of the Hangul type that occurred on the up- Table 3. Performance Comparison per-level nodes. The change in the horizontal vowel caused by a contact or noise has been absorbed using the restriction rule based on the number and direction of the 1405 Most Frequently character set horizontal stroke. Method New We propose a method that verifies the existence of a Gothic Myongjo Total horizontal vowel by evaluating whether the horizontal Myongjo stroke extracted follow the association rule. The possibility that it be a horizontal vowel becomes Do’s 96.58% 95.87% 94.09 % 95.52% enormous, regardless of the character shape, if the vertical Method branch does not appear on the left-hand side of the Proposed 99.99 % 99.86 % 99.57 % 99.81% horizontal stroke and the left-end of a horizontal vowel Method indicates the very left-end of the normalized character image in most character shape. Also, if a vertical branch 4. Conclusion appears in the center of the horizontal stroke, the possibil- ity of it being a horizontal vowel becomes even bigger. In This paper proposed a Jaso domain that could absorb this study, the length of the horizontal stroke is obtained font changes and noise, a method that extracts the hori- by tracking the final horizontal stroke, left after reducing zontal stroke and vertical stroke, which are the main the number of strokes in the candidate group by verifying stroke of horizontal vowel and vertical vowel. The main the existence of a vertical branch to the left side of the idea of Hangul type classification is extracting the correct horizontal stroke and the minimum x-coordinate value of vertical vowel and horizontal vowel from the candidates, the horizontal stroke, to the upper-right side. The horizon- which were selected by extracting vertical stroke from tal vowel is determined by verifying the existence of a vertical vowel domain and horizontal stroke from hori- vertical branch either on top or bottom of the center of the zontal vowel domain. This paper has proposed a method horizontal stroke using the facts such as the direction and where the number of horizontal branches is estimated to number of the horizontal branch of the vertical vowel at use the information of compound of horizontal vowel and the top of the tree classifier and the length of the horizon- vertical vowel in sentence structure. tal stroke.

Type pattern Verification References After the existence of vertical vowel, closing phoneme element, horizontal vowel is figured out by the tree classi- [1] S. Mori, C.Y. Suen and K. Yamamoto, “Historical Review of OCR Research and Development”, Proc. of the IEEE, fier, the Hangul type is determined. However if ‘ㅋ’ or Vol. 80, No. 7, 1992. ‘ㅍ’ appears at the opening phoneme element of {type [2] S. B. Cho and J. Kim, “Hierarchical Neural Network for 1,2}, the error of horizontal stroke being judged as a main Printed Hangul Characters”, KISS Journal, Vol. 17, No. 3, stroke of the horizontal vowel cannot be solved. For ex- pp306-316, 1990. ample, the horizontal stroke of ‘커’, ‘파’ in type 1 are [3] J. S. Lee, O. J. Kwon and S.Y. Bong,” Highly Accurate Hangul Recognition using Improved Jaso Recognition judged as horizontal vowel of ‘귀’, ‘과’ in type 5 or the Method”, KISS Journal, Vol. 23, No. 8, pp841-851, 1996. horizontal stroke of ‘컬’, ‘팔’ of type 2 are judged as [4] J. I. Do, “Jaso Separation for Printed Hangul Character horizontal vowel of ‘궐’, ‘괄’ in type 6. The characteristic Recognition”, Proc. of KISS fall conference, vol. 17, No.2 of ‘ㅋ’ is the angle between the horizontal stroke and pp175-178, 1990. [5] J. I. Do, “Development of Hangul Document Recognizer”, declined stroke(diagonal/anti-diagonal) while the charac- Review of KISS, Vol. 9, No.1, pp22-32, 1991 teristic of ‘ㅍ’ is the x coordinate contact of the vertical [6] K.S. Lee and H. I. Choi, “Hangul Character Type Classifi- stroke between the two horizontal stroke. cation using Fuzzy Inference”, Proc. of IPIU, Vol.5, pp164- 172, 1993. [7] S. B. Cho and J. Kim "Recognition of Large-Set Printed 3. Experiments and Result Analysis Hangul (Korean ) by Two-Stage Backpropagation Neural Classifier," Pattern Recognition, Vol. 25, No. 11, pp The fonts used for experimentation were Myungjo, 1353 - 1360., 1992. Gothic, new Myungjo, which are the most widely used

1051-4651/02 $17.00 (c) 2002 IEEE