Segmentation of Touching Character Printed Lanna Script Using Junction Point

Segmentation of Touching Character Printed Lanna Script Using Junction Point

Journal of Engineering Science and Technology Vol. 13, No. 10 (2018) 3331 - 3343 © School of Engineering, Taylor’s University SEGMENTATION OF TOUCHING CHARACTER PRINTED LANNA SCRIPT USING JUNCTION POINT RUJIPAN KOSARAT*, NUALSAWAT HIRANSAKOLWONG Department of Computer Science, Faculty of Science, King Mongkut’s Institute of Technology Ladkrabang, Ladkrabang, Bangkok 10520, Thailand *Corresponding Author: [email protected] Abstract In the northern part of Thailand since 1802, Lanna characters were popular as ancient characters. The segmentation of printed documents in Lanna characters is a challenging problem, such as the partial overlapping of characters and touching characters. This paper focuses on only the touching characters such as touching between consonants and vowels. Segmentation method begins with the horizontal histogram and then vertical histogram for segmentation of text lines and characters, respectively. The results are characters consisted of correct clear characters, partial overlapping characters, and touching characters. The proposed method computes the left edge junction points and right edge junction points. Then find their maximum numbers and find the value of its row to separate consonant and vowel from touching. The trial over the text documents printed in Lanna characters can be processed with an accuracy of 95.81%. Keywords: Histogram, Line and character Segmentation, Touching character. 3331 3332 R. Kosarat and N. Hiransakolwong 1. Introduction Lanna language, commonly called Northern Thai language or Kam Mueang, is the language most used in Lanna empire, the Northern part of Thailand. Lanna language developed from ancient Mon script similarity with religion scripts of Laos and Burmese. This font of Lanna characters was used in the Lanna Kingdom since 1802. Currently, this language is used in religious that found in the temple of northern Thailand. The Lanna language is a likeness of northern Thai and Chiang Saeng languages. Six million people in north of Thailand and several thousand people in Laos used Lanna for communication, however, although nowadays the Lanna alphabets were inscribed on many places such as the wall in temples, ancient books and stone inscriptions, still there are very few people who know Lanna language. An automatic segmentation of printed Lanna document is very useful for teaching in the classroom and can be applied to different applications such as read automatically and e-learning. Nowadays, there are so many OCR (Optical Character Recognition) available online in many languages, however, it is not available for Lanna language. Therefore, this paper needs to focus the Lanna character segmentation and do the Lanna character recognition in the future work for planning to do OCR available online for Lanna Language in the future. The Lanna language contains 42 consonants, 14 single vowels and 8 top vowels, 12 transform consonants, 9 forced tone marks, and 10 number characters. The mix between consonants and vowels in Lanna language styles is similar to Thai characters. The Lanna language is consists of consonants around vowels. A vowel can be placed on the top, bottom, left and right of consonants as shown in Fig. 1. Current font Lanna on palm leaves or published books were mainly written in four levels, as shown in Fig. 2. Level of words at the present time in Lanna language has four levels. The first level is a tone, which is placed on top of the consonant. The second level is a vowel. The third level is consonants in the main line and the fourth level is a vowel under consonant. There are two types of overlapping characters; partial overlapping and touching overlapping. Touching and partial overlapping of characters are the first encountered problem like any other languages. Fig. 1. An example of Lanna language. Fig. 2. Level of words at the present time in Lanna language. Journal of Engineering Science and Technology October 2018, Vol. 13(10) Segmentation of Touching Character Printed Lanna Script using . 3333 Some overlapping characters are letters of any two consonants and/or two vowels overlapping each other (in any direction) any levels shown in Fig. 3. A touching character is a character that contains a consonant and a vowel touching each other between levels shown in Fig. 4. Over most touching character is found a few touching conjunct consonants in Lanna characters as shown in Figs. 5 and 6. Fig. 3. Some partial overlapping Lanna characters in boundary boxes. Fig. 4. Examples of touching Lanna characters in boundary boxes. Fig. 5. Touching a consonant and a vowel between levels. Fig. 6. Examples of few touching characters in Lanna characters. To many researches, the most problem for segmentation of the characters is the touching characters in several other languages such as English language with a handwritten style [1], Thai language [2], Japanese language [3], Chinese language [4], Telugu language [5], Arabic language [6] and Hindi language [7]. Most of the researches proposed a solution of touching characters for OCR [8-17]. Many languages most problems are caused by the distortion of the image quality such as writing, photography, scanning and quality of ink. Some characters cannot avoid the overlap between a consonant and a vowel. This paper focuses on touching Lanna characters from a data set of printed text documents in Lanna language. The data set of Lanna language images is created by scanning from various books in Lanna language. This method used the median filter to filter noises from the images. The horizontal and vertical histogram technique was used for segmentation lines and characters. The output is consists of correct clear characters, partial overlapping characters and touching characters. Finally, the methods will segmentation the touching characters by using junction point of edges. Journal of Engineering Science and Technology October 2018, Vol. 13(10) 3334 R. Kosarat and N. Hiransakolwong 2. Literature Review Currently, there are various methods for segmentation characters into text documents, but the issue of character discrimination is not yet solvable in many languages. Singh et al. [18] proposed an approach for offline handwritten Devnagari character Recognition using the curvelet transform, the character geometry, and comparing their recognition performances using two different classifiers; the Support Vector Machine (SVM) with Radial Basis Function (RBF), and the k-Nearest Neighbors (k-NN) classifier. This paper reported experimental results of performances by comparing of Curvelets and geometry-based features with using k-NN and SVM with RBF classifier, the percentage accuracy classifier of the Geometry-based feature at 73.4 and 35.9 percent respectively. Likewise, the percentage accuracy classifier of the Curvelet-based feature was at 93.8 and 91.5 percent respectively. Comparison results show that k-NN classifier gives better results than SVM with RBF and the Curvelet-based features are more suitable for Devnagari character recognition. There are not much research papers in character segmentation of Lanna language. Pravesjit and Thammano [19] with example of previous work with the Lanna languages, proposed a method combined Otsu’s method and multi- thresholding method and then used thinning algorithm to extract the skeleton of the touching characters in Lanna handwritten documents. Finally, use the junction points to separate touching characters with an accuracy of 86.67%. The accuracy rate is quite low. At the present time, there are so many OCR available online in many languages. Unfortunately, there is not available for Lanna language. Therefore, the proposed method is not able to find a previous work of segmentation in printed Lanna documents for comparison purposes. 3. Proposed Algorithm The main program of proposed algorithm begins with the input of Lanna text document images (as shown in Figs. 7 to 11). For each image, the algorithm will do pre-processing, segmentation (Line and character), and then segmentation of touching character as follows: Start Input image Preprocessing Segmentation Line, Character Touching Character End Fig. 7. Main program of the proposed algorithm. Journal of Engineering Science and Technology October 2018, Vol. 13(10) Segmentation of Touching Character Printed Lanna Script using . 3335 Touching character Start IF touching No Yes character ? IF overlapping Input touching character Yes character ? No Segmentation Correct clear Future work character End Output Return Fig. 8. Touching character subroutine. Segmentation Start Getting input, a touching character An example case 1 An example Case 2 … Find horizontal histogram Find horizontal histogram s1 Find s1 s1 Find junction point Return Fig. 9. Segmentation subroutine. Find s1 Start Boundary box of character w = width, h = height Valley is in Yes z = z5 z5 ? No z = z3 Return Fig. 10. Find s1 subroutine. Journal of Engineering Science and Technology October 2018, Vol. 13(10) 3336 R. Kosarat and N. Hiransakolwong Find junction point Start Left edge junction point in z Right edge junction point in z (ljp) (rjp) L = max (ljpi) R = max (rjpi) Cutting touching Character at row = (L + R) / 2 Output Return Fig. 11. Find junction point subroutine. 3.1. Preprocessing For the preparation of the dataset, the system scans the books of the Lanna language in greyscale, with a resolution of 300 dpi, totally of 112 scanned images. Dataset images may have noise from the scanning process. The system uses a median filter [20] to reduce noises. Text images are converted to binary images by using Otsu's method [21]. A binary image represents 1 (black pixels) as an object and 0 (white pixels) as background. 3.2. Line and character segmentation approach After the image was passed the pre-processing section, the system used the horizontal and vertical histograms [22] to segment lines and characters, respectively. The main segmentation is according to the following pattern. First, identify a line of text on the page: line segmentation methods based on technical histogram horizontally. Horizontal lines are used to calculate the frequency of black pixels in each row, which represents the horizontal histogram in each row.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    13 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us