A Script Independent Approach for Handwritten Bilingual Kannada and Telugu Digits Recognition

IJMI International Journal of Machine Intelligence ISSN: 0975–2927 & E-ISSN: 0975–9166, Volume 3, Issue 3, 2011, pp-155-159 Available online at http://www.bioinfo.in/contents.php?id=31 A SCRIPT INDEPENDENT APPROACH FOR HANDWRITTEN BILINGUAL KANNADA AND TELUGU DIGITS RECOGNITION DHANDRA B.V.1, GURURAJ MUKARAMBI1, MALLIKARJUN HANGARGE2 1Department of P.G. Studies and Research in Computer Science Gulbarga University, Gulbarga, Karnataka. 2Department of Computer Science, Karnatak Arts, Science and Commerce College Bidar, Karnataka. *Corresponding author. E-mail: [email protected],[email protected],[email protected] Received: September 29, 2011; Accepted: November 03, 2011 Abstract- In this paper, handwritten Kannada and Telugu digits recognition system is proposed based on zone features. The digit image is divided into 64 zones. For each zone, pixel density is computed. The KNN and SVM classifiers are employed to classify the Kannada and Telugu handwritten digits independently and achieved average recognition accuracy of 95.50%, 96.22% and 99.83%, 99.80% respectively. For bilingual digit recognition the KNN and SVM classifiers are used and achieved average recognition accuracy of 96.18%, 97.81% respectively. Keywords- OCR, Zone Features, KNN, SVM 1. Introduction recognition of isolated handwritten Kannada numerals Recent advances in Computer technology, made every and have reported the recognition accuracy of 90.50%. organization to implement the automatic processing Drawback of this procedure is that, it is not free from systems for its activities. For example, automatic thinning. U. Pal et al. [11] have proposed zoning and recognition of vehicle numbers, postal zip codes for directional chain code features and considered a feature sorting the mails, ID numbers, processing of bank vector of length 100 for handwritten Kannada numeral cheques etc. To recognize such documents developing recognition, achieved reasonably high accuracy, but the of handwritten optical character recognition system is time complexity of their algorithm is more. essential. Recently, a piece of work on bi-lingual and tri-lingual In this direction, many researchers have developed the (i.e. English and Devanagri, Kannada, Telugu and numeral recognition systems by using various feature Devanagri,) numeral recognition of Indian scripts has extraction methods such as statistical features, been proposed in [12, 13]. topological and geometrical features, global From the literature survey, it is evident that most of the transformation and series expansion features like Hough authors have been attempted to recognize digits of a Transform, Fourier Transform, Wavelets, Moments, etc. script. However, in day today life we come across with A survey on different feature extraction methods for many documents which contain bi-script digits. For character recognition is reported in [1]. A review on example, people who are living on the border area of two different pattern recognition methods is given in [2]. states like Karnataka and Andhra Pradesh (south Indian Extensive work has been carried out for recognition of states) have a practice of using both Kannada and Telugu characters and numerals in foreign languages like digits in a single document (i.e. postal zip codes, English, Chinese, Japanese, and Arabic. With respect to application forms, railway reservation slip and bank the Indian scripts, a major work can be found in [3, 4] on cheques etc.). Bengali and Tamil scripts, where as the work on To read such documents monolingual OCR will not work. handwritten Kannada and Telugu numerals recognition is Therefore, the designing of bilingual and multilingual in infant stage. OCRs is essential for multi-script and multi-lingual country Recognition of handwritten Kannada and Telugu like India. This has motivated us to design a simple and characters are complex task due to the unconstrained robust bilingual handwritten (Kannada and Telugu) OCR shapes, variation in writing style and different kinds of system. noise that break the strokes primitives in the characters In this paper, Section 2 contains about Kannada/Telugu or change in their topology. Most of the methods have scripts. Section 3 is about bilingual OCR. The employed fuzzy features [5, 6], templates and preprocessing of the images and data collection is deformable templates [7, 8], Structural and Statistical presented in Section 4. Feature extraction procedure is features [9]. Dinesh Acharya et al. [10] have used the 10- discussed in Section 5, the experimental details and segment string, water reservoir, horizontal/vertical results obtained are presented in Section 6. Conclusion strokes, and end point as the potential features for is the subject matter of Section 7. 155 Bioinfo Publications A script independent approach for handwritten bilingual Kannada and Telugu digits recognition 2. About Kannada/Telugu Scripts height and width (i.e.32 x 32 pixels).The normalized digit Kannada is one of the most well known Dravidian image is used for extracting the features. A sample languages of India. It is the official language of the dataset of handwritten Kannada and Telugu digits are southern Indian state of Karnataka. It is spoken by about shown in Fig.1 and Fig.2 respectively. As an example 44 million people in the Indian states of Karnataka, normalized Kannada digit is shown in Fig3. Andhra Pradesh, Tamil Nadu and Maharashtra. Its writing system is alpha syllabary in which all consonants have an inherent vowel. Other vowels are indicated with diacritics, which can appear above, below, before or after the consonants. When consonants appear at the beginning of a syllable, vowels are written as independent letters and when they appear together without intervening vowels, the second consonant is written as a special conjunct symbol, usually below the first. The direction of writing is left to right in horizontal lines. It has 14 vowels and 36 consonants and 10 digits. Fig.1 - Sample handwritten Kannada digits 0 to 9 The Telugu script is derived from the Telugu-Kannada script and developed independently at the same time as the Kannada script, as evidenced by their strong orthographical resemblance to one another. 3. Bilingual OCR Fig. 2 - Sample handwritten Telugu digits 0 to 9 Recognition of bilingual documents can be approached in two ways (1) Through script identification (2) Bilingual approach, in this approach; the OCR to be employed for the recognition of the bilingual documents (Kannada/Telugu) can be activated based on the script recognition of the input word/character. This approach (a) (b) reduces the search space in the database and allows for the Kannada and Telugu characters recognition to be Fig.3 - (a) Binary Image: Before Size normalization handled independently from each other. (b) Image after Size normalized In bilingual approach, characters are handled in the same manner, irrespective of the script they belong to. In 5. Feature Extraction any classification problem, the feature dimension is very Feature extraction is a problem of extracting the relevant much dependent on the number of classes. In the information from the preprocessed data for classification proposed work, the total number of classes to be of underlying objects/characters. The preprocessed digit classified is 20 (Kannada script digits -10 and Telugu image is used as an input for feature extraction. For script digits - 10). However, as the number of classes extracting the potential feature from the handwritten digit increases, it is prudent to divide the classification image, the frame containing the problem. Hence, the classification problem of preprocessed/normalized image is divided into non- Kannada/Telugu bilingual digit recognition is reduced to overlapping zones of size 8 x 8 and obtained 64 zones. 11 classes based on the observation that all the digits For each zone, the pixel density is computed and there have the similar shape except digit three of both the pixel densities are used as a feature for recognition. scripts. Hence, 64 features vector is used for recognition of a digit. 4. Data Set and Preprocessing Algorithm: Zone based pixel density feature There is no standard database available for south Indian extraction system scripts. Hence, we have created own database. Input: Preprocessed digit Image. Handwritten documents were collected from different Output: Features for Classification and Recognition. professionals belonging to Schools, Colleges, Doctors Start and Lawyers etc. The collected documents were 1. Divide the input digit image into 64 zones of scanned through a flatbed HP scanner at 300 dpi and size 8 x 8. binarized using global threshold (i.e. Otus’s) method. 2. Compute the pixel density for each zone. Unconstrained and isolated 1000 digits of each script 3. Repeat this procedure sequentially for all were manually segmented from the scanned documents. zones. The segmented digit images contain noise and that 4. Finally, 64 features will be obtained for arises due to printer, scanner, print quality, etc. Noise classification and recognition using KNN and removal is performed by employing morphological SVM classifier. opening operation. All isolated handwritten Kannada and Stop Telugu digit images are normalized into a common 156 International Journal of Machine Intelligence ISSN: 0975–2927 & E-ISSN: 0975–9166, Volume 3, Issue 3, 2011 Dhandra BV, Gururaj Mukarambi, Mallikarjun Hangarge 6. Experimental Results and Discussions The experimental results have shown high recognition For experimentation, each 1000 handwritten digits of accuracies for Kannada and Telugu handwritten digits of Kannada and Telugu script were considered. In this zone size 8 x 8 as 96.22% and 99.80% with SVM paper, we have employed two approaches for classifier. It is to be noted that large size zones have experimentations one is script dependent digit failed to capture the essential part to distinguish the recognition and another is bilingual digit recognition. digits. The comparative analysis for handwritten Throughout the experimentation, 50% of the dataset is Kannada and Telugu digits recognition are shown in used for training and the rest is for testing with KNN and Table 3 and Table 4.

A Script Independent Approach for Handwritten Bilingual Kannada and Telugu Digits Recognition

Proposal for a Kannada Script Root Zone Label Generation Ruleset (LGR)

The Unicode Standard, Version 4.0--Online Edition

WO 2010/131256 Al

Tugboat, Volume 19 (1998), No. 2 115 an Overview of Indic Fonts For

Comparison of Different Orthographies for Machine

Phonetic Dictionary for Natural Language Processing: Kannada

The Unicode Standard, Version 3.0, Issued by the Unicode Consor- Tium and Published by Addison-Wesley

Initial Literacy in Devanagari: What Matters to Learners

Evolution of Script in India

Han'gŭlization and Romanization

UC Berkeley Proposals from the Script Encoding Initiative

Kannada to Tamil Letters