Zone Based Hybrid Feature Extraction Algorithm for Handwritten Numeral
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Zone-Based Hybrid Feature Extraction Algorithm for Handwritten Numeral Recognition of Four Indian Scripts *S.V.Rajashekararadhya Vanaja Ranjan P Research Scholar Asst.Professor Department of Electrical Engineering, CEG Department of Electrical Engineering, CEG Anna Anna University, Chennai, India University, Chennai, India *e-mail: [email protected] e-mail: [email protected] Abstract-India is a multi-lingual and multi-script country, recognition and post processing. Research in HCR is popular where eighteen official scripts are accepted and there are over for various practical application potential such as, reading aids hundred regional languages. In this paper we propose a zone- for the blind, bank cheques, vehicle number plates, automatic based hybrid feature extraction system. The character centroid is pin code reading to sort postal mail. There is a lot of demand computed and the image (character/numeral) is further divided on Indian scripts character recognition and a review of the into n equal zones. An average angle from the character centroid OCR work done on Indian languages is excellently reviewed to the pixels present in the zone, is computed (one feature). in [5]. In [6] a survey on the feature extraction methods for Similarly, the zone centroid is also computed (two features). The character recognition is reviewed. The feature extraction average angle from the zone centroid to the pixels present in the method includes Template matching, Deformable templates, zone is computed (one feature). This procedure is sequentially Unitary Image transforms, Graph description, Projection repeated for all the zones/grids/boxes present in the numeral Histograms, Contour profiles, Zoning, Geometric moment image. There could be some zones that are empty; then, the value invariants, Zernike Moments, Spline curve approximation, of that particular zone image in the feature vector is zero. Finally, Fourier descriptors , Gradient feature, Gabor feature etc. 4*n such features are extracted. The nearest neighbor and support vector machine classifiers are used for subsequent India is a multi-lingual and multi-script country, classification and recognition purposes. We obtained 97.85 %, comprising eighteen official languages, namely, Assamese, 96.8 %, 95.1% and 95 % recognition rates for Kannada, Telugu, Bangla, English, Gujarati, Hindi, Kankanai, Kannada, Tamil and Malayalam numerals respectively, using support Kashmiri, Malayalam, Marathi, Nepali, Oriya, Punjabi, vector machine. Rajasthani, Sanskrit, Tamil, Telugu and Urdu. Recognition of handwritten Indian scripts is difficult because of the presence Keywords—Handwriiten character recognition, image of numerals, vowels, consonants, vowel modifiers and processing, Indian scripts, Feature extraction, Nearest neighbor compound characters. classifier, Support vector machine We will now briefly review the few important works done I. INTRODUCTION towards HCR with reference to the Indian language scripts. In Handwritten character recognition (HCR) is an important [7] the grid based feature extraction method is used to area in image processing and pattern recognition fields. HCR recognize the handwritten Bangla numerals using a multifier has received extensive attention in academic and production classifier. An accuracy of 97.23% is reported. The recognition fields. The recognition system can be either on-line or off-line. of conjunctive Bangla Characters by the artificial neural In on-line handwriting recognition, words are generally network is reported in [8]. Recognition of Devanagari written on a pressure sensitive surface (digital tablet PCs) from characters using gradient features and fuzzy-neural network is which real time information, such as the order of the stroke reported in [9] and [10] respectively. We found curvature made by the writer is obtained and preserved. This is feature for recognizing Oriya characters in [11]. significantly different to off-line handwriting recognition In [12] the zone/grid based feature extraction for where no dynamic information is available [1]. Off-line handwritten Hindi numerals is reported. For extracting the handwriting recognition is the process of finding letters and features, each character image is divided into 24 zones. By words that are present in the digital image of the handwritten considering the bottom left corner of the image as an absolute text. It is the subfield of optical character recognition(OCR). reference, the average vector distance for the pixels present in Several methods of recognition of English, Latin, Arabic, the zone is computed. Character recognition for Telugu scripts Chinese scripts are excellently reviewed in [1, 2, 3, 4]. using multi-resolution analysis and associative memory is There are five major stages in the HCR problem: Image reported in [13]. The Fuzzy technique and neural network preprocessing, segmentation, feature extraction, training and based off-line Tamil character recognition are found in [14]. 5295 978-1-4244-2794-9/09/$25.00 ©2009 IEEE In [15] the author has described a scheme for extracting Kannada is one of the major Dravidian languages of South features from the gray scale images of the handwritten India and one of the earliest languages evidenced characters on their state-space map with eight directional epigraphically in India, and spoken by about 50 million people space variations. In [16] for feature computation, the bounding in the Indian states of Karnataka, Tamil Nadu, Andhra Pradesh box of a numeral image is segmented into blocks and the and Maharasthra. The script has 49 characters in its directional features are computed in each of the blocks. These alphasyllabary and is phonetic. The characters are classified blocks are then down-sampled by a Gaussian filter and the into three categories: swaras(vowels), vyanjans(consonants) features obtained from the down-sampled blocks are fed to a and yogavaahas (part vowel, part consonants). The script also modified quadratic classifier for recognition. Off-line HCR for includes 10 different Kannada numerals of the decimal numeral recognition is also found in [17-21]. number system. Recognition of isolated handwritten Kannada numerals Telugu is a Dravidian language and it is the third most based on the image fusion method is found in [17]. In [18] popular script in India. It is the official language of the south Zone and Distance metric based feature extraction method is Indian state, Andhra Pradesh and also spoken by neighboring used. The character centroid is computed and the image is states. The Telugu script is closely related to the Kannada further divided into n equal zones. The average distance from script. Telugu is a syllabic language. Similar to most the character centroid to pixels present in the zone is languages of India, each symbol in the Telugu script computed. This procedure is repeated for all the zones present represents a complete syllable. Officially, there are 10 in the numeral image. Finally n such features are extracted for numerals, 18 vowels, 36 consonants, and three dual symbols. classification and recognition. Tamil is a Dravidian language and one of the oldest In [19] the zone and vertical projection distance metric is languages in the world. It is the official language of the Indian used. The preprocessed numeral image (50x50) is fed to the state of Tamil Nadu; it also has official status in Sri Lanka, feature extraction module. The image is divided into 25 zones Malaysia and Singapore. The Tamil script has 10 numerals, 12 (each zone size is 10x10). The average pixel distance of each vowels, 18 consonants and five grantha letters. The script, column present in the zone is computed vertically. In [20] however, is syllabic and not alphabetic. The complete script, image is further divided in to n equal zones. The average therefore, consists of 31 letters in their independent form, and distance from the zone centroid to pixels present in the zone is an additional 216 combining letters representing every computed. possible combination of a vowel and a consonant. From the above literature survey it is clear that not much Malayalam is a Dravidian language and it is the eighth work has been done on the south-Indian scripts. This most popular script in India, and spoken by about 30 million motivated us to work on south-Indian scripts. The selection of people in the Indian state of Kerala. Both the language and the the feature extraction method is also a very important factor writing system are closely related to Tamil. However, for achieving efficient character recognition. In this paper we, Malayalam has its own script. The script has 16 vowels, 37 propose a simple and efficient zone-based hybrid feature consonants and 10 numerals. extraction algorithm. The Nearest Neighbor Classifier (NNC) and Support Vector Machine (SVM) classifiers are used for The challenging part of the Indian handwritten the recognition and classification of a numeral image. We numeral/character recognition is the distinction between the have tested our method for Kannada, Telugu, Tamil and similar-shaped components. A very small variation between Malayalam numerals and also experimentation is carried out two characters or numerals leads to recognition complexity on MNIST digit database and ISI Bangla digit database. and a certain degree of recognition accuracy. The style of writing the characters is highly different