© 2019 JETIR January 2019, Volume 6, Issue 1 www.jetir.org (ISSN-2349-5162)

FEATURE EXTRACTION AND RECOGNITION OF HANDWRITTEN GUJARATI NUMERALS

1P E Ajmire, 2Fouzia I Khandwani 1Head of the Department, Computer Science & Application, G S Science, Arts and Commerce College, Khamgaon, (M.S.), . 444303 2Assistant Professor, Computer Science & Application, G S Science, Arts and Commerce College, Khamgaon,(M.S.), India 444303

Abstract: Intensive research has been done on optical character recognition (OCR) and a large of articles have been published on this topic during the last few decades. This research work deals with the feature extraction and Classification of Guajarati number recognition. Many researchers are working on the handwritten Guajarati number recognition. The was adapted from the script to write the . This paper intends to bring Gujarati character recognition in attention. Methods based on artificial neural network (ANN), support vector machine (SVM) and naive Bayes (NB) classifier are exercised for handwritten Gujarati numerals recognition and average accuracy of recognition is 98.33%.

Keywords: Handwritten Gujarati numeral recognition, Feature Extraction, Classifier Support Vector Machine (SVM).

I. INTRODUCTION Handwritten character recognition is a challenging task and many people are striving to convert handwritten literature to computer readable format. A subset of this task is recognition of handwritten numerals. Recognizing handwritten numerals is difficult compared to printed numerals because handwritten numerals may vary from person to person with respect to the individual writing style, size, curve, strokes and thickness of numerals. Gujarati belongs to Devanagari family of languages and is spoken by over 55 million people in – a western state of India. All over the world, there are more than 65 million Gujarati speakers. Ideal shapes of the Gujarati numerals and corresponding English numerals are shown in Figure 1.

Figure 1 Gujarati numerals and corresponding English numerals

It is important to note that effectiveness of features and accuracy of recognition techniques are script dependent, i.e., one kind of features and technique may provide accurate results for few scripts but may give erroneous results for the others. Experimentation on a large scale is therefore extremely important to validate the suitability of features and recognition techniques for a particular script. The paper does exactly the same and validates the proposal through systematic and exhaustive experimentations.

II. LITRATURE SURVEY Diversity in India, the scripts of India also show a wide range of variety in the characters and numerals of various scripts. One can trace the complexity among the Indian scripts. Many of the Indian regional scripts do not have Shirorekha like Gujarati, Oriya, etc. Prime scripts used for handwritten recognition are: Gujarati, Devanagari, Kannada, Gurumukhi, Oriya, Tamil, Telugu, Bengali, etc. The Gujarati script was adapted from the Devanagari script to write the Gujarati language. The earliest known document in the Gujarati script is a manuscript dating from 1592, and the script first appeared in print in a 1797 advertisement. Until the 19th century it was used mainly for writing letters and keeping accounts, while the Devanagari script was used for literature and academic writings. Gujarati, an Indo-Aryan language spoken by about 46 million people in the Indian states of Gujarat, Maharashtra, Rajasthan, Karnataka and Madhya Pradesh, and also in Bangladesh, Fiji, Kenya, Malawi, Mauritius, Oman, Pakistan, Reunion, Singapore, South Africa, Tanzania, Uganda, United Kingdom, USA, Zambia and Zimbabwe. The major difference between Gujarati and Devanagari is the lack of the top horizontal bar in Gujarati. Otherwise the two scripts are fairly similar [1]. Antani and Agnihotri [2] in 1999 have given the primitive effort to Gujarati printed text. The author has created the data sets from scanned images, at 100 dpi, of printed Gujarati text as well as from various sites of internet from 15 font families. For training 5 fonts created 10 samples each. The images were scaled up and then scaled down to a fixed size so that all the samples should be of same size i.e. 30x20. It does not have skew correction or noise removal etc. for feature extraction the author computed both invariant moments and raw moments. Also image pixel values are used as features creating 30x20= 600 dimensional binary feature space. For classification the author has used two classifiers, K-NN classifier and minimum hamming distance classifier. The best recognition rate was for 1-NN for 600 dimensional binary features space i.e. 67% 1-NN in regular

JETIR1901579 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 617

© 2019 JETIR January 2019, Volume 6, Issue 1 www.jetir.org (ISSN-2349-5162) moment space gave 48% while minimum distance classifier had the recognition rate of 39%. The Euclidean minimum distance classifier recognized only 41.33%. Dholakia [3] attempted to use wavelet features, GRNN classifier and KNN classifier on the printed Gujarati text of font sizes 11 to 15 with styles regular, bold and italic by finding the confusing sets of the characters. They collected 4173 samples of middle zone glyphs of initial size 32x32 and 16x16 wavelet coefficients have been extracted creating the feature vector. Two sets of the randomly selected glyphs (2802 symbols) were used for training and 1371 symbols were used for testing. Two classifiers GRNN and KNN with Euclidean distance as similarity measure are used producing 97.59 and 96.71 as their respective recognition rates. In 2005, Jignesh Dholakia et. al [4] have presented an algorithm to identify various zones used for Gujarati printed text. In the algorithm they have proposed the use of horizontal and vertical profiles. They have identified these zones by slope of lines created by upper left corner of rectangle created by the boundaries of connected components from line level and not word level the 3 different document images, 20 lines were extracted where 19 were detected with correct zone boundary. The line where it failed was very much skewed. Desai [5] collected 300 samples of 300dpi with initial size of each numeral as 90x90. The author then adjusted the contrast by CLAHE i.e. contrast limited adaptive histogram equalization algorithm considering 8x8 tiles and 0.01 as contrast enhancement constant. The boundaries are then smoothed out by median filter of 3x3 neighborhoods. Image is then reconstructed to the size of 16x16 pixels using nearest neighbor interpolation. For feature extraction four profile vectors are used as an abstracted feature of identification of digit. Five more patterns for each digit are created in both clockwise and anticlockwise directions with the difference of 2degrees each up to 10◦. A feed forward back propagation neural network is used for Gujarati numeral classification with 278 sets of various digits. Out of these 278 sets, 11 sets were created by a standard font. From the 265 sets the author recorded the success rate for standard fonts as 71.82%, for handwritten training sets as 91.0% while for testing sets as a score of 81.5% was recorded. Another effort contributed for Gujarati script was by Shah & Sharma [6]. They used template matching and Fringe distance classifier as distance measure. Initially the sample images were filtered using low pass filter. Then the binary conversion was done by considering the optimal threshold method. Skew detection and correction is done within 0.05◦. They segmented printed characters in terms of lines, words and connected components. By this effort, for connected component recognition rate was 78.34%. For upper modifier recognition rate was 50% where as for lower modifier it was 77.55% and for punctuation marks it was 29.6%. Cumulative for overall it was 72.3%. The characteristics which constrain hand writing may be combined in order to define handwriting categories for which the results of automatic processing are satisfactory.

III. DATABASE PREPARATION Standard database of handwritten Gujarati is not available till today [7]. So, we have developed our own database. The database is prepared by scanning the images of Handwritten Gujarati numbers which are collected on a special designed sheet, from individuals of various age, gender and professions. Source document are scanned at 300 dpi using HP flatbed scanner and stored in JPG format.

Figure 2 Sampled database for zero

Figure 3 Sampled database for one

Figure 4 Sampled database for two

Figure 5 Sampled database for three

Figure 6 Sampled database for four

Figure 7 Sampled database for five

Figure 8 Sampled database for six

JETIR1901579 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 618

© 2019 JETIR January 2019, Volume 6, Issue 1 www.jetir.org (ISSN-2349-5162)

Figure 9 Sampled database for seven

Figure 10 Sampled database for eight

Figure 11 Sampled database for nine

IV. PROPOSED WORK In handwritten character recognition system database design and pre-processing are important part of the system. For the task of identifying handwritten digits, the most important thing is to bring all the segmented numerals in a standard normal form. This is absolutely necessary for handwritten character recognition system as writers may write using different types of pens, papers and they may even follow different writing styles. We proposed the following algorithm for the recognition of Gujarati numbers. 1. Data Acquisition 2. Pre-processing 3. Split the image. 4. Resize each image. 5. Extract the features for image separately. 6. Classification.

V. FEATURES EXTRACTION Feature extraction is backbone of any recognition system. At the same time, feature selection and number of features are also important [8]. For this recognition system we use minimum shape and statistical features. Handwritten database is design for this work. For this task the person from different age, gender, educational qualifications are selected [9]. After pre-processing [10-12], character image is normalized to 40 x 40 pixel size. Histogram oriented gradient features are extracted [13]. The selected features are as follows:  Convex area  Filled area  Euler number  Eccentricity

Convex Area A set of points is defined to be convex if it contains the line segments connecting each pair of its points. In mathematics, the convex area or convex envelope of a set X of points in the Euclidean plane or Euclidean space is the smallest convex set that contains X. For instance, when X is a bounded subset of the plane, the convex hull may be visualized as the shape enclosed by a rubber band stretched around X. Formally, the convex area may by defined as the intersection of all convex sets containing X or as the set of all convex combinations of points in X. With the latter definition, convex hulls may be extended from Euclidean spaces to arbitrary real vector spaces; they may also be generalized further, to oriented matroids.

Filled Area An area graph displays elements in Y as one or more curves and fills the area beneath each curve. When Y is matrix, the curves are stacked showing the relative contribution of each row element to the total height of the curves at each X interval. Some area object properties that you set on an individual area object set the values for all area objects in the graph. See Area Properties for information on specific properties.

Euler Number In mathematics, the Euler numbers are a sequence En of integers A122045 in defined by the Taylor series expansion t = 2 e t + e-- t = Σ n = 0 ∞ En The Euler number appears as a special value of the Euler polynomials. The odd-indexed Euler numbers are all zero. The even-indexed ones have alternating. The Euler numbers appear in the Taylor series expansions of the secant and hyperbolic secant functions. The latter is the function in the definition. They also occur in combinatorics, specifically when counting the number of alternating permutations of a set with an even number of elements.

JETIR1901579 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 619

© 2019 JETIR January 2019, Volume 6, Issue 1 www.jetir.org (ISSN-2349-5162)

Eccentricity Any conic section can be defined as the locus of points whose distances to a point and a line are in a constant ratio that ratio is called eccentricity, denoted as e. All types of conic sections, arranged with increasing eccentricity. Note that curvature decreases with eccentricity, and that none of these curves intersect. In mathematics, the eccentricity, denoted e or ε {\displaystyle \varepsilon}, is a parameter associated with every conic section. It can be thought of as a measure of how much the conic section deviates from being circular. In particular,  The eccentricity of a circle is zero.  The eccentricity of an ellipse which is not a circle is greater than zero but less than 1.  The eccentricity of a parabola is 1.  The eccentricity of a hyperbola is greater than 1. Furthermore, tow conic sections are similar (identically shaped) if and only if they have the same eccentricity.

VI. EXPERIMENTAL RESULTS Experimentation is carried out on 10 handwritten Gujarati numerals (0 to 9). The database contains 50 images of each numeral. Database is prepared by writers of various age groups and qualification and of both genders. For classification SVM is used. Table 1 Results on Test Performance o(se) o(th) o(tw) o(fo) o(on) o(ni) o(z) o(ei) o(si) o(fi)

MSE 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02

Percent Correct 100 100 100 100 100 100 100 100 100 88.9

On Test = 98.88% Table 2 Results on Train Performance o(se) o(th) o(tw) o(fo) o(on) o(ni) o(z) o(ei) o(si) o(fi)

MSE 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 Percent 100 100 100 100 100 100 100 100 100 100 Correct

On Train = 100% Table 3 Results on Cross Validation Performance o(se) o(th) o(tw) o(fo) o(on) o(ni) o(z) o(ei) o(si) o(fi)

MSE 0.02 0.01 0.01 0.01 0.02 0.01 0.02 0.02 0.01 0.02 Percent 100 100 100 100 100 100 100 83.3 100 90.9 Correct

On Cross Validation = 97.423%

VII. CONCLUSION This paper addresses the problem of handwritten Gujarati numeral recognition. The statistical classifier SVM is used for the classification. The performance of classification model is mainly based on the feature extraction. To improve the performance of this prototype, one can add more features. As a whole, this model offers a satisfactory recognition rate. The overall accuracy of recognition of handwritten Gujarati numerals is given in the table below.

Table 4 SVM Classifier Accuracy SVM Classifier

On Accuracy

Training 100

Cross Validation 97.42

Testing 98.88

JETIR1901579 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 620

© 2019 JETIR January 2019, Volume 6, Issue 1 www.jetir.org (ISSN-2349-5162)

Overall Accuracy 98.33

REFERENCES [1] Mamta Maloo , Dr. K.V. Kale, “Gujarati Script Recognition: A Review”, IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 4, No 1, pp.480-489, July 2011. [2] S. Antani, L. Agnihotri. “Gujarati Character Recognition”. Fifth International Conference on Document Analysis and Recognition (ICDAR'99), p. 418, 1999. [3] J. Dholakia, A. Yajnik, A. Negi “Wavelet Feature Based Confusion Character Sets for Gujarati Script”, ICCIMA, p 366-371, 2007. [4] J. Dholakia, A. Negi, S. Rama Mohan “Zone Identification in the Printed Gujarati Text”, Proc. of 8th ICDAR, 2005 p272-276. [5] A. Desai, “Gujarati handwritten numeral optical character reorganization through neural network”. Pattern Recognition Vol 43 pp. 2582-2589, 2010. [6] S K Shah and A Sharma “Design and Implementation of Optical Character Recognition System to Recognize Gujarati Script using Template Matching” IE (I) Journal−ET Vol.86, pp. 44-49, 2006. [7] Sk Md Obaidullah, Anamika Mondal, Nibaran Das, and Kaushik Roy, “Script Identification from Printed Indian Document Images and Performance Evaluation Using Different Classifiers”, Applied Computational Intelligence and Soft Computing 2014. [8] P E Ajmire, R V Dharaskar, V M Thakare and S E Warkhede, “Structural Features for Character Recognition System-A Review”, International Journal of Advanced Research in Computer Science, Vol. 3, Issue 3, pgs 766-768, 2012. [9] P E Ajmire and P U Talole, “Comparative Study of Feature Extraction and classification Techniques for Handwritten Devanagari Script”, International Journal of Advanced Research in Computer and Communication Engineering, 2017, Vol. 6, Issue 2, pgs 135-139. [10] H K Anasuya Devi, “Thinning: A preprocessing technique of OCR for the Brahmi Script”, Ancient Asia Vol. 1, 2006. [11] Dipti Deodhare, NNR Ranga and Suri R. Amit, “Preprocessing and Image Enhancement Algorithms for a Form-based Intelligent Character Recognition System”, International Journal of Computer Science & Applications, Vol.II, No.II, pp.131- 144, 2005. [12] P.E.Ajmire R.V Dharaskar, V. M. Thakare,S.E.Warkhede, “Structural Features for Character Recognition System-A Review”, International Journal of Advanced Research in Computer Science,Vol 3, Issue 3, pp. 766-768, 2013. [13] P E Ajmire, R V Dharaskar and V M Thakare, "Handwritten Devanagari (Marathi) Character Recognition using MLP and SVM", International journal of Pure and Applied Research in Engineering and Technology, Vol 4(4), pp. 1-10, 2015.

JETIR1901579 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 621