Multiclass Recognition of Offline Handwritten Devanagari Characters Using CNN
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Mathematical, Engineering and Management Sciences Vol. 5, No. 6, 1429-1439, 2020 https://doi.org/10.33889/IJMEMS.2020.5.6.106 Multiclass Recognition of Offline Handwritten Devanagari Characters using CNN Mamta Bisht Department of Electronics and Communication Engineering, Jaypee Institute of Information Technology, Noida, India. Corresponding author: [email protected] Richa Gupta Department of Electronics and Communication Engineering, Jaypee Institute of Information Technology, Noida, India. E-mail: [email protected] (Received May 16, 2020; Accepted July 17, 2020) Abstract The handwriting style of every writer consists of variations, skewness and slanting nature and therefore, it is a stimulating task to recognise these handwritten documents. This article presents a study on various methods available in literature for Devanagari handwritten character recognition and performs its implementation using Convolutional neural network (CNN). Available methods are studied on different parameters and a tabular comparison is also presented which concludes superiority of CNN model in character recognition task. The proposed CNN model results in well acceptable accuracy using dropout and stochastic gradient descent (SGD) optimizer. Keywords- Optical character recognition (OCR), Devanagari script, Numeral recognition, Character recognition, Convolutional neural network. 1. Introduction The recognition of scanned documents in term of characters/words is one of the major applications of pattern recognition known as optical character recognition (OCR). The scanned documents either printed or handwritten, involves pre-processing, segmentation, feature extraction followed by classification and post-processing steps in OCR. For scanned printed documents various tools are available in market with high recognition accuracy (Breuel, 2008). Some tools also perform well on neatly handwritten documents. But the precision of recognition system requires more accurate performance (Adak, 2019). Therefore, it is an active research area to improve results of recognition system. The various application of this research is in the field of handwritten notes/forms digitization, bank cheque processing, postal code recognition etc. In India, Devanagari is one of the most popular scripts in central and Northern part. In rural areas, most of the documentation work is done in Devanagari script. The national language of is also scripted in Devanagari. This paper focuses on the existing work introduced in literature for Devanagari handwritten character recognition. This script consists total 49 basic characters which comprises 13 vowels (swar) and 33 consonants (vyanjan) and 3 compound characters as mentioned in Table 1. Every character consists a horizontal line in its top, called as headline (or shirorekha) which added when the characters combine. The way of writing of this script is from left to right direction on paper. One of the fundamental characteristics of this script is that when the consonant is trailed by vowel, the shape of consonant is modified with the addition of modifier (or matra) in left, right, 1429 International Journal of Mathematical, Engineering and Management Sciences Vol. 5, No. 6, 1429-1439, 2020 https://doi.org/10.33889/IJMEMS.2020.5.6.106 top or bottom position corresponding to that vowel. Sometimes, orthographic shaped character is also formed when consonant or vowel gets added with another consonant known as conjuncts (or Sanyuktakshar). The examples are shown in Table 1. Table 1. Devanagari characters Swar (Vowels) अ आ इ ई उ ऊ ए ऐ ओ औ अं अः ऋ Vyanjan (Consonants) क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न प फ ब भ म य र ल व श ष स ह Compound Characters क् +ष= क्ष, त्+र= त्र, ज्+ञ=ज्ञ Modified Characters (for consonant क (Ka)) का कक की कु कू के कै को कौ कं कः Conjuncts (Few Examples) क्र त्र्य Numerals ० १ २ ३ ४ ५ ६ ७ ८ ९ The major contributions of this article are as: (1) comparative study of offline Devanagari handwriting recognition in terms of numeral and characters. (2) Propose CNN architecture for Devanagari Handwritten Character Dataset (DHCD) recognition with dropout and SGD optimizer. (3) Classification using Random forest classifier on same dataset and validate CNN performance for multiclass classification. (4) Validate proposed CNN model on other dataset, MNIST. The rest of the paper is arranged in five sections. Section 2 presents the challenges of Devanagari handwriting recognition system. Section 3 discusses the different existing recognition schemes for offline handwritten Devanagari script at numeral and character level. Section 4 and Section 5 presents the experimental setup and results for Devanagari character recognition using CNN model. Section 6 concludes the article and presents some future aspects. 2. Challenges in Devanagari Handwriting Recognition Lots of challenges are identified for offline Devanagari handwritten documents recognition. The variations in the writing style of every writer, skewness and slanting nature etc., causes difficulty in the segmentation of document from text line to word and then word to character. This script has large number of character sets due to modifiers as compared to non- Indic script like Latin which make recognition system more challenging. Poorly scanned document or image captured from low resolution camera, spots, broken strokes, blurring etc. added noise to the document. The historical documents recognition is also very challenging task because of the availability of low- quality manuscript, missing of standard character and unknown font size etc. Presence of many confusing strokes and compound/conjunct character added more challenges to develop a Devanagari handwriting recognition system. 3. Recognition of Offline Handwritten Devanagari Script Many Indic languages are written in Devanagari script like Hindi, Marathi, Sanskrit, Nepali etc. Indic script consists more variations in shape of characters as compared to non- Indic scripts like Latin, Chinese, Korean and Japanese etc. which introduces more challenges to build a handwriting recognition system for Indic scripts (Bharath and Madhvanath, 2009). Kumar et al. (2018) discussed key issuess for character and numeral identification in Indic and non-Indic scripts. A review on different online and offline character recognition including Indic and non- Indic scripts are presented by Kaur et al. (2019). Singh et al. (2012) have discussed a review related to potential aspects of OCR in various fields. A review of methodologies applied in the 1430 International Journal of Mathematical, Engineering and Management Sciences Vol. 5, No. 6, 1429-1439, 2020 https://doi.org/10.33889/IJMEMS.2020.5.6.106 Indian language scripts recognition is described by Pal and Chaudhuri (2004) and Pal et al. (2012). Prasad (2014) also discussed an in-depth literature survey of Indic script recognition systems. Nowadays, convolution neural network (CNN) shows its importance in various fields and Indic script recognition is one of these (Mehrotra et al., 2013; Acharya et al., 2015; Maitra et al., 2015; Singh et al., 2016). The details of numeral and character recognition are presented in following subsections. 3.1 Devanagari Numeral Recognition In literature many feature extraction and classification techniques are used. Density, moment of right, left, upper and lower profile and descriptive component etc. are the main part of statistical features used by Bajaj et al. (2002). Elnagar and Harous (2003) worked on structural (or shape) features i.e., strokes and cavity features of thinned cursive handwritten Hindi numerals. Ramteke and Mehrotra (2006) have been worked on features which are based on moment, image partition, principle component analysis (PCA), correlation coefficient and perturbed moments. Garain et al. (2006) explores the strength of a clonal selection algorithm for 10-class classification problem of handwritten numerals. Sharma et al. (2006) divided the numeral (also applied for character) into blocks and then calculated the chain code histogram in each block. They used quadratic classifier for classification of obtained chain code features. Hanmandlu and Murthy (2007) represent the numerals in the form of modifying exponential relationship functions processed to the fuzzy sets. Box approach is used to obtain normalized distance features and then fitted to fuzzy set classifier. Hanmandlu et al. (2007b) also worked on box approach extracted features of handwritten Hindi numerals followed by bacterial foraging method. Patil and Sontakke (2007) explored general fuzzy neural network classifier for structural features of Devanagari numerals. Pal et al. (2007a) extracted directional information of the contour points of the handwritten numerals and then discussed a Modified Quadratic Discriminant Function (MQDF) for classification. Pal et al. (2009a) worked on postal automation by Indian multi-script full pin code string recognition using gradient features and MQDF classifier. Basu et al. (2010) presented a quad-tree-based longest run (QTLR) feature extraction scheme from the numeric digit designs followed by SVM classifier. Rajput and Mali (2010) used Fourier Descriptors to identify the figure of isolated Marathi handwritten numerals as features and then apply those features into three unlike classifiers namely, nearest neighbourhood (NN), K-nearest neighbourhood (KNN) and SVM. Acharya et al. (2015) presented Deep Convolutional Neural Network (CNN) model for Devanagari handwritten characters including numerals.