A Benchmark Dataset for Devnagari Document Recognition Research
Total Page:16
File Type:pdf, Size:1020Kb
Recent Advances in Telecommunications, Signals and Systems A Benchmark Dataset for Devnagari Document Recognition Research RAJIV KUMAR, AMRESH KUMAR, PERVEZ AHMED Intelligent System Research Group Sharda University Greater Noida INDIA [email protected] http://www.sharda.ac.in Abstract: - A benchmark dataset is required for the development of an efficient and a reliable recognition system. Unfortunately, no comprehensive benchmark dataset exists for handwritten Devnagari optical document recognition research, at least in the public domain. This paper is an effort in this direction. In here, we introduce a comprehensive dataset that we referred to as CPAR-2012 dataset, for such benchmark studies, also present some preliminary recognition results. The dataset includes 35,000 isolated handwritten numerals, 83,300 characters, 2,000 constrained and 2,000 unconstrained handwritten pangrams. It is organized in a relational data model that contains text images along with their writer's information and related handwriting attributes. We collected the handwriting samples from 2,000 subjects who were chosen from different age, ethnicity, and educational background, regional and linguistic groups. The samples reflect expected variations in Devnagari handwriting. The digit recognition results using recognition schemes that uses simple most features & four neural network classifiers & KNN, and classifier ensemble have also been reported for benchmarking. Key-Words: - Benchmark Dataset, Character Extraction, Devnagari Handwritten Numeral Recognition, K- Nearest Neighborhood Classifier (KNN), Neural Networks Classifiers. 1 Introduction engineers to put in their best efforts in building an A large number of languages, dialects and scripts acceptable character recognition system for almost are being used in India. Among them all major languages [1,2] including Hindi Devnagarithe script of the national language language [3,4] and the effort is still on. (Hindi) is the most widely used script. In such a Initially, the focus of the research in character multilingual environment, a computer-mediated recognition was mainly on the development of system that can transcribe Devnagari text into the feature extraction, selection and classification text of the regional languages and vice-versa would methods. Their performance was measured on small be a desired communication option. In such a experimental datasets [9,12,14]. Often the test heterogeneous community, a system that can dataset consisted of either artificial or digitized transcribe unconstrained handwritten text of one character image samples. Therefore, the language into another with acceptable accuracy recognition reliability and robustness of such would be an asset because it can facilitate methods could not be considered of practical communications from anywhere and at any time importance. To overcome these limitations, need by capturing the data at its source. Thereby, such for obtaining the experimental dataset from real- data capturing ability can pave the way for the life handwriting sample was realized. Currently, development of many real life applications, like several real-life datasets are available for an automatic reading machine for visually benchmark studies [1-9]. These real-life challenged people. Undoubtedly, development of handwriting samples are not only helping in such applications is a formidable task as it needs an measuring the reliability and robustness of the accurate and efficient optical character recognition recognition systems but also providing insights into system as a front-end system and equally efficient the phenomena that control handwriting attributes and intelligible text to speech generator as a back- like stroke formation styles. Such insights, in turn, end system. In practice, development of these are helping in devising better recognition systems poses many challenges. However, despite techniques [3]. This work is an attempt to create a challenges, the social and humanitarian obligations benchmark dataset for Devnagari digital document and most importantly automated data capturing research. As compared to Latin languages [4-6], requirements have motivated scientists and efforts to collect benchmark dataset for Hindi language have been very minimal. ISBN: 978-1-61804-169-2 258 Recent Advances in Telecommunications, Signals and Systems Table 1: Dataset available for Devnagrai Scripts and other Indian Languages Number of Year Ref. Language Dataset Size Dataset Type Writers Devnagari/ Bangla/ 22, 546/ 14, 650/ 2, 2007 [7] Telugu/ Oriya/ Kannada/ 220/ 5, 638/ 4, Isolated numerals N/A Tamil 820/ 2, 690 Strings of Online numerals / 2007 [8] Bangla 8,348 / 23,392 N/A Offline-Isolated Numerals 2009 [9] Devnagari 22,556 Isolated Numerals 1,049 2010 [10] Tamil and Kannada 100,000 Words 600 2011 [11] Kannada 4,298 / 26,115 Text lines / Words 204 2011 [12] Hindi & Marathi 26,720 Words of Legal Amounts N/A Bangla only & Bangla 2012 [13] 100 & 50 Words N/A mixed with English However, few noticeable efforts to create a Experience shows, real-life handwriting samples realistic benchmark dataset for Indian languages are needed to build a realistic system. The study have appeared in literature [7-13] recently, these reveals the scarcity of such realistic datasets for are described in Section 2. In Section 3, we describe Devnagrai character recognition research. For this data collection process, storage format and data purpose, we have developed a large dataset attributes. Section 4 describes the experiments named as CPAR-2012 (Centre for Pattern Analysis with isolated numeral dataset. In Section 5 we and Recognition-2012). It contains samples that present an analysis of our recognition results of would help in understanding the complex structure initial experiments that we conducted to assess of Devnagari characters, discovering features, the recognition performance of standard training and testing the system in real techniques on this dataset. We conclude the paper environment. with Section 6, where we summarize our As compared to other datasets, the CPAR-2012 observations and highlight our efforts of building dataset provides more information. It is much an integrated research environment to support the larger than the largest dataset of Devnagari digits as sharing of benchmark dataset & algorithms, and reported in Bhattacharya and Chaudhary [5]. processes for continuous data collection and Moreover, CPAR-2012 has a variety of data storage from maximum possible sources. samples including digits, characters, and text images. In addition to these, it contains writers’ information and color images. The color images 2 Related Work are in a format that reflects the worst image representation scenario that may be expected in real The test dataset creation for character recognition life applications. Hence, it would help in a research of Indian languages is a recent effort [7- realistic assessment of the robustness of the system. 13,15]. Table 1 shows that serious efforts to collect Devenagri datasets began in the preceding decade [3, 5, 6] but these datasets are small as 3 Data Collection compared to the datasets of other major languages. We collected data from writers belonging to Moreover, most of the datasets for Indian diverse population strata. They belonged to languages contain numerals only [7,9]. different age groups (from 6 to 77 years), genders, Surprisingly, none of them have Devnagari educational backgrounds (from 3rd grade to post characters, words and texts with the writers’ graduate levels), professions (software engineers, information. The writer’s information is useful for professors, students, accountants, housewives and handwriting analysis application development. retired persons), regions (Indian states: Bihar, Uttar However, the work of Nethravathi et al. [10] on Pradesh, Haryana, Punjab, National Capital Region design of Tamil and Kannada word recognition (NCR), Madhya Pradesh, Karnataka, Kerala, indicates that they had collected and compiled a Rajasthan, and countries: Nigeria, China and large handwritten document dataset but that too Nepal). Two thousand writers participate in this did not incorporate writer’s information. ISBN: 978-1-61804-169-2 259 Recent Advances in Telecommunications, Signals and Systems experiment. In order to collect their personal After step 4, we filtered out all those labeled information and handwriting samples we designed components whose areas were less than a specified two forms. The first form referred to as Form-1, is threshold. Under these criteria, 1,700 out of 2,000 for isolated numerals, characters and writer forms were accepted and processed. In each information collection as shown in Fig. 1. It has a accepted from 154 bounding boxes were detected, header block that contains three fields: Text Type, cropped, stored and displayed for verification see Writer Number and Form Number followed by a Fig. 2. Through this process, we accepted machine printed Hindi character block used as constrained handwritten samples of 15,000 specimen to write these characters in the space numerals and 83, 300 characters. The process provided in the following block. The last block is rejected poor quality samples (1400 samples). reserved for writer information where every writer These samples have also been stored in the wrote his/her name, age, gender, education level, database for further investigation. profession, writing instrument/color, designation, psychological condition, region and nationality. The second form that