A Benchmark Dataset for Devnagari Document Recognition Research

Total Page:16

File Type:pdf, Size:1020Kb

A Benchmark Dataset for Devnagari Document Recognition Research Recent Advances in Telecommunications, Signals and Systems A Benchmark Dataset for Devnagari Document Recognition Research RAJIV KUMAR, AMRESH KUMAR, PERVEZ AHMED Intelligent System Research Group Sharda University Greater Noida INDIA [email protected] http://www.sharda.ac.in Abstract: - A benchmark dataset is required for the development of an efficient and a reliable recognition system. Unfortunately, no comprehensive benchmark dataset exists for handwritten Devnagari optical document recognition research, at least in the public domain. This paper is an effort in this direction. In here, we introduce a comprehensive dataset that we referred to as CPAR-2012 dataset, for such benchmark studies, also present some preliminary recognition results. The dataset includes 35,000 isolated handwritten numerals, 83,300 characters, 2,000 constrained and 2,000 unconstrained handwritten pangrams. It is organized in a relational data model that contains text images along with their writer's information and related handwriting attributes. We collected the handwriting samples from 2,000 subjects who were chosen from different age, ethnicity, and educational background, regional and linguistic groups. The samples reflect expected variations in Devnagari handwriting. The digit recognition results using recognition schemes that uses simple most features & four neural network classifiers & KNN, and classifier ensemble have also been reported for benchmarking. Key-Words: - Benchmark Dataset, Character Extraction, Devnagari Handwritten Numeral Recognition, K- Nearest Neighborhood Classifier (KNN), Neural Networks Classifiers. 1 Introduction engineers to put in their best efforts in building an A large number of languages, dialects and scripts acceptable character recognition system for almost are being used in India. Among them all major languages [1,2] including Hindi Devnagarithe script of the national language language [3,4] and the effort is still on. (Hindi) is the most widely used script. In such a Initially, the focus of the research in character multilingual environment, a computer-mediated recognition was mainly on the development of system that can transcribe Devnagari text into the feature extraction, selection and classification text of the regional languages and vice-versa would methods. Their performance was measured on small be a desired communication option. In such a experimental datasets [9,12,14]. Often the test heterogeneous community, a system that can dataset consisted of either artificial or digitized transcribe unconstrained handwritten text of one character image samples. Therefore, the language into another with acceptable accuracy recognition reliability and robustness of such would be an asset because it can facilitate methods could not be considered of practical communications from anywhere and at any time importance. To overcome these limitations, need by capturing the data at its source. Thereby, such for obtaining the experimental dataset from real- data capturing ability can pave the way for the life handwriting sample was realized. Currently, development of many real life applications, like several real-life datasets are available for an automatic reading machine for visually benchmark studies [1-9]. These real-life challenged people. Undoubtedly, development of handwriting samples are not only helping in such applications is a formidable task as it needs an measuring the reliability and robustness of the accurate and efficient optical character recognition recognition systems but also providing insights into system as a front-end system and equally efficient the phenomena that control handwriting attributes and intelligible text to speech generator as a back- like stroke formation styles. Such insights, in turn, end system. In practice, development of these are helping in devising better recognition systems poses many challenges. However, despite techniques [3]. This work is an attempt to create a challenges, the social and humanitarian obligations benchmark dataset for Devnagari digital document and most importantly automated data capturing research. As compared to Latin languages [4-6], requirements have motivated scientists and efforts to collect benchmark dataset for Hindi language have been very minimal. ISBN: 978-1-61804-169-2 258 Recent Advances in Telecommunications, Signals and Systems Table 1: Dataset available for Devnagrai Scripts and other Indian Languages Number of Year Ref. Language Dataset Size Dataset Type Writers Devnagari/ Bangla/ 22, 546/ 14, 650/ 2, 2007 [7] Telugu/ Oriya/ Kannada/ 220/ 5, 638/ 4, Isolated numerals N/A Tamil 820/ 2, 690 Strings of Online numerals / 2007 [8] Bangla 8,348 / 23,392 N/A Offline-Isolated Numerals 2009 [9] Devnagari 22,556 Isolated Numerals 1,049 2010 [10] Tamil and Kannada 100,000 Words 600 2011 [11] Kannada 4,298 / 26,115 Text lines / Words 204 2011 [12] Hindi & Marathi 26,720 Words of Legal Amounts N/A Bangla only & Bangla 2012 [13] 100 & 50 Words N/A mixed with English However, few noticeable efforts to create a Experience shows, real-life handwriting samples realistic benchmark dataset for Indian languages are needed to build a realistic system. The study have appeared in literature [7-13] recently, these reveals the scarcity of such realistic datasets for are described in Section 2. In Section 3, we describe Devnagrai character recognition research. For this data collection process, storage format and data purpose, we have developed a large dataset attributes. Section 4 describes the experiments named as CPAR-2012 (Centre for Pattern Analysis with isolated numeral dataset. In Section 5 we and Recognition-2012). It contains samples that present an analysis of our recognition results of would help in understanding the complex structure initial experiments that we conducted to assess of Devnagari characters, discovering features, the recognition performance of standard training and testing the system in real techniques on this dataset. We conclude the paper environment. with Section 6, where we summarize our As compared to other datasets, the CPAR-2012 observations and highlight our efforts of building dataset provides more information. It is much an integrated research environment to support the larger than the largest dataset of Devnagari digits as sharing of benchmark dataset & algorithms, and reported in Bhattacharya and Chaudhary [5]. processes for continuous data collection and Moreover, CPAR-2012 has a variety of data storage from maximum possible sources. samples including digits, characters, and text images. In addition to these, it contains writers’ information and color images. The color images 2 Related Work are in a format that reflects the worst image representation scenario that may be expected in real The test dataset creation for character recognition life applications. Hence, it would help in a research of Indian languages is a recent effort [7- realistic assessment of the robustness of the system. 13,15]. Table 1 shows that serious efforts to collect Devenagri datasets began in the preceding decade [3, 5, 6] but these datasets are small as 3 Data Collection compared to the datasets of other major languages. We collected data from writers belonging to Moreover, most of the datasets for Indian diverse population strata. They belonged to languages contain numerals only [7,9]. different age groups (from 6 to 77 years), genders, Surprisingly, none of them have Devnagari educational backgrounds (from 3rd grade to post characters, words and texts with the writers’ graduate levels), professions (software engineers, information. The writer’s information is useful for professors, students, accountants, housewives and handwriting analysis application development. retired persons), regions (Indian states: Bihar, Uttar However, the work of Nethravathi et al. [10] on Pradesh, Haryana, Punjab, National Capital Region design of Tamil and Kannada word recognition (NCR), Madhya Pradesh, Karnataka, Kerala, indicates that they had collected and compiled a Rajasthan, and countries: Nigeria, China and large handwritten document dataset but that too Nepal). Two thousand writers participate in this did not incorporate writer’s information. ISBN: 978-1-61804-169-2 259 Recent Advances in Telecommunications, Signals and Systems experiment. In order to collect their personal After step 4, we filtered out all those labeled information and handwriting samples we designed components whose areas were less than a specified two forms. The first form referred to as Form-1, is threshold. Under these criteria, 1,700 out of 2,000 for isolated numerals, characters and writer forms were accepted and processed. In each information collection as shown in Fig. 1. It has a accepted from 154 bounding boxes were detected, header block that contains three fields: Text Type, cropped, stored and displayed for verification see Writer Number and Form Number followed by a Fig. 2. Through this process, we accepted machine printed Hindi character block used as constrained handwritten samples of 15,000 specimen to write these characters in the space numerals and 83, 300 characters. The process provided in the following block. The last block is rejected poor quality samples (1400 samples). reserved for writer information where every writer These samples have also been stored in the wrote his/her name, age, gender, education level, database for further investigation. profession, writing instrument/color, designation, psychological condition, region and nationality. The second form that
Recommended publications
  • Easychair Preprint an Efficient Approach for Handwritten
    EasyChair Preprint № 4370 An Efficient Approach for Handwritten Devanagari Script Recognition Manoj Sonkusare, Roopam Gupta and Asmita Moghe EasyChair preprints are intended for rapid dissemination of research results and are integrated with the rest of EasyChair. October 12, 2020 An efficient approach for Handwritten Devanagari Script Recognition Manoj Sonkusare1, Roopam Gupta2, Asmita Moghe3 1Research Scholar, 2Professor, 3Professor, 1,2,3 Department of IT, Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal, India [email protected] Abstract. Handwritten text recognition is a challenging task because of the vast changes in writing styles. In India, a massive number of people use Devanagari Script to write their documents, but due to large complexity, research work accomplished on this script is much lesser as compared to English script. Hence, recognition of handwritten Devanagari Script is amongst the most demanding research areas in the field of image processing. Feature extraction and recognition are key steps of OCR which affects the accuracy of the character recognition system. This paper gives an efficient approach for handwritten Devanagari script recognition based on various parameters like database used, sample size, training and test set ratio, class size, data normalization size, recognition accuracy, etc. Keywords: OCR, Devanagari Script, ANN, DCNN, CNN, KNN, SVM, DBN. 1 Introduction Optical Character Recognition (OCR) is a technique to turn the scanned image of handwritten or printed text into a digital form. Handwritten character recognition is active field of research which possesses a substantial importance in digital image processing. It has many applications such as automation of various organizations like post offices, government and private offices, searching data from documents and books, processing of cheque in banks, etc.
    [Show full text]
  • Importance of Andhra Pradesh Mother Tongue-A Study on Telugu Language
    RESEARCH PAPER SOCIAL SCIENCE Volume : 4 | Issue : 12 | Dec 2014 | ISSN - 2249-555X Importance of Andhra Pradesh Mother Tongue-a Study on Telugu Language KEYWORDS Andhra, Baasha, language, Telugu, Poet M. Venkatalakshmamma N. Munirathnamma Department of Telugu Studies, Sri Venkateswara Department of Telugu Studies, Sri Venkateswara University, Tirupati-517 502, Andhra Pradesh, India. University, Tirupati-517 502, Andhra Pradesh, India. ABSTRACT Telugu is the most widely spoken language amongst those using the Brahmi script. These comprise the languages of south India (Tamil, Telugu, Malayalam, Kannada, Tulu and others such as Sinhala (spoken in Sri Lanka)and languages spoken in South East Asia such as Burmese, Thai and Cambodian) In terms of popula- tion, Telugu ranks second to Hindi among the Indian languages. The main languages spoken in Andhra Pradesh are Telugu, Urdu, Hindi, Banjara, and English followed by Tamil, Kannada, Marathi and Oriya. Telugu is the principal and official language of the State. It was also referred to as `Tenugu' in the past. `Andhra' is the name given to it since the medieval times. Some argued that `Telugu' was a corruption of `Trilinga' (Sanskrit meaning three `lingas'). Its vocabulary is very much influenced by Sanskrit. In the course of time, some Sanskrit expressions used in Telugu got so naturalized that people regarded them as pure Telugu words. INTRODUCTION poet called Telugu as “Sundara Telugu”. Telugu language has written literature from more than thousand years. The language which is spoken is called Nicolas called Telugu as “Italian of the east”. language. When there is development in language there will be development of society.
    [Show full text]
  • A Review of Research on Devnagari Character Recognition
    International Journal of Computer Applications (0975 – 8887) Volume 12– No.2, November 2010 A Review of Research on Devnagari Character Recognition Vikas J Dongre Vijay H Mankar Department of Electronics & Telecommunication, Government Polytechnic, Nagpur, India ABSTRACT surveying the complete solutions. Although the off-line and English Character Recognition (CR) has been extensively on-line character recognition techniques have different studied in the last half century and progressed to a level, approaches, they share a lot of common problems and sufficient to produce technology driven applications. But solutions. Since it is relatively more complex and requires same is not the case for Indian languages which are more research compared to on-line and machine-printed complicated in terms of structure and computations. Rapidly recognition, off-line handwritten character recognition is growing computational power may enable the selected as a focus of attention in this article. implementation of Indic CR methodologies. Digital Handwriting Recognition Technology has been improving document processing is gaining popularity for application to much under the purview of pattern recognition and image office and library automation, bank and postal services, processing since a few decades. Hence various soft publishing houses and communication technology. computing methods involved in other types of pattern and Devnagari being the national language of India, spoken by image recognition can as well be used for DOCR. more than 500 million people, should be given special attention so that document retrieval and analysis of rich Seminal and comprehensive work in DOCR is carried out by ancient and modern Indian literature can be effectively done. R.M.K.
    [Show full text]
  • Magahi and Magadh: Language and People
    G.J.I.S.S.,Vol.3(2):52-59 (March-April, 2014) ISSN: 2319-8834 MAGAHI AND MAGADH: LANGUAGE AND PEOPLE Lata Atreya1 , Smriti Singh2, & Rajesh Kumar3 1Department of Humanities and Social Sciences, Indian Institute of Technology Patna, Bihar, India 2Department of Humanities and Social Sciences, Indian Institute of Technology Patna, Bihar, India 3Department of Humanities and Social Sciences, Indian Institute of Technology Madras, Tamil Nadu, India Abstract Magahi is an Indo-Aryan language, spoken in Eastern part of India. It is genealogically related to Magadhi Apbhransha, once having the status of rajbhasha, during the reign of Emperor Ashoka. The paper outlines the Magahi language in historical context along with its present status. The paper is also a small endeavor to capture the history of Magadh. The paper discusses that once a history of Magadh constituted the history of India. The paper also attempts to discuss the people and culture of present Magadh. Keywords: Magahi, Magadhi Apbhransha, Emperor Ashoka. 1 Introduction The history of ancient India is predominated by the history of Magadh. Magadh was once an empire which expanded almost till present day Indian peninsula excluding Southern India. Presently the name ‘Magadh’ is confined to Magadh pramandal of Bihar state of India. The prominent language spoken in Magadh pramandal and its neighboring areas is Magahi. This paper talks about Magahi as a language, its history, geography, script and its classification. The paper is also a small endeavor towards the study of the history of ancient Magadh. The association of history of Magadh with the history and culture of ancient India is outlined.
    [Show full text]
  • Pennsylvania Department of Education INTERNATIONAL
    Prepared for: Pennsylvania Department of Education by: INTERNATIONAL SERVICE CENTER 21 South River Street Harrisburg, PA 17101 India 2 TABLE OF CONTENTS PAGE India……………………………………………………………………………......................... 1 History…………………………………………………………………………......................... 2 Geography…………………………………………………………………….......................… 4 Climate…………………………………………………………………………........................ 5 Education………………………………………………………………………......................... 6 Traditions………………………………………………………………………......................... 6 People and Language…………………………………………………………........................... 7 Religion…………………………………………………………………………........................ 8 Holidays………………………………………………………………………........................... 8 Economy……………………………………………………………………….......................... 9 Bibliography……………………………………………………………………........................ 10 Suggested Reading……………………………………………………………........................... 11 India 1 INDIA Map: Capital: New Delhi Time Difference: 10.5 hours ahead of Washington, DC during Standard Time Population: 1,166,079,217 (July 2009 est.) Official Languages: Hindi, Bengali, Telugu, Marathi, Tamil, Urdu, Gujarati, Kannada, Malayalam, Oriya, Punjabi, Assamese, Kashmiri, Sindhi, and Sanskrit Type of Government: Federal republic Administrative Divisions: 28 states and 7 union territories Total Area: 3,287,590 sq km (1,269,346 sq mi) Area: Slightly more than one-third the size www.cia.gov of the US India 2 HISTORY The Indus Valley civilization, among the earliest in history, thrived on the Indian subcontinent
    [Show full text]
  • Minority Rights and Ethnic Conflict in Assam, India Robert G
    Boston College Third World Law Journal Volume 14 | Issue 1 Article 5 1-1-1994 Minority Rights and Ethnic Conflict in Assam, India Robert G. Gosselink Follow this and additional works at: http://lawdigitalcommons.bc.edu/twlj Part of the Asian Studies Commons, Civil Rights and Discrimination Commons, Demography, Population, and Ecology Commons, Foreign Law Commons, Inequality and Stratification Commons, Race and Ethnicity Commons, and the South and Southeast Asian Languages and Societies Commons Recommended Citation Robert G. Gosselink, Minority Rights and Ethnic Conflict in Assam, India, 14 B.C. Third World L.J. 83 (1994), http://lawdigitalcommons.bc.edu/twlj/vol14/iss1/5 This Notes is brought to you for free and open access by the Law Journals at Digital Commons @ Boston College Law School. It has been accepted for inclusion in Boston College Third World Law Journal by an authorized administrator of Digital Commons @ Boston College Law School. For more information, please contact [email protected]. MINORITY RIGHTS AND ETHNIC CONFLICT IN ASSAM, INDIA ROBERT C. GoSSELINK* How will this nationality, the Assamese, be able to keep its numeri­ cal position as the majority in Assam in the face of uncontrolled and unassimilated immigration. In the absence of any ar­ rangement in the form of assimilation of immigrants into its linguistic fold or of a constitutional provision for maintaining its majority position, a weak nationality in the face of a ceaseless influx of people belonging to a strong linguistic nation may face another eventuality. 1 1. INTRODUCTION From both a utopian ideal and a political necessity, India always has been committed to unity through diversity.
    [Show full text]
  • A Comparative Study on Handwritten Devanagari Character Recognition
    EasyChair Preprint № 3668 A Comparative study on Handwritten Devanagari Character Recognition Manoj Sonkusare, Roopam Gupta and Asmita Moghe EasyChair preprints are intended for rapid dissemination of research results and are integrated with the rest of EasyChair. June 22, 2020 A Comparative study on Handwritten Devanagari Character Recognition Manoj Sonkusare1, Roopam Gupta2, Asmita Moghe3 1Research Scholar, 2Professor, 3Professor, 1,2,3 Department of IT, Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal, India [email protected] Abstract. Handwritten text recognition is a challenging task because of the vast changes in writing styles. In India, a massive number of people use Devanagari Script to write their documents, but due to large complexity, research work accomplished on this script is much lesser as compared to English script. Hence, recognition of handwritten Devanagari Script is amongst the most demanding research areas in the field of image processing. Feature extraction and recognition are key steps of OCR which affects the accuracy of the character recognition system. This paper gives a comparative study on distinct techniques used for feature extraction and classification by the researchers over the last few years. Keywords: OCR, Devanagari Script, ANN, CNN, K-NN, SVM. 1 Introduction Optical Character Recognition (OCR) is a technique to turn the scanned image of handwritten or printed text into a digital form. Handwritten character recognition is active field of research which possesses a substantial importance in digital image processing. It has many applications such as automation of various organizations like post offices, government and private offices, searching data from documents and books, processing of cheque in banks, etc. Handwritten character recognition and printed character recognition are two types of OCR [1], [2].
    [Show full text]
  • Proposal for a Devanagari Script Root Zone Label Generation Rule-Set (LGR)
    Proposal for a Devanagari Script Root Zone Label Generation Rule-Set (LGR) LGR Version: 3.0 Date: 2018-07-27 Document version: 6.1 Authors: Neo-Brahmi Generation Panel [NBGP] 1 General Information/ Overview/ Abstract This document lays down the Label Generation Rule Set for the Devanagari script. Three main components of the Devanagari Script LGR i.e. Code point repertoire, Variants and Whole Label Evaluation Rules have been described in detail here. All these components have been incorporated in a machine-readable format in the accompanying XML file named "Proposal-lgr-devanagari-20180727.xml". In addition, a document named “Devanagari-test-labels-20180727.txt” has been provided. It contains a list of valid and invalid labels as per the Whole Label Evaluation laid down in Section 7 of this document. The labels have been tagged as valid and invalid under the specific rules1. In addition, the file also lists the set of labels which can produce variants as laid down in Section 6 of this document. 2 Script for which the LGR is proposed ISO 15924 Code: Deva ISO 15924 Key N°: 315 ISO 15924 English Name: Devanagari (Nagari) Latin transliteration of native script name: dévanâgarî 1 The categorization of invalid labels under specific rules is given as per the general understanding of the LGR Tool by the NBGP. During testing with any LGR tool, whether a particular label gets flagged under the same rule or the different one is totally dependent on the internal implementation of the LGR Tool. In case of discrepancy among the same, the fact that it is an invalid label should only be considered.
    [Show full text]
  • Genealogical Classification of New Indo-Aryan Languages and Lexicostatistics
    Anton I. Kogan Institute of Oriental Studies of the Russian Academy of Sciences (Russia, Moscow); [email protected] Genealogical classification of New Indo-Aryan languages and lexicostatistics Genetic relations among Indo-Aryan languages are still unclear. Existing classifications are often intuitive and do not rest upon rigorous criteria. In the present article an attempt is made to create a classification of New Indo-Aryan languages, based on up-to-date lexicosta- tistical data. The comparative analysis of the resulting genealogical tree and traditional clas- sifications allows the author to draw conclusions about the most probable genealogy of the Indo-Aryan languages. Keywords: Indo-Aryan languages, language classification, lexicostatistics, glottochronology. The Indo-Aryan group is one of the few groups of Indo-European languages, if not the only one, for which no classification based on rigorous genetic criteria has been suggested thus far. The cause of such a situation is neither lack of data, nor even the low level of its historical in- terpretation, but rather the existence of certain prejudices which are widespread among In- dologists. One of them is the belief that real genetic relations between the Indo-Aryan lan- guages cannot be clarified because these languages form a dialect continuum. Such an argu- ment can hardly seem convincing to a comparative linguist, since dialect continuum is by no means a unique phenomenon: it is characteristic of many regions, including those where Indo- European languages are spoken, e.g. the Slavic and Romance-speaking areas. Since genealogi- cal classifications of Slavic and Romance languages do exist, there is no reason to believe that the taxonomy of Indo-Aryan languages cannot be established.
    [Show full text]