Text Analysis and Information Retrieval of Historical Tamil Ancient Documents Using Machine Translation in Image Zoning

International Journal of Languages, Literature and Linguistics, Vol. 2, No. 4, December 2016 Text Analysis and Information Retrieval of Historical Tamil Ancient Documents Using Machine Translation in Image Zoning E. K. Vellingiriraj, M. Balamurugan, and P. Balasubramanie different categories. Document registration of land and Abstract—The aim of this paper is to develop a system that building which are donated by the kings to the people are involves character recognition of Brahmi, Grantha and encrypted with palm manuscripts. The literary works, Vattezuthu Characters from palm manuscripts of Historical grammar, astrology, science and technology, etc., are Tamil Ancient Documents, anaylsed the text and machine encrypted in palm manuscripts. Historical moments of translated the present Tamil digital text format. Though many researchers have implemented various algorithms and places and dominion are also encrypted. Character techniques for character recognition in different languages, recognition is one of the most difficult tasks in the pattern Ancient characters conversion still poses a big challenge. recognition system. There are a lot of difficulties in image Because Image recognition technology has reached near- processing techniques. To solve these, one should know perfection when it comes to scanning English and other how to i) separate the characters in the segmentation process, language text. But optical character recognition (OCR) ii) to recognize unlimited character fonts and sculpting software capable of digitizing printed Tamil text with high levels of accuracy is still elusive. Only a few people are familiar styles in noisy image and iii) distinguish characters that with the ancient characters and make attempts to convert them have the same shape, but have different pronunciation in into written documents manually. The proposed system characters like (Ng) and (iT). Many researchers have tried to overcomes such a situation by converting all the ancient apply many techniques for breaking through the complex historical documents from inscriptions and palm manuscripts problems of character recognition. The most difficult thing into Tamil digital text format. It converts the digital text is to recognize the sculpting of Brahmi characters in stone format using Tamil Unicode. Our algorithm comprises different stages: i) image preprocessing, ii) feature extraction, compared to the other character recognition from different iii) character recognition and iv) digital text conversion. The sources. The Brahmi characters were sculpted in stones, first phase conversion accuracy of the Brahmi script rate of clay pot, copper plate etc. In this paper, a character our algorithm is 91.57% using the neural network and image recognition system based on a statistical approach with a zoning method. The second phase of the vettezhuthu character new feature set is proposed. set is to be implemented. Conversion accuracy of Vattezhuthu is 89.75% Index Terms—Character recognition, vattezhuthu, II. BRAHMI AND VATTEZHUTHU SCRIPTS segmentation, image zoning, machine translation. Brahmi script is one of the most important writing systems in the world by virtue of its time depth and influence. It represents the earliest post-Indus corpus of I. INTRODUCTION texts, and some of the earliest historical inscriptions found Tamil is one of the oldest languages in the world with a in India. The best-known Brahmi inscriptions are the rock- rich literature. In the ancient days, the writers, especially in cut edicts of Ashoka in North-central India, dated to 250– Tamilnadu, have used palm leaves and inscriptions to 232 BCE. This elegant script appeared in India most encrypt their writings. A very good example of the usage of certainly by the 5th century BCE, but the fact is that it had Palm leaf manuscripts is to store the history of Tamil many local variants even in the early texts which suggest grammar book named Tholkappiyam which is written that its origin lies further back in time. There are several during 4th B.C. The ancient literature includes many palm theories on the origin of the Brahmi script. Brahmic scripts leaf manuscripts that contain rare commentaries on Sangam are descended from the Brahmi script. The most reliable of works, unpublished portions of classics, Saiva, Vaishnava these are short Brahmi inscriptions dated to the 4th century and Jain works, poetry of all descriptions, medical works of BCE and published by Coningham et al. [1] but scattered exceptional values, food, astronomy & astrology, Vaastu & press reports have claimed both dates as early as the 6th Kaama Shastra, jewelry, music, dance & drama, medicine, century BCE and that the characters are identifiably Tamil Siddha and so on. Palm manuscripts are utilized for 3 Brahmi though these latter claims do not appear to have been published academically. Manuscript received September 14, 2016; revised December 5, 2016. The Vatteluttu (or Vattezhuttu) is an alphabet used in the E. K. Vellingiriraj and P. Balasubramanie are with the Department of southern India mainly in the states of Kerala and Tamil Computer Science and Engineering, Kongu Engineering College, Nadu. It has grown from the Brahmi script of southern part Perundurai, Tamil Nadu, India (e-mail: [email protected], [email protected]). from around 6th century CE, and utilized to write Tamil and M. Balamurugan is with the Department of Computer Science and Malayalam languages. The main historical development in it Engineering, Christ University, Bangalore, Karnataka, India (e-mail: is that the insignificant signs obtained from Brahmi to [email protected]). doi: 10.18178/ijlll.2016.2.4.88 164 International Journal of Languages, Literature and Linguistics, Vol. 2, No. 4, December 2016 represent Tamil alphabets are removed from Vatteluttu. approach that is based on the storing and manipulation of From the 8th century CE, Tamil language has been written information by a biological neural system. This is an in both Vatteluttu and Tamil scripts. The reason is that artificial neural system and it is named as “neural networks”. different kingdoms employed different scripts in the era of This system is expected to solve all the problems with scattered political nature of southern India. During 15th automatic reasoning, including pattern recognition problems. century CE, Vatteluttu script was supplanted by Tamil script It classifies patterns on the basis of predictable properties of in Tamil-speaking areas. Similarly, in 12th century CE in neural networks. However, it represents only a little amount Kerala, the Malayalam script was developed from the old of information about semantic from a network [14]-[16]. Grantha script and so it became phonetically suitable to the Malayalam language. This situation led to the misuse of Vatteluttu instead of the Malayalam script. Fig. 1 shows the IV. METHODOLOGY family of Ancient Brahmi and Vattezhuthu script that All the details of the proposed system design are belong to 500BCE. presented below. Initially, the framework of the ancient Brahmi character recognition system is started. Then the component details are given. The Fig. 2 shows the vattezhuthu set. The system gets the input image of Brahmi characters and converts it into equivalent current Tamil digital format. Fig. 1. Ancient script family. Fig. 2. Vattezhuthu character set. III. LITERATURE REVIEW Three major approaches used commonly for recognition A. Overview of System Architecture of handwritten characters are statistical, structural or Fig. 3 shows the overall architecture of Tamil Brahmi syntactic and neural network based approaches. character recognition system for stone inscriptions. The major components of this system are image preprocessing, A. Statistical Approach feature extraction and character recognition. The Statistical or probability functions are used for architecture of the system is given below. formulating a recognition algorithm in statistical approach. Features to be given as input are derived from a set of measurements used for pattern recognition. There prevails a difficulty in representing the pattern of classification on the basis of structural information in this statistical approach [2]-[8]. B. Structural or Syntactic Approach Syntactic approach employs syntactic or structural information about patterns to obtain information relevant to patterns. It extracts similar patterns and formulates pattern syntax or structural rules. The information about the pattern syntax rules helps highlight, classify and identify unknown patterns. The structural approach is favorable for evolving a Fig. 3. Architecture of ancient Tamil Brahmi character recognition system recognition system for handwritten characters. The reason is for stone inscription. that it uses structural rules to build much pattern syntax for handwritten characters, but it is not easy to formulate B. Structure of the System learning structural rules [9]-[13]. Based on the system framework formulated, the Brahmi C. Neural Network Based Approach image is converted into Tamil text format [17]. The Neural pattern recognition is a pattern recognition framework involves i) capturing of an image, ii) 165 International Journal of Languages, Literature and Linguistics, Vol. 2, No. 4, December 2016 preprocessing, iii) feature extraction, iv) character recognition and v) text conversion. C. Image Capturing Fig. 7. Character segmentation. In the first stage, the Brahmi Inscription images are c) Image resizing: After cropping the particular

Text Analysis and Information Retrieval of Historical Tamil Ancient Documents Using Machine Translation in Image Zoning

Cross-Language Framework for Word Recognition and Spotting of Indic Scripts

Recognition of Ancient Tamil Characters in Stone Inscription Using Improved Feature Extraction M

SC22/WG20 N896 L2/01-476 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale De Normalisation

“Lost in Translation”: a Study of the History of Sri Lankan Literature

418338 1 En Bookbackmatter 205..225

Updated Proposal to Encode Tulu-Tigalari Script in Unicode

Intelligence System for Tamil Vattezhuttuoptical

Present Tense Past Tense Past Participle with Tamil Meaning

Study Report on Gaja Cyclone 2018 Study Report on Gaja Cyclone 2018

Comment on L2/13-065: Grantha Characters from Other Indic Blocks Such As Devanagari, Tamil, Bengali Are NOT the Solution

Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset