<<

International Journal of Languages, Literature and Linguistics, Vol. 2, No. 4, December 2016

Text Analysis and Information Retrieval of Historical Tamil Ancient Documents Using Machine Translation in Image Zoning

E. K. Vellingiriraj, M. Balamurugan, and P. Balasubramanie

 different categories. Document registration of land and Abstract—The aim of this paper is to develop that building which are donated by the kings to the people are involves character recognition of Brahmi, Grantha and encrypted with palm manuscripts. The literary works, Vattezuthu Characters from palm manuscripts of Historical grammar, astrology, science and technology, etc., are Tamil Ancient Documents, anaylsed the text and machine encrypted in palm manuscripts. Historical moments of translated the present Tamil digital text format. Though many researchers have implemented various algorithms and places and dominion are also encrypted. Character techniques for character recognition in different languages, recognition is one of the most difficult tasks in the pattern Ancient characters conversion still poses a big challenge. recognition system. There are a lot of difficulties in image Because Image recognition technology has reached near- processing techniques. To solve these, one should know perfection when it comes to scanning English and other how to ) separate the characters in the segmentation process, language text. But optical character recognition (OCR) ii) to recognize unlimited character fonts and sculpting software capable of digitizing printed Tamil text with high levels of accuracy is still elusive. Only a few people are familiar styles in noisy image and iii) distinguish characters that with the ancient characters and make attempts to convert them have the same shape, but have different pronunciation in into written documents manually. The proposed system characters like (Ng) and (iT). Many researchers have tried to overcomes such a situation by converting all the ancient apply many techniques for breaking through the complex historical documents from inscriptions and palm manuscripts problems of character recognition. The most difficult thing into Tamil digital text format. It converts the digital text is to recognize the sculpting of Brahmi characters in stone format using Tamil . Our algorithm comprises different stages: i) image preprocessing, ii) feature extraction, compared to the other character recognition from different iii) character recognition and iv) digital text conversion. The sources. The Brahmi characters were sculpted in stones, first phase conversion accuracy of the Brahmi rate of clay pot, copper plate etc. In this paper, a character our algorithm is 91.57% using the neural network and image recognition system based on a statistical approach with a zoning method. The second phase of the vettezhuthu character new feature set is proposed. set is to be implemented. Conversion accuracy of Vattezhuthu is 89.75%

Index Terms—Character recognition, vattezhuthu, II. BRAHMI AND VATTEZHUTHU SCRIPTS segmentation, image zoning, machine translation. is one of the most important systems in the world by virtue of its time depth and influence. It represents the earliest post-Indus corpus of I. INTRODUCTION texts, and some of the earliest historical inscriptions found Tamil is one of the oldest languages in the world with a in . The best-known Brahmi inscriptions are the rock- rich literature. In the ancient days, the writers, especially in cut edicts of in North-central India, dated to 250– Tamilnadu, have used palm leaves and inscriptions to 232 BCE. This elegant script appeared in India most encrypt their . A very good example of the usage of certainly by the 5th century BCE, but the fact is that it had Palm leaf manuscripts is to store the history of Tamil many local variants even in the early texts which suggest grammar book named Tholkappiyam which is written that its origin lies further back in time. There are several during 4th B.C. The ancient literature includes many palm theories on the origin of the Brahmi script. leaf manuscripts that contain rare commentaries on Sangam are descended from the Brahmi script. The most reliable of works, unpublished portions of classics, Saiva, Vaishnava these are short Brahmi inscriptions dated to the 4th century and Jain works, poetry of all descriptions, medical works of BCE and published by Coningham et al. [1] but scattered exceptional values, food, astronomy & astrology, Vaastu & press reports have claimed both dates as early as the 6th Kaama Shastra, jewelry, music, dance & drama, medicine, century BCE and that the characters are identifiably Tamil Siddha and so on. Palm manuscripts are utilized for 3 Brahmi though these latter claims do not appear to have been published academically.

Manuscript received September 14, 2016; revised December 5, 2016. The (or Vattezhuttu) is an used in the . K. Vellingiriraj and P. Balasubramanie are with the Department of southern India mainly in the states of and Tamil Computer Science and Engineering, Kongu Engineering College, Nadu. It has grown from the Brahmi script of southern part Perundurai, , India (e-mail: [email protected], [email protected]). from around 6th century CE, and utilized to write Tamil and M. Balamurugan is with the Department of Computer Science and languages. The main historical development in it Engineering, Christ University, , , India (e-mail: is that the insignificant signs obtained from Brahmi to [email protected]). doi: 10.18178/ijlll.2016.2.4.88 164 International Journal of Languages, Literature and Linguistics, Vol. 2, No. 4, December 2016 represent Tamil are removed from Vatteluttu. approach that is based on the storing and manipulation of From the 8th century CE, has been written information by a biological neural system. This is an in both Vatteluttu and Tamil scripts. The reason is that artificial neural system and it is named as “neural networks”. different kingdoms employed different scripts in the era of This system is expected to solve all the problems with scattered political nature of southern India. During 15th automatic reasoning, including pattern recognition problems. century CE, Vatteluttu script was supplanted by It classifies patterns on the basis of predictable properties of in Tamil-speaking areas. Similarly, in 12th century CE in neural networks. However, it represents only a little amount Kerala, the was developed from the old of information about semantic from a network [14]-[16]. and so it became phonetically suitable to the Malayalam language. This situation led to the misuse of Vatteluttu instead of the Malayalam script. Fig. 1 shows the IV. METHODOLOGY family of Ancient Brahmi and Vattezhuthu script that All the details of the proposed system design are belong to 500BCE. presented below. Initially, the framework of the ancient Brahmi character recognition system is started. Then the component details are given. The Fig. 2 shows the vattezhuthu set. The system gets the input image of Brahmi characters and converts it into equivalent current Tamil digital format.

Fig. 1. Ancient script family.

Fig. 2. Vattezhuthu character set.

III. LITERATURE REVIEW Three major approaches used commonly for recognition A. Overview of System Architecture of handwritten characters are statistical, structural or Fig. 3 shows the overall architecture of Tamil Brahmi syntactic and neural network based approaches. character recognition system for stone inscriptions. The major components of this system are image preprocessing, A. Statistical Approach feature extraction and character recognition. The Statistical or probability functions are used for architecture of the system is given below. formulating a recognition algorithm in statistical approach. Features to be given as input are derived from a set of measurements used for pattern recognition. There prevails a difficulty in representing the pattern of classification on the basis of structural information in this statistical approach [2]-[8]. B. Structural or Syntactic Approach Syntactic approach employs syntactic or structural information about patterns to obtain information relevant to patterns. It extracts similar patterns and formulates pattern syntax or structural rules. The information about the pattern syntax rules helps highlight, classify and identify unknown patterns. The structural approach is favorable for evolving a Fig. 3. Architecture of ancient Tamil Brahmi character recognition system recognition system for handwritten characters. The reason is for stone inscription. that it uses structural rules to build much pattern syntax for handwritten characters, but it is not easy to formulate B. Structure of the System learning structural rules [9]-[13]. Based on the system framework formulated, the Brahmi C. Neural Network Based Approach image is converted into Tamil text format [17]. The Neural pattern recognition is a pattern recognition framework involves i) capturing of an image, ii)

165 International Journal of Languages, Literature and Linguistics, Vol. 2, No. 4, December 2016 preprocessing, iii) feature extraction, iv) character recognition and v) text conversion.

C. Image Capturing Fig. 7. Character segmentation.

In the first stage, the Brahmi Inscription images are c) Image resizing: After cropping the particular character, collected from various places. The snapshot images are each character may be of different size. So, each character captured with high quality or high resolution High has to be changed to be of equal size. The character image is Definition / Digital Single Lens Reflex (HD / DSLR) resized as 100×100 pixel image. camera and stored in JPEG format (Fig. 4 and Fig. 5). d) Image thinning: The dark pixel (i.e. A character) is converted to be the thinning character. A thickened character extracts easily the thin character by using a nearest pixel darken to lighten color change to a particular range. e) Image binarization: A character stores the Boolean matrix that is used to store either 0’s or 1’s. The dark pixel is stored as 1’s and light pixel is stored as 0’s using the image zoning technique. Fig. 4. A Vattezhuthu script in temple epigraphy. E. Data Set Training Different classes of Brahmi characters and Vattezhuthu D. Image Preprocessing from various writers of all the 237 Vattezhuthu characters During the image preprocessing stage, a Brahmi character are collected and stored in the training data sets. image is prepared for feature extraction. This stage consists of four sub-processes related to an image: a) cropping, b) F. Character Recognition segmentation, c) resizing, d) thickening and e) binarization. A data set of Vattezhuthu characters is matched with the All these sub-processes are explained below. current user data set. The edge detection algorithm is used a) Image cropping: The Brahmi Inscription image has the to identify each character in the training datasets using white space for each character, which means that in the image zoning. inscription image, the Brahmi characters have spaces in G. Unicode Text between characters. In the stone image, the characters are easily identifiable from each pixel color. The image is After matching the character, the equivalent current changed to grayscale or black and white image and the Digital Tamil text is converted using Unicode. noisy and unwanted spaces are removed. H. Retrieve from the Database After converting the Unicode text, spelling and meaning are checked in the dictionary database. If a word does not give the meaning in the sentence, the nearest meaning is to be searched from the database.

V. PROPOSED ALGORITHM For Zoning, let ZM= {z1, z2, z3,..., zM} be a method of zoning. The zoning based classification deals with the way

Fig. 5. Using Sobel’s edge detection method for segmentation. in which each feature is detected from a pattern x that has an influence on each zone of ZM. Let the classification of a b) Segmentation: Using the extraction [18] pattern x into one class of Ω = ,, … . . be considered technique, each character can be segmented by the using the feature set F = ,, … . . . In this, x can be individual characters. Segmentation is divided into three described by the feature matrix, Ax, of T rows (features) and groups: line segmentation, word segmentation and character M columns (zones). segmentation. In the present approach, the line (Fig. 6) and character segmentation are applied (Fig. 7). 1,1 1,2 … … . 1, … 1, 2,1 2,2 … … . 2, … 2, ……………………………….. Ax = , 1, 2 … … . , … , … …..…..…..….….. , 1, 2 … … . , … ,

with , = ∑ where wij represents the weight that determines the degree of influence of an instance of feature fi on zone zj. Different decompositions of the character of an image and the density of foreground pixels are measured in each zone to obtain Fig. 6. Line segmentation.

166 International Journal of Languages, Literature and Linguistics, Vol. 2, No. 4, December 2016 some features so as to create the first feature subset. The 95 number of black pixels is divided by the total number of Brahmi 94 93.45 pixels in order to find the density of each zone. Vattezhuthu 93 92.75 A c 92 91.57 c 91.11 90.24 VI. RESULTS AND DISCUSSION 91 90.75 90 A. Database r 89.12 89.75 89 To measure the effectiveness of the proposed OCR a 88 system, the Vattezhuthu database is used as a resource for c 87 training and testing processes. The database includes 5000 y characters stored for training and 4000 handwritten 86 Vattezhuthu characters produced by 60 writers, who wrote 5 85 Consonantal Overall samples of each 30 classes for testing. The Vattezhuthu Vowels characters are old one and so the writers have not written Brahmi and Vattezhuthu Scripts Character the test characters clearly and correctly. After some practice, Recognition the writers have written 5 samples of X 30 characters, totally 150 characters from each writer. The training set is In the Table I shows that the comparison of both Brahmi composed of 11 vowels, 18 consonants and 198 consonantal and vattezhuthu ancient character's accuracy of proposed vowels, totally 227 characters of all class and the testing set algorithms for Machine translation of the Tamil digital text. is composed of approximately 30 characters of each class. The conversion rate of vowels in Brahmi character is The lack of character recognition is due to three reasons. 93.45% [19] and the Vattezhuthu is 91.11%. The One is that the consonantal characters are similar to Consonants is 92.75% of Brahmi characters and 90.75% of consonants as shown in Fig. 8. vattezhuthu. The consonantal vowels of Brahmi are 90.24% and the vattezhuthu is 89.12%. The overall character conversion is 91.57% of Brahmi characters and the 89.75% of vattezhuthu. So the comparison of these results is Brahmi character set is more accuracy with the vattezhuthu.

Fig. 8. Similarity of consonantal vowels.

VII. CONCLUSION AND SCOPE FOR FUTURE WORK The second one, the writers do not write the Brahmi characters properly. Moreover, the characters are stored in In the present paper on Phase II, an approach for different strokes and styles with ambiguity as revealed in character recognition of vattezhuthu characters using the Fig. 9. image zone based on classification and recognition is proposed. The next phase implements this approach to Grantha. This is the new way of approaching Ancient vattezhuthu character recognition, the result of which is found to be above 90% for and vowel letter recognition, whereas the consonantal vowel recognition is somewhat low due to the lack of similarity of letters. The overall output of the proposed algorithm is 89.75% accurate which higher than that of the existing systems. Nevertheless, Fig. 9. Unknown characters in the data set. the Brahmi character and vattezhuthu recognition systems still contain many problems that require more efficient The same character with different style and stroke from algorithms to solve the three reasons mentioned above. In different writers is given in Fig. 10. future, the system can be modified with the incorporation of better feature extraction methods to improve the accuracy rate. The writers shall get trained in writing the Brahmi letters to create the data set. By that, the error rate of

Fig. 10. Maa character from different writers. ambiguous letters will be reduced. In addition, the system in future will recognize the Brahmi letters and vattezhuthu Third one, the data set stores only a minimum number of from the stone inscriptions for improving the result in using data, i.e. only 6000 characters are stored in the database for some hybrid algorithms. training set. REFERENCES TABLE I: COMPARISON OF THE RESULTS OF PROPOSED ALGORITHMS [1] R. A. E. Coningham, F. R. Allchin, C. M. Batt, and D. Lucy, Consonantal “Passage to India? Anuradhapura and the early use of the Brahmi Vowels Consonants Overall Vowels Script,” Cambridge Archaeological Journal, vol. 6, no. 1, pp. 73-97, 1996. Brahmi 93.45 92.75 90.24 91.57 [2] S. Seth, G. Nagy, and M. Viswanathan, “A prototype document Vattezhuthu 91.11 90.75 89.12 89.75 image analysis system for technical journals,” Computer, vol. 25, no. 7, pp. 10-22, Jul. 1992.

167 International Journal of Languages, Literature and Linguistics, Vol. 2, No. 4, December 2016

[3] F. Shafait, T. M. Breuel, and D. Keysers, “Performance evaluation [15] P. Sanguansat, S. Jitapunkul, and W. Asdornwised, “Online Thai and benchmarking of six-page segmentation algorithms,” IEEE Trans. handwritten character recognition using hidden Markov models and Pattern Analysis Machine Intelligence, vol. 30, no. 6, pp. 941-954, support vector machines,” in Proc. the International Symposium on Jun. 2008. Communications and Information Technologies, 2004, pp. 492-497. [4] Y. Liang, M. C. Fairhurst, and R. M. Guest, “A synthesisd word [16] R. Budsayaplakorn, W. Asdornwised, and S. Jitapunkul, “On-line approach to word retrieval in handwritten documents,” Elsevier Thai handwritten character recognition using hidden Markov model Pattern Recognition, vol. 45, pp. 4225-4236, Jun. 2012. and fuzzy logic,” in Proc. the IEEE 13th Workshop on Neural [5] G. Pirlo and D. Impedovo, “Adaptive membership functions for Networks for Signal Processing, pp. 537-546, 2003. handwritten character recognition by voronoi-based image zoning,” [17] E. K. Vellingiriraj and P. Balasubramanie, “Recognition of ancient IEEE Trans on Image Processing, vol. 21, no. 9, pp. 3827-3836, Sep Tamil handwritten characters in palm manuscripts using genetic 2012. algorithm,” International Journal of Scientific Engineering and [6] C. Pornpanomchai, S. Jeungudomporn, V. Wongsawangtham, and N. Technology, vol. 2, issue 5, pp. 342-346, 2013. Chatsumpun, “Thai handwritten character recognition by genetic [18] K. Siriboon, A. Jirayusakul, and B. Kruatrachue, “HMM topology algorithm (THCRGA),” IACSIT Journal of Engineering and selection for on-line Thai handwriting recognition,” The First Technology, vol. 3, no. 2, Apr. 2011. International Symposium on Cyber Worlds, pp. 142-145, 2002. [7] Q.-F. Wang, F. Yin, and C.-L. Liu, “Handwritten Chinese text [19] E. K. Vellingiriraj and P. Balasubramanie, “Automatic digitization of recognition by integrating multiple contexts,” IEEE Transaction on ancient Brahmi characters into Tamil digital texts using image zoning Pattern Analysis and Machine Intelligence, vol. 34, no. 8, Aug. 2012. from palm manuscripts and stone inscriptions,” in Proc. International [8] T. M. Jose and A. Wahi, “Recognition of Tamil handwritten Conference on Digital Humanities (CDH2015), The Open University characters using Daubechies wavelet transforms and feed-forward of Hong Kong, Hong Kong, vol. 1, issue 1, p. 49, Dec. 17-18, 2015. back propagation network,” International Journal of Computer Applications, vol. 64, no. 8, pp. 0975-8887, Feb. 2013. [9] J. Chen and D. Lopresti, “Model based ruling line detection in noisy E. K. Vellingiriraj completed his BSc (computer handwritten documents,” Pattern Recognition Letters, 2012. science) degree in 1999, then received his M.C.A. [10] A Bharath and S. Madhvanath, “HMM-based lexicon-driven and degree in 2002 and completed M.E. degree in lexicon-free word recognition for online handwritten Indic scripts,” software engineering from Anna University, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. Coimbatore, India in 2010. He completed MBA (HR) 34, no. 4, Apr. 2012. in Indira Gandhi National Open University (IGNOU), [11] C. Pornpanomchai, D. N. Batanov, and N. Dimmitt, “Recognizing New Delhi in 2010. He is very much interested in Thai handwritten characters and words for humancomputer which led him to completed M.A. interaction,” International Journal of Human-Computer Studies, pp. Tamil in TNOU, . Currently he is doing his 259-279, 2001. research in Natural Language Processing of Character Recognition from [12] C. Pornpanomchai, N. Pisitviroj, P. Panyasrivarom, and P. Historical Palm Manuscripts and Stone Inscriptions in Anna University, Prutkraiwat, “Thai handwritten character recognition by Euclidean nd Chennai. He is also completed master degree in psychology, sociology and distance,” in Proc. the 2 International Conference on Digital Image political science in Tamilnadu Open University, Chennai. Totally, he has Processing (ICDIP 2010), 2010, pp. 53-58. 13 years’ experience including 3 years industrial experience and 10 years [13] C. Pornpanomchai and M. Daveloh, “Printed Thai character of teaching. He presented his research work in international conferences recognition by genetic algorithm,” in Proc. The International held at , Thailand, and Hongkong. The Tamil Nadu Conference on Machine Learning and Cybernetics, 2007, pp. 3354- Government recognized him for presenting his paper on 12th International 3359. Tamil Internet Conference held at Malaysia. And also UGC Granted to [14] B. Kruatrachue, N. Pantrakarn and K. Siriboon, “State machine present his paper on First International Conference on , induction with positive and negative for Thai character recognition,” Mauritius on coming July 2014. Erode Tamil Mandram and Pavendar in Proc. The International Conference on Communications, Circuits Bharathidasan Tamil Mandram have announced the cash award for his and Systems, 2007, pp. 971-975. research work. He has created his own Tamil Blog Spot and developed various Android apps for Tamil people and others.

168