<<

Toward a Handwritten Recognition System Using Canonical Representation for Multi- Documents Yassine El Malki, Youssef Es-Saady, Driss Mammass IRF-SIC Laboratory, IbnoZohr University Agadir, {y.elmalki; y.essaady ; mammass}@uiz.ac.ma

Abstract: We present in this paper a system of multilingual handwriting recognition based on canonical representation. After the word/character image preprocessing, the skeleton is transformed only into vertical and horizontal strokes, which is called the canonical representation. Then, the word/character is represented by a vector of its intersection pixels’ values, those values, which are depends on the pixel’s 8-surrounding neighbors. Finally, an algorithm is applied to match the character’s vector with a codebook containing the vector of each character of the language. The vector that represents the character will be used in the classification step, and the system will be applied to databases of multi-script documents, which contains texts in Latin, Arabic, and/or . Index Terms — Optical Character Recognition; Multilingual; Amazigh; Tifinagh; Handwritten Recognition; Multi-Script Documents 1. Introduction Libraries around the world have a big amount of documents. By scanning those documents their content can be accessible everywhere and by everyone. However, searching manually in scanned document, page by page seems difficult task for those who seek specific information. The solution is to use Optical Character Recognition (OCR) which is the process of transforming the image of text into text known by the computer, thus can easily be used as ASCCII code through searching information. This area of research has attracted many researches in many languages especially for Latin, Chinese and Arabic scripts [Bozinovic and Srihari 1989], [Bazzi et al 1999], [Zhang et al 2009], [Agrawal et al 2009], [Radwan et al 2013], [Es-saady et al 2014]. The variability nature of the handwriting scripts made the recognition very challenging. Many local languages such Amazigh have integrated the information systems, thus, in recent years, many researchers have been interested in Amazigh Character Recognition such as [Skoutni 2003],[Zenkouar et al 2004], [Es-saady et al 2010], [Es-saady et al 2011], [Es-saady et al 2014], because the Amazigh character needs a specific processing. In North of Africa, the local Tifinagh system, which is a system of the Amazigh language, is widely used. The Tifinigh script is among the oldest script in the world, it started from 3rd century BC. This script has known many changes from its original form as many languages like the who changed from its original script (adding dots and diacritics marks). This script is found in the stones and tombs in some historical locations in Morocco, , and the Tuareg areas in the . The Amazigh which is called ―Tifinagh-IRCAM‖, adopted by the Royal Institute TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS of the Amazigh Culture, was officially recognized by the International Organization of Standardization (ISO) as the basic multilingual plan[Zenkouar 2014]. Figure 1 represents the repertoire of Tifinagh which is recognized and used in Morocco with their correspondents in Latin characters. The number of the alphabetical phonetic entities is 33, but codes only 31 letters plus a change of a few letters to form the two phonetic units: (g ) and (k ).

Fig. 1. Tifinagh characters adopted by the Royal Institute of the Amazigh Culture with their correspondents in Latin character

The Amazigh have their own since the ancient times [Gaci 2011]. However, the amazigh people had used the alphabet and/or the language of the dominant people, who were in interaction with them, in the writing of the amazigh documents. Three systems have been used to transcript the Amazigh script[Skounti& al. 2003]:  The Tifinagh as an authentic alphabet in the Libyc inscription since the ancient times.  The Arabic script, after the arrival of the Arabic in the end of 6th century.  The Latin since the 19th century, by the colonial’s scientists and later on by national researchers. Some documents that are present in libraries have multiple languages in the same document as shown in Figure 2 extracted from a dictionary of Taoureg language[Foucauld 1951]. This dictionary has both Amazigh and Latin Scripts; the Amazigh words are also written in Latin characters. TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS

Fig2: An Extract of Touareg dictionary [Fouclaud 1951]

Many works aimed to process multilingual documents, for example [Ambekar et al 2013] presented an OCR for printed English and text, they used KNN for classification on 10 samples of each character of the two languages (610 samples, 260 for English and 350 for Devanagari) and they reached 95% for English text and 93% for the Devanagari Text. [G.S Lehal 2013] used a OCR for and English, first they identify the script nature either Gurmukhi or English, then they used the proper OCR for each language, the test were held on 4 sets that were constructed based on 105 pages’ images (76 Gurmukhi pages and 29 English Pages), they obtained different rates depending on the used set. [Tan 1998] described an automatic method for identification of Chinese, English, Greek, Russian, Malayalam and Persian text. [Das et al 2011] has used an OCR system for Telugu, English and Hindi, they extracted 8 features and then used a KNN Classifier, and they obtained 93% accuracy. To process multilingual documents, we used a system script-free proposed by [Al Abodiand Li 2014] with some alterations, which can perform on different languages in the same text image. The scheme of the system is presented in figure 3 below. The paper is organized as following: section 2 represent the preprocessing step, the third section describes the process of the canonical representation, finally in the section 4 we present a conclusion and some future work.

TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS

Fig3:proposed system scheme TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS 2. Preprocessing Such as was presented in Figure 3, the procedure of pre-processing which refines the scanned input image includes several steps: Binarization, noises removal and image resizing. We used the [Otsu 1979] method for binarization. This method of thresholding is performed as a preprocessing step to remove the background noise from the picture prior to extraction of characters and recognition of text. This method performs a statistical analysis of histograms to define a function to be maximized to estimate the threshold. The database images present gaps between pixels. This is due to the writing style or the type of the pen used. To fill the gaps, we used the widely known algorithm Run Length Smoothing Algorithm (RLSA) in both vertical and horizontal direction. For instance, if we have a sequence of values and the RLSA threshold is for example 4 all zeros between two ones becomes 1 if their number is less than the fixed threshold. 10010001000001001 becomes 11111111000001111 After, a skeletonization step is applied on the image to have a pixel width of one pixel only. Figure 4 shows an example of Amazigh character and the result of skeletonization step.

a) b) c) d) Fig 4: a) original image of Yab, b) binarization,c) RLSA gap filling, d) skeletonization 3. Canonical Representation 3.1 Pixel Value We represent each pixel by a code that is obtained by summing the values of the 8 surrounding pixels, which is similar to Freeman coding. The pixel of the top has the position 0, and we move clockwise to attribute the position of the other pixels, so the pixel in the top right has the value 1, the pixel in the right has the position 2 and so one we end up with the last pixel in the top left that has the position 7. The value of the pixel is 2pixel_position. The figure 5 represents the coding scheme of the surrounding pixels. There are 256 unique forms of surrounding pixels calculated as above. The following formulate implies that we have 256 of all geometric possible combination of the 8 neighboring pixels:

TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS

The figure 6 below presents some pixels with their codes:

Fig 5: a) the value of the surrounding pixels b) code of the pixel is 145 = 128+1+16

Fig 6: some pixel codes

3.2 The Canonical Form Scripters have deferent writing styles, this produces a problem for OCRs because the shape, the width, and the height of the characters change from one writer to another. To reduce this variety of characters forms we used the canonical representation introduced by [abode et al 14], the main concept is to have only horizontal and vertical strokes by applying processing techniques on the pixels of the image and depending on their values we got the canonical representation as described by the following algorithm:

TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS

For example, in Figure 7 below we begin the examination of the pixels from the top left to the bottom right of the image. In this case, the first white pixel is located in (4, 12) (column 4, line 12) with 18 as value. This value means that this pixel has an upper right pixel and a bottom pixel. The upper right pixel does not belong to the same line or the same column as the processed pixel, so we must follow the northeast direction until finding a pixel with pixel in his upper or upper right, and have no pixel on its left pixel (17, 4). The pixels in this path will have the value of 255, and then the will be deleted. For each deleted pixel, we draw pixel in the 4 column (column of the starting pixel), and another one in the line 4 (the line of the ending pixel).

Fig7: Example of the canonical form transformation of the TifinaghcharacterYa

We continue the same process over and over until deleting all diagonal pixels from c to h. The output is the canonical form of the character. The figure 8 presents some Tifinagh characters and their canonical form and the figure 9 present an example of the canonical form Arabic word.

Fig 8: Example of the canonical form of some Tifinagh characters TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS

from IFN-INIT مارث Fig 9: Example of the canonical form Arabic word Database[Pechwitz et al 2002]

3.3 The Character’s vector The vector value is constructed from the values of starting and the intersections pixels of the character. The starting is the upper right of the character canonical from previously explained, for example the vector of the character Yadd shown below is (64,64,64,20,21,5).

The following table presents some of the amazighe characters and their correspondent vector:

TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS 4. Conclusion and perspectives In this paper, we presented a system that is multi-scripts. This system can be applied to different languages present in the same image of the document. The system is tested on some of the characters of the AMCHD database [Es-Saady et al 2011], for example Yadd 91,53% ,Ya has 80,13%, Yan 96,84% recognition rate this with exact matching of the codebook vector and the character vector. However, characters with diagonal strokes present deferent vectors this because depending on stroke orientation its canonical representation will be vertical for one character and horizontal for another one. This problem can be solved using the SVM classifier with K-fold validation. In future work, we improved the method of canonical form. In addition, we will experiment our approach on a database of multi-script documents.

References A. Skounti, A. Lemjidi, M. Nami (2003), Tirra aux origines de l’écriture au Maroc, Publications de l’Institut Royal de la Culture Amazigh, 2003, Rabat. Aarti G. Ambekar, Chhaya S. Hinge, Samidha S. Kulkarni (2013).‖Bilingual OCR for Printed English and Devnagari Text‖. INDIAN JOURNAL OF RESEARCH. ISSN - 2250-1991. Charles de FOUCAULD (1951).―Dictionnaire touareg-français : dialecte de l'Ahaggar―. Paris :Imprimerie Nationale de France, 1951-1952, tome III (L-OU), pp. 0971-1547. Elsayed Radwan (2013). ―Hybrid of Rough Neural Networks for Arabic/Farsi Handwriting Recognition‖. (IJARAI) International Journal of Advanced Research in Artificial Intelligence. Vol. : 2, No. 2, pp. 39- 47. G.S Lehal (2013). ―A Bilingual Gurmukhi-English OCR Based on Multiple Script Identifiers and Language Models‖. International Workshop on Multilingual OCR. ISBN 978-1-4503-2114-3. . Bazzi, R. Schwatz and J. Makhoul (1999). ―An Omni font Open- Vocabulary OCR System for English and Arabic‖. IEEE Transactions on pattern Analysis and Machine Intelligence. vol. 21, no 6. pp. 495-504. L. Zenkouar (2004). ―L’écriture Amazighe Tifinaghe et Unicode‖, in Etudes et documents berbères. Paris (France). n° 22 pp. 175-192 M Swamy Das, C R K Reddy, D SandhyaRani, AGovardhan (2011). ―Script identification from Multilingual Telugu, Hindi and English Text TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS

Documents‖. International Journal of Wisdom Based Computing. Vol. 1 (3) M. Pechwitz, S. SnoussiMaddouri, V. Margner, N. Ellouze, H. Amiri (2002). ―IFN/ENIT - database of handwritten Arabic words―. In Proceedings of CIFED pages 129–136, 2002. Otsu (1979). ‖A threshold selection method from gray-level histograms‖. IEEE Trans. Sys, Mach., Cyber. vol. 9, pp. 62-66. Pooja Agrawal, M. Hanmandlu, BrejeshLall (2009). ―Coarse Classification of Handwritten Hindi Characters‖. International Journal of Advanced Science and Technology. Vol: 10 pp. 43-54. R. M. Bozinovic and S. N. Srihari (1989). ―Off-line cursive script word recognition‖. IEEE Trans. on Pattern Anal. Mach. Intell. vol. 11, no. 1, pp. 68-83. T. N. Tan (1998). ―Rotation Invariant Texture Features and their use in Automatic Script Identification‖. IEEE Trans. On PAMI, pp. 751-756, Y. Es-Saady, A. Rachidi, M. El Yassa, D. Mammass (2011). ‖AMHCD: A Database for Amazigh Handwritten Character Recognition Research‖. International Journal of Computer Applications (0975 – 8887 IJCA). Vol: 27(4), pp.44-49, ISBN:978-93-80864-53-2 Y. Es-Saady, M. Amrouch, A. Rachidi, M. El Yassa, D. Mammass (2014). ‖HandwrittenTifinagh Character Recognition Using Baselines Detection Features‖. International International Journal of Scientific & Engineering Research. Vol: 5, ISSN 2229-5518 Y. Es-Saady, M. Amrouch, A. Rachidi, M. El Yassa, D. Mammass (2014). ―Handwritten Tifinagh Character Recognition Using Baselines Detection Features‖. International Journal of Scientific & Engineering Research. Vol: 5, ISSN 2229-5518 Z. GACI (2011), Quel système d’écriture pour la langue berbère (le Qabyle), Mémoir de magister, 2011. Z. Zhang, L. Jin, K. Ding, X. Gao (2009). Character-SIFT: ―A Novel Feature for Offline Handwritten Chinese Character Recognition‖. 10th International Conference on Document Analysis and Recognition. vol: pp. 763-767. J. Al Abodi, X. Li (2014). ―An effective approach to offline Arabic handwriting recognition‖. Comput Electr Eng. Vol:40, pp. 1883-1901.