Toward a Handwritten Recognition System Using Canonical

Toward a Handwritten Recognition System Using Canonical Representation for Multi-Script Documents Yassine El Malki, Youssef Es-Saady, Driss Mammass IRF-SIC Laboratory, IbnoZohr University Agadir, Morocco {y.elmalki; y.essaady ; mammass}@uiz.ac.ma Abstract: We present in this paper a system of multilingual handwriting recognition based on canonical representation. After the word/character image preprocessing, the skeleton is transformed only into vertical and horizontal strokes, which is called the canonical representation. Then, the word/character is represented by a vector of its intersection pixels’ values, those values, which are depends on the pixel’s 8-surrounding neighbors. Finally, an algorithm is applied to match the character’s vector with a codebook containing the vector of each character of the language. The vector that represents the character will be used in the classification step, and the system will be applied to databases of multi-script documents, which contains texts in Latin, Arabic, and/or Tifinagh. Index Terms — Optical Character Recognition; Multilingual; Amazigh; Tifinagh; Handwritten Recognition; Multi-Script Documents 1. Introduction Libraries around the world have a big amount of documents. By scanning those documents their content can be accessible everywhere and by everyone. However, searching manually in scanned document, page by page seems difficult task for those who seek specific information. The solution is to use Optical Character Recognition (OCR) which is the process of transforming the image of text into text known by the computer, thus can easily be used as ASCCII code through searching information. This area of research has attracted many researches in many languages especially for Latin, Chinese and Arabic scripts [Bozinovic and Srihari 1989], [Bazzi et al 1999], [Zhang et al 2009], [Agrawal et al 2009], [Radwan et al 2013], [Es-saady et al 2014]. The variability nature of the handwriting scripts made the recognition very challenging. Many local languages such Amazigh have integrated the information systems, thus, in recent years, many researchers have been interested in Amazigh Character Recognition such as [Skoutni 2003],[Zenkouar et al 2004], [Es-saady et al 2010], [Es-saady et al 2011], [Es-saady et al 2014], because the Amazigh character needs a specific processing. In North of Africa, the local Tifinagh writing system, which is a system of the Amazigh language, is widely used. The Tifinigh script is among the oldest script in the world, it started from 3rd century BC. This script has known many changes from its original form as many languages like the Arabic script who changed from its original script (adding dots and diacritics marks). This script is found in the stones and tombs in some historical locations in Morocco, Algeria, Tunisia and the Tuareg areas in the Sahara. The Amazigh alphabet which is called ―Tifinagh-IRCAM‖, adopted by the Royal Institute TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS of the Amazigh Culture, was officially recognized by the International Organization of Standardization (ISO) as the basic multilingual plan[Zenkouar 2014]. Figure 1 represents the repertoire of Tifinagh which is recognized and used in Morocco with their correspondents in Latin characters. The number of the alphabetical phonetic entities is 33, but Unicode codes only 31 letters plus a change of a few letters to form the two phonetic units: (g ) and (k ). Fig. 1. Tifinagh characters adopted by the Royal Institute of the Amazigh Culture with their correspondents in Latin character The Amazigh have their own writing system since the ancient times [Gaci 2011]. However, the amazigh people had used the alphabet and/or the language of the dominant people, who were in interaction with them, in the writing of the amazigh documents. Three systems have been used to transcript the Amazigh script[Skounti& al. 2003]: The Tifinagh as an authentic alphabet in the Libyc inscription since the ancient times. The Arabic script, after the arrival of the Arabic in the end of 6th century. The Latin since the 19th century, by the colonial’s scientists and later on by national researchers. Some documents that are present in libraries have multiple languages in the same document as shown in Figure 2 extracted from a dictionary of Taoureg language[Foucauld 1951]. This dictionary has both Amazigh and Latin Scripts; the Amazigh words are also written in Latin characters. TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS Fig2: An Extract of Touareg dictionary [Fouclaud 1951] Many works aimed to process multilingual documents, for example [Ambekar et al 2013] presented an OCR for printed English and Devanagari text, they used KNN for classification on 10 samples of each character of the two languages (610 samples, 260 for English and 350 for Devanagari) and they reached 95% for English text and 93% for the Devanagari Text. [G.S Lehal 2013] used a OCR for Gurmukhi and English, first they identify the script nature either Gurmukhi or English, then they used the proper OCR for each language, the test were held on 4 sets that were constructed based on 105 pages’ images (76 Gurmukhi pages and 29 English Pages), they obtained different rates depending on the used set. [Tan 1998] described an automatic method for identification of Chinese, English, Greek, Russian, Malayalam and Persian text. [Das et al 2011] has used an OCR system for Telugu, English and Hindi, they extracted 8 features and then used a KNN Classifier, and they obtained 93% accuracy. To process multilingual documents, we used a system script-free proposed by [Al Abodiand Li 2014] with some alterations, which can perform on different languages in the same text image. The scheme of the system is presented in figure 3 below. The paper is organized as following: section 2 represent the preprocessing step, the third section describes the process of the canonical representation, finally in the section 4 we present a conclusion and some future work. TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS Fig3:proposed system scheme TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS 2. Preprocessing Such as was presented in Figure 3, the procedure of pre-processing which refines the scanned input image includes several steps: Binarization, noises removal and image resizing. We used the [Otsu 1979] method for binarization. This method of thresholding is performed as a preprocessing step to remove the background noise from the picture prior to extraction of characters and recognition of text. This method performs a statistical analysis of histograms to define a function to be maximized to estimate the threshold. The database images present gaps between pixels. This is due to the writing style or the type of the pen used. To fill the gaps, we used the widely known algorithm Run Length Smoothing Algorithm (RLSA) in both vertical and horizontal direction. For instance, if we have a sequence of values and the RLSA threshold is for example 4 all zeros between two ones becomes 1 if their number is less than the fixed threshold. 10010001000001001 becomes 11111111000001111 After, a skeletonization step is applied on the image to have a pixel width of one pixel only. Figure 4 shows an example of Amazigh character and the result of skeletonization step. a) b) c) d) Fig 4: a) original image of Yab, b) binarization,c) RLSA gap filling, d) skeletonization 3. Canonical Representation 3.1 Pixel Value We represent each pixel by a code that is obtained by summing the values of the 8 surrounding pixels, which is similar to Freeman coding. The pixel of the top has the position 0, and we move clockwise to attribute the position of the other pixels, so the pixel in the top right has the value 1, the pixel in the right has the position 2 and so one we end up with the last pixel in the top left that has the position 7. The value of the pixel is 2pixel_position. The figure 5 represents the coding scheme of the surrounding pixels. There are 256 unique forms of surrounding pixels calculated as above. The following formulate implies that we have 256 of all geometric possible combination of the 8 neighboring pixels: TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS The figure 6 below presents some pixels with their codes: Fig 5: a) the value of the surrounding pixels b) code of the pixel is 145 = 128+1+16 Fig 6: some pixel codes 3.2 The Canonical Form Scripters have deferent writing styles, this produces a problem for OCRs because the shape, the width, and the height of the characters change from one writer to another. To reduce this variety of characters forms we used the canonical representation introduced by [abode et al 14], the main concept is to have only horizontal and vertical strokes by applying processing techniques on the pixels of the image and depending on their values we got the canonical representation as described by the following algorithm: TOWARD A HANDWRITTEN RECOGNITION SYSTEM USING CANONICAL REPRESENTATION FOR MULTI-SCRIPT DOCUMENTS For example, in Figure 7 below we begin the examination of the pixels from the top left to the bottom right of the image. In this case, the first white pixel is located in (4, 12) (column 4, line 12) with 18 as value. This value means that this pixel has an upper right pixel and a bottom pixel. The upper right pixel does not belong to the same line or the same column as the processed pixel, so we must follow the northeast direction until finding a pixel with no pixel in his upper or upper right, and have no pixel on its left pixel (17, 4).

Load more