Script Recognition by Statistical Analysis of the Image Texture
Total Page:16
File Type:pdf, Size:1020Kb
X International Symposium on Industrial Electronics INDEL 2014, Banja Luka, November 0608, 2014 Script Recognition by Statistical Analysis of the Image Texture Invited Paper Darko Brodić Technical Faculty in Bor University of Belgrade Bor, Serbia [email protected] Abstract—The paper proposed an algorithm for the script The proposed approach incorporates the statistical analysis identification using the statistical analysis of the texture obtained of the texture. Texture is suitable for extracting similarities and by script mapping. First, the algorithm models the script sign by dissimilarities between images. However, the novelty of the the equivalent script type. The script type is determined by the proposed approach [8] is given by specific text modeling and position of the letter in the baseline area. Furthermore, the 1-D texture analysis. During text modeling the number of extraction of the features is performed. This step of the algorithm variables is substantially reduced. Furthermore, the image is is based on the script type occurrence and co-occurrence pattern replaced with text. Hence, the image which represents a 2-D analysis. Then, the resultant features are compared. Their image is replaced with the text given by 1-D "image". All differences simplify the script feature classification. The aforementioned contributes to the algorithm's speed. algorithm is tested on the German and Slavic printed documents incorporating different scripts. The experiment gives the results The paper is organized as follows. Section 2 describes the that are promising. proposed algorithm. Section 3 explains the experiment. Section 4 presents the results and gives the discussion. Section 5 makes Keywords - Coding, Cultural heritage, Historical documents, conclusions. Script recognition, Statistical analysis. II. THE METHODS I. INTRODUCTION The proposed algorithm is a multi-stage method. It includes Script recognition is a part of document image analysis [1]. the following stages: Many techniques have been proposed for the script recognition. They are typically classified into global or local ones. 1. Coding, Global methods characterize the processing of the large 2. Feature extraction, image areas, which are subjected to the statistical or frequency- 3. Feature classification, domain analysis [2]. However, the image area normalization is mandatory. [3]. On contrary, the local methods split up text 4. Script identification criteria. into small pieces. They typically represents characters, words or lines. After that, the black pixel runs analysis is performed Figure 1 illustrates the multi-stage method flow. [4]. A. Coding The proposed algorithm integrates the local and global The first step of the algorithm represents a coding. It is approach. First, it extracts characters from the text. Then, it established using into account the position of the letter in the codes the characters according to their script type [5]. The text line. To explain it, the text line needs to be considered. It coded text is obtained, which is an input to an occurrence consists of three vertical zones [9]: (frequency analysis) and co-occurrence (statistical analysis) similarly as in global methods. As the results of • Upper zone, aforementioned analysis, statistical measures of the gray-level co-occurrence matrix (GLCM) are extracted [6],[7]. To classify • Middle zone, the results a linear discrimination function is proposed. It • Lower zone. represents a key point in a decision-making process of the script discrimination. Figure 2 illustrates the vertical zones that belong to the text line. 168 Čuvaj uši svoje da slušaju samo svete i časne razgovore, a ne Input document ružne i svjetovne, jer je napisano: (a) Coding 2 1 1 1 4 1 2 2 1 1 1 4 1 2 1 1 2 1 2 1 4 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 4 1 2 1 1 1 1 4 1 1 4 1 1 1 3 2 1 1 1 1 Feature (b) extraction Чувај уши своје да слушају само свете и Feature classification часне разговоре, а не ружне и свјетовне, јер је написано: Criteria for decision- (c) making 3 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 Script 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 1 1 2 1 1 identification (d) Figure 1. The multi-stage method flow Чувај уши своје да слушају само свете и часне разговоре, а не ружне и свјетовне, јер је написано: (e) 2 3 1 1 4 3 1 1 1 1 1 4 1 3 1 1 1 3 1 1 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 3 1 1 1 1 3 3 1 1 1 1 1 Figure 2. Vertical zones in text line 1 4 1 1 1 1 1 1 4 1 3 4 1 1 1 1 1 1 1 1 1 (f) Using into account text line zones, one can distinguish four different script types [9]: Figure 3. Same text given in different Slavic scripts: (a) Original text in • Base letter (B), Latin script, and (b) its coded counterpart; (c) Original text in Glagolitic script, and (d) its coded counterpart; (e) Original text in Cyrillic script, and (f) • Ascender letter (A), its coded counterpart. • Descendent letter (D), Füllest wieder Busch und Tal Still mit Nebelglanz, • Full letter (F). Lösest endlich auch einmal Meine Seele ganz; Base letters (B) like the letter x, occupy a middle zone only; (a) ascender letters (A), like the letter t, spread over the middle and upper zones; while descendent letters (D), like the letter p, 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 2 2 1 2 2 2 1 2 include the middle and lower zones. Few letters like the capital 2 1 1 2 2 1 2 1 2 3 2 1 1 1 2 2 1 1 1 2 1 1 2 2 1 1 2 1 letter Lj (in Serbian or Croatian Latin alphabet) comprise all 1 1 2 1 1 1 1 1 2 2 1 1 1 1 2 1 1 2 1 3 1 1 1 three zones. They are classified as a full letter (F). This way, all (b) letters are coded according to their script type classification. To organize data, the following mapping is made [5,10]: Füllest wieder Busch und Tal Still mit Nebelglanz, Lösest endlich auch einmal Meine Seele ganz; B→→ 1, A 2, D →→ 3, F 4 (1) (c) This way, the coded text is established (Appendix contains 4 2 2 2 1 4 2 1 1 1 2 1 1 2 1 4 1 4 1 1 2 2 1 2 2 2 1 2 the alphabets and equivalent codes [10]). 2 1 2 2 2 1 2 1 2 3 2 1 1 3 2 2 4 1 4 2 1 1 2 2 1 1 4 1 1 1 4 1 1 1 1 1 2 2 1 1 1 1 2 1 1 2 1 3 1 1 3 B. Feature Extraction (d) Feature extraction is based on statistical analysis of the coded text. Figure 3 illustrates the text written in different Figure 4. Same text given in different German scripts: (a) Original text in Slavic scripts and their coded text, while Figure 4 shows Latin script, and (b) its coded counterpart; (c) Original text in Fraktur script, German text written in Fraktur and Latin with their coded text. and (d) its coded counterpart 169 In the first step, the script type distribution of coded text is GG 2 (8) analyzed. As a result, four script features are extracted. After Contrats = ∑∑Ci(,jj )⋅− (i ) the script type distribution analysis, the coded text is subjected ij to the co-occurrence analysis [6,7]. This way, the texture features are calculated. They use the conditional joint probabilities of all pair wise combinations of grey levels in the GG = +−2 (9) window of interest (WOI). WOI is determined by the inter- Inverse Different Moment ∑∑Ci( , j ) / [1 ( i j ) ] ij pixel distance ∆x and ∆y shown in Figure 5. GG Homogenei ty = ∑∑Ci( , j ) / [1+− ( i j )] (10) ij GG (11) Correlation = ∑∑(i−µx ) ⋅− (j µ y ) ⋅Ci (, j )/( σσ xy ⋅ ) ij Figure 5. Window of interest (WOI). The method starts from the top left corner and counts C. Feature Classification number of each reference pixel occurrences in respect to According to the aforementioned script type distributions neighbour pixel relationship. At the end of this process, the and GLCM features, the text examples from Figures 3 and 4 element (i, j) gives the number of how many times the gray are analyzed. levels i and j appears as a sequence of two pixels located at ∆x and ∆y. This way, GLCM P for an image I with M rows and N Table I shows the script type distributions, which are columns is given as [6,7]: obtained from the same text written in different Slavic scripts [11]. MN1, ifIxy ( , )= i , Ix ( +∆ xy , +∆ y ) = j Pi(, j )= ∑∑ TABLE I. SCRIPT TYPE DISTRIBUTIONS BETWEEN SLAVIC SCRIPTS = = 0, otherwise xy11 (2) Different Slavic Scripts Script Type Furthermore, matrix P is normalized giving a matrix C: Cyrillic Glagolitic Latin Base 0.7786 0.8015 0.6336 G Ascender 0.0153 0.1679 0.2595 Cij(, )= Pij (, )/∑ Pij (, ) Descendent 0.1298 0.0305 0.0229 ij, (3) Full 0.0763 0.0000 0.0840 In our case, the coded text is given as 1-D image, which leads to following: ∆x = ± 1, ∆y = 0 [10].