ARABIC HANDWRITING TEXT RECOGNITIONAND RETRIEVAL IN HISTORICAL DOCUMENT IMAGES

Thesis submitted in partial fulfillment of the requirements for the degree of “DOCTOROFPHILOSOPHY”

by Raid Saabni

Submitted to the Senate of Ben-Gurion University of the Negev

September 2010

Beer-Sheva ARABIC HANDWRITING TEXT RECOGNITIONAND RETRIEVAL IN HISTORICAL DOCUMENT IMAGES

Thesis submitted in partial fulfillment of the requirements for the degree of “DOCTOROFPHILOSOPHY”

by Raid Saabni

Submitted to the Senate of Ben-Gurion University of the Negev

Approved by the advisor Dr. Jihad El-Sana Approved by the Dean of the Kreitman School of Advanced Graduate Studies

Septemper 2010

Beer-Sheva This work was carried out under the supervision of

Dr. Jihad El-sana

in the Department of Computer Science

Faculty of Natural Sciences Acknowledgment

I would like to thank my advisor, Dr. Jihad El-Sana,for his friendship, knowledgeable, and invaluable guidance, advice, and support through out the development of the results presented in this thesis. I would also like to thank Mohammad Cheriet, Volker Margner, Haikal El-Abed, The Triangle R&D cen- ter and the ministry of science for enjoyable, fruitful,and friendly collabora- tions and support. Finally,I want to express my deepest gratitude to my won- derful family: My wife Maryam, My son Rida, and my daughter Nora for their support and for being this pleasing part of my life during my years while working on this thesis

i CHAPTER 0. ACKNOWLEDGMENT

ii Contents

Acknowledgment i

Abstract xi

1 Introduction 1 1.1 Data capture ...... 2 1.2 Binarization ...... 2 1.3 Page Analysis ...... 4 1.4 Preprocessing ...... 5 1.5 Feature Extraction ...... 6 1.6 Matching-Classification ...... 6 1.7 Contribution of this thesis ...... 8

2 The Arabic Script 11 2.1 Background ...... 12 2.2 Letters and Strokes ...... 13

3 Related Work 17 3.1 Text Recognition ...... 17 3.1.1 Recognition of isolated letters ...... 17 3.1.2 Segmentation-based Methods ...... 19 3.1.3 Segmentation-free Methods ...... 20 3.1.4 Delayed Strokes ...... 22 3.2 Arabic HWR Databases ...... 23 3.3 Word Spotting ...... 25 3.4 Text Line Extraction ...... 28 3.4.1 Top-down approaches ...... 28 3.4.2 Bottom-up approaches ...... 29

4 Arabic Language Statistics for Efficient Text Recognition 31 4.1 The Arabic Language Analysis ...... 32 4.1.1 Additional Strokes and Word-Parts ...... 32 4.1.2 Loops, Ascenders, and Descenders ...... 33

iii CONTENTS CONTENTS

4.2 Arabic Language and Script Properties ...... 35 4.2.1 Valid Word-Parts ...... 36 4.2.2 Valid Word-Parts Analysis ...... 37 4.2.3 Reducing Search Space ...... 37 4.3 Discussion ...... 39

5 Comprehensive Synthetic Arabic Database for HWR 43 5.1 Our Approach ...... 45 5.1.1 Extracting Handwriting Fonts ...... 46 5.1.2 Synthesizing Word-parts and Word Shapes ...... 47 5.1.3 Dimensionality Reduction and Clustering ...... 51 5.2 Experimental Results ...... 52

6 Hierarchical On-line Arabic Handwriting Recognition 57 6.1 Our Approach ...... 58 6.1.1 Geometric Preprocessing ...... 59 6.1.2 Features Extraction ...... 59 6.1.3 Shape Context ...... 60 6.1.4 Word-Part Recognition ...... 61 6.1.5 Dynamic Time Warping ...... 61 6.2 Experimental Results ...... 62

7 Segmentation-Free Online Arabic Handwriting Recognition 65 7.1 Recognition Framework ...... 65 7.2 Optimization ...... 68 7.3 Results and Discussion ...... 69

8 Language-Independent Text Lines Extraction Using Seam Carving 75 8.1 Our Approach ...... 76 8.1.1 Preprocessing ...... 77 8.1.2 Energy function ...... 78 8.1.3 Seam Generation ...... 78 8.1.4 Component Collection ...... 79 8.2 Experimental Results ...... 80

9 Keyword Searching for Arabic Handwritten Documents 85 9.1 Our Approach ...... 86 9.1.1 Component Labeling ...... 86 9.1.2 Simplification ...... 87 9.1.3 Feature Extraction ...... 88 9.2 Matching ...... 88 9.2.1 Pruning ...... 90 9.2.2 Rule-based system ...... 91

iv CONTENTS CONTENTS

9.3 Experimental results ...... 91

10 Word Spotting using Chamfer Distance and Dynamic Time Warping 95 10.1 Our Approach ...... 96 10.1.1 Line Extraction and Component Labeling ...... 96 10.1.2 Computing the Similarity Distance ...... 97 10.1.3 Matching ...... 99 10.1.4 Dynamic Time Warping ...... 100 10.1.5 Clustering Process ...... 101 10.2 Experimental results ...... 101

11 Conclusion and Future Work 105

Bibliography 109

v CONTENTS CONTENTS

vi List of Figures

1.1 Historical Arabic Documents exhibiting common aging symp- toms such as yellowing...... 4

2.1 Development of the Phoenician alphabet as the origin of the dif- ferent scripts; oriental including Arabic with the down arrow and western with the up arrow...... 11

2.2 The word MOBARAT and its letters and word-parts. Leaves represent letters and subtrees represent word-parts...... 13

2.3 Different styles of writing additional strokes for the letters ( ¼ ) and the ligature (È)...... 13

2.4 Delayed strokes in Arabic script may appear under or above the letter body. The boxed pairs represent common variants (e.g., three dots are often written as a circumflex “hat”). These seven strokes appear in letters used in writing standard Arabic. Eleven additional strokes exist for writing additional letters in other languages (Urdu, Pashto, Farsi, etc.)...... 14

2.5 A list of additional strokes. The first column depicts the differ- ent basic elements used in additional stroke classes, the third column shows different styles of writing the different classes of the additional strokes, and the fourth column includes a list of letters that have these basic strokes and their possible positions. 15

4.1 The word “MOHAMAD” in two different writing styles. . . . . 33

vii LIST OF FIGURES LIST OF FIGURES

4.2 Loops in different Letters,* stands for optional loops. The right- most column depicts different styles of writing characters with loops, while the first column gives examples of some common problems...... 34

4.3 The distribution of word-parts by their length ...... 39

4.4 The distribution of words by their degree ...... 40

4.5 Distribution of word parts by the number of ascenders with one descender...... 42

5.1 This table presents part of the data integrated with each shape in the database. The data presented is the number of loops, and for each loop, its shape and position. (Loops properties : Rnd = Rounded ,Dn= Down , Triple,Double, Degen = Denigrated loop 44

5.2 A diagram flow sample for generating a compact set of shapes for given word-parts and word...... 45 5.3 Samples of the shapes generated for the word (ÑêÓ) where the letters ( è) and ( Ð) include different numbers of loops. The im- ages in each row are in decreased order with respect to number of loops...... 48

5.4 Column (c) shows some examples of low probabilities of loop existence for the letter (Ð). High probabilities as can be seen in column (a) show obvious loops which are easy to extract, while shapes with medium probabilities are in column (b). The size of the suspected loop and the ratio between the diameters are used to calculate these probabilities...... 49 5.5 Three samples of synthetically generating the word ”YÒm×”. . . . 50 5.6 Different samples of the shapes generated for writing the word ”I»QÓ”. In the first row we can see the one pixel width shapes, in. the second row the results after the dilation process. Results after feature points smoothing can be seen in the third row and after the scanning process imitation in the fourth row. In the fifth row, we can see results of using different layout techniques to combine word-part shapes to generate Arabic words...... 53

6.1 The flow of our system ...... 58

viii LIST OF FIGURES LIST OF FIGURES

6.2 The response time of our system ...... 63

7.1 (a) The projection of the delayed stroke Z in the letter ¼ (k); (b) the delayed stroke is projected to the letter body; (c) the newly generated PPS (p1 to p53)...... 66

7.2 A word-part network: each path from the start node to a leaf ∗ represents a wpi, which is formally defined as [final + medial + initial)|isolated]. 67

7.3 This graph shows the results of recognition rates comparing the different systems. The compared systems are our proposed sys- tem and three other systems, changing one factor each time. The factors we changed are: SD = Synthetic Database, SF = Sliding Window features, and DTW = Dynamic Time Warping classifier. 73

8.1 (a) Calculating a Signed Distance Map of a given binary image, (b) Calculating the energy map of all different seams, (c) Find- ing the seam with minimal energy cost and (d) Extracting the components that intersect the minimal energy seam...... 76

8.2 Random samples from the tested pages: (a) Arabic, (b) Chinese, (c) German and (d) Greek. The extracted lines are shown in dif- ferent colors...... 82

8.3 Different documents with fluctuating lines. The components in red are touching components, which were determined during the line extraction process. We can see in (d) the original touch- ing component (1), the primary splitting in (2), while in (3) we can see the desired result...... 83

9.1 Meta Components with different numbers of additional strokes. 87

9.2 Horizontal and Vertical Density histograms on top of a simpli- fied word-part...... 88

9.3 The deletion of similar segments can lead to a different word- part, which illustrates the need for different costs for insertion, deletion, and substitution...... 89

9.4 The columns (c) and (g) show the similarity of the density his- tograms of the same word-parts ...... 91

ix LIST OF FIGURES LIST OF FIGURES

9.5 The results from the first (a) and the four(b) search schemes; and the final results, using the accumulative process, are shown in (c) and (d) ...... 92

10.1 This figure depicts the spotting process starting from top-left with the binary image and ending with bottom-left with the clusters of spotted words...... 97 10.2 In the first row we can see the Gradient Edge Map for the tem- plate word-part image of the word-part Ghayr. In the second row we see the Gradient Edge Map for the same word-part as an input image...... 100 10.3 In this figure we present three different resultant clusters from the presented system of three Arabic word-parts. The manually assigned word-parts are in red and the different shapes of the same word-part in each cluster are in black...... 102

x Abstract

One of the most impactful aspects of the digital revolution has been the wide-spread dissemination of content that has, until now, been difficult or even impossible to access. Specifically, rare documents and old manuscripts that are kept in brick-and-mortar libraries around the world are being digi- tized, converted to searchable text, and made available on-line. The impli- cations of opening up this treasure for the greater good of mankind are too obvious to enumerate. In this research, we address the technical issue of con- verting hand-written Arabic to a form that can be indexed and searched. The majority of these historic documents were written before the advent of print- ing presses, and while Arabic calligraphy as an art form is one of the most beautiful, it does present significant and unique challenges for automated text recognition algorithms. Considering the fact that a large fraction of these doc- uments are well preserved, and that the current technologies for binarization and image enhancement can guarantee sufficient results, we have set our tar- get in this research on developing new techniques for searching and spotting words in documents by improving existing, and developing new, novel meth- ods for converting written words (on-line and off-line) to text and new word- matching techniques. In the first part of this research, we started by studying and analyzing the Arabic script and its special specification and properties. In chapter 4, We present comprehensive statistics on the mutual presence of let- ters, words, word-parts, additional strokes, and dots in the Arabic language. A deeper look and analysis of the shapes of the different letters and word- parts is done to help develop efficient methods for on-line/off-line recognition of written Arabic. Special attention has been given to databases for training and testing the developed systems; therefore, in chapter 5 we present a novel method to synthetically generate a comprehensive database for Arabic words in various handwriting styles. We have used this database for training the systems we present in the four following chapters. In chapters 7 and 6, we present our novel techniques for developing an open vocabulary on-line hand- writing recognizer for Arabic script based on HMM and DTW, respectively. In the last three chapters, we present our novel approach for generating a word- spotting system for handwritten Arabic manuscripts. In Chapter 8 we present our novel approach for extracting lines from multi-skewed script images. In

xi CHAPTER 0. ABSTRACT chapters 9 and 10 we present two novel approaches for searching and spotting words in handwritten manuscripts.

xii Chapter 1

Introduction

Recognition of Arabic handwritten manuscripts is a necessary prerequisite for the systematic and efficient storage, indexing, and study of invaluable texts often stored in suboptimal environmental conditions. The recognition of these texts goes through a number of stages starting with scanning, image enhance- ment, and Page Layout segmentation, ending with shape matching and label- ing. Removing of all or most of the background noise, and the binarization of the foreground text is an essential step that precedes and simplifies the pro- cess of segmentation. In the Arabic scripts, there is usually a lot of overlap be- tween letters and words appearing in old handwritten manuscripts. As such, segmentation can only be reliably applied to lines and in some cases to words. Once line segmentation is achieved, a profitable way forward would involve some form of image registration between a set of index words (pre-selected for their significance) and the binarized segmented lines of text within individual pages. Preprocessing is the key to processing and understanding historical and ancient documents. If a document image is preprocessed well, many of the common tools used for processing normal documents, or modified ver- sions of them, can be applied to historical document images. However, as these documents are usually severely degraded, unstructured, and variable in content and appearance, the preprocessing itself becomes a difficult problem. Document image analysis and Recognition (DIAR) refers to the process of converting a raster image of a document page (a matrix of pixels) to a symbolic form consisting of textual (characters, digits, punctuation, words) and graphi- cal (lines, geometric shapes, etc.) objects; for a complete survey see [87]. Docu- ment descriptions in terms of these higher-level objects are significantly more compact than their image counterparts. More importantly, the rich semantic content of such descriptions makes it possible to manipulate these documents to serve a variety of uses such as searching them for specific patterns or clas- sifying and combining them according to some criteria. Most DIA systems consist of the following main stages: 1. Data capture, to generate an initial electronic copy (a raster image) of the

1 1.1. DATA CAPTURE CHAPTER 1. INTRODUCTION

document.

2. Binarization, to separate the foreground (ink) from the background (pa- per).

3. Page analysis and segmentation, to extract major text blocks, separate them from graphics (figures, logos, etc.), and segment them into columns, paragraphs, lines, words, word parts, and additional strokes.

4. Preprocessing, to reduce noise, correct for skew, and convert pixels to shapes (contours or skeletons, sliding windows).

5. Feature extraction, to characterize segmented objects by key details such as the number of their loops, their concavities and protrusions, their cor- ners.

6. Classification, to assign each segmented object to a ”class” based on its features.

7. Post processing, using lexicons, statistics, and natural language process- ing.

1.1 Data capture

The initial electronic version of a document is usually obtained by scanning it. This can be a labor-intensive process, especially if the documents involved are rare fragile manuscripts that require special handling. Large-scale digiti- zation projects (such as Google’s ongoing project to scan more than 50 million books in the library collections of Harvard, Stanford, Oxford, and the Univer- sity of Michigan, as well as the New York Public Library) utilize machines with robotic arms that can mechanically flip a book’s pages in front of a 16-mega pixel digital camera. An important design consideration at this stage is image resolution. Low-resolution scans can be obtained relatively quickly and their storage requirements are manageable. In many situations, resolutions as low as 300 to 600 DPI are sufficient to allow for high-accuracy feature extraction and classification. However, certain documents may require higher resolution scans, e.g., 1200 DPI, to achieve acceptable levels of accuracy.

1.2 Binarization

The goal of binarization is to classify the image pixels as either foreground or background pixels by using an appropriate threshold. This can be straight- forward for a clean document page with a white background and text in black

2 CHAPTER 1. INTRODUCTION 1.2. BINARIZATION ink, but may prove quite challenging for pages in a historical document that have degraded over time. It is not uncommon for historic documents to have smears, stains, large variations in the color and texture of the background, seepage of ink, yellowing, dirt, and fingerprints (see Figure 1). An additional complication for Arabic documents is the presence of delayed strokes (dots,

é jJ ¯ , é Öޕ , èY ƒ , èQå„ », etc.) that may be hard to distinguish from back- ground noise and could be lost during binarization. Pages with a uniform background can generally be binarized using a single global threshold. Oth- erwise, an adaptive local thresholding strategy is necessary. As might be ex- pected, local thresholding is significantly slower and more compute-intensive than global thresholding. Generally, thresholding techniques fall into one of the following categories [110]: • Histogram shape-based methods, where algorithms analyze the peaks, valleys, and curvatures of the smoothed histogram • Cluster-based methods, where gray-level samples are clustered in two parts as either background or foreground, or alternately are modeled as a mixture of two Gaussians • Entropy-based methods that try to minimize a cost function based on the entropy of the foreground and background regions in order to determine a threshold value or a formula for pixel values • Object attribute-based methods that search for a measure of similarity between the gray level and the binarized images, such as fuzzy shape similarity, edge coincidence, etc. • Spatial methods that use higher-order probability distributions and/or correlations between pixels • Local methods that adapt the threshold value on each pixel to the local image characteristics Arabic scripts contain delayed strokes; their loss during thresholding and noise removal would lead to errors in subsequent classification stages. Ideally, the feature extraction method should be robust to noise and changes in illu- mination. Traditional handwriting recognition methods using structural fea- tures depend on thinning, stroke estimation, etc., and are not robust against such degradations. In other areas of pattern recognition, it was found that image-based methods are robust to such degradations. The similarity is com- parable especially with fingerprints. Minutiae-based fingerprint matching re- quires similar steps, such as noise removal and binarization, as preprocessing steps. These lead to loss of important discriminating information in finger- prints. Later image-based fingerprint recognition techniques were robust to

3 1.3. PAGE ANALYSIS CHAPTER 1. INTRODUCTION

Figure 1.1: Historical Arabic Documents exhibiting common aging symptoms such as yellowing. such degradations in fingerprint images [56] and gave good results with highly degraded images.

1.3 Page Analysis

Segmentation is the process of decomposing an image into a sequence of sub-images that represent different elementary classes. Segmentation can be carried out at several levels of granularity from delineating text and graphics blocks on the page to isolating individual characters or fractions of charac- ters. In general, segmentation proceeds top-down by identifying text blocks, then lines within blocks, then words within lines, then characters and character fractions(graphemes) within words. At the word and character levels, segmen- tation is relatively straightforward for printed Roman scripts. However, seg- mentation becomes quite challenging for hand-written documents or for semi- cursive languages such as Arabic [72, 42]. The basic assumption behind all

4 CHAPTER 1. INTRODUCTION 1.4. PREPROCESSING segmentation algorithms is that letters in the word are connected horizontally and are discontinuous vertically. The initial attempts to segment the charac- ters from the words [100, 83] were not very successful. To overcome this prob- lem of accurately segmenting the characters, later researchers over-segmented the words (more than the number of characters) [82]. Another group of re- searchers claim that there is no need to segment the words. Stroke-based algo- rithms [50] intend to reconstruct some of the dynamic information lost during off-line recognition and use it for classification. Other techniques use charac- ter lever Hidden Markov Models (HMM) [51] that use low level features to learn characters without segregating them from the words. Later, word level HMMs use the information from character level HMMs to recognize the entire word. Recently, more researchers are resorting to a segmentation-free recogni- tion approach. This is evident in a recent survey [69] showing that more recent works use segmentation-free methods, while earlier ones use segmentation- based methods.

1.4 Preprocessing

Before the pattern of pixels comprising the image of a segment (generally a character, word part, or complete word) can be recognized as readable text it must be ”cleaned up” to remove irrelevant artifacts [43]. This so-called ”pixel- level processing” involves filtering to remove ”salt-and-pepper” noise and re- peated addition and removal of ”on” pixels to region boundaries in order to smooth them out and to fill narrow gaps between regions. Additional pro- cessing involves slant and skew correction, size scaling, and normalizing. An important (but optional) next step is thinning to find the approximate cen- ter lines, or skeletons, of regions. This can be viewed as reducing an image to its essence by removing irrelevant details, such as boldface in the case of printed text, or thick strokes in hand-written text. As with binarization, Ara- bic presents a challenge to pixel-level processing because of the extra strokes associated with the letters of the alphabet. Even when we assume that a docu- ment is reasonably noise-free, preprocessing algorithms might incorrectly in- terpret such strokes as noise, causing them to be removed from the image. It is, thus, critical to tune existing preprocessing techniques to insure correct han- dling of delayed strokes in Arabic [26, 108, 106]. The delicacy of the Arabic script makes such preprocessing techniques a critical task. Balancing between noise removals, thresholding, and maintaining structure of the words is dif- ficult. Initially, traditional preprocessing techniques will be used followed by structural feature extraction, but at later stages image-based feature extraction techniques that do not require preprocessing will be investigated.

5 1.5. FEATURE EXTRACTION CHAPTER 1. INTRODUCTION

1.5 Feature Extraction

To aid in the final step of recognition, a set of features is extracted from the images produced after the preprocessing step. The choice of what and how many features to extract affects both the quality and performance of this stage. In general, features should be invariant to the expected distortions and variations in the character shapes. Furthermore, enough features should be extracted to allow good recognition rates while remaining within reason- able time and space limits for the training and recognition steps. There are many methods and feature groups in the literature such as Fourier descrip- tors, Hough Transforms, moments, invariants mapping, and geometrical fea- tures (see [95, 120] for a complete survey). It has also been proposed that relevant features be extracted automatically using Unsupervised Neural Net- works. Pixel features are widely used under different names. After all, a bi- nary character image can be described by the spatial distribution of its black pixels. Width to length proportion can be used to differentiate characters from each other. To compute the pixel distribution features, the segment image is divided into different zones according to the baseline.

1.6 Matching-Classification

Given a set of features that have been extracted from an image, the goal of classification is to find a match for this image from a set of pre-defined images. For non-cursive printed text this process can be carried out at the character level and is relatively efficient due to the small number of candidate template images (a language’s alphabet and a few punctuation symbols). This becomes problematic for hand-written cursive text, however, because of the potential difficulty of segmenting words into individual characters. To avoid segmenta- tion errors, image matching could be done at the word, rather than the charac- ter, level. These techniques are commonly referred to as ”segmentation-free” approaches [46, 65]. Segmentation at word boundaries is arguably easier than at individual character boundaries because of inter-word spacing. However, the set of template images that must be examined for possible matches grows exponentially with the number of characters in a word and makes this ap- proach computationally unattractive. Arabic script, even in printed form, is semi-cursive: a given word may consist of several fully-connected word parts (one or more characters each) that are separated by small intra-word spaces. This suggests an approach that matches at the word part level. Image match- ing is a fairly mature field with many approaches to choose from. In gen- eral, image matching algorithms fall into one of two categories [75], 1) Pixel- by-pixel matching algorithms [93] rank candidate matches using the distance between the pixel representations of the images being compared. Common

6 CHAPTER 1. INTRODUCTION 1.6. MATCHING-CLASSIFICATION distance measures include Euclidean Distance Map (EDM), XOR difference, or the Sum of Square Differences (SSD), and 2) Feature-based matching al- gorithms that rank candidate matches using distance measures in the feature space of the images being compared. The following are among the most common methods for feature-based matching:

• The Cott and Longuet-Higgins algorithm recovers an affine warping trans- form between sample points taken from the edge of the template and candidate images. The matching cost used is the residual between tem- plate points and warped candidate points.

• The shape context matching method establishes correspondences between the outlines of the images being compared. The outlines are sampled and shape context histograms are generated for each sample point. The matching cost is determined from the cost associated with the chosen correspondences.

• The correspondence correlation technique [93] recovers the correspon- dences between points of interest in the two images. These correspon- dences are then used to construct a similarity measure.

• The Dynamic Time Warping (DTW) algorithm [96, 95, 95] measures the similarity between two sequences that may vary in time or speed. For instance, similarities in patterns running at different speeds would be detected even if there were accelerations and decelerations during the scope of the observation. DTW has been applied to video, audio, and graphics. Well-known applications have been automatic speech recog- nition, hand-written recognition, and word spotting. The method finds an optimal match between two sequences by warping them non-linearly in the time dimension. The optimization process is performed using dy- namic programming and its complexity for the one-dimensional case is polynomial.

• The Hidden Markov Model (HMM) classification technique: For the last three decades Hidden Markov Models (HMM) were successfully used for automatic speech recognition. Due to the success of HMM in mod- eling sequential data it has also been adopted for modeling letters in handwriting recognition. Many variations of HMM models have been adapted and used in text recognition research. Discrete, continuous, and semi-continuous types were used with various topologies ranging from ergodic to left-to-right models with no state skipping. HMM-based algorithms were designed to handle letters, words, strokes, or pseudo characters using one-dimensional, two-dimensional or planner Hidden

7 1.7. CONTRIBUTION OF THIS THESIS CHAPTER 1. INTRODUCTION

Markov Models. Results were very encouraging in the handwritten case and appear to handle the cursiveness well.

• Artificial Neural Networks (ANN): Artificial neural networks are used widely in machine learning applications. They are usually used for clas- sification problems or for problems that require some type of cognitive ability such as (OCR), Speech Synthesis, and Data Classification. An ar- tificial neural network is software or hardware that tries to simulate the working of the human brain. An artificial neural network is an inter- connected network of many artificial neurons. These artificial neurons are objects used to simulate the neurons in the human brain. In this ap- proach, an artificial neural network is trained to identify similarities and patterns among different handwriting samples. Artificial neural network techniques have also proven helpful in the preprocessing that must take place before a handwriting sample can be considered suitable input for an artificial neural network.

• Support Vector Machine (SVM): A Support Vector Machine (SVM) per- forms classification by constructing an N-dimensional hyper-plane that optimally separates the data into two categories. SVM models are closely related to neural networks. In fact, an SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network. Using a kernel function, SVMs are an alternative training method for polynomial, radial basis function and multi-layer perceptron classifiers in which the weights of the network are found by solving a quadratic programming problem with linear constraints, rather than by solving a non-convex, un- constrained minimization problem as in standard neural network train- ing.

In this Thesis, we research some aspects of Historical Document Image Analysis. The main aspects we have researched were the matching algorithms for word shapes for on-line text recognition and off-line key word searching and spotting. Since a major fraction of these historical documents are easy to binarize, and Arabic script is complex for recognition tasks due to the inher- ited cursiveness and many additional strokes, motivated our research to focus on the steps following the binarization process. i.e., line extraction, feature extraction, synthetic databases, training, and matching techniques.

1.7 Contribution of this thesis

We have published several papers describing the research we have per- formed and present in this thesis. The first paper was published in the Pro- ceedings of the 11th International Conference on Frontiers in Handwriting

8 CHAPTER 1. INTRODUCTION 1.7. CONTRIBUTION OF THIS THESIS

Recognition (ICFHR2008)[107]. The paper describes a system for searching Arabic key words in modern and historical handwritten documents. An ad- ditional system for matching and searching Arabic words in Historical Ara- bic manuscripts based on the chamfer distance and DTW was described in a recently published paper [105] in the Document Recognition and Retrieval Conference (DRR2011). A paper [103] describing a complete system for spot- ting Arabic words and indexing historical manuscripts based on [107] was re- cently submitted to International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI). In the field of Arabic On-Line Handwriting Recognition, we have published two papers. The first paper, describing a segmentation- free on-line system for recognizing Arabic handwritten words, was accepted for publication in the International Journal of Pattern Recognition and Artifi- cial Intelligence (IJPRAI)[22]. The second paper, describing a Hierarchical On- Line System for Arabic word Recognition, was published in the Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR2009)[108]. In ICDAR2009, we published an additional paper describ- ing a novel method for Efficient Generation of Comprehensive Database for Arabic script[106]. A generalized version of this work describing our method for generating a Comprehensive Synthetic Arabic Database for on/off-line text Recognition Research [104] was submitted to be published in the International Journal of Document Image Analyses (IJDAR). A paper describing a method for language-independent line extraction using Seam Carving in multi-skew images was accepted to appear in the proceedings of the Eleventh Interna- tional Conference on Document Analysis and Recognition (ICDAR 2011). The paper [106] presenting Statistics and Properties of the Arabic Language and Script for Efficient Text Recognition is prepared to be submitted to the Interna- tional Journal of Pattern Recognition and Artificial Intelligence (IJPRAI).

9 1.7. CONTRIBUTION OF THIS THESIS CHAPTER 1. INTRODUCTION

10 Chapter 2

The Arabic Script

The Arabic language originated from the earliest-known alphabet, the North Semitic alphabet that was developed in Syria around B.C. 1700. The North Semitic alphabet was the source of several languages including Arabic, He- brew, and Phoenician, which is the origin of western alphabets [109]. Fig- ure 2.1 [126] shows the script development with the alphabet of each script. The spread of Islam carried the Arabic script to other regions outside the Arab world to become the script of more than twenty languages such as Farsi, Urdu, Malay, Swahili, Hausa, and Ottoman Turkish.

Figure 2.1: Development of the Phoenician alphabet as the origin of the differ- ent scripts; oriental including Arabic with the down arrow and western with the up arrow.

11 2.1. BACKGROUND CHAPTER 2. THE ARABIC SCRIPT

2.1 Background

Unlike western scripts, Arabic script is written from right to left in a semi- cursive manner in handwriting as well as machine printing. A discrete format of Arabic handwriting in predefined boxes has been proposed but never be- came popular. On the one hand, the Arabic script is similar to western scripts in the way it has a strict alphabet consisting of letters, numerals, punctuation marks, spaces, and special marks. On the other hand, it is different in the way it combines letters into words and the way it treats vowels. The Arabic script consists of 28 basic letters, 12 additional special letters, and 8 diacritics. A letter in Arabic usually has several (2 to 4) different shapes – initial, medial, final, and isolated – according to its adjacent letters and its po- sition within the word. As a result, the 28 basic letters in Arabic script have 120 different shapes. Among the basic letters six are disconnective, i.e., they inter- rupt the cursiveness of a word by prohibiting the connection to the following letters and splitting words into connected groups of letters, which are called components. Each component includes one or more letters and forms a part of a word. We refer to each connected component as a word-part. Spaces sepa- rating different words within a sentence are usually wider than those between consecutive word-parts within the same word. Some writing styles, especially in handwriting, do not respect the spacing rules and require complex context analysis to determine the right word.  Let us consider the word è@PAJ.Ó, which means ”match” in English. It includes  four word-parts AJ.Ó, P, @, è. The word-part AJ.Ó includes three letters, Ó(m), J. (b), A(a) in their initial, middle, and final forms, respectively. Each of the remaining three word-parts consists of a single letter in its isolated form (a one-letter word-part is always the isolated form of that letter). Figure 2.2 depicts the hierarchical structure of this word. The root includes the complete word, the internal nodes include the word-parts, and the leaves include the letters of the word. It is easy to define the grammar rules to generate a sentence in Arabic. Let blank and space be the space between words and the space between word-parts (usually smaller than blank), respectively. The formal rules to generate Arabic text are:

Word Part = Lisolated Word Part = Linitial + (Linternal∗) + Lfinal Word = Word Part + (space + Word Part)* Sentence = Word + (blank + Word)* where ’+’ indicates concatenation and ’*’ indicates zero or more repetitions; Lisolated, Linitial, Linternal, and Lfinal indicate the isolated, initial, internal, and final shapes of a letter.

12 CHAPTER 2. THE ARABIC SCRIPT 2.2. LETTERS AND STROKES

مباراة

مبا ر ا ة

مـ ـبـ ـا

Figure 2.2: The word MOBARAT and its letters and word-parts. Leaves repre- sent letters and subtrees represent word-parts.

2.2 Letters and Strokes

One could perceive an Arabic letter as a letter body and a set of additional strokes that may include dots and/or short strokes. These additional strokes, which are written below or above the letter body, are usually called delayed strokes as they are written after completing the main body of the word-part. Note that additional strokes do not include demarcations or vowelizations, which are not mandatory, especially in handwriting. We refer to the addi- tional strokes of a letter or letters within a component as the complementary part. Some of the letters have only a body part and do not have any additional strokes, and several letters can share the same body part, differing only by their complementary parts.

ل ط,ظ كـ,ـكـ ل

Figure 2.3: Different styles of writing additional strokes for the letters (¼ ) and the ligature (È).

13 2.2. LETTERS AND STROKES CHAPTER 2. THE ARABIC SCRIPT  The delayed strokes include hamza (Z), madda (@), and vertical strokes that may not be connected to the letter body in handwriting (see Figure 2.5). In addition, dots may appear as a single, a pair, and a triplet.

Figure 2.4: Delayed strokes in Arabic script may appear under or above the letter body. The boxed pairs represent common variants (e.g., three dots are often written as a circumflex “hat”). These seven strokes appear in letters used in writing standard Arabic. Eleven additional strokes exist for writing addi- tional letters in other languages (Urdu, Pashto, Farsi, etc.).

Handwriting usually encounters additional delayed strokes that do not ex- ist in the printed form. The letters ( ), ( ) and the ligature (B) may encounter a delayed stroke in a form of vertical segment, similar to the Arabic letter (@) drawn right over the letter. These are easy to detect on-line, as it involves pen lift. The letter (º ) at the beginning or the middle of a word-part may also encounter a horizontal stroke as shown in Figure 2.3. In addition, the delayed stokes that consist of a pair or triplet of dots may be written as a short hori- zontal stroke or a hat-like shape (∧), respectively. Figure 2.5 depicts different shapes that approximate pair and triplet. In some writing styles the ligature ( B) is written using an alternative way without encountering additional strokes as seen in the first column in Figure 2.3; the alternative way of writing encoun- ters one loop and is written as one continuous stroke. Arabic script is similar to Roman script in that it uses spaces and punc- tuation marks to separate words. However, certain characteristics relating to the obligatory dots and strokes of the Arabic script distinguish it from Roman script, making the recognition of words in Arabic script more difficult than in Roman script. Eliminating, adding, or moving a dot or stroke could produce a completely different letter and, as a result, produce a word other than the one that was intended. The number of possible variations of delayed stokes is greater than those in Roman script, as shown in Figure 2.4. There are only three such strokes used for English: the cross in the letter t, the slash in x, and the dots in i and j. Some of the 120 different Arabic letter shapes may look the

14 CHAPTER 2. THE ARABIC SCRIPT 2.2. LETTERS AND STROKES

Shape Name Styels Letters أ,إ,ئ,ئـ,ء,ل,ل,ؤ,ك ء hamza ء آ,ل ~ Madda ~ ب,ج,خ,ذ,ز,ض,ظ,غ,ف,ن . noqta . ت,ق,ي,ة - .. noqtatein . - . . ث,ش .. ^ Noqat 3 . - ^ ط,ظ v_line ـكـ h_line ل d_line

Figure 2.5: A list of additional strokes. The first column depicts the different basic elements used in additional stroke classes, the third column shows dif- ferent styles of writing the different classes of the additional strokes, and the fourth column includes a list of letters that have these basic strokes and their possible positions.

same when ignoring the complementary parts and considering only the letter body. Grouping the 28 different shapes of the isolated forms of the different Arabic letters based on their body part reduces their number to 17 shapes. This number increases to close to 40 classes when considering all the 120 different letter forms. Identifying the complementary parts and associating them with the right main body can be used to recognize a letter or distinguish between similar letters. For example, the triplets ( H, H,H ) have exactly one shape for each form, when ignoring the dots. Delayed strokes. are very important in distinguishing between letters that have the same body part but differ only in the complementary part (number and/or place of the additional strokes). Ignoring delayed strokes reduces the number of basic shapes of Arabic letters, but increases ambiguity.

Finally, in Arabic script a top-down writing style called vertical ligatures is

15 2.2. LETTERS AND STROKES CHAPTER 2. THE ARABIC SCRIPT

Disconnective Letters:Strokes Isolated Final   @, @, @, @ ÇÜ,ÃÜ,A X, X Y, Y P, P Q, Q ð, Zð ñ, Zñ Connective Letters:Strokes Isolated Initial Medial Final Zø ÜZ' Ü@ Ü' Zù , , , ,  , ,  , , H. H H K. K K J. J J I. I I , , , , , , , , h p h. k k k. j j j. i q i. €, € ƒ, ƒ ‚, ‚ ,  ,  “, “ ’, ’ ‘, ‘ , £, £ ¢, ¢ ¡, ¡ ¨, ¨ «, © ª, ª ©, © ¬ , † ¯ , ‡ ® , ® ­ , ‡ ¼ » º ½ È Ë Ê É Ð Ó ê Ñ à K J á è, è ë ê é, é ø K J ù 

Table 2.1: The full list of Arabic letters and their position within a word. very common – letters in a word may be written above their consequent letters. In this style, the position of letters cannot be predefined relative to the base- line of the word. This further complicates the recognition task, particularly in comparison with the Roman script.

16 Chapter 3

Related Work

3.1 Text Recognition

Arabic text recognition research faced difficulties in segmenting a word into individual letters from its early stages, when the recognizers were de- signed only for the printed form of the script. Although this chapter focuses on on-line Arabic text recognition, research in the general field of Arabic text recognition is also presented. Segmenting handwritten Arabic script is much harder compared to the printed form even though both are cursive, since the former has much more flexibility. Generally, research in this field can be clas- sified into two classes: on-line and off-line recognition, where off-line recogni- tion can be of either printed or handwritten scripts. Three methodologies were suggested to handle the cursiveness of the Arabic script in text recognition sys- tems: 1) Recognition of isolated letters; 2) Segmentation-based methods; and 3) Segmentation-free methods. In the following sections we briefly overview related work that was conducted in each of these methodologies.

3.1.1 Recognition of isolated letters

Several papers dealt with the recognition of isolated Arabic characters, a form that does not exist in Arabic writing. These approaches aimed to provide on-line Arabic text recognition systems for this artificial form as a temporary easy solution. In other cases the purpose was to test the classification and recognition process while ignoring the segmentation phase. The off-line recognition of Arabic texts has also attracted the interest of several researchers. Elsheikh and Guindi [41] and Mahmoud [72] extracted information from inner and outer boundaries by following the contour and then used Fourier descriptors to recognize characters. Alshebeili et al. [16] pre- sented an Arabic character recognition system for isolated forms of printed Arabic script using features extracted from the 2D slice of the spectrum es-

17 3.1. TEXT RECOGNITION CHAPTER 3. RELATED WORK timated by the Fourier spectrum of the character. Mahmoud [73] used an HMM-based system to recognize offline handwritten Arabic (Indian) numer- als. Angle, distance, and horizontal and vertical span features were extracted from these numerals as units for training and testing the HMM. The number of states had been estimated by performing several experiments which show that the best results were achieved using an HMM model with 10 states. The sys- tem achieved an average recognition rate of 97.99% on a large private database using 120 features presented as 12 observations of 10 features per digit.

Several approaches were developed to recognize isolated Arabic letters or digits. Al-Emami and Usher [5] developed an online Arabic handwrit- ing recognition system based on decision-tree techniques. Al-Taani [9] used a structural approach to develop an online Arabic digit recognizer. Primi- tives representing specific strings are extracted from each digit, then the gram- mars that construct these strings are used to identify digits. The system was tested using 100 different writers and high recognition rates were reported. Mezghani et al. [78] developed an online recognition system for isolated Arabic letters using Fourier descriptors and Kohonen maps. They reported a satisfy- ing recognition rate on 7244 samples of 17 classes written by 17 writers. Fourier descriptors and tangents, extracted along the boundary, were used to represent the characters. Mezghani et al. [79] also used Self Organizing Kohonen Maps (SOM) and features taken from the extracted elliptic Fourier coefficients of the handwritten stroke to recognize isolated online Arabic characters. In a recent paper, Mezghani et al. [80] investigated Bayes classification of online Arabic characters using histograms of tangent differences and Gibbs modeling. Al- salakh and Safadi [15] introduced a system for Arabic on-line handwriting recognition (”AraPen”) based on Dynamic Time Warping (DTW). It was de- signed to handle non-cursive character recognition and adapted to the cursive case. In the non-cursive case, the system was tested on a small corpus and achieved high recognition rates. The recognition rates went down dramati- cally, to less than 50%, when adapting the system to cursive scripts. Baghshah et al.[21] developed a system to recognize on-line isolated Persian handwrit- ten letters. The system is based on a fuzzy logic classifier and yields a high recognition rate using the Razavi and Kabir database [98]. Halavati et al. [52] used visual features and fuzzy logic classifiers to develop a system for on- line recognition of Persian handwriting. Alimi and Ghorbel [11] developed an online recognition system for isolated Arabic characters using dynamic pro- gramming algorithms. They reported high recognition rates using different database sizes and replication of characters.

Despite the high average recognition rates reported for these systems, this artificial approach that instructs users to write isolated forms of Arabic char- acters never became popular.

18 CHAPTER 3. RELATED WORK 3.1. TEXT RECOGNITION

3.1.2 Segmentation-based Methods

The approaches in this category segment words into individual letters (or pseudo letters), which are then sent to a character recognizer. The combina- tion of the recognition results are used to generate a ranked list of possible matching words. This approach is theoretically stronger in handling large dic- tionaries while using a constant or limited number of classes for classification. Some segmentation-based approaches try to segment the input word into char- acters or constituent strokes using open and closed curves, vertical and hori- zontal strokes, or cusp and inflection points [18, 1, 14]. Other approaches rely on projections and/or histogram techniques to segment words into charac- ters [19, 39, 57, 10, 33]. Segmentation approaches based on HMM-models [49] or morphological rules [115] were also developed. Several off-line approaches were based on recognizing Arabic words by segmenting words into their constituent strokes and then collecting the recog- nized strokes into letters and words [14, 19, 57, 10]. These approaches used projection and histogram techniques, such as vertical and horizontal projec- tion, to split a word or word-parts into basic segments using horizontal pro- jection. Al-Emmami [39] used a similar approach for on-line Arabic text recog- nition. The performance of these approaches is heavily dependent on the level of script consistency, which explains the acceptable results of these approaches for the printed Arabic script. This dependence on script consistency also ex- plains the inadequate results of these approaches when recognizing Arabic handwriting. Bushofa and Spann [33] segment words to characters by detecting the base- line of a printed word and analyzing the geometric behavior of its connection points. They considered the different forms of the Arabic characters, while ig- noring additional strokes. They extracted features from the smoothed skeleton of the segmented characters. Gouda and Rashwan [49] presented an HMM- based method for segmenting printed Arabic script and reported a high rate of correct segmentation. They used a sliding window to extract features for train- ing and recognition and adopted the invariant moments described by El-Khaly and Sid-Ahmed [40]. Sari et al. [115] presented an algorithm for segmenting handwritten Arabic words into characters based on morphological rules and reported satisfying results. For complete survey of character segmentation see [99]. Broumandnia et al. [30] introduced a novel scheme based on wavelet trans- form to segment printed Farsi/Arabic words into characters. They employ a novel wavelet transform, which is used to detect underlying horizontal edges and base-line. Projection of horizontal edges on the base line reveals the seg- mentation points. They compared the presented algorithm to three schemes, closed contour, structural, and holistic, in terms of precision, speed, and ro- bustness against Gaussian noise. The presented results indicate the superiority

19 3.1. TEXT RECOGNITION CHAPTER 3. RELATED WORK of their algorithm in terms of precision and showed that it improves recogni- tion speed.

3.1.3 Segmentation-free Methods

In this approach the recognition is globally performed on the whole rep- resentation of the word. It was originally introduced for speech recognition, where the identification of individual characters is almost impossible. Trin- kle et al. [47] used a hybrid system of three subsystems for printed Arabic text recognition. The first subsystem uses a sliding window that breaks the image to pieces smaller than a character by looking at special features such as local maxima. These segments are combined into groups (using all possible unions) and scored by a neural network trained on character images. This subsys- tem uses a dynamic programming algorithm to determine the best-cost union for words from a lexicon. The second subsystem uses a neural network to detect the probable locations for characters within a word and to recognize the character in the next step. The third subsystem computes word-level fea- tures such as loops, lines, and endpoints on the word-image as a whole, and does not consider character-level information in any way. The recognition re- sults are combined using a decision strategy to produce a single recognition response. Maddouri and Amiri [71] used global and local features to classify and recognize words from a small vocabulary of 70 words used for check veri- fication. They used transparent neural network and reported high recognition rates for different layers. Fadi et al. [25] proposed a method for on-line Ara- bic text recognition that performs segmentation during the training phase to avoid large size dictionaries for word models, and uses a segmentation-free approach for the recognition phase. A novel approach was used to embed the additional strokes in the main body of the character. Pechwitz and Maergner [91] presented an off-line recognition system for Arabic handwriting using a semi-continuous HMM. They used a sliding win- dow that moves from right to left to collect features directly from the normal- ized gray image pixels. Then they applied Loeve-Karhunen transformation to reduce the number of features in each frame. They used a seven-state model for each character shape. Tests were performed using the IFN/ENIT database of handwritten Arabic words and achieved an 89% maximal recognition rate. Khorshed [60] used the Hidden Markov Model Toolkit (HTK) to develop an off-line printed Arabic text recognition system. After decomposing the docu- ment image into text line images, a narrow sliding window is used to extract a set of simple statistical features. The system was applied to a data corpus that includes Arabic text of more than 600 A4-size sheets typewritten in multiple computer-generated fonts and achieved 95% maximal recognition rate. Benouareth et al. [24] presented an off-line segmentation-free recognition

20 CHAPTER 3. RELATED WORK 3.1. TEXT RECOGNITION system for unconstrained Arabic handwritten words using discrete HMM with explicit state duration. The explicit state duration modeling was used to im- prove the discriminating capacity of the HMM and enable the recognition of difficult patterns in an unconstrained Arabic handwriting. They used a new version of the Viterbi algorithm that takes into account explicit state dura- tion to perform efficient training and testing tasks. A set of statistical and structural features were extracted from the word image using a sliding win- dow approach based on vertical projection histogram. Experiments using the IFN/ENIT database achieved 90.2% average recognition rate. Al-Hajj et al. [6] presented a segmentation-free system for off-line recogni- tion of cursive Arabic handwritten words. They used three HMM-based clas- sifiers, and combinations of their results are used to determine the recognized words. Sliding windows with different orientations were used to extract pixel- level features, such as pixel density distribution and local pixel configurations from the binary image. They have tested different combination schemes and have achieved a recognition rate of up to 90.96% using the IFN/ENIT database. Khorsheed [59] presented a segmentation-free method for off-line recogni- tion of cursive handwritten Arabic script. Structural features were extracted from the skeleton of the words after segmentation to elementary strokes. These features are used to train a single hidden Markov Model. The HMM is com- posed of multiple character models where each model represents a single let- ter from the alphabet. The proposed method achieved recognition rates of 72% and 87% after consulting with a word dictionary on samples extracted from a historical handwritten manuscript [58]. Dehghana et al. [36] presented a segmentation-free approach for an off-line handwritten Farsi/Arabic word recognition. A discrete HMM was used for the recognition process and Ko- honen self-organizing maps for vector quantization of the feature vectors. A sliding window on the histogram of the chain code directions was used to generate the feature vectors. The width of the sliding window was fixed to twice the stroke width and divided into five horizontal zones. Experiments carried out on 17, 000 test samples of 198 city names in Iran achieved a 65.05% recognition rate. In another paper [35], they presented a novel holistic hand- written Farsi/Arabic word recognition scheme in the presence of word rota- tion and scale change. Image word features are extracted by exploiting rota- tion and scale invariance characteristics of M-Band packet wavelet transform performed on polar transform versions of handwritten Farsi/Arabic word im- ages. The extracted features construct a feature vector for each word image. This vector is employed during the recognition phase by finding similar words based on the least Mahalanobis distance of feature vectors. This scheme is ro- bust against rotation and scaling. Menasri et al. [77] presented a hybrid sys- tem based on HMM and neural network classification methods using explicit grapheme segmentation. Each letter-body class is represented by an HMM model and the neural network computes the observations probability distribu-

21 3.1. TEXT RECOGNITION CHAPTER 3. RELATED WORK tion. Experiments using IFN/ENIT database result in an 87% average recog- nition rate. Dots and diacritics were recognized independently and used as prior knowledge to eliminate and validate letters. Al-Muhtaseb et al. [7] pro- posed an HMM-based system for off-line recognition of Arabic printed texts. Sixteen features were generated from each vertical sliding strip from overlap- ping and non-overlapping hierarchical windows. Eight different fonts were used for training and testing and yielded recognition rates around 99%. De- hghan et al. [37] presented a segmentation-free approach for off-line handwrit- ten Farsi/Arabic word recognition. A discrete HMM was used for the recog- nition process and Kohonen self-organizing Maps were used for vector quan- tization of the feature vectors. A sliding window on the histogram of the chain code directions was used to generate the feature vectors. The width of the slid- ing window was fixed to twice the stroke width and divided to five horizontal zones. Experiments carried out on 17, 000 test samples of 198 city names in Iran achieved 65.05% recognition rate.

3.1.4 Delayed Strokes In general, previous work has viewed delayed strokes as features that add complexity to on-line handwriting recognition. Five methods were proposed to recognize words with delayed strokes: • Delayed strokes were totally discarded from handwriting in the prepro- cessing phase [12].

• Delayed strokes were detected in the preprocessing phase and then used in a post-processing phase [55].

• The end of a word was connected to the delayed strokes with a special connecting stroke [74]. Adding the special stroke, which indicates that the pen was raised, results in a continuous-stroke sequence for the entire handwritten English sentence.

• Delayed strokes were treated as special characters in the alphabet [55], i.e., a word with delayed strokes was given alternative spellings to ac- commodate different sequences where delayed strokes are drawn in dif- ferent orders.

• The delayed strokes were projected and connected to the body of the word using two lines in different directions imitating a continuous stroke written with no pen raise to preserve order in the extracted points of a written word-part with its additional strokes [25]. Eliminating delayed strokes causes tremendous ambiguity, particularly when the letter body is not written clearly. Furthermore, eliminating delayed strokes

22 CHAPTER 3. RELATED WORK 3.2. ARABIC HWR DATABASES may lead to a similar shape that may represent different letters or a sequence of letters. For example, the letter (Seen) ƒ has a shape similar to that of the three letters JK (b + t + y) (without dots) in some writing styles. The other meth- ods cannot be . implemented, since Arabic words may contain many delayed strokes. These methods dramatically increase the hypothesis space, since words should be represented in all their handwriting permutations. For example, the word éJ®J®k (Hqyqyp) ’truth’ contains 10 dots, 6 above the word and 4 under it. Connecting the delayed strokes with the end of the word complicates the representation of the word, and removing the delayed strokes (dots) requires handling 3 × 4 × 5 × 4 × 2 = 480 different representations. The idea of us- ing delayed strokes and other global features for lexicon reduction of words or off-line handwritten word-parts has been presented in some papers [84]. For example, Mozaffari et al. [84] used dots to reduce the large number of words to be recognized by eliminating unlikely candidates before recognition. The prin- cipal of their technique involves extraction of dots, diacritics, and word-parts from the cursive Arabic word image to describe its shape. In the first stage of lexicon reduction, the number of sub-words in the input word is estimated. Then, during the second stage, the word descriptor, based on the dots and diacritics information, is used while taking into account only the candidates selected in the first stage. The system was tested on the IFN/ENIT database, consisting of 26, 459 cursive Arabic word images, and showed a lexicon reduc- tion of 92.5% with accuracy of 74%.

3.2 Arabic HWR Databases

The research in Arabic text recognition has attracted the interest of researchers recently, compared to Latin and Chinese. Research results and techniques used in other scripts were adapted and improved to meet the needs and challenges of Arabic text recognition. Segmenting cursive Arabic words into characters is a hard task due to the large variety of writing styles and the absence of constraints and consistency. Attempts have been made to recognize isolated forms of Arabic letters, thus avoiding the segmentation process by forcing a non-cursive style of writing [16, 41, 72, 80]. In parallel, segmentation-based methods were developed or adapted and improved [19, 10, 39, 47, 115].

23 3.2. ARABIC HWR DATABASES CHAPTER 3. RELATED WORK

The poor recognition rates, which usually result from unsuccessful seg- mentation, shifted the focus to the segmentation-free, holistic, approach. In the holistic approach [47, 71, 26, 13], complete words are processed to be recog- nized without segmenting them into characters. Character-based recognition approaches are required to store the various appearances of each character, while holistic approaches are required to maintain large databases that store multiple shapes for each word in the lexicon for the training and recognition phases. The holistic approach was initially used for recognition tasks that re- quire a small vocabulary, such as check verification, mail sorting, and keyword searching. Recently, the development of efficient holistic-based recognition methods to handle large vocabularies has attracted more interest [61, 108]. Such development requires large databases for training and evaluation as well as efficient processing in terms of time and space. In [124], Wang et al. (2002), present a method to synthesize cursive hand- writing words guided by a deformable model. The process concatenates lig- ature strokes and isolated letters generated from learned models to generate a word trajectory. In [121], Varga et al. (2003). Present a method for gen- erating synthetic handwritten text lines using images of text lines of cursive human Handwriting. Thinning/thickening and other geometrical transfor- mations were used as perturbation model to generate the synthetic lines. They used this synthetic data to improve the learning process of HMM-based off- line cursive handwriting recognition. In a recent paper [122], they present a method for synthesizing cursive handwritten English text lines using tem- plates of characters and the Delta Log Normal Model. To generate a text line they first concatenate perturbed versions of the characters in the text line based on the given templates. Overlapping strokes and delta-log normal velocity profiles in accordance with the Delta Log Normal theory were used to draw the text line. Large standard databases with large datasets of real data are an essen- tial requirement for handwriting recognition research and development. The recognition rates of handwriting recognition systems are inconceivable with- out the availability of such databases to compare results. Many databases for handwritten text recognition for English scripts, such as UNIPEN, CEDAR, NIST, and IRONOFF, have been developed. In contrast, very few databases were developed for Arabic script and fewer became publicly available. The IFN/ENIT off-line database for Arabic words was one of the first databases that was publicly available and became the first standard database for Ara- bic. The IFN/ENIT database includes 946 Tunisian town/villages names and postal codes written by 411 people. A Persian version of the IFN/ENIT was recently released, including city names handwritten in Farsi. The Persian ver- sion consists of 7, 271 binary images of 1, 080 Iranian province and city names, collected from 600 writers. For each image in the database, the ground truth in- formation includes its ZIP code, and sequence of characters and numbers. An-

24 CHAPTER 3. RELATED WORK 3.3. WORD SPOTTING other known database for Arabic handwriting recognition is the CENPARMI Arabic checks, which was released in 2003 [8] and consisted of legal and cour- tesy amounts on bank checks and isolated handwritten digits. Few standard databases have been developed recently for research of Farsi/Arabic off-line handwriting recognition [2, 85, 114]. Mozaffari et al. (2006) [85] presented a new comprehensive database for isolated off-line handwritten Farsi/Arabic numbers and characters for use in optical character recognition research. It in- cludes the gray scale images of 52, 380 characters and 17, 740 numerals. Each image was scanned from Iranian school entrance exam forms during the years 2004−−2006 at 300 dpi. Solimanpour et al. (2006) [114] described an approach toward a standard handwritten Farsi database including isolated digits, let- ters, numerical strings, legal amounts used for checks and dates. The ADAB database for on-line Arabic text recognition had been published by Haikal et al. (2009) [2] as part of a competition in on-line Arabic handwriting text recog- nition at the ICDAR2009. The database consists of 15, 158 Arabic words of 937 Tunisian town and village names written by more than 130 writers. In conclusion, researchers have mostly developed their own small datasets, or large databases that are not available to the public [8, 17, 86, 26]. None of these databases were developed to include the entire Arabic lexicon. As a result, until now, no standard comprehensive database (on-line or off-line) for Arabic handwriting text recognition is currently available.

3.3 Word Spotting

Algorithms for spotting words in handwritten manuscripts provide us with the ability to search for specific words in a given collection of document images automatically, without converting them into their ASCII equivalences. This is done by clustering similar words, depending on their general shape within documents, into different classes, to generate indexes for efficient searching. Shape Matching algorithms roughly fall into two categories [75]: Pixel-based and Feature-based matching. Pixel-based matching approaches measure the similarity between the two images on the pixel domain using various met- rics, such as the Euclidean Distance Map (EDM), XOR difference, or the Sum of Square Differences (SSD) [93]. In Feature-based matching, two images are compared using representative features extracted from the images. Similarity measurements, such as DTW and point correspondence, are defined on the feature domain. The Dynamic Time Warping(DTW) technique had been used and tested in many systems using various sets of features and shown to have better results than the competing techniques [75]. Manmatha et al. [75] examined several matching techniques and showed that DTW, in general, provides better re- sults. Using a set of 2000 images of Latin words, they have reported an average

25 3.3. WORD SPOTTING CHAPTER 3. RELATED WORK match rate of 70%, which motivated them to develope algorithms to acceler- ate the computation of the DTW. Rath and Manmatha[96] preprocessed seg- mented word images to create sets of one-dimensional features, which were compared using DTW. Experimental results using different datasets from the George Washington collection have yield matching rates that range between 51.81% and 73.71%. They also analyzed a range of features suitable for match- ing words using DTW [95]. Rothfeder et al. [101] presented an algorithm which recovers correspon- dences of points-of-interest in two word images. These correspondences are used to measure the similarity between word images. They reported a cor- rect matching rate of 62.57% and 15.49% using set of 2372 images of reasonable quality and a set of 3262 images of poor quality, respectively. Srihari et al. [116] presented a system for spotting words in scanned document images for three scripts, Devanagari, Arabic, and Latin. The system retrieved the candidate words from the documents and ranked them based on global word shape fea- tures. They reported better results for printed text compared to handwritten and showed that combining prototype selection and word matching yield bet- ter results for handwritten document. They obtained a correct match of 60% for handwritten English and 90% for printed Sanskrit documents. Shrihari et al. [102] used global word shape features to measure the simi- larity between the spotted words and a set of prototypes from known writers. They reported results for manually segmented documents, using five writers to provide prototypes and another five for testing. They obtained 55% correct matching rate and commented that the match rate increases as more writers are used for training. In [117] they presented a design for a search engine for handwritten documents. They indexed documents using global image fea- tures, such as stroke width, slant, word gaps, as well as local features that describe the shapes of characters and words. Image indexing is done auto- matically using page analysis, page segmentation, line separation, word seg- mentation, and recognition of characters and words. Rath et al. [94] and [118] extract discrete feature vectors that describe word images, which are used to train the probabilistic classifier. They reported 89% correct matching rate for 4-word queries on a subset of George Washington’s manuscripts. A segmentation-free approach was adopted by Lavrenko et al. [65]. They used the upper word and projection profile features to spot word images with- out segmenting into individual characters. They showed that this approach is feasible even for noisy documents. Their experimental results show a recog- nition accuracy of 65%. Another segmentation-free approach for keyword search in historical documents was proposed by Gatos et al. [46]. Their system combines image preprocessing, synthetic data creation, word spotting, and user feedback technologies. A language independent system for preprocess- ing and word spotting of historical document images was presented by Farrahi Moghaddam et al. [81], which has no need for line and word segmentation. In

26 CHAPTER 3. RELATED WORK 3.3. WORD SPOTTING this system, spotting is performed using the Euclidean distance measure en- hanced by rotation and DTW.

Manmatha and Rothfeder[76] describe a novel scale space algorithm for au- tomatic segmentation of handwritten documents into words. They clean mar- gins, segment lines and use anisotropic Laplacian at several scales to segment lines into words. They reported 17% incorrect matching on 100 handwritten documents from the George Washington corpus of handwritten document im- ages. You et al. [129] presented a hierarchical Chamfer matching scheme as an extension to traditional approaches of detecting edge points, and managed to detect interesting points dynamically. They created a pyramid through a dynamic thresholding scheme to find the best match for points of interest. The same hierarchical approach was used by Borgefors [29] to match edges by minimizing a generalized distance between them. It is important to notice that some of the mentioned algorithms and feature sets which were devel- oped to be used with Latin scripts such as upper word and projection profile features et al. [65] fail to give good results when adopted to the Arabic Script.

An algorithm for robust machine recognition of keywords embedded in a poorly printed document was presented by Kuo and Agazzi [64]. For each keyword, two statistical models are generated – one represents the actual key- word and the other represents all irrelevant words. They adopted dynamic programming to enable elastic matching using the two models. They created a synthetic database that includes about 26000 words and reported 99% recogni- tion rate for words that share the same font size and 96% for those that do not. Chen et al. [34] developed a font-independent system, which is based on HMM to spot user-specified keywords in a scanned image. The system extracted po- tential keywords from the image using a morphology-based preprocessor and then used external shape and internal structure of the words to produce feature vectors. Duong et al. [38] presented an approach that extracts regions of inter- est from gray scale images. The extracted regions are classified into textual and non-textual using geometric and texture features. Farooq et al. [43] present pre- processing techniques for Arabic handwritten document to overcome the inef- fectiveness of conventional preprocessing for such documents. They described techniques for slant normalization, slope correction, and line and word separa- tion for handwritten Arabic documents. Saabni and El-sana [107] presented an algorithm for searching Arabic keywords in handwritten documents. In their approach, they used geometric features taken from the contours of the word- parts to generate feature vectors. DTW uses these real valued feature vectors to measure similarity between word-parts. Different templates of the searched keywords were synthetically generated to be matched against the word-parts within the document image.

27 3.4. TEXT LINE EXTRACTION CHAPTER 3. RELATED WORK

3.4 Text Line Extraction

Text-line extraction methods can be divided roughly into three classes: top- down, bottom-up, and hybrid. Top down approaches partitions the document images into regions, often recursively, based on various global aspects of the input image. Bottom-up approaches group basic elements, such as pixels or connected components, to forms the words of a line. The hybrid schemes com- bine top-down and bottom up procedures to yield better results.

3.4.1 Top-down approaches Projection Profiles [119, 88] along a predetermined direction are usually used in top-down approaches to determine the paths separating consecutive text-lines. Shapiro et al. [123], applied a Hough transform to determine the predefined direction for Project Profile calculation. Hough transform was used by Likforman-Sulem et al. [68] to generate the best text-line hypothesis in the Hough domain and later on to check the validity of the hypothesis in the im- age domain. He and Downton [53] presented the RXY cuts, which relies on projections along the X and the Y axes, resulting in a hierarchical tree struc- ture. Several approaches [31, 128, 53, 131] use Projection Profile on predefined sub blocks of the given document image to handle multi-skew. These global methods often fail to segment multi-skew (fluctuating) document images. To handling multi-skew in document images Bar-Yosef et al. [128] use adap- tive local projection profiles, which adapt to the skew of each text-line as it progresses, in an incremental manner. Wong et al. [125] developed the smear- ing approach to determine the text-lines in binarized printed document im- ages. In this approach, consecutive black pixels along the horizontal direc- tion are smeared; i.e., the white space between them is filled with black pix- els if their distance is within a predefined threshold. The bounding boxes of the connected components in the smeared image enclose text-lines. This method was adapted to gray-level document images and applied to printed books from the sixteenth century [66]. Shi and Govindaraju [133] determine text-lines by building a fuzzy run length matrix. An Adaptive Local Connec- tivity Map (ALCM) was presented in [132] for text-line location and extrac- tion, which can be directly applied on gray-scale images. Thresholding the gray scale ALCM, reveals clear text-line patterns as connected components. Shi et al. [111] presented a text-line extraction method for handwritten docu- ments based on ALCM. They generate ALCM using a steerable direction filter and group connected components into location masks for each text-line, which are used to collect the corresponding components (on the original binary doc- ument image). Nicolaou and Gatos [89] used local minima tracers, to follow the white-most and black-most paths from one side to other in order to shred the image into text-line areas.

28 CHAPTER 3. RELATED WORK 3.4. TEXT LINE EXTRACTION

3.4.2 Bottom-up approaches Various approaches rely on grouping techniques to determine text-line in document images, while applying applying heuristic rules [44], learning algo- rithms [130], nearest neighbor [48], and searching trees [113]. In contrast to the machine printed document images, simple rules such as nearest neighbor does not work for handwritten documents. The nearest neighbor often be- longs to the next or previous text-line, which necessities additional rules for quality measurement to determine the quality of the extracted text-lines. The approaches on this category require the isolation of basic building elements, such as strokes and connected components, and often find it difficult to sepa- rate touching component across consecrative text rows Gorman [48] presented a typical grouping method, which rules are based on the geometric relationship among k-nearest neighbors. Kise et al. [62] com- bine heuristic rules and the Voronoi diagrams to merge connected components into text-lines. Nicola et al. [113] use the artificial intelligence concept of the production system to search for an optimal alignment of connected compo- nents into text-lines. The minimal spanning tree (MST) clustering technique was used in [3, 112] to group components to text-lines. Proximity, similarity, and direction continuity were used to iteratively construct lines by grouping neighboring connected components [44]. Recently, a few methods were presented using Level-set techniques for line extraction [127, 32]. Li et al. [127] presented a hybrid approach based on the level-set method for unconstrained handwritten documents. The Level-set method is exploited to determine the boundary between neighboring text- lines, while converting the binary image into gray-scale using a continuous anisotropic Gaussian kernel. Bukhari et al. [32] presented a method based on Level-set for extracting lines from handwritten document images with multi- ple orientations, touching, and overlapping characters. They used ridges to compute the central line of parts of text-line on the smoothed image and then used active contours (snakes) over the ridges.

29 3.4. TEXT LINE EXTRACTION CHAPTER 3. RELATED WORK

30 Chapter 4

Arabic Language Statistics for Efficient Text Recognition

Developers of hardware and software technologies have been focusing their efforts on improving computation power, increasing memory capacities, im- proving sensor capabilities while reducing their size, and improving network speed and reliability. However, the field of human-computer interfaces has been less successful in attracting research attention, and as a consequence, lit- tle progress has been made in this area. The most common means of human- computer interface are still a keyboard and a mouse, despite the fact that nei- ther one is intuitive to work with. This problem has been exacerbated since the last decade of the 20th century, when small mobile computing devices were introduced. The natural alternatives to typing and point/click are speech and handwriting, which are universal communication methods. Handwriting interfaces are easy for literate users and provide a greater degree of privacy compared to speech. A large body of literature has been devoted to on-line handwriting recognition and significant progress has been made. Neverthe- less, handwriting–human-computer interfaces are not common, and most of the research has been focused on Latin and Chinese characters, with much less work on Arabic scripts. In this chapter, we study various statistical aspects of the Arabic languages and show how they can be used to improve the efficiency and recognition rates of holistic segmentation-free based Arabic text recognizers. The presented sta- tistical study also proves the hypothesis that a holistic approach is the appro- priate one for Arabic text recognition. Within the context of Arabic script a holistic approach performs recognition of a word by recognizing its word- parts in the right order. One would initially think that this approach is not feasible since the number of all possible word-parts is huge. Storing a model for each word-part and searching through these models is not practical in real time. Surprisingly, the number of possible word-parts is not that large, and this number could be further reduced by removing dots and delayed strokes.

31 CHAPTER 4. ARABIC LANGUAGE STATISTICS FOR EFFICIENT TEXT 4.1. THE ARABIC LANGUAGE ANALYSIS RECOGNITION In an on-line handwriting recognition system, where loops and other features are easy to detect, it is also possible to further reduce the search space by clas- sifying word-parts according to robust features, such as the number of loops in a word-part. Integrating these results in an on-line recognition system has dramatically reduced the search space, accelerated the recognition time, and improved the recognition rates. In the rest of this chapter we first briefly overview Arabic characters, fol- lowed by our study and its integration with an on-line handwriting recogni- tion system. Finally, we conclude and present direction for future work.

4.1 The Arabic Language Analysis

Theoretically, open vocabulary recognition of the Arabic text using a holis- tic approach requires a large word (or word part) dictionary, as it is required to include recognition models for each word. Such a large dictionary demands the processing of a large number of candidates and imposing severe limita- tions on the processing efficiency. In this section we analyze the different features, such as loops, additional strokes, ascenders, and descenders, of the Arabic script and show how they could be used to reduce the number of processed elements in a given lexicon for recognition and/or training. These analyses will be limited to the Arabic language (Arabic lexicon); nevertheless we believe that similar properties will hold for other languages that use the Arabic script.

4.1.1 Additional Strokes and Word-Parts The Arabic script is usually printed in hundreds of different printed fonts and written in numerous handwriting styles. An Arabic word-part could be represented by a graph, which is often linear (polyline), and a set of delayed strokes. Intuitively, the larger the graph the larger the variety in different writ- ing styles. Many character segmentation algorithms use baseline estimation and histograms to break a word-part into individual characters. Estimating an exact baseline for an Arabic word is not an easy task for most handwritten scripts, as histograms and base-lines do not behave consistently. For example, Figure 4.1 shows different writing styles of the word (Mohamed). The style on the left is easy to segment horizontally using a well observed base line and the vertical histogram. However, the style on the right is hard to segment horizon- tally because the three boxes bounding the first three letters are aligned almost vertically. For these reasons and the results from word-to-character segmenta- tion research, we reach a conclusion that it is preferable to avoid segmenting an Arabic word into individual letters. Such a scheme achieves better results in terms of recognition accuracy.

32 CHAPTER 4. ARABIC LANGUAGE STATISTICS FOR EFFICIENT TEXT RECOGNITION 4.1. THE ARABIC LANGUAGE ANALYSIS

Figure 4.1: The word “MOHAMAD” in two different writing styles.

In contrast to the problem of character segmentation, segmenting a given word-part into its body and complementary parts is an easier task. For that reason, to recognize a word-part one would detect the body part first, then use the complementary part to resolve conflicts and determine the recognized word-part. Delayed strokes, especially in handwriting, are quite robust and could be used to reduce the search space to include only the body parts whose complementary part match that of the processed word-part. In such a scheme, robust features of the body and complementary parts of a given word-part are used to guide the search for a word-part matching the input pattern. Next we will discuss these features in detail.

4.1.2 Loops, Ascenders, and Descenders The upper and lower base lines usually bound most of the graph represent- ing a word-part. Similar to Latin, ascenders and descenders are defined as the parts of a written graph above the upper or below the lower base lines, respec- tively. For example, the letters ”p” and ”h” encounter one descender and one ascender, respectively. The letter ”n” encounters no descender or ascender. Some Arabic letters, such as Ð, ð, and ¬ , include a loop and contribute these loops to the graph representing the word-part. Practically, the loops could appear empty or filled, which requires a delicate mechanism. Many Arabic letters include loops, ascenders, or descenders and a word- part may include zero or more loops or ascenders. Descenders usually appear once in a word-part since it only happens in the final forms of some letters. The Arabic letters ( , , , , ) contribute one ascender and the letters ( , , @ È ¼ h h. p, ¨, ¨ , ¼, È) contribute one descender to each word-part they participate in. The definition of descenders in the Arabic script is not as restrictive as in English, since many different writing styles may encounter other descenders,

33 CHAPTER 4. ARABIC LANGUAGE STATISTICS FOR EFFICIENT TEXT 4.1. THE ARABIC LANGUAGE ANALYSIS RECOGNITION such as the letters (P, ð, and P ). Each Arabic letter may contribute zero, one, two, or three loops. Figure 4.2 presents a list of letters that contribute at least one loop to a word-part. The letter ë, which can contribute zero, one, two, or three loops, depends on the letter form and the writing style. Ascenders and descenders are easy to detect both off-line and on-line as they are always drawn over the upper baseline or under the lower base line. Loops, on the other hand, are easier to detect in on-line recognition (line cross) and need much more effort in the off-line recognition. Filled or degenerated loops are among the most common obstacles for detecting loops off-line. Detecting open loops is a challenging task both on-line and off-line.

Letter Forms Loops Common Styles Iso,Ini 0,1 ح,خ,ج All 1 ص,ض All 1 ط,ظ Med,Fin 1 ـعـ,ـع,ـغـ,ـغ All 1 ف,ق All 1 م Ini 1,2 هـ Med 0,1,2,3 ـهـ Fin 0,1 ـه Iso 1 ه All 1 و Iso,Fin 0,1 ل

Figure 4.2: Loops in different Letters,* stands for optional loops. The right- most column depicts different styles of writing characters with loops, while the first column gives examples of some common problems.

The letters { h, h, p} and the ligature ( B) form an exceptional case (see Figure 4.2). These letters. may be written in their initial and isolated forms in many writing styles that form loops, as shown in Figure 4.2. On the other hand, these letters are commonly written as in their printed from (without a loop). The inconsistency of writing loops in several Arabic letters influences our statistics seriously. In this work, we compute the statistics on the lexicon and not on the writing shape of each word. Similar statistics could be computed using the shapes of the word-parts in a given database. In such a case, the database will store multiple representations for those word-parts that include loops, since loops may appear filled or degenerate. Under these circumstances, different shapes of the same word-part can belong to different categories when

34 CHAPTER 4. ARABIC LANGUAGE STATISTICS FOR EFFICIENT TEXT RECOGNITION 4.2. ARABIC LANGUAGE AND SCRIPT PROPERTIES querying different numbers of loops. For example, the word-part (ÉîD ) may generate zero, one, two, or even three loops in different writing styles. There- fore, when querying word-parts with zero, one, two, or three loops we get the appropriate shapes of this word. This solution enables non-deterministic treatment for loops that are frequently found in Arabic writing. The size and the geometric shape of the different loops are very sensitive features. Nevertheless, they determine the difference between various letters, e.g., the loops in {Ð, ð, ¬ } are small and rounded while the loops in the letters { , , ,  } are elliptical and wide. The loops in the letters ª and ª have less rounded shapes with clear sharp edges on the sides. These properties are con- sistent in the printed form and in some handwritten writing styles. In several letters the loop is combined with another feature, which usually reduces am- biguity, e.g., the vertical additional stroke above a loop uniquely determines the letters and . In handwriting, several letters, such as {Ê , º , ½ }, could be written using a double stroke (foreword and backward) to avoid pen left. In practice, such writing generates slivers of elliptic loops. Distinguishing these cases from real loops is easy, since these elliptic loops are always narrow and long – one di- ameter is much larger than the other. Segmenting a word into individual letters does not insure sufficient results. Without segmenting into individual characters, the number of different words and word-parts may exceed several hundreds of thousands, which demand traversing a large number of candidates in a holistic approach. Optimizing procedures are essential to make a holistic approach practical.

4.2 Arabic Language and Script Properties

In this research we have studied the holistic approach for Arabic text recog- nition. In a typical segmentation-based text recognizer an input word is split into individual letters/characters, which are assembled to generate the recog- nized word. Holistic approaches avoid splitting a word into letters and try to recognize the whole word at once. In Arabic scripts there are two levels of seg- mentation: segmenting a word into word-parts and segmenting a word-part into letters. Splitting a word into word-parts is not a complicated task (differ-

35 CHAPTER 4. ARABIC LANGUAGE STATISTICS FOR EFFICIENT TEXT 4.2. ARABIC LANGUAGE AND SCRIPT PROPERTIES RECOGNITION ent word-parts have different connected components). However, splitting a word-part into individual letters is a very complicated process.

A holistic approach for Arabic script can split a word into word-parts and recognize word-parts as one pattern – without splitting it into letters. In such an approach, the processed pattern has to be tested against all the models of word-parts. It is obvious that the processing time, which is critical for on-line systems, depends on the number of tested word-parts. Therefore, in this study we measured the number of valid word-parts and developed several schemes to reduce the number of tested word-parts to a small enough number that enables on-line (interactive) recognition.

The validity of a word-part is language-dependent, but the Arabic script is used in several languages. In this study we use the Arabic language and we believe that the other languages that use the Arabic script will behave simi- larly.

4.2.1 Valid Word-Parts

As one would expect, the valid word-part is a small subset of all the pos- sible combination of letters in the alphabet, Σ∗. To compute the valid word- part, one could test every possible combination against the language dictio- nary. However, the dictionary usually includes words and not word-parts. In addition, our initial assumption (which was later verified) was that the valid word-parts set is a tiny fraction of Σ∗. For that reason, we scanned Arabic dic- tionaries, books, as well as journals on the Internet to extract the valid word- parts. Languages usually develop over the years – new words are introduced and others become unpopular. To cope with these changes we scanned[RAID] old dictionaries as well as new ones. We used modern dictionaries and In- ternet journals to capture the modern Arabic language words and the famous dictionary Lessan Al-Arab ( ) to capture historical words. H. QªË@ àA‚Ë We have collected and analyzed three million words, resulting in around 292, 000 distinct words. These words include around 85, 000 different word- parts, which are the input for our analysis.

36 CHAPTER 4. ARABIC LANGUAGE STATISTICS FOR EFFICIENT TEXT RECOGNITION 4.2. ARABIC LANGUAGE AND SCRIPT PROPERTIES 4.2.2 Valid Word-Parts Analysis Let us define the degree of a word w, degree(w), as the number of word-parts in w. We also define the length of a word-part, length(wp) as the number of letters in wp. We first classify words based on their degree, and word-parts based on their length. Table 4.2 shows the distribution of the length of the input word- parts and Figure 4.3 illustrates this distribution graphically. Table 4.1 shows the distribution of the length of the input words, which are graphically illus- trated in Figure 4.4.

Number of Word-Parts 1 2 3 4 5 Number of Words 51123 101264 87916 42725 12479 Number of Word-Parts 6 7 8 9 Number of Words 2128 296 61 15

Table 4.1: The distribution of Arabic words based on their degree.

Number of Letters 1 2 3 4 5 Number of Word-Parts 34 729 8214 30269 30017 Number of Letters 6 7 8 9 Number of Word-Parts 13343 3027 459 51

Table 4.2: The distribution of Arabic word-parts based on their length.

Analyzing the different words by their degree can be seen in Figure 4.3. We can observe that almost half of the different words ( 190,000) consist of two or three word-parts and the majority of the rest, about 87, 000, have one or four word-parts. Figure 4.4 shows the distribution of the length of a word- part. We can see that the majority of word-parts have four or five characters. The degree of a word and the position of a given word-part can be used to eliminate candidates from the search space and improve recognition efficiency. The length of a given word-part can be used to estimate the stroke (on-line) or boundary (off-line) length to determine the targeted sub-lexicon [25].

4.2.3 Reducing Search Space Even though, we were able to reduce dictionaries from 300, 000 words to 86, 000 word-parts, these numbers are still too large for the state of the art classification methods in terms of number of models. They are too large to store the trained models and too many to traverse in real-time, especially for computation devices with limited processing power such as PDAs and cellu- lar phones. In previous research, global features such as additional strokes

37 CHAPTER 4. ARABIC LANGUAGE STATISTICS FOR EFFICIENT TEXT 4.2. ARABIC LANGUAGE AND SCRIPT PROPERTIES RECOGNITION Dots 0 1 2 3 4 5 6 7 8 0 2793 2157 2334 1151 486 131 28 3 0 1 3714 1909 2569 911 454 123 31 7 2 2 4446 1963 2802 861 445 77 14 3 0 3 3777 1349 2261 607 314 74 21 2 0 4 2626 720 1497 275 153 24 3 0 0 5 1498 352 730 130 93 13 2 0 0 6 755 105 313 36 12 2 1 0 0 7 262 37 117 16 0 0 0 0 8 95 13 47 4 1 0 0 0 0

Table 4.3: The distribution of Arabic word-parts based on the number of dots below and above the main body. The first row indicates the number of dots above a word-part body and the first column indicates number of dots below. and loops had been used to reduce the number of different characters but not word-parts. In this work we use three groups of features to rebuild a compact dictionary of Arabic word-parts:

• Additional strokes

• Number of loops

• Number of ascenders and descenders

We chose these global features as they are easy to extract and detect in Arabic scripts with reasonable quality, such as on-line and constrained off-line Arabic scripts. As we have mentioned in sub section 4.1.2, in off-line Arabic handwritten documents the loops are not always empty (unfilled). Table 4.3 shows that even when ignoring loops within a word-part and considering the dots below and above a word-part, it is possible to classify the processed candidates and significantly reduce the search space. The results in Table 4.3 and 4.6 do not consider the order and distribution of dots in the different classes. For exam- ple, four dots may represent the flowing sequences (1,1,1,1), (2,2), (1,3), (3,1), (1,1,2), (1,2,1), and (1,1,2) where 1, 2, and 3 stand for the classes containing one dot, two dots, and three dots, respectively. Based on the knowledge in Table 4.4, it is possible to decrease the search space by reducing the number of

38 CHAPTER 4. ARABIC LANGUAGE STATISTICS FOR EFFICIENT TEXT RECOGNITION 4.3. DISCUSSION

35000 30000

25000

20000

15000

10000 Numberword-partsof 5000

0 10 9 8 7 6 5 4 3 2 1 Length of a Word-part

Figure 4.3: The distribution of word-parts by their length processed candidates. For example, the word-parts (á K, I K , I K, IK) have three dots above and one dot below, but with different. distribution. . . and or- der. It is obvious that observing the order and distribution leads to processing fewer candidates. As can be seen in Table 4.6, the largest class includes 1693 candidates (word- parts without dots or loops). Nevertheless, the average search space includes about 350 candidates. Note that 2793 word-parts do not have dots and 11, 552 do not include any loop. Ascender and descender distribution is illustrated in Figure 4.5 and expected to cut the search space (when combined with any set of features) by half. These features (loops, additional strokes, dots) and statistics regarding them, are used to reduce the search space based on the input dataset, which dic- tates the robustness of each feature. Such reduction on the search space not only reduces the processing time, but it also improves the recognition rate, as demonstrated in Table 4.5.

4.3 Discussion

Text recognition algorithms are used to convert graphical information into text. The input for these algorithms varies from well defined on-line hand- writing and clear modern printed text (off-line) to noisy historical documents. The robustness of the different classification features, such as loops and de- layed strokes vary with respect to the input dataset. Loops and delayed strokes are usually very robust in on-line systems, which generate clear curves with

39 CHAPTER 4. ARABIC LANGUAGE STATISTICS FOR EFFICIENT TEXT 4.3. DISCUSSION RECOGNITION

120000

100000

80000

60000

40000 Numberwordsof

20000

0 10 9 8 7 6 5 4 3 2 1 Degree of a Word-Part

Figure 4.4: The distribution of words by their degree well-defined dots. In historical documents, where loops are filled, it is not recommended to rely on loops to reduce the search space. Nevertheless, the developer should be aware of these differences and design the system to han- dle these various cases based on user specified parameters. Classifying the entries within the dictionary into classes could be easily implemented as an additional layer on top of the word-part dictionary. In letter-based recognition, the system is trained using a small training dataset. In holistic approaches the training process is more complicated, in terms of both time consumption and storage. Computationally intensive clas- sifiers, such as HMM, find it hard to deal with 300,000 different models. The 30, 000 different models for the different word-parts may be more reasonable for manual training, but still pose a practical challenge. Manually segment- ing words into letters at the training phase, similar to the approach by Bi- adsy et al. [25], simplifies the training but demands manual processing. Separating word-parts into main body and complementary parts dramat- ically reduces the dictionary size as all the common main parts, which are shared by several word-parts, will have a single entry in the database. In the training phase, it is also possible to train the system to recognize the common body parts using a single model (as they are identical). Such an approach not only reduces the number of models the system is required to store, but also re- duces the training time, which is usually done manually by a human operator. Nevertheless, the system could maintain multiple entries for the entire word-part to cope with situations where some letters may or may not explicitly include loops, depending on the writing style (see Figure 4.2). Table 4.5 shows the improvements of the recognition rates, which is a direct

40 CHAPTER 4. ARABIC LANGUAGE STATISTICS FOR EFFICIENT TEXT RECOGNITION 4.3. DISCUSSION 3U/3D 3/2,1 2,1/1,1,1 2,1/2,1 3/1,1,1 WordParts 364 1130 31 2 3U/3D 2,1/1,1,1 1,1,1/1,1,1 Word-Parts 3 0 1U/2D 1/2 1/1,1Dn Word-parts 3516 402 2U/1D 1,1/1 2/1 Word-parts 2283 2897 2U/2D 2/2 1,1/2 2/1,1 1,1/1,1 Word-parts 2415 1741 368 98

Table 4.4: In the first, third, and fifth rows, the first column indicates the num- ber of dots above and below a word-part and the rest of the columns indicate the order of these dots. The second, fourth, and last rows show the number of word-parts for each order of dots.

Loops:Strokes Loops:No Strokes Small Set Large Set Small Set Large Set Response Time 1150 1310 11540 12380 Recognition Rate 82.3 81.5 76.7 75.8 No Loops:Strokes No Loops:No Strokes Small Set Large Set Small Set Large Set Response Time 14220 14820 112500 116000 Recognition Rate 77.8 76.2 68.2 66.3

Table 4.5: The four possibilities of using/not using additional strokes and/or loops were tested using the same system for Arabic on-line handwriting recog- nition. The system is based on geometric features and elastic matching tech- niques. The results in this table show the impact of using the additional strokes and loops on time response and recognition rates. result of reducing the search space. As can be seen, using loops and delayed stroke distribution yields a 20% − 23% improvement over a recognition that ignores these two features. Using delay stroke distribution and loops sep- arately resulted in an improvement of about 15% each. These results show that a holistic approach guarantees better results in recognizing written Ara- bic words/word-parts, while avoiding segmentation into individual letters.

41 CHAPTER 4. ARABIC LANGUAGE STATISTICS FOR EFFICIENT TEXT 4.3. DISCUSSION RECOGNITION

D/U 0 1 2 3 4 0 1 2 3 4 0 589 746 761 552 242 1154 973 1054 472 195 1 600 623 748 408 192 1520 851 1130 379 192 2 696 597 642 390 151 1693 854 1158 337 191 3 764 546 614 320 128 1563 573 908 217 123 4 499 298 305 99 42 1062 281 585 116 66 No loops One loops 0 813 382 449 119 43 213 53 68 8 6 1 1190 375 673 115 64 361 57 112 9 6 2 1424 425 783 118 84 535 80 204 16 19 3 1078 194 588 59 50 319 34 139 10 12 4 801 117 423 52 37 224 33 87 6 7 Two loops Three loops

Table 4.6: This table includes four matrices that correspond to 0, 1, 2, and 3 loops. In each matrix the cell in column i and row j includes the number of word-parts that have i dots above it and j dots below it.

One-Descender & Ascenders distribution

4000 3500 3000 2500 2000 1500 1000 Numberword-partsof 500 0 10 9 8 7 6 5 4 3 2 1 Number of Ascenders

Figure 4.5: Distribution of word parts by the number of ascenders with one descender.

42 Chapter 5

Comprehensive Synthetic Arabic Database for HWR

The recognition of cursive handwriting is a challenging task, because of the huge variance and individuality of personal handwriting. In scripts that sup- port cursive and non-cursive handwriting, such as Latin scripts, it is possible to restrict the recognition for the non-cursive handwriting to provide partial answer. However, such solution is not possible for scripts that have cursive- ness as inherent part of the writing, such as the Arabic script systems. The research in text recognition has distinguished between two main ap- proaches – segmentation-based and segmentation-free. The segmentation-based approaches segment an input word into individual characters, which are then recognized and combined to identify the input word. The segmentation-free approaches, recognize the whole word at once, without segmenting into char- acters. Recent researches [60, 25, 6, 77, 7, 24], have shown that segmenta- tion based-approaches are prone to segmentation errors and can not provide the appropriate recognition rates. These observations made the holistic ap- proach the leading technique and widely accepted in handwriting recogni- tion research. However, the holistic approach compares continuous words or word-parts and for this purpose it is required to maintain large databases – a recognition model for each word in the lexicon In addition, the training and recognition demand the existence of the handwritten shapes for all the words in the lexicon, and it is an expensive and times consuming task to generate such database manually. Many Latin databases for handwritten text recognition tasks have been de- veloped, especially for English scripts. In contrast, very few databases have been developed for the Arabic script and fewer have become publicly avail- able. Research groups have developed private databases, which rarely become available to the public. In addition, none of these databases were developed to include the entire handwritten words or word-parts in the Arabic language. As a result, until now, no standard comprehensive database (on-line or off-

43 CHAPTER 5. COMPREHENSIVE SYNTHETIC ARABIC DATABASE FOR HWR line) for Arabic handwriting text recognition is available. In this chapter we present a new approach for efficient generation of a syn- thetic comprehensive database for on-line and off-line Arabic text recognition research. This database includes multiple shapes for each word that represent different handwriting styles, naturally written by different writers to capture personal writing styles. In our system we use a novel approach to generate synthetic shapes of any Arabic word using predefined handwriting fonts rep- resenting the various writing of each letter in the different positions. To keep the database compact– reducing redundancy– we used clustering and dimen- sionality reduction techniques. The compact set still covers the huge variety of writing styles while keeping the size as minimal as possible to enable afford- able processing. Since word-parts shapes are produced using basic elements of characters with one pixel width, we collected, analyzed, and integrated impor- tant properties for each word-part into the database (see details in Figure 5.1). These properties include feature points on the stroke, global features of the whole shapes, and the skeleton of each shape (which is often required for the off-line representation). The rest of this chapter is organized as follows: in section 5.1 we discussed in detail the proposed system. Sections 5.2 presents the results and discuss directions for future work.

Shape L_Num Human Operator Properties Automatic Extraction Properties

Loop1 Loop2 Loop3 Loop1 Loop2 Loop3

5 Rnd,Dn Triple Rnd,Dn Rnd,Dn - Rnd,Dn

4 Rnd,Dn Double Degen Rnd,Dn Double -

2 Rnd,Dn Rnd,Dn - Rnd,Dn Rnd,Dn -

4 Rnd,Dn Double Rnd,Dn Rnd,Dn Double Rnd,Dn

2 Degen Degen - - - -

2 Rnd,Dn Degen - Rnd,Dn - -

Figure 5.1: This table presents part of the data integrated with each shape in the database. The data presented is the number of loops, and for each loop, its shape and position. (Loops properties : Rnd = Rounded ,Dn= Down , Triple,Double, Degen = Denigrated loop

44 CHAPTER 5. COMPREHENSIVE SYNTHETIC ARABIC DATABASE FOR HWR 5.1. OUR APPROACH 5.1 Our Approach

محمدين

م PCA L, 1 L, 1 List md up

ـحـ محمد ـمـ K- L, 1 L, 1 Means md dn ـد List Hierarchical Clustering

Final List Clustering Layout Layout Methods Generating Predefined All Options Fonts (Global Features) (Global Features)

ي ين ن Words List Final List

Figure 5.2: A diagram flow sample for generating a compact set of shapes for given word-parts and word.

In this work, we present a novel approach for generating synthetic compre- hensive databases for training, testing, and evaluating Arabic text recognition systems. Writers are advised to create predefined sets of different (handwrit- ing fonts) using their own handwriting style. The shape of a given word ω is generated by concatenating the shapes of its constituting letters in the right or- der. However, the existence of multiple shapes for each letter and the need to consider all the permutations generate too many shapes for each word. Han- dling and maintaining such large datasets may exceed the local memory size and require unacceptable processing time. Nevertheless, the generated shapes of a given word ω have some similarities and by clustering these shapes into groups we have dramatically reduced the number of shapes. We have generated, using our system, comprehensive databases that in- clude almost the entire Arabic lexicon and can be used for on-line, as well as off-line text recognition. A generated database includes additional properties, such as local and global features at the character and word-part level and the stitching points between adjacent characters in word-parts. These properties could be used to experiment with various text recognition algorithms, such as character and word-part recognition, and word segmentation algorithms.

45 CHAPTER 5. COMPREHENSIVE SYNTHETIC ARABIC DATABASE FOR 5.1. OUR APPROACH HWR The diagram flow in Figure 5.2 shows the different stages of the presented system. The system starts with accepting an ASCII code of a given Arabic word. In the next stage, using the predefined fonts it synthesizes word-parts images within the given word. After eliminating redundancy using clustering techniques, it synthesizes the given word using different layout schemes. Our system consists of three main sub-systems – extracting handwriting fonts, generating synthetic word-part shapes, and shape clustering. In the rest of this section we discuss the three components of our systems in detail.

5.1.1 Extracting Handwriting Fonts

We have developed a sub-system that guides users to write Arabic words in different styles and manually segment each word into its constituting letters. For each word-part the system provides tools for manual specification of de- marcation points that separate letter shapes. This sub-system expects multiple appearance of each word to capture the different variations of the handwrit- ing. Our system can also generate handwriting fonts from available small databases. It scans all the words in the database and extracts the occurrence of each let- ter in the various positions and generates a new font. Databases that do not include the segmentation of the handwritten words into individual characters is required to go through a manual segmentation phase for each word, first. Upon the end of this process, the system includes multiple shapes for each character. Typically, a large fraction of the shapes of each letter in a generated font are very similar. Therefore, we apply a hierarchical clustering technique to eliminate redundant samples and retain compact set of shapes that faithfully represent the variety of handwriting shapes of each letter. Database authors use their own handwriting styles, but they are also re- quired to imitate different common writing styles for ligatures or letters in different positions. For example, the two letters (Ôg) have different writing styles/shapes and users are advised to write them vertically tiled in order to adjust to the vertical tiling style (see Figure 5.5). The writers are also guided to pay attention to special ligatures, such as (B) and other common pairs of con- sequent letters, which may not concatenate naively.

46 CHAPTER 5. COMPREHENSIVE SYNTHETIC ARABIC DATABASE FOR HWR 5.1. OUR APPROACH Global features, such as loops, are maintained per letter and per word-part in the database. The deterministic loop properties are extracted directly from the word-part’s text. To consider these properties in a non-deterministic man- ner, we store the probability of detecting a loop algorithmically for each loop within a word-part shape. This probability is determined based on the size and the ratio between the short and the long diameters of the loop (see Fig- ure 5.1 and Figure 5.4). Nevertheless, researchers and developers can access the strokes representing each character and word-part and extract additional information. The algorithms, which generate word-parts and words, use these feature properties to summarize the features for the generated words and in- tegrate them into the database. Features on the stroke, such as end point, split point, or high curvature points, are extracted and added to the database. The generation of an off-line representation uses these features to imitate off-line writing as much as possible. This process for example includes end- point smoothing and thickening of split points. These feature points, the orig- inal stroke, and the concatenation points between letters within the word-part are available in the database. They can be utilized by researchers as a ground truth to test and evaluate thinning and character segmentation algorithms. In our approach, characters or ligatures have different writing styles, and as a result they encounter different number of loops, which require careful treatment to avoid eliminating legitimate candidates in a loop-based candidate filtering. For example, the existence of loop in the letter (ê ) is not consistent even in different printed fonts, the medial form of (ê) may include zero, one, two, or three loops (see Figure 5.3 for details). The letters (k) ,(k ) and (k) can be written with or without a loop. Such inconsistency in the number of. loops complicates the preprocessing and post-processing candidate pruning phase. In our approach, loops are extracted and counted separately for each different letter shape and filtering is applied across the different word shapes (see Figure 5.1). We have adopted such policy to avoid deterministic pruning, which can mistakenly eliminate words with degenerated loops or different writing styles of the same letters, as mentioned previously. As a result, each different shape of the same letter in the same position can contribute different local and global features to a word-part. This data is highly important, as it is used to filter out candidates in a non deterministic manner when considering global features such as loops.

5.1.2 Synthesizing Word-parts and Word Shapes The synthesizing process uses the designed fonts to generate word-parts represented as 2D vectors (strokes), which form the on-line database. We also use these vectors to generate off-line databases by increasing the width of the lines and adding noise to simulate scanned off-line words.

47 CHAPTER 5. COMPREHENSIVE SYNTHETIC ARABIC DATABASE FOR 5.1. OUR APPROACH HWR

Figure 5.3: Samples of the shapes generated for the word ( Ñ ê Ó) where the letters ( è) and (Ð) include different numbers of loops. The images in each row are in decreased order with respect to number of loops.

The generation of words from word-parts is performed, based on a pre- defined layout scheme, which determines the position of the shapes of word- parts with respect to the word. To represent the different writing styles, these schemes apply different layout methods, such as tiling word-parts horizon- tally, and semi vertically with homogeneous, heterogeneous, or zero distances. In the Arabic script, there are only six disconnective letters ((ð, P, P ,X, @, X )), which means that any non final Arabic word-part has to end with one of these six letters. By observing the different writing styles and their different behav- ior with different starting or ending letters of word-parts, we have noticed that the three letters (ð, P, P ) at the end of a word-part may allow or encourage the consequent word-part to overlap or touch the current word-part within a word. The other three of the six letters prohibit overlapping or touching, unless the consequent word-part starts with the letters ( ¼, €, €). We have utilized this observation to generate various shapes of a given word by tiling

48 CHAPTER 5. COMPREHENSIVE SYNTHETIC ARABIC DATABASE FOR HWR 5.1. OUR APPROACH

(a) (b) (c)

Figure 5.4: Column (c) shows some examples of low probabilities of loop exis- tence for the letter ( Ð). High probabilities as can be seen in column (a) show obvious loops which are easy to extract, while shapes with medium probabili- ties are in column (b). The size of the suspected loop and the ratio between the diameters are used to calculate these probabilities. the different word-parts within a word using the different layout schemes (see Figure 5.6). Our system uses the set of the generated fonts to construct a writer-independent open vocabulary database of word-part shapes. For each word-part ω in the lexicon Σ the system determines the shape of the letters in ω, while taking into account the position of each letter and ignoring the additional strokes. The shapes representing ω are generated by concatenating the various shapes for each letter in the right order, thus generating all the possible permutations of ω. This simple concatenation is performed by stitching the endpoint of each letter shape to the start point of the following one and smoothing the stitch- ing region. Our current system uses an Arabic language word-parts lexicon that includes almost every word-part in the Arabic language – around 48, 000 word-parts. It is also capable of generating a database for any given lexicon, such as Farsi or Urdo, that uses the Arabic alphabet. For some languages it may require adding shapes of letters that do not exist in the Arabic language. p Let li be a vector (v0, ··· , vni ) with length ni, representing the i − th shape of the letter l in the position p. To generate one shape for the word-part ω = ini med lm,fin (l1, l2, ··· , lm) with length m, we concatenate the vectors li1 , li2 , ··· , lim each represents one appearance of a letter in a specific form, where ini, med, and fin stands for the initial, middle ,and final positions of a letter within a word-part, respectively. The concatenation is performed by joining the end- point of the vector li with the start point of li+1, while taking into account the appropriate positions of the two letters. In the concatenation process, points pi+1 in the vector li+1 are adjusted to be aligned to the previous vector using the

49 CHAPTER 5. COMPREHENSIVE SYNTHETIC ARABIC DATABASE FOR 5.1. OUR APPROACH HWR

(a) (b) (c)

(م) Initial (م) Medial (ح) Medial (د) Final

Figure 5.5: Three samples of synthetically generating the word ”YÒm×”.

Euclidean distance between its start point and the end point of the previous letter’s vector. To achieve seamless stitching, we apply a simple smoothing process to the stitching region of each two adjacent shapes of letters. Obviously, no concatenation is applied for one-letter words-parts. Two- letter word-parts are constructed by concatenating the initial and final vectors of the corresponding letters. As expected, such a scheme for word-part shape generation produces huge sets of shapes for each word-part in the lexicon. For example, with a word-part that contains five letters with eight different shapes for each letter, the method produces a list of 32, 768 = 85 different shapes. Many of these shapes are similar as they display only minor differences, thus calling for techniques to reduce redundancy. During the generation process global and local features are extracted from the font classes. For example, the number of loops in a word-part, ω, is de- termined based on the assigned loop count for each letter in ω. These prop- erties are maintained for each shape entry in the database. It is important to notice that the same word-part may have various shapes that have different properties depending on the complexity of its constituting letter shapes, e.g., they may have different number of loops. Such diversity in word-part repre-

50 CHAPTER 5. COMPREHENSIVE SYNTHETIC ARABIC DATABASE FOR HWR 5.1. OUR APPROACH sentation enables sensitive treatment of various features in a nondeterministic manner, which is essential for holistic-based recognition approaches. We create the off-line images of the word-parts from the generated on-line word-parts shapes – the ordered strokes – by applying a standard dilation pro- cess. In general handwriting, the curve near end points are usually smooth and thin due to a pen lift and areas around a split point and curved strokes are often thicker than the average width of the stroke. Our off-line handwriting generation algorithm determine these properties based on a pre-define param- eters. In the final step, we use two methods to simulate the process of printing and scanning. In the first method, we exchange the expected process of print- ing and scanning by the methods presented in [4] and to simulate the scanning process we use different degradation factors. The second method uses a con- volution with a Gaussian kernel(see Figure 5.6) to add noise to the generated images. We generate words from the given lexicon using the generated word-part shapes and based on three (can be extended to more layout schemes) different layout schemes that determine different handwriting styles.

• A reasonable gap to concatenate the word-parts within a word on the same base line. This gap has been determined by calculating the average gap between different word-parts within the same word in a text collec- tion that includes one hundred full pages of different handwriting styles.

• Enabling selected word-parts, based on their constituting letters, to be aligned vertically or in any other direction, while allowing their bound- ing boxes to overlap. This is done using the results of layout schemes discussed earlier.

• Based on the first and last letters of the different word-parts, we enable some selected shapes to touch each other.

Even though these methods do not represent all the different styles, still they include most of them. Researchers are invited to generate their own lay- out techniques using our collection of word-parts or even ignore the layout and access word-parts directly.

5.1.3 Dimensionality Reduction and Clustering The generated representation for each word-part in the lexicon is too large for practical use. Fortunately, we have realized that a large fraction of these representations have very few or no differences. Such a high percentage of redundancy of the generated shapes for each word-part, which may include tens of thousand of items, could be reduced dramatically by clustering and

51 CHAPTER 5. COMPREHENSIVE SYNTHETIC ARABIC DATABASE FOR 5.2. EXPERIMENTAL RESULTS HWR dimensionality reduction techniques. In this step we aim to generate com- pact sets, defined as the smallest sets that represent the wide variety of shapes for each word-part. We have adopted three techniques to build compact sets: Hierarchical clustering, Principal Component Analysis (PCA), and K-Means Clustering. n Let S(ω) = {si(ω)}i=1 be a set of n vectors where each vector si = (v1, v2, ··· , vni ) represents one generated shape for the word-part ω. To enable efficient and ac- curate processing, we simplify the stroke si in a semi-uniform manner. Let us δ denote the simplified vector si by si , where δ is the error tolerance used to control the simplification process. We define the feature αj at the point pj, on a given vector(point sequence), as the angle between the segment pj, pj+1 and δ the following segment. For each point vector si we generate a feature vector δ fi using the features αj for 0 ≤ j < ni. We also use a parameter k to determine the desired cardinality. We first apply PCA on the covariance matrix of the n vectors fi and use the m eigenvectors derived from the largest m eigenvalues for dimensionality re- duction. The original samples transformed by the m eigenvectors are clustered using the K-Means clustering technique. The results are then transformed back to the original vectors and used as the k centroids to extract the representative vectors within each cluster. The result of the third step is a set of k vectors representing the k shapes in our desired compact set. The constants k an m are fixed for each word-part as a percentage of the different shapes for each let- ter and the length of the word-part. In these clustering methods, we adopted the Euclidean distance to measure differences between shapes, which requires applying length normalization on the feature vectors fi. We have processed the same technique of clustering using the contour of the shape for the off-line case. The resulting compact sets are very similar to those we have obtained using the one-pixel width stroke. Therefore, we decided to apply the dilation on the clustering results, yield from stroke, for efficient processing. Holistic approaches using contour or sliding windows techniques could use the original– non-compact – and apply their own tech- niques for clustering using contours or the entire image representation. No lexicon reduction had been done after word generation. Holistic ap- proaches using words as one component can adapt their own technique to reduce the size of the lexicon for each word, if needed. We believe that our layout methods represent various writing styles, nevertheless, additional re- duction techniques can be used to obtain different compact sets.

5.2 Experimental Results

Two Arabic on-line handwriting recognition systems [108, 26] were used to evaluate the quality of the synthetic database generated by our system.

52 CHAPTER 5. COMPREHENSIVE SYNTHETIC ARABIC DATABASE FOR HWR 5.2. EXPERIMENTAL RESULTS

(a)

(b)

(c)

(d)

(e)

Figure 5.6: Different samples of the shapes generated for writing the word ” I»QÓ”. In the first row we can see the one pixel width shapes, in the second row the. results after the dilation process. Results after feature points smoothing can be seen in the third row and after the scanning process imitation in the fourth row. In the fifth row, we can see results of using different layout techniques to combine word-part shapes to generate Arabic words.

We compared the recognition time and rates using a synthetically generated database and a manually generated one. The following four datasets were used in the evaluation process:

• Manual (Set M): This set includes a manually generated 40, 000 words, written by five different writers [26].

• Synthetic (Set S): This dataset was generated synthetically from the above mentioned 40, 000 words using predefined fonts created by advised/trained writers.

• SyntheticFromManual (Set MS): This set is a comprehensive synthetic database that was generated based on fonts extracted from the Manual dataset.

• SyntheticGeneral (Set E): This is a comprehensive synthetic database, created using fonts from the manual database in the Manual set and ad- ditional fonts using our system to generate the synthetic database.

53 CHAPTER 5. COMPREHENSIVE SYNTHETIC ARABIC DATABASE FOR 5.2. EXPERIMENTAL RESULTS HWR These four sets were used to train the HMM based system and as a pro- totype collection for the DTW based recognition system with the geometric features described in [108, 26]. Four classes of experiments were designed to measure the different factors: • Trainers- This group includes eight member who participated in training and evaluating the system. • Evaluators- This group includes eight evaluators who had never trained the system. • Trainers-Evaluators- This group consists of two sub-groups, three mem- bers each. The first sub-group includes members who have trained the system and the second one includes member who have not participated in training the system. • Unsupervised- This group includes members who have trained the sys- tem and other who have not. These member were required to experiment with words that had not been part of the 40, 000 used in the training pro- cess. The columns, Set M, Set S and Set MS show close results when the eval- uators participated in training the system. The similarity among the results obtained using the manually and the synthetically generated databases proves the credibility of the synthetic process of concatenation and the effectiveness of the reduction process. Column Set MS shows the ability of the proposed ap- proach to efficiently extend a small database to a comprehensive one. The best results as seen in column Set E, are achieved when the database was extended from a given database and enriched by our approach. This proves the ability of this approach to enhance a given database with new shapes representing a wide spectrum of handwriting styles. In order to evaluate the efficiency of deterministic and nondeterministic utilization of the global loop feature, three schemes were embedded in the system. In the first scheme, the system did not use the global loop feature even though this feature was embedded in the feature set extracted from each shape. In the second, the loop global feature was used as a filtering step di- rectly on the text lexicon of the Arabic words. In the third scheme we used our approach of using the loop feature across the shapes of the word-parts and applied filtering on the lexicon of shapes instead of text words. To evaluate the contribution of the integrated data to the accuracy and time response, we randomly selected a set of 400 words from the lexicon and generated the syn- thetic shapes of these words. The results shown in Table 5.1 obtained using the random set of 400 words that include at least one loop. The manual repre- sentations were generated by eight writers, who were asked to write these 400 words.

54 CHAPTER 5. COMPREHENSIVE SYNTHETIC ARABIC DATABASE FOR HWR 5.2. EXPERIMENTAL RESULTS Dataset Set M Set S Set MS Set E HMM classifier Trainers 86.24% 88.21% 88.16% 90.09%

Evaluators 78.79% 82.78% 84.34% 88.22%

Trainers-Evaluators 82.19% 84.28% 85.24% 89.12%

Unsupervised 79.15% 85.12% 83.14% 88.14%

Trainers 85.84% 89.22% 96.86% 95.90% DTW classifier Evaluators 81.19% 86.21% 86.14% 90.36%

Trainers-Evaluators 84.31% 87.83% 87.24% 90.54%

Unsupervised 80.21% 86.18% 82.21% 89.16%

Table 5.1: Results for the HMM and DTW based Systems. Results of testing the system using trainers, evaluators and a combination of both are presented in first , second and third lines of each system. The fourth line shows the results of testing the systems on words out of the trained lexicon. The Manual and synthetic database are in the first two columns respectively, while the third column presents the synthetic database regenerated from the given manual, and the fourth shows the results of the Set MS enriched by the shapes gener- ated by the operator’s fonts.

Table 5.2 shows that using the loop feature to filter candidate words from the lexicon reduces the average recognition time by 80%. Therefore, the recog- nition rates are less encouraging when the loop feature is used in a determinis- tic manner. In general, the second and third rows in Table 5.2 show that using global features for pruning candidates improves response time. The recogni- tion rates are improved when the loop global feature is used in a nondetermin- istic manner across different shapes as seen in the third line. Research in segmentation into characters can use the recorded split points and the global features for training, testing, and evaluation. Thinning algo- rithms can use the synthetically generated off-line words with the original strokes as skeletons, and feature points such as end, split and curvature points for testing and evaluations.

55 CHAPTER 5. COMPREHENSIVE SYNTHETIC ARABIC DATABASE FOR 5.2. EXPERIMENTAL RESULTS HWR

Loops Case 1 of 5 Recognition Rate Time No Loop Used 85.31% 82.27% 100% Loops in Lexicon 86.51% 83.19% 20.12% Loops integrated with shapes 91.76% 89.68% 21.17%

Table 5.2: The recognition accuracy rates and the time reduction when using the loops’ numbers as a global feature to pick the right candidates class. Re- sults were conducted using a set of 1000 words selected to have many types of loops.

56 Chapter 6

Hierarchical On-line Arabic Handwriting Recognition

Keyboards and electronic mice may not endure as the prevalent means of human-computer interfacing. Devices such as digital tablets, hand-held com- puters, and mobile technology provide significant opportunities for alterna- tive interfaces that work in forms smaller than the traditional keyboard and mouse. In addition, the need for more natural human-computer interfaces becomes ever more important as computer use reaches a larger number of people. Two such natural alternatives to typing are speech and handwrit- ing, which are universal human communication methods. Both are poten- tially easier human-computer interfaces to learn by new users compared to keyboards. Although a handwriting interface expects users to be literate, it ensures a greater degree of privacy and confidentiality compared to speech. In Latin and Cyrillic scripts, cursive writing is a style of handwriting that is designed for writing down by hand. In this style, the letters in a word are connected, making a word one single complex stroke. Other scripts, such as Arabic, cursive writing – is not a style – it is an inherent part of the script. The connection between consecutive letters in a word depends on the letter. Some do not connect to the following letter and interrupt the continuity of the stroke. As a result, a word in Arabic script is composed of multiple complex strokes. Automatic handwriting recognition has been classified into two categories, off-line and on-line, based on the presentation of the data to the system. Off- line handwriting recognition approaches do not require immediate interaction with users. A scanned handwritten or printed text is fed to the system in a digital image format. In a typical on-line handwriting recognition approach, a special stylus is used to write on a digital device, such as a digital tablet. The digitized samples are fed to the system as a sequence of 2D-points in real-time, thus tracking additional temporal data not present in off-line input. In this chapter, we present a new online recognition algorithm for hand- written Arabic script (as shown in Figure 6.1). We have adopted the holistic

57 CHAPTER 6. HIERARCHICAL ON-LINE ARABIC HANDWRITING 6.1. OUR APPROACH RECOGNITION

           

   

      

     

         

!    !          

Figure 6.1: The flow of our system approach to avoid segmenting words into letters. Nevertheless, we segment words into connected components, which will be called word-parts. We also perform the recognition on the word-part level instead of the whole word level and ignore the additional strokes. Such an approach dramatically reduces the search space as many words share common word-parts and some differ only by the additional strokes. To reduce the search space, we apply a series of fil- ters in a hierarchical manner. The earlier filters perform light processing on a large number of candidates and the later filters perform heavy processing on a small number of candidates. In the first filter, global features and delayed strokes patterns are used to reduce candidate word-part models. In the second filter, local features are used to guide a dynamic time warping (DTW) classifi- cation. The resulting k top ranked candidates are sent for shape-context based classifier, which determines the recognized word-part. In this work we have modified the classic DTW to enable different costs for the different operations and control their behavior.

6.1 Our Approach

In this section, we discuss the various modules of our online recognizer and its general flow, which is shown in Figure 6.1. Our system accepts an ordered sequence of samples directly from the digitizer. The input sequence then goes through the following stages in order to recognize the corresponding word.

• The input sequence goes through several geometric processing steps to min- imize handwriting variations and reduce noise. • The points on the input sequence are classified into body and comple- mentary parts; then the delayed strokes, which belong to the comple- mentary part, are extracted and classified into points and strokes.

58 CHAPTER 6. HIERARCHICAL ON-LINE ARABIC HANDWRITING RECOGNITION 6.1. OUR APPROACH • The global features and delayed strokes patterns are used to determine the set of candidates, which is usually a small fraction of entire dataset.

• Local features are extracted from the point sequence that represents the main body part.

• The extracted features are fed to a dynamic time warping (DTW) rec- ognizer, which uses the extracted features to determine and order the trained models (candidates) that match the input sequence.

• The top ranked k candidates are sent to a shape context based classifier that determines the recognized word.

In the following subsections we discuss in detail each of these stages.

6.1.1 Geometric Preprocessing Most digitizers perform uniform temporal sampling, which often results in an oversampling of slow pen motion regions and under-sampling of fast pen motion regions. This stage performs writing-speed normalization by re- sampling the point sequences and distributing the points uniformly over the sampled curve. The point sequence (polyline) is then smoothed, using a low pass filter – to minimize handwriting variations, reduce noise, and remove imperfections caused by acquisition devices. The number of edges/vertices representing a polyline usually influences the number of features used to characterize it. The running time of most sta- tistical recognizers is affected by the number of features. For that reason, it is desirable to reduce the number of points in the sequence (polyline) while maintaining the shape of the input model. In this work, we have adopted the Dynamic Time Warping (DTW) sta- tistical recognizer, which tends to produce better results when the edges of the polyline are of similar length. Therefore, our simplification algorithm reduces the number of vertices that represent the polyline, while maintain- ing almost the same length for its edges. We chose to simplify a polyline p = v0, v1, ··· , vn−1 by applying the vertex-removal operator. This operator removes a vertex vi based on its distance from the segment vi−1, vi+1 and the distance to its two adjacent vertices.

6.1.2 Features Extraction The detection of delayed strokes is performed based on their sequential order, location, and size. Delayed strokes are detected based on the size and shape of their bounding box with respect to the word-part.

59 CHAPTER 6. HIERARCHICAL ON-LINE ARABIC HANDWRITING 6.1. OUR APPROACH RECOGNITION We extract two types of features from the body part – global and local fea- tures. The global features include loops, ascenders, and descenders. The local features characterize local relation between adjacent or nearby points on the polyline. The global features are easy to extract in on-line handwriting recognition systems. Loops are detected by inspecting the self-intersection within the curve. Ascenders and descenders are defined with respect to lower and up- per baselines. The existence of these baselines and respecting a constrained writing style simplify the extraction of reliable ascender and descender fea- tures; otherwise, it is hard to rely on these features. In online handwriting, it is easy to define and draw upper and lower baselines and respect the first-grade Arabic writing rules. The local features are extracted from the point sequence and quantify the relation between neighboring strokes. Let Ps = {p0, . . . , pn−1} denote the input sequence after applying geometric processing. From this sequence we extract the following two features.

• For each point pi, i > 0, we determine the angle between the segment pi−1pi and the x-axis (the horizontal line). We will refer to this feature as α(pi). This feature quantifies the relation between adjacent segments, but does not provide any information concerning the point’s environment. • To quantify the relation between a point and its environment, we extract a semi-global feature, similar to the one introduced by Belongie and Ma- lik [23]. It is defined as the angle between the segment pi−1pi+δ and the x-axis, where δ determines the width of the considered environment. We will refer to this feature as β(pi, δ), where δ > 2.

The two features are interpolated linearly using Equation 6.1, where w is a normalized positive weight that controls the blending of the two features and δ.

f(pi) = (1 − w)α(pi) + wβ(pi, δ) (6.1)

6.1.3 Shape Context The second recognition phase utilizes the shape context feature, introduced by Belongie and Malik [23]. The shape context feature vectors scheme consid- ers the set of n points on the contour, C, of the shape. For each point pi ∈ C it assigns n − 1 vectors, one for each point pj ∈ (C − pi). This set is very rich description vectors, however, it is too detailed. Therefore, the relative position distribution is used as a robust, compact, and highly discriminative descrip- tor. For each point pi, the scheme defines the shape context to be the coarse histogram of the relative coordinates of the remaining n − 1 points.

60 CHAPTER 6. HIERARCHICAL ON-LINE ARABIC HANDWRITING RECOGNITION 6.1. OUR APPROACH We use the shape context feature on the stroke of the body part as it was used on the closed contours. We use the n points taken uniformly from the given stroke.

6.1.4 Word-Part Recognition Matching algorithms are the core process of any recognition system. The recognition and classification algorithms rely on matching techniques to de- termine the similarity between two point sequences. The feature-based tech- niques extract and compare a set of feature vectors from the two strokes (poly- line). In this work we use a feature-based technique as it provides flexible comparison, which is essential to handling varying handwriting styles. We had avoided segmenting word-parts into letters and considers the con- tinuous word-parts as the basic alphabet of the Arabic language. As a result, the recognition for a written word is performed by recognizing its word-parts in the right order and combining them while consulting the dictionary. For that reason, the basic matching procedure compares word-parts, i.e., computes the match between two word-parts.

6.1.5 Dynamic Time Warping Dynamic Time Warping (DTW) is an algorithm for measuring similarity between two polylines which may vary in time or speed. This technique suits matching sequences with nonlinear warping. For one-dimensional sequences, DTW runs at polynomial time complexity and is usually computed by dy- namic programming using Equation 6.2.

D(i, j) = min{D(i, j − 1),D(i − 1, j),D(i, j)} + cost (6.2) In this research, we have slightly adjusted the classic DTW to include dif- ferent costs for insertion, deletion, and substitution. In addition, we have adopted an extra-cost for consecutive insertion and deletion to avoid intro- ducing long segments that disturb the recognition accuracy. The DTW is com- puted by taking the minimum of the three options including the cost of each operation, as shown in Equation 6.6. We assign different cost functions for deletion, insertion, and substitution based on the introduced change. In all handwriting, including Arabic, the difference between two point sequences that represent two different words is very small, i.e., inserting/deleting just a few consecutive elements can change the sequence to represent a different word-part. The match between the shapes of two word-parts is estimated by comput- ing the feature vectors, mentioned in section 6.1.2. Let Sa and Sb be the se- quences of the feature vectors calculated from the two word-parts. We define

61 CHAPTER 6. HIERARCHICAL ON-LINE ARABIC HANDWRITING 6.2. EXPERIMENTAL RESULTS RECOGNITION the costins(i), costdel(i), and costsub(i, j) as the cost of inserting a new element at i into the sequence Sa, the deletion of the element i from the sequence Sa, and the substituting of the element i in the sequence Sa by the element j in sequence Sb, respectively. Equation 6.3, 6.4, and 6.5 define the cost of each operation, Where deli and insi are the numbers of consequent operation of deletion or insertion until point i, respectively.

2 costsub = (Sa(i) − Sb(j)) (6.3) 2 costdel = ((Sa(i + 1) − Sa(i)) ∗ insi) (6.4) 2 costins = ((Sb(i + 1) − Sb(i)) ∗ deli) (6.5)

In order to embed the influence of consequent deletion or insertion into the minimization problem of the DTW, we use Equation 6.6 to define the dynamic programming.

D(i, j) = min{D(i, j − 1) + costins,

D(i − 1, j) + costdel,

D(i − 1, j − 1) + costsub} (6.6)

As can be seen, this rule costs consequent operations of deletion and inser- tions in a quadratic factor to the number of these consequent operations. This scheme forces the spread of these operations over all the fitting process and thus, forbids consequent operations of deletion or insertion. Several stages are performed to reach the final recognition of a written word-part. In the first stage, the system filters a class of all candidate word- parts from the dictionary using the global features and the complementary part as explained in section 6.1.2. In the second stage, a DTW algorithm is applies to measure and score the similarity between the input word-part and each candidate word-part using the extracted local features. In the third stage, the k top ranked word-parts are selected and compared against the written word-part using shape context features. The closest word-part is reported as the recognized word-part.

6.2 Experimental Results

In this project, we focus on testing the feasibility of the online recognition of Arabic script using the holistic approach in a reasonable response time. We have implemented our system and performed several tests on various datasets using 2.1GHz Pentium Dual-Core with 1024GB.

62 CHAPTER 6. HIERARCHICAL ON-LINE ARABIC HANDWRITING RECOGNITION 6.2. EXPERIMENTAL RESULTS Time Response Table (Loops and Additinal Strokes) 4000 1200

3500 1000 1-Loops 3000 3-Loops 800 2500 2-Loops 0-Loops 2000 600

1500 400 1000

Time Response In Ms In Response Time 200 500

0 0 0,0 0,1 1,0 0,2 1,1 2,0 0,3 1,2 2,1 3,0 0,4 1,3 2,2 3,1 4,0 1,4 2,3 3,2 4,1 2,4 3,3 4,2 3,4 4,3 4,4 Additional Stroks Up:Dn

Figure 6.2: The response time of our system

User Type GM.Hit GM.5 SCM.Hit Tester1(Trainer) 88% 98% 90% Tester2(Trainer) 83% 96% 89% Tester3(Trainer) 85% 95% 87% Tester4 85% 94% 87% Tester5 83% 92% 86% Tester6 86% 94% 88%

Table 6.1: The recognition behavior of the various stages of our system for each tester

The average response of our unoptimized system for recognizing a written word-part on the open vocabulary system was 954 ms and the longest was close to 2800 ms. We consider this time response as reasonable and focused on the recognition precision. The graph in Figure 6.2 shows the response time with respect to various configuration. To evaluate our system we generated the shapes of the words in the database by using a group of 10 writers. Each writer wrote a compact set of Arabic words that include all the Arabic letters in their different shapes. A semi- automatic system was used to generate, for each writer, the shapes of all the words in the database from the written compact set. For evaluating the recognition rate, we asked each user to write 100 word- parts retrieved randomly from the database. Six students participated in our experiment, where each performed the test 10 times, with different sets of random word-parts. Three of the six students participated in generating the shapes of the word-parts (trained the system). Such separation enables evalu- ating the writer dependency of the system. Table 6.1 summaries these results. The column GM.Hit reports the recognition rate after the geometric filter; the column GM.5 reports the rate of finding the correct word in the top 5 candi- dates; and the column SCM.Hit reports of the success rate of the shape-context filter using top 5 candidates.

63 CHAPTER 6. HIERARCHICAL ON-LINE ARABIC HANDWRITING 6.2. EXPERIMENTAL RESULTS RECOGNITION

64 Chapter 7

Segmentation-Free Online Arabic Handwriting Recognition

Biadsy et al. [25] presented a novel approach for on-line Arabic handwriting recognition based on geometric feature and Hidden Markov Models. In this chapter we extend this work by presenting an optimization step which aims to improve time and accuracy. Additionally, in the experimental results section, we compare results of the proposed system to others in the field. The system presented in [27, 28, 25] performs the recognition on the contin- uous word-part level and the training on the letter level. Such a scheme avoids the segmentation of words into individual letters during the recognition pro- cess, which is often prone to errors, and substitutes the training for large set (the word-parts) by a small set (the letters). The system presents a novel so- lution to the problem of additional strokes which accurately handles delayed strokes by first detecting them and then integrating them into the word-part body. The system focuses on word-level recognition of undiacriticled (unvo- calized) Arabic, and thus no sentence-level context is modeled. Arabic vocalic diacritics are most often ignored in writing and printing and, therefore, not addressed in this work.

7.1 Recognition Framework

The detection of delayed strokes – dots and short-strokes – is performed based on their sequential order, location, and size. Dots are detected based on the size and shape of their bounding box with respect to the word-part. Upon the detection of a delayed stroke and distinguishing it from the word- part body, they perform the delayed-stroke projection, which is illustrated in Fig- ure 7.1 (with one letter). The recognition framework uses discrete HMMs to represent each letter shape. To enhance word recognition, these letter-shape models are embed-

65 CHAPTER 7. SEGMENTATION-FREE ONLINE ARABIC HANDWRITING 7.1. RECOGNITION FRAMEWORK RECOGNITION

p1 p2

q1 p29

p36

p53

pi p44 p19 (a) (b) (c)

Figure 7.1: (a) The projection of the delayed stroke Z in the letter ¼ (k); (b) the delayed stroke is projected to the letter body; (c) the newly generated PPS (p1 to p53). ded in a network that represents a word-part dictionary ( See Figure 7.2). The recognition of word-parts is performed without explicit segmentation into let- ter shapes, and instead, the recognition is performed along paths that repre- sent valid word-parts, similar to [54, 74, 90]. To limit the search space, they utilize a dictionary of possible valid words. This ensures better recognition rates compared to systems that can recognize any arbitrary permutation of letters. The Arabic dictionary D is subdivided into a set of sub-dictionaries D1,D2, ··· ,Dn based on the number of word-parts in each word. Sub-dictionary Dk includes all words that consist of k word-parts. The words in the training data are split into letter-shape samples. In or- der to avoid improper samples, each letter-shape sample is tested to deter- mine if it satisfies the predetermined letter-shape well-formedness rules, e.g., number and placement of dots/strokes above or below the letter body. The Baum-Welch training algorithm is used to determine the HMM parameters, λ = (A, B, π), for each letter-shape model. Before the training process, the ini- tial state distribution π = {Πi} is initialized to: π1 = 1 and πi = 0 for 1 < i < N (where N is the number of states in the model). The transition probability ma- trix A = ai,j is initialized to ai,i = 0.5 and ai,i+1 = 0.5 for i < N and ai,j = 0 (where i 6= j and i 6= j + 1 for j < N and aN,N = 1). The observation matrix B is initialized to reflect a uniform distribution. The number of states for each letter-shape model based on the geometric complexity of the letter shape have been empirically chosen ranging from 5 to 11. For a tutorial in Hidden Markov Model; please refer to [92]. The input to the HMM model is a sequence of discrete values, which are usually denoted as the observation sequence. Thus, a quantization process is required to convert the three-dimensional feature-vector sequence, extracted from a handwritten word-part, to a discrete observation sequence. Each ob-

66 CHAPTER 7. SEGMENTATION-FREE ONLINE ARABIC HANDWRITING RECOGNITION 7.1. RECOGNITION FRAMEWORK

a1,1 a2,2 a3,3 aN,N

1 2 3 N : Ø a1,2 a2,3 a3,4

WP1 Start node : WP2 : . : . . . . : . . . . . : . . . . : : WP : k

Figure 7.2: A word-part network: each path from the start node to a leaf repre- ∗ sents a wpi, which is formally defined as [final + medial + initial)|isolated].

servation oi in an observation sequence is an integer value [0 ··· 259] (which are represented using 9bits).

This sharp discretization is necessary to reduce the training samples for on- line Arabic handwriting systems. The lowest 8 bits are used to represent the 3D-feature vector – local anglei, super seg anglei, and is loopi. The local anglei and super seg anglei, which are real angle values, are converted to 16 and 8 directions, respectively (similar to [67]); and the feature is loopi is a binary value (one bit). The 9th bit is used to mark virtual points and when it is on, the first two bits are used to describe the property of the corresponding virtual point. The observation values [256 ··· 259] (which correspond to the first two bits 00 ··· 11, respectively) are used to classify the virtual points using (a) the position of the delayed stroke (above or below the word-part), and (b) the direction of the virtual segment (up or down). These four observation values are crucial to distinguish different letter shapes that have the same letter body, but differ on the position of their delayed strokes, e.g., K(Teh) and K(Yeh). 67 CHAPTER 7. SEGMENTATION-FREE ONLINE ARABIC HANDWRITING 7.2. OPTIMIZATION RECOGNITION 7.2 Optimization

Time and space complexity play major roles in application efficiency. In interactive applications, such as on-line text recognition, the system response time is obviously very crucial. Nevertheless, high recognition rates are the most important aspect of these systems. Segmenting a Latin word into individual letters is an easy task for non- cursive handwriting and a challenging one for cursive writing. In Arabic scripts, such segmentation is often difficult for printed and handwritten words. Several reasons make the segmentation of Arabic words more complicated than cursive Latin words. In Latin cursive writing, the letters are similar to their isolated equivalence and they are usually connected using additional lig- atures, which are often not part of the letter body. In addition, Arabic script is not restricted to writing horizontally (along a base line) and allows some letters to appear in vertical order. An alternative holistic approach is the segmentation-free scheme. How- ever, these approaches usually require huge databases to store the basic mod- els and large time complexity to search for the right candidates. Each con- nected component (word-part) is treated as one component to be classified. As a result, a distinct different model is constructed, trained, and saved for each word-part. The number of possible word-parts determines the complexity of the system. Fortunately, as we show in section 4.2, in the Arabic language this number is not as huge as one would think. The number of unique word-parts in the Arabic language is around 86, 000, and the majority of these word-parts include four and five letters. These properties are used for further optimiza- tions that include fixing observation-sequence length, and purifying the sta- tistical post-processing phase. Most words in Arabic, more than 90%, have additional strokes and/or loops. Since these features are determined in the feature extraction step, they are utilized to accelerate the classification process by reducing the search space. Based on the results in Table 4.6, an optimization step is performed as a preprocessing step to reduce the number of models to be tested using the number, position, and order of the additional strokes. As shown in Table 4.6, determining the additional strokes of a written component representing a word-part reduces the size of a class of candidates to fewer than 500 on average, accelerating the system responses; see Table 7.4. In our approach, we perform training on the letter level. The models of these letters are combined into a word-part dictionary network, which also represents the models of the word-parts. This network is used to assist and verify word-part recognition and guides combining recognized word-parts into words. Such an approach avoids training for all valid word-parts in the language, but manages to recognize word-parts and words at high rates even though most of them were not part of the samples training the system. To recognize a given word w, we can use an improved algorithm that uti-

68 CHAPTER 7. SEGMENTATION-FREE ONLINE ARABIC HANDWRITING RECOGNITION 7.3. RESULTS AND DISCUSSION lizes the properties of Arabic script.

For each w in Text For each word-part (wp) in w above_dots = CountUpperDots(wp) below_dots = CountLowerUpperDots(wp) loops = CountLoopsDots(wp) ReducedDict = ReduceDictionary(above_dots, below_dots, loops) Classify(wp, ReducedDict)

To reduce the dictionary search space, we index its words using the num- ber of loops and dots above and below each word. Such indexing reduces the search space to fewer than 500 word-parts on average, instead of the en- tire word-part dictionary. In cases where the loop feature is not consistent, the average number would increase by three times when the loops feature is ignored, which is still affordable. The calculation we present are based on around 4 million words from different books and websites. It is obvious that this dataset does not include every word in the Arabic language. Nevertheless, we believe it is enough to describe the general distribution of Arabic language word-parts.

7.3 Results and Discussion

Several tests and experiments had been carried out to test and validate the different parts of our system. Four classes of experiments were performed to measure the effect and validate the following parameters: 1) the datasets for training and evaluation, 2) the proposed feature set, 3) the HMM based clas- sification method, and 4) the optimization process based on additional strokes and loops. Contrary to the English case where databases for on-line handwriting are publicly available for many years, there is no standard reference dataset for training and/or evaluating on-line handwriting recognition systems for Ara- bic script. We were not able to access any Arabic handwritten corpus to be used to properly evaluate our approach. Furthermore, it was not possible to attain any on-line Arabic handwriting recognition system to accurately com- pare with our results. Therefore, we constructed our own datasets (Manual Database) using four trainers. Each of the four trainers was guided to write 800 selected words and mark the boundaries of the letter shapes. The words were selected to cover all Arabic letter shapes with almost uniform distribu- tion. In the evaluation stage, ten writers (the four trainers and six new writers) were asked to write 280 words not in the training dataset. The evaluation

69 CHAPTER 7. SEGMENTATION-FREE ONLINE ARABIC HANDWRITING 7.3. RESULTS AND DISCUSSION RECOGNITION set included 2, 358 words in total.1 The overlap of trainers participating in the creation of training and evaluation data is intended to help us evaluate writer-dependence, as well as writer-independence. The trainers and evalua- tors were asked to write in their own writing style, but respect the rule that a word-part body should be written in a single continuous stroke followed by a number of delayed strokes. We evaluated our system using five dictio- nary sizes: 5K, 10K, 20K, 30K, and 40K words selected from the Arabic Tree- bank [70], twenty random articles from Al-Arabi Magazine, and ten random articles from the website of the news channel Al-Jazeera. The 280 evaluation words were present in all dictionary sizes. The purpose of the various dic- tionary sizes is to test our system’s performance under different ambiguous conditions. To validate results using our manual database, we instructed five students to write many shapes of each letter in all different positions. A novel approach we have developed uses these shapes to synthetically generate a compact set of shapes for writing each word-part in the target dictionary. This approach had been used and evaluated in another project and proved to be able to im- itate on-line Arabic handwriting of many different writing styles. Word-part shape generation is performed by concatenating characters in the appropri- ate position within a word-part for each word. Since the number of different shapes for each word is very large, this number is reduced using dimensional- ity reduction and clustering techniques. We used this system to synthetically generate all the manually generated words, which were used to train and test the system. Table 7.1 shows the recognition rates of our system using our manual database and the synthetic database for training and testing.

Database Size 5K 10K 20K 30K 40K WD 98.44% 97.94% 96.86% 95.90% 95.44% Manual Database WI 98.49% 97.78% 96.54% 95.12% 94.44% WD 92.14% 91.15% 89.61% 88.33% 85.17% Synthetic Database WI 91.44% 91.01% 89.64% 88.22% 88.11%

Table 7.1: Writer-dependent (WD) and writer-independent (WI) average word recognition rates for two tests including 2, 358 and 6220 words written by ten writers. Results using the synthetic database to extract the same words as in the manual database are in lines three and four.

To evaluate the effectiveness of the presented set of features we carried out several experiments, keeping the same system and training data sets but re- placing the geometric features set by a set of sliding-window–based features.

1Not all volunteers finished the testing task, and some word samples were omitted due to being incomplete.

70 CHAPTER 7. SEGMENTATION-FREE ONLINE ARABIC HANDWRITING RECOGNITION 7.3. RESULTS AND DISCUSSION Pechwitz and Maergner [91] used the sliding-window features to recognize written Arabic words using the IFN/ENIT benchmark database. To adjust the on-line strokes to the off-line case, we thicken the one pixel width stroke to three pixels width before applying the sliding window technique. We extract the features from the binary image instead of the gray one. The feature vec- tor is the concatenation of the feature vectors extracted from the three sliding windows. Table 7.2 shows that recognition rates in both cases are similar.

Database Size 5K 10K 20K 30K 40K WD 98.44% 97.94% 96.86% 95.90% 95.44% Geometric Features WI 98.49% 97.78% 96.54% 95.12% 94.44% WD 91.21% 91.15% 90.61% 88.63% 88.21% Sliding Win Features WI 92.11% 91.31% 90.22% 90.11% 86.78%

Table 7.2: Writer-dependent (WD) and writer-independent (WI) average word recognition rates using our geometric features and pixel-based features from the sliding window technique.

We designed a system based on Dynamic Time Warping (DTW) technique for the matching and classification process. This system uses the same fea- ture set extracted from the word-parts to build the collection of prototypes for matching. The words from the training sets were used to build the sets of pro- totypes, and datasets of feature vectors were saved for each word instead of the HMM model. Table 7.3 shows the results of the two classifiers.

Database Size 5K 10K 20K 30K 40K WD 98.44% 97.94% 96.86% 95.90% 95.44% HMM classifier WI 98.49% 97.78% 96.54% 95.12% 94.44% WD 91.24% 90.21% 90.12% 89.23% 88.27% DTW classifier WI 96.18% 96.11% 93.32% 90.18% 87.22%

Table 7.3: Results of our system compared with a system with the same database and feature sets but using a different classifier based on DTW match- ing technique.

High recognition rates are the most important aspect of our work, even though the system response time is obviously very important. In these exper- iments we intend to show the applicability of the suggested optimization to improve recognition rates and reduce the time response. Reducing time response was not the main task in this experimental test. Therefore, results are presented as improvement percentage to validate the

71 CHAPTER 7. SEGMENTATION-FREE ONLINE ARABIC HANDWRITING 7.3. RESULTS AND DISCUSSION RECOGNITION Dictionary Size 5K 20K 40K Time response Reduction -62.11% -75.31% -78.45% Writer Independent Improvement Recognition +0.91% +1.63% 2.87% Time response Reduction -64.33% -76.18% -78.98% Writer Dependent Improvement Recognition +1.12% +2.32% 3.14%

Table 7.4: The improvement in response time and recognition rates that results from using the dots and loops for optimization. effectiveness of the optimization independent of the time efficiency of the sys- tem. Table 7.4 shows that using additional strokes combined with a loop counter in preprocessing can reduce the search space and improve response time. It is obvious that this factor becomes more important when the target database is large. Overall, we achieved good results given that we used a relatively small training set. The differences between the writer-independent and writer-dependent recognition rates are less than 2%, with all tested dictionary sizes. This implies that the features, model, and delayed-stroke algorithm we introduced are ad- equate for writer-independent handwriting recognition. The performance de- grades as the dictionary size increases. The degradation in word-part recogni- tion is at a lower rate than word recognition, suggesting that the recognition failure is tied to specific word-parts. Most of the recognition errors arose in word-parts that have similar shapes, such as AK/ H and X/P. Therefore, the current features are not sufficient for adequately. distinguishing. between such word-parts.

72 CHAPTER 7. SEGMENTATION-FREE ONLINE ARABIC HANDWRITING RECOGNITION 7.3. RESULTS AND DISCUSSION

100.00%

98.00%

96.00%

94.00%

92.00%

90.00%

88.00%

86.00%

84.00%

82.00%

80.00% 5k 10k 20k 30k 40k

HMM-GF-MD DTW-GF-MD HMM-SF-MD HMM-GF-SD

Figure 7.3: This graph shows the results of recognition rates comparing the different systems. The compared systems are our proposed system and three other systems, changing one factor each time. The factors we changed are: SD = Synthetic Database, SF = Sliding Window features, and DTW = Dynamic Time Warping classifier.

73 CHAPTER 7. SEGMENTATION-FREE ONLINE ARABIC HANDWRITING 7.3. RESULTS AND DISCUSSION RECOGNITION

74 Chapter 8

Language-Independent Text Lines Extraction Using Seam Carving

Text line/row extraction algorithms aim to determine the letters and words along a text-line on an image document. It is an important practice often used in various handwriting analysis procedures, such as word-spotting, keyword searching, and text recognition. For example, keyword searching often re- quires the determination of text-line [117, 97, 96, 34, 63]. In addition, segment- ing text blocks into distinct rows is vital for text recognition. It is easy to determine the text-lines in machine-printed documents, since text-line are usually parallel and often have the same skew. Density histograms, projection profiles, and Hough transform are often enough to reveal text-lines in machine-printed document images. On the contrary, determining the text- lines in handwritten historical documents is a challenging task for various rea- sons. Among these reasons are the variability of skew between the different text-lines and within the same text-line; spaces between lines are narrow and variable; components may spread over multiple lines or overlap and touch within consecutive lines; and the existence of small components, such as dots and diacritics (e.g. Arabic script), between consecutive text-lines. Several text-line extraction methods for handwritten documents have been presented. Most of them group connected components using various distance metrics, heuristics, and adaptive learning rules. Projection profile, which was initially used to determine text-lines in printed image documents, was modi- fied and adapted to work on sub-blocks and stripes [119, 88, 123, 128, 53]. In this chapter we present a language independent global method for au- tomatic text-line extraction. The proposed algorithm computes an energy map of the input text block image and determines the seams that pass across text- lines. The crossing seam of a line, l, marks the components that make the letters and words along l. These seams may not intersect all the components along the text-line, especially vertically disconnected components; e.g. a seam may intersects the body of the letter ”i” and misses the dot. This is handled by

75 CHAPTER 8. LANGUAGE-INDEPENDENT TEXT LINES EXTRACTION 8.1. OUR APPROACH USING SEAM CARVING

(a) (b)

(c) (d)

Figure 8.1: (a) Calculating a Signed Distance Map of a given binary image, (b) Calculating the energy map of all different seams, (c) Finding the seam with minimal energy cost and (d) Extracting the components that intersect the minimal energy seam. locally labeling and grouping the components that formulate the same letter, word-part, or word. The component collection procedure may require param- eter adjustments that may differ slightly from one language to the other, and mainly depend on the existence of additional strokes – their expected location and size. In the rest of this chapter we describe our approach in detail and present some experimental results. Finally we conclude our work and discuss direc- tions for future work.

8.1 Our Approach

Human ability to separate text blocks into lines is almost language inde- pendent. They usually manage to separate text blocks into lines without ac- tually reading the written text, even in text blocks with multi-skew lines or touching segments. Humans tend to identify lines by collecting basic elements and/or components into groups and then analyze the shape, size, and location of these groups with respect to the adjacent elements. The spaces between ad-

76 CHAPTER 8. LANGUAGE-INDEPENDENT TEXT LINES EXTRACTION USING SEAM CARVING 8.1. OUR APPROACH jacent lines and the concentration of ink along the lines play a major role in separating text-lines. These observations have motivated most line extraction approaches to search for the path that separates consecrative text-lines with minimal crosses and maximal distance from the script components. Our novel approach to separating text blocks into lines was inspired and built upon the seam carving work [20], which resizes images in a content- aware fashion. The core of our approach is a line extraction procedure (See Figure 8.1), which starts by computing an energy map of the input image based on the signed distance metric (Figure 8.1(a)). It then uses dynamic program- ming to compute the minimum energy path, p, that crosses the image from one side to its opposite, as shown in Figure 8.1(b,c). The path p resembles a text line in the document image. Finally, it collects the components along the computed path p, which formulate the words of that line (Figure 8.1(d)). The line extraction procedure is executed iteratively until no more lines remain. In our current implementation we assume the input document image is binary. Next we discuss in detail the three main steps of the line extraction pro- cedure: generating an energy map, computing the minimal energy path, and collecting the component along the path.

8.1.1 Preprocessing

Usually, vertically touching components become connected and form one component with height above the average and spreads over more than one text-line. It is possible to detect over-average-height components before seg- menting the text into lines, but determining a component that vertically stretch over multi-lines requires line estimation and extraction. We calculate the average height of the connected components and classify them (according to their height) into four categories: additional strokes, ordi- nary average components, large connected components, and vertically touch- ing components. Additional strokes are identified as the small components; components that include ascenders and/or descenders are classified as large components; and the components which are significantly higher than ordinary and large connected components are classified as touching components. The classification is performed by comparing to the average height of the components. The classification is not rigid, i.e., components may switch cat- egory after the line extraction. In the preprocessing step, connected compo- nents, which were labeled as touching components, are split vertically in the middle. The list of these components is passed to the post-processing phase, which draws the final decision based on the extracted lines – a suspected touching component may actually be an ordinary large component with as- cender/descender. The small components (additional strokes) are reconsid- ered with respect to the computed line region to decide their final position.

77 CHAPTER 8. LANGUAGE-INDEPENDENT TEXT LINES EXTRACTION 8.1. OUR APPROACH USING SEAM CARVING 8.1.2 Energy function Avidan and Shamir [20] discussed several operators for calculating the en- ergy function to determine pixels with minimum energy for content-aware resizing. They suggested the gradient operator (see Equation 8.1) to generate the energy map, E(I), for an image I and showed that removing pixels with low energy lead to minimal information loss.

∂(I) ∂(I) E(I) = + (8.1) ∂x ∂y Typical line extraction approaches seek paths that separate text-line from each other in a document image, which is usually performed by traversing the ”white” regions between the lines or the medial axis of the text (the ”black” regions), respectively. The separating paths are perceived as seams, in seam carving terminology, with respect to some energy function. We have found the energy functions presented in the seam carving work inappropriate for text line extraction, mainly because the applications are different. The search for a separating path (polyline) that lies as far as possible from the document components motivated the adoption of the distance transform for computing the energy map. Local extreme (minima and maxima) points on the energy map determine the nodes of the separating path. This scheme also requires maintaining a range of possible horizontal directions to prevent seams (paths) from jumping across consecutive lines. Even though, the seams often jump across consecutive lines, mainly when the local skew is close to diagonal or when there is large gaps between consecutive components on the same row. It also fails to handle touching components along consecutive lines, since the touching components act as barriers, which prevent the progress of the seam along the white region between consecutive lines. To overcome these limitations we search for seams that pass along the medial axis of the text lines. To search for seams that pass along the medial axis of the text lines, i.e., cross components within the same text line, we use the Signed Distance Trans- form[SDT] in computing the energy map. In SDT, pixels lying inside com- ponents have negative values and those lying outside have positive values. Following the local minima on that energy map, results with seams that pass through components along the same text-line.

8.1.3 Seam Generation We define a horizontal seam in a document image as a polyline that passes from the left side of the image to the right side. Formally, let I be an N × M image, we define a horizontal seam as shown in Equation 8.2, where x is the mapping, x : [1 . . . m] → [1 . . . n]. For k = 1 the resulting seam is 8-connected and for k > 1 the resulting seam may not be connected. Note that seams in

78 CHAPTER 8. LANGUAGE-INDEPENDENT TEXT LINES EXTRACTION USING SEAM CARVING 8.1. OUR APPROACH content-aware resizing are connected in order to maintain the uniform rectan- gle grid of the image, when removing the seam pixels.

m S = {x(i), i}i=1, ∀ i, | x(i) − x(i − 1) |≤ K. (8.2) Let E(I) be the distance transform based energy map, the energy cost, e(s), of a horizontal seam (path) s is defined by Equation 8.3. The minimal cost seam, smin, is defined as the seam with the lowest cost; i.e., smin = min∀ s{e(s)}. m m X e(S) = e({x(i), i}i=1) = E(x(i)) (8.3) i=1 Dynamic programming is used to compute the minimal cost seam smin. The algorithm starts with filling a 2D array, SeamMap, which has the same dimension as the input image document. It initializes the first column of the cell map, SeamMap, to the first column of the energy map image; i.e., SeamMap[i, 1] = E(e)[i, 1]. It then proceeds to iteratively compute the rest of the columns from left to right and top-down using Equation 8.4. Elements out of ranges of the array SeamMap, are excluded from the computation.

SeamMap[i, j] = (8.4) 2 2I(i, j)+minl=−2(SeamMap[i + l, j − 1])

The resulting array, SeamMap, describes the energy cost of the left-to-right paths, which start from the left side and ends at the right side of the image. The algorithm determines the minimal cost path by starting with the minimal cost on the last column and traversing the SeamMap array backward – from right-to-left.

8.1.4 Component Collection The computed minimal cost path (seam) intersects the main components along the medial axis of the text line, but may miss small satellite components, which usually consist of delayed strokes and small components, such as dots and short strokes, off the baseline. In addition, touching components across consecutive lines are treated as one component and assigned to the first inter- secting path. Our component collection algorithm manages to handle almost all these cases correctly. m For an input minimal energy seam s = {x(i), i}i=1, the collection compo- nent algorithm performs three main operations. In the first step it defines an empty component list, cl, it then determines the components that intersect the seam s and adds them to the component list, cl. The components in cl represent the text row, rs, spanned by the seam s and used to determine the upper, ur

79 CHAPTER 8. LANGUAGE-INDEPENDENT TEXT LINES EXTRACTION 8.2. EXPERIMENTAL RESULTS USING SEAM CARVING and lower, lr, boundary of the text line. We refer to the region between the two boundaries as the row region. The mean and standard deviation of the height of the row region is measured and used to filter touching elements across con- secrative lines. The over-sized vertical components – their height being signif- icantly above the average height – are classified as touching components and split in the middle. Small satellite components that intersect the row region are handled in two different phases. Components which major fraction (above 50%) falls within a row region, are assigned to the text row rs, spanned by the seam s (note that this also includes components that fall entirly within the row regions). Finally the row region is marked as processed region. This procedure does not collect small components beyond the row region, since correct assignment requires the existence of the two bounding row re- gions. For this reason, for each computed seam (except the first one), we deter- mine whether it is adjacent to a marked region (already processed row region). In such case, we distribute the the unclassified components between the two adjacent row regions based on their distance from adjacent row regions; i.e., each component is assigned to the closest row region.

8.2 Experimental Results

Several evaluation methods for line extraction algorithms have been re- ported in the literature. Some, evaluate the results manually, while others use predefined line areas to count misclassified pixels. In [111], the connected components were used to count the number of misclassified connected com- ponents within the extracted lines. We have adopted this evaluation method to evaluate our algorithm. We have used images of text pages as a test set for the evaluation process. As a ground truth for this set, we have manually added the information about the lines existing in these pages as groups of word-parts (main part and additional strokes). The evaluation process results, were con- cluded by counting the number of the classified /misclassified components. We have evaluated our system using 40 Arabic pages from Juma’a Al-Majid Center (Dubai), ten pages in Chinese, and 40 pages taken from the ICDAR2007 Handwriting Segmentation Contest dataset [45] including English, French, German and Greek. The images have been selected to have multi-skew and touching lines. The 40 Arabic pages include 853 lines and 24, 876 word-parts, the additional 50 pages have 967 lines. Using the Arabic set, Only 9 lines were extracted with misclassified components; i.e., 98.9% lines were extracted cor- rectly. The number of misclassified word-parts (additional strokes in extreme cases were not considered) was 312, which is 1.25% of the 24, 876 word-parts. In a post-processing step, we have used data calculated by extracted lines’ average height, orientation and average component size to reclassify compo- nents. Around 63% of misclassified components were reclassified correctly. All

80 CHAPTER 8. LANGUAGE-INDEPENDENT TEXT LINES EXTRACTION USING SEAM CARVING 8.2. EXPERIMENTAL RESULTS the 86 touching components from consecutive lines in the tested dataset were split correctly in the post processing step, see Figure 8.3. We had simillar re- sults with the other 50 pages. Only 12 lines were extracted mistakenly which is 98%. Generally, misclassification occurs when the extracted seam jumps from one line to its neighbor, which can be easily detected during the line extraction and corrected at a post processing level.

1st System 2nd System Our System ICDAR2007 98.1% 98.2% 98.6% Dubai+ 97.85% 98.1% 98.9%

Table 8.1: This table presents the results achieved by our system (the third column) compared to the system presented in [89](the first Column) and the system presented in [111](the second Column). The first Row compares results of the three systems on a subset of the ICDAR2007 [45] test set and the second row uses our private collection based on the documents from Juma’a Almajed Center in Dubai.

We have implemented two known systems in order to compare results to our system. The first system was presented in 2009 by Nicolaou and Gatos [89] and shredded the image to lines using tracers to follow the white-most and black-most paths. In the second system Shi et al. [111], generate ALCM using a steerable direction filter and group connected components into location masks for each text line. The three systems (including our system) were tested us- ing the data set described in this section and have yielded similar results and success rates, see table 8.1.

81 CHAPTER 8. LANGUAGE-INDEPENDENT TEXT LINES EXTRACTION 8.2. EXPERIMENTAL RESULTS USING SEAM CARVING

(a)

(b)

(c)

(d)

Figure 8.2: Random samples from the tested pages: (a) Arabic, (b) Chinese, (c) German and (d) Greek. The extracted lines are shown in different colors.

82 CHAPTER 8. LANGUAGE-INDEPENDENT TEXT LINES EXTRACTION USING SEAM CARVING 8.2. EXPERIMENTAL RESULTS

(a)

(b)

(c)

(1)

(d) (2)

(3)

Figure 8.3: Different documents with fluctuating lines. The components in red are touching components, which were determined during the line extraction process. We can see in (d) the original touching component (1), the primary splitting in (2), while in (3) we can see the desired result.

83 CHAPTER 8. LANGUAGE-INDEPENDENT TEXT LINES EXTRACTION 8.2. EXPERIMENTAL RESULTS USING SEAM CARVING

84 Chapter 9

Keyword Searching for Arabic Handwritten Documents

The advances in digital scanning and electronic storage have driven the digitization of historical documents for preservation and analysis of cultural heritage. This process enables important knowledge to be accessible to the wide public, while protecting historical documents from aging and deteriora- tion by frequent handling. These documents are usually stored as a collection of images, which complicates searching through them for a specific word or phrase. To utilize the digital availability of these documents, it is essential to develop an indexing and searching mechanism. Currently, indexing is built manually and the search is performed on the scanned pages one by one. Since this procedure is expensive and very time consuming, an automation process is desirable. One may consider using off-line handwriting recognition to con- vert these document images into text files. However, The research on off-line handwritten text recognition has been limited to domains with small vocab- ularies, such as automatic mail sorting and check processing. Historical doc- uments add another level of complexity resulting from lower quality sources due to various aging and degradation factors, such as faded ink, stained paper, dirt, and yellowing. This chapter deals with searching for a keyword in historical documents written in Arabic script, which is more complex due to cursiveness and sim- ilarity among letters. The results for off-line Arabic text recognizers are still very limited due to the lack of research in this field, compared to the Latin scripts. More than 40 million documents survived the last ten centuries and fortunately are preserved in different libraries around the word. For the pro- cessing of these documents, it is essential to develop efficient searching, index- ing, and archiving for Arabic documents. Keyword searching is designed to give users the ability to search for spe- cific words in a given collection of document images automatically, without converting them into their ASCII equivalences. Spotting words aims to to clus-

85 CHAPTER 9. KEYWORD SEARCHING FOR ARABIC HANDWRITTEN 9.1. OUR APPROACH DOCUMENTS ter similar words within documents into different classes in order to generate indexes for efficient search. In this work we have developed a keyword searching system for historical documents in Arabic. The features we use are extracted from the segment’s an- gles and length of the word-part’s simplified contour. We have experimented with two probabilistic classifiers – HMM and DTW, using the same set of ex- tracted features. In addition, we have slightly modified the DTW algorithm to include different costs for substitution, insertion, and deletion of segments from the compared sequences. The same preprocessing techniques and similar feature sets were used for the two classifiers. The HMM based system requires many training samples of the keywords, which are generated manually from the processed documents. The DTW based system uses a simplified represen- tation of the component’s contour to constructed the feature vector, which is used to compare the word-parts. In the rest of this chapter we will present our approach followed by exper- imental results. Finally, we draw some conclusions and suggest directions for future work.

9.1 Our Approach

In this chapter we present a novel keyword searching algorithm for hand- written Arabic Documents include historical Arabic manuscripts with reason- able quality. Our algorithm is based on geometric features, which can be used for any feature-based matching technique, such as DTW and HMM. We assume the input documents can be segmented into words and word-parts, which boundary contours are well defined. Next we will overview the vari- ous stages of this algorithm.

9.1.1 Component Labeling Prior to component labeling procedure we horizontally align the base line of the rows of the input page. Such alignment is archived by first computing the page’s vertical density histogram and then analyzing it’s standard devia- tion to determine the optimum points. We then segment the page into lines and calculate the lower and upper base lines, which are used to extract the various components. Since we are tracing the contour of each component independently, seg- menting the line into words is not necessary. The extracted components are classified into main and secondary based on their size and location with respect to the base line. We use main component to denote the continuous body a word- part and secondary component to refer to an additional stroke. Each secondary component is associated with a main component. A main component with its

86 CHAPTER 9. KEYWORD SEARCHING FOR ARABIC HANDWRITTEN DOCUMENTS 9.1. OUR APPROACH secondary components represent an Arabic word-part, which will be denoted Meta Component (see Figure 9.1).

Figure 9.1: Meta Components with different numbers of additional strokes.

9.1.2 Simplification

The pixels on a component’s contour form a 2D polygon. However, such a representation includes more than required vertices, which often complicate processing and handling these contours. Therefore, we simplify the contour polygon to work with a small number of representative vertices. In each it- eration of the simplification process we remove the vertex with the smallest distance from the line passing through its two adjacent neighbors. The pro- cess terminates when an error threshold or a satisfying number of vertices is reached. Since we are using two major classification approaches that rely on inher- ently different classification measures, we generate two simplified versions for each contour polygon. For the HMM classifier there is a need to control the number of fed vertices (points). Therefore, the simplified polygon is re- fined by adding k vertices from the original polygon, which are distributed nearly uniformly between each two consecutive vertices. The point sequence P = [p1, p2, ··· , pn] includes all the vertices on the refined polygon. The size of P is determined based on the characters of the keyword and a predefined table that provides an estimation for the number of points required to describe each character. The DTW requires nearly equal-length edges of the contour polygon, i.e., similar distances between consecutive vertices. Since the geodesic distance between the vertices of the simplified model could dramatically vary, we use the short edges and a predefined tolerance value to subdivide the long edges to satisfy the requirements of the DTW.

87 CHAPTER 9. KEYWORD SEARCHING FOR ARABIC HANDWRITTEN 9.2. MATCHING DOCUMENTS

Figure 9.2: Horizontal and Vertical Density histograms on top of a simplified word-part.

9.1.3 Feature Extraction In machine learning, feature vectors are used to generate observation se- quences. In this work we have adopted a set of features which provide good recognition rates for the on-line Arabic handwriting [26]. In addition, we have developed several features that capture the special structure of the Arabic script. These features capture local, semi-global, and global behaviors.

• The angle αi, which is the angle between the two vector (pi−1, pi) and (pi, pi+1).

• The length of the vector (pi, pi+1).

• The angle βi, which is the angle between the vector (pi, pi+1) and (pj, pj+1), where pj and pj+1 are consecutive vertices in the simplified polygon and the pi vertex was inserted between them by the refining process. • Loops Number: the number of loops found in the component. In this work we have used different subsets of the mentioned features for the HMM-based and the DTW-based classifiers. Our HMM classifier have used the features α and β. The contour length and number of loops were ig- nored since they do not behave consistently in handwriting style. In the DTW classifier we have used α and Length features.

9.2 Matching

Matching algorithms form the core of any search algorithm. Keyword search relies on a matching technique to determine the similarity between

88 CHAPTER 9. KEYWORD SEARCHING FOR ARABIC HANDWRITTEN DOCUMENTS 9.2. MATCHING word images. In general, these matching techniques could be categorized into: pixel-based and feature-based approaches. The pixel based approaches com- pare pixels or blocks from the two images. The feature-based techniques ex- tract a set of features from the two images and compare them. In this system we use a feature-based technique as it provides flexible comparison, which is essential to handling varying handwriting styles. In this research we avoid segmenting words into letters and consider the continuous word-part as the basic alphabet of the Arabic language. As a result, the search for a given keyword is performed by the search for its word-parts in the right order. For that reason, the basic matching procedure compares word- parts, i.e., computes the match between two word-parts. We have embedded our feature’s set into two well known matching schemes, The Hidden Markov Model (HMM) frame work and The slightly modified version of the classic DTW presented in section 6.1.5. We have manually extracted different occurrences of the word-parts of the keyword from the searched document. The basic postulation assumes that the extracted occurrences capture the different shapes of each word-parts accord- ing to document’s writing style. The extracted word parts are used to produce feature vectors (as explained in Section 9.1.3), which are used to train the HMM system. The number of states is determined by the letters in each word-part of the keyword according to a predefined table. The search for a keyword is per- formed by searching for its word-parts, which are later combined into words (the keywords). For each processed word-part an observation sequence is gen- erated and fed to the trained HMM system to determine its proximity to each of the keyword’s word-parts. This approach is suitable for large documents authored by the same writer, as is the case for many large historical Arabic manuscripts.

Figure 9.3: The deletion of similar segments can lead to a different word-part, which illustrates the need for different costs for insertion, deletion, and substi- tution.

We have developed four searching schemes using the TDW algorithm. These

89 CHAPTER 9. KEYWORD SEARCHING FOR ARABIC HANDWRITTEN 9.2. MATCHING DOCUMENTS schemes differ in the way we generate the keyword from its textual description and the searched document. In the first scheme we manually extract the keyword’s word-parts from the document. Then we search for the extracted word-parts in the input docu- ment. Note that extracting the word-parts of the keyword does not necessary require locating the keyword itself. Extracting multiple shapes of the same keyword have yielded better results. Since word spotting generates the index for all the words in a document, this scheme is appropriate for word spotting without the need to extract word-parts manually. In the second scheme a human operator generates several versions of the keyword by extracting letter shapes, from the document, and assembling them into word-parts. Then the generated keyword shapes are used to search the document images. The third scheme automatically generates multiple shapes of the keyword using predefined fonts and handwritten templates. The generated shapes are used to search the documents for the keyword. The best matches among the located keywords are used to extend the keyword shapes for future search (within the same session). The process accumulatively proceeds until locating all the appearance of the keyword. In the fourth scheme a human operator mimics the document handwriting by following the shapes of letters in the input document, using a digitizing device, such as Tablet-PC. Generating multiple samples for the keyword has improved the matching results. It is important to note that there is no need for a prototype database in the first, second, and forth schemes. In the third scheme we maintain a dictio- nary that includes the predefined handwritten templates of word-parts, using variety of common handwriting styles. The match between the shapes of two word-parts is estimated by comput- ing the feature vectors, mentioned in section 9.1.3, for each word-part and us- ing the DTW framework presented in section 6.1.5.

9.2.1 Pruning

Since matching algorithms are usually very expensive, a pruning step is necessary to avoid comparing word images that are very different from the keyword image. The compared word-parts are normalized according to the average height of the document’s word-parts. In this work we perform pruning by using the contour properties and the density histograms. The ratio between the width and the contour’s length of the compared word-parts are computed. A pre-computed ratio is used to prune word-parts with large ratios. The hor- izontal and vertical density histograms are computed for the two compared

90 CHAPTER 9. KEYWORD SEARCHING FOR ARABIC HANDWRITTEN DOCUMENTS 9.3. EXPERIMENTAL RESULTS word-parts. We then calculate the sum of the square differences between the two horizontal and vertical density histograms, separately. An experimentally determined threshold is used to eliminate the irrelevant word-parts (see Fig- ure 9.4).

Figure 9.4: The columns (c) and (g) show the similarity of the density his- tograms of the same word-parts .

9.2.2 Rule-based system The system treats each word-part as a meta component – one main compo- nent and associated secondary components. Recall that the secondary compo- nents represent additional strokes associated with the word-parts, represented by the main component. The shape of an additional stroke could be a dot, de- tached vertical segment, or small curve (usually similar to “ s ” or “ ˜ ”). Addi- tional strokes can appear above or below word-parts. The additional dots are associated in groups that include one, two, or three dots. Our rule-based sys- tem utilizes the number, shape, and position of the additional strokes to prune irrelevant words and verify the match between two word-parts. In addition, the rule-based system determines whether a located set of word-parts could be combined into one keyword or not. Recall that our four searching schemes (see Section 10.1.4) use accumulative search process, i.e., quality results from an iteration are used as the sample set for the next one. The rule-based system also determines the samples for the next iteration, by using the match proba- bilities of the results to determine their quality.

9.3 Experimental results

We have performed several experiments to test our system. We have used a dataset of 40 pages written in Arabic and included more than 8000 words and more than 15000 word-parts. These pages are classified into three groups: printed, handwritten, and historical documents, each include five documents.

91 CHAPTER 9. KEYWORD SEARCHING FOR ARABIC HANDWRITTEN 9.3. EXPERIMENTAL RESULTS DOCUMENTS The printed documents are in different fonts and the handwritten documents were written by different writers. Each one of the printed and handwritten documents includes two pages and each one of the historical documents is composed of four pages. A two phases process has been used to complete the search task. In the first phase the system recognizes the main bodies of the word-parts and the addi- tional strokes, separately. The second phase combines the recognized word- parts and the additional strokes into keywords using the rule-based part of the system. We have run experiments using the four searching schemes. In order to highlight the insufficient training problem, we have used two training sets of different sizes – small (S) and large (L). It is important to notice that the results we are presenting consider word-parts, since the focus is on the matching al- gorithm and the geometric features. In addition, the four search schemes use accumulative search.

Figure 9.5: The results from the first (a) and the four(b) search schemes; and the final results, using the accumulative process, are shown in (c) and (d)

Table 9.1 shows samples of our results. The 5th and 6th columns show that the HMM based system is highly dependent on size of the training set. It also

92 CHAPTER 9. KEYWORD SEARCHING FOR ARABIC HANDWRITTEN DOCUMENTS 9.3. EXPERIMENTAL RESULTS — DTW Results HMM Results Data Sc Small Large Small Large 1 86(+2)% 88(+6)% 76% 86% 2 86(+2)% 87(+7)% 72% 83% Printed 3 88(+0)% 89(+1)% 60% 71% 4 88(+0)% 90(+2)% 62% 85% 1 82(+6)% 86(+6)% 70% 76% 2 78(+8)% 86(+6)% 60% 74% HWR 3 79(+6)% 84(+7)% 60% 71% 4 76(+6)% 84(+6)% 41% 56% 1 80+2% 81(+5)% 61% 66% 2 80(+0)% 81(+5)% 46% 63% History 3 76(+6)% 79(+8)% 39% 52% 4 74(+6)% 76(+9)% 34% 44%

Table 9.1: Results of DTW and HMM classifiers. The improvement achieved by our modification are depicted using the numbers inside the parentheses shows that the more variation we use to train the system, the better the recog- nition rates we achieve. The 3rd and 4th columns show that in general the DTW provides better results and it is less sensitive to the size of the training set. The DTW classifier performance slightly deteriorates when using a small training set. It is also able to find close variations of a given word-part bet- ter than HMM. The results are excellent for the Handwritten (HWR) and the Printed document and very good for the Historical documents. Since it is not always possible to provide enough samples to train a prob- abilistic classifier, our experimental results show that it is better to use DTW rather than HMM for keyword searching in Arabic historical documents. The CEDARABIC system [102] is a well known system for Arabic word spotting. Our system differs from CEDARABIC in several ways. They spot the entire word as one component, while our system searches for word-parts with- out additional strokes. As a result, our system deals with a much smaller dic- tionary that includes only the word-parts without the additional strokes. The CEDARABIC system accepts the spotted words in English characters, which are used to guide the search for the appropriate Arabic prototype. In con- trast, our system accepts the search words directly in Arabic (handwritten, and printed), which are used to automatically generate the prototypes for search- ing. Our system relies on local features to preform the DTW-based search, while The CEDARABIC system uses correlation similarity measure based on global word shape features.

93 CHAPTER 9. KEYWORD SEARCHING FOR ARABIC HANDWRITTEN 9.3. EXPERIMENTAL RESULTS DOCUMENTS

94 Chapter 10

Word Spotting using Chamfer Distance and Dynamic Time Warping

In this work we concentrate on historical Arabic documents, since this col- lection is very large and has attracted modest amounts of research attention. Between the seventh and fifteenth centuries a huge number of documents were written in Arabic on various subjects, ranging from science and philoso- phy, to individuals’ diaries. More than seven million titles have survived the years and are currently available in museums, libraries, and private collections around the world. Several projects have been initiated in recent years, aimed at digitizing historical Arabic documents – Al-Azhar University, Alexandria library, Qatar heritage library. These projects demonstrate the importance and the need for developing efficient and accurate algorithms for indexing and searching within document images. Currently, such indexes are built manu- ally, which is a tedious, expensive, and very time-consuming task. Therefore, automating this task using word spotting and keyword searching algorithms is highly desirable. In this chapter we introduce a word-spotting algorithm for handwritten documents including historical Arabic manuscripts using a novel approach for matching word-images. We assume the input for the proposed algorithm is a collection of binary images of handwritten text of reasonable quality. This as- sumption is not made to boil the problem down to the simple case, but to work in accordance with fact that there are a large number of Arabic manuscripts that can be converted into the required quality using state of the art binariza- tion techniques. After binarization, this process starts by extracting the Con- nected Components and text lines. The components in each line are collected and classified into main and secondary subsets, where the main components describe the continuous part of a word/word-part and the secondary compo- nents include delayed strokes, such as dots, diacritics, and additional strokes.

95 CHAPTER 10. WORD SPOTTING USING CHAMFER DISTANCE AND 10.1. OUR APPROACH DYNAMIC TIME WARPING Our current implementation relies only on clustering the images of the main components. Ordering of the word-parts along a line is used to generate words from the identified word-parts. A slightly modified version of the Chamfer Distance is used to measure the similarity between two slices of images. Generally, we may consider the Chamfer Distance as a suitable method for matching images of complete word- parts against each other. However, this approach may fail due to the non-linear behavior, which frequently occurs in handwriting scripts. In our approach, we use the Chamfer Distance, strengthened by the use of geometric gradient fea- tures extracted from the contour polyline. These features are then used to mea- sure similarities between vertical slices subdividing the image of a word-part. This matching measurement, when implemented on these slices, is used as a cost function for a DTW-based process to measure the total similarity between the compared images.

10.1 Our Approach

Word spotting algorithms are usually based on clustering similar images, where clusters are used to generate indexes for word/word-part images to be used efficiently in future search tasks. In this work, we extract images of Arabic word-parts from text block images and mutually match them against each other. The distance between these word-parts is used to classify them into various clusters, where each cluster represents an Arabic word-part as shown in Figure 10.1. A human operator then assigns textual representation to the resulting clusters. Our matching algorithm is based on DTW and a modified Chamfer Distance that includes geometric gradient features. Here we give a detailed description for each phase of the proposed algorithm.

10.1.1 Line Extraction and Component Labeling To extract lines from the text and label the components in the correct or- der we use an algorithm presnted in chapter 8 citeSaabnib11, based on the seam carving approach for content aware image resizing [20]. The algorithm uses the signed distance transform to generate an energy map, where extreme points (minima/maxima) indicate the layout of text lines. Dynamic program- ming is then used to compute the minimum energy left-to-right paths (seams), which pass along the “middle“ of the text lines. Each path intersects a set of components, which determine the extracted text line and estimate its height. Unassigned components that fall within a text line region are added to the component list of that text line. The components between two consecutive text lines are processed when the two lines are extracted. The algorithm assigns components to the closest text line, which is estimated based on the attributes

96 CHAPTER 10. WORD SPOTTING USING CHAMFER DISTANCE AND DYNAMIC TIME WARPING 10.1. OUR APPROACH of extracted lines as well as the sizes and positions of components. The result- ing images – each representing a word-part – are used in the matching process.

Binary Image Line Extraction & Matching Word-parts

من غير ...... كلمه

......

Text To Cluster Assigning _Manual NN-Clustering

Figure 10.1: This figure depicts the spotting process starting from top-left with the binary image and ending with bottom-left with the clusters of spotted words.

10.1.2 Computing the Similarity Distance The Chamfer matching technique evaluates the similarity distance between a template image, It, and a candidate input image, Ii, by overlaying the edge map of Ii on a Distance Transform Map (DTM), It, and measuring the fitness in terms of pixel values in the DTM matching edge pixels. This distance is usually computed using Equation 10.1, which computes the root mean square average of the sum of the values if the DTM (It) is covered by pixels of the model edge map of the input image. The Chamfer matching distance is a sim- ple and effective technique to measure distances between edges in the two images. However, it does not take into consideration the local behavior of pix- els. In the proposed matching algorithm we modify the Chamfer Distance by integrating the difference in the local behavior (neighborhood) of pixels into the input edge and the overlayed pixels in the template image, (see Figure

97 CHAPTER 10. WORD SPOTTING USING CHAMFER DISTANCE AND 10.1. OUR APPROACH DYNAMIC TIME WARPING 10.2). The idea of the presented approach is to improve the Chamfer Distance to include the difference between the geometric behavior of the compared pix- els, in addition to the value of the DTM. Formally, let Bw be a binary image l containing the word-part w, and Cw = {Pi}i=1 be the contour of the word-part w in Bw, where X(pi) and Y (pi) are the x and y coordinates of of the pixel Pi in Bw. We assume, without loss of generality, that contours are extracted consis- tently in a clockwise direction. For an  > 1 neighborhood of a pixel, pi ∈ Cw, we define α(pi) to be the angle between the line (pi, pi + ) (along the contour) and the x-axis. For each pixel pj ∈ Cw, where i < j < i + , we assign α(pj) as equal to α(pi). Let DTw be the DTM of the image Bw and DTCw be the DTM of the edge model of Bw (the edge model includes only pixels from the class Cw). To generate the Gradient Edge Map (GEM) we assign to each pixel inside and outside the contour the proper α(p) imitating a dilation process tracking the closest pixel p which have been already assigned a value (See Figure 10.2). Formally, to generate, GEMw, for the given Bw, which has the same size as Bw, we apply Algorithm 1.

Algorithm 1 Generating the Gradient Edge Map of w

for each pixel pi ∈ Cw do GEMw(X(pi),Y (Pi)) ← α(pi) end for minval ← 0 Do { m ← findMinV alue in DTw where m > minval for each DTw(i, j) ≡ m do q ← the closest pixel to pi with value < m GEMw(i, j) ← α(qi) end for minval ← m } W hile (GEMw still has empty cells)

To generate GEM for an input image (w1), we apply the same algorithm by updating only foreground pixels. To calculate the modified Chamfer Distance between two equal-size images Bw1 and Bw2, we generate DTw1, DTw2, GEMw1 and GEMw2. We then overlay Bw1 over DTw2 and sum the values at GEMw2 using Equation 10.1, where k is the number of pixels with foreground values in Bw2, and Vij is defined based on Equation 10.2.

98 CHAPTER 10. WORD SPOTTING USING CHAMFER DISTANCE AND DYNAMIC TIME WARPING 10.1. OUR APPROACH v u n m 1u1 X X 2 V (10.1) 3tk ij i=0 j=0

Vij =Bw2(i, j)(DTw1(i, j) 2 + (GEMw1(i, j) − GEMw2(i, j)) ) (10.2)

10.1.3 Matching Word spotting methods usually rely on a matching algorithm to cluster similar pictorial representation of words. In the proposed system, we use a hybrid scheme which uses the Chamfer Distance and geometric gradient fea- tures of pixels to measure the distance between two images. Applying DTW to a series of windows sliding horizontally over the images, assimilates with the inherent non-linear nature of handwriting. We adapted a holistic approach and avoided segmenting words into letters. The search for a given keyword is performed by determining its word-parts in the right order. Our matching algorithm accepts two binary images, w1 and w2, representing two word-parts, and returns the similarity distance, d(w1, w2), between them. The width of the two images may vary but they are normalized to the same height h. We define δw to be the width of the sliding window. To compute the similarity distance d(w1 ,w2) between the two word-parts, we apply the following steps:

1. Compute the Distance Transform DTw1 of the image w1.

2. Compute the Gradient Edge Map GEMw1 and GEMw2 of the word-parts w1 and w2 respectively.

3. Subdivide w1 and GEMw1 into n windows of width δw; i.e, n = width(w1)/δw;

4. Subdivide w2 and GEMw2 into m windows of width δw; i.e., m = width(w2)/δw.

5. Create a matrix D (n×m), where the entry D(i, j) is the similarity distance i j between the two windows, w1 and w2 (see detail in Section 10.1.2)

6. Apply DTW to the matrix D to find the path with minimum cost from the upper left entry to the bottom right one; this is the warping path. The value in the bottom-right entry, D(n, m), normalized by the warping path, is the distance between w1 and w2.

The distance D(w1 ,w2) = (d(w1, w2) + d(w2, w1)) /2, is the similarity distance between the two images w1 and w2.

99 CHAPTER 10. WORD SPOTTING USING CHAMFER DISTANCE AND 10.1. OUR APPROACH DYNAMIC TIME WARPING

Figure 10.2: In the first row we can see the Gradient Edge Map for the tem- plate word-part image of the word-part Ghayr. In the second row we see the Gradient Edge Map for the same word-part as an input image.

10.1.4 Dynamic Time Warping

DTW is an algorithm for measuring similarity between two sequences, which may vary in time or speed. It is suitable for matching sequences with missing information or with non-linear warping, which could be used to clas- sify handwritten words/word-parts. The Chamfer Distance, as described in the previous subsection, has been modified to take local behavior of pixels into consideration. Another weakness of the Chamfer Distance is the inabil- ity to take non-linear behavior (which is frequent in handwriting styles) into consideration. In the presented approach, we discard the idea of using the Chamfer Distance on complete images, and use it only on sliding windows

100 CHAPTER 10. WORD SPOTTING USING CHAMFER DISTANCE AND DYNAMIC TIME WARPING 10.2. EXPERIMENTAL RESULTS with the same width as a cost function for a DTW based method. For 1D se- quences, DTW runs at polynomial time complexity and is usually computed by dynamic programming techniques using Equation 6.2. Here we have converted the 2D images to a 1D sequence of n windows, with the width δw sliding horizontally along the image from left to right. In our approach, we use a non overlapping sliding window in both images, which have the same normalized height h. Generally in this approach, we compare the width WB1 of Bw1 and WB2 of Bw2 by computing the ratio R = WB1/WB2 between the two images. If R is in the range [0.5, 1.5], then we use the presented approach to measure the distance between w1 and w2, otherwise, we consider them as different word- parts.

10.1.5 Clustering Process There are many methods used in the literature for clustering elements. The Nearest Neighbor based Clustering method is one of the simplest known, there- fore it was the first option to be used in the clustering process. In our pro- cess, the newly encountered (candidate) word-part is compared with the al- ready clustered word-parts and added to the cluster, which includes the clos- est word-part (Nearest Neighbor). If the distance to any clustered word-part is more than a predefined threshold the candidate word-part creates a new clus- ter. During the process, whenever two different clusters become closer (pair- wise distance) than a predefined threshold, they are merged into one cluster. In the next step a human operator approves the results and assigns the text code to each cluster.

10.2 Experimental results

We have evaluated our system using 100 pages from various documents, with different writing styles, obtained from Juma’a Al Majid Center - Dubai. These pages were selected as being of reasonable quality, and having been binarized using state of the art techniques. In our experiments, we have con- centrated mainly on the quality of the mutual matching results of word-parts when considering the main parts of words (without the additional strokes). We have classified secondary elements – dots, diactrics, and delayed strokes – based on their size and position. The number of different word-parts in all pages analyzed was 29, 654. Among them we have selected 60 meaningful word-parts, assigned textual representations, and used these for performance evaluation. As can be seen in Figure 10.3, similar word-parts with different shapes from the same document were successfully clustered into the same group.

101 CHAPTER 10. WORD SPOTTING USING CHAMFER DISTANCE AND 10.2. EXPERIMENTAL RESULTS DYNAMIC TIME WARPING

Figure 10.3: In this figure we present three different resultant clusters from the presented system of three Arabic word-parts. The manually assigned word- parts are in red and the different shapes of the same word-part in each cluster are in black.

The word-part ranking column in Table 1 in this chapter determines the method we have used to classify a word-part to a candidate cluster. In the first row, the value (1) states that the candidate word-parts have been clustered immediately to the closest cluster. The second and third rows (< k) state that if the right cluster was one of the k (k is 5 or 10) highest results, it was considered a successful clustering step.

102 CHAPTER 10. WORD SPOTTING USING CHAMFER DISTANCE AND DYNAMIC TIME WARPING 10.2. EXPERIMENTAL RESULTS

Word-Part Ranking Correctly Classified 1 89.4% < 5 95.8% < 10 99.3%

Table 10.1: The percentage of correct classifications of the 60 word-parts. The precision is computed manually by dividing the number of correctly clustered word-parts by the total number of clustered word-parts (true + false positive).

103 CHAPTER 10. WORD SPOTTING USING CHAMFER DISTANCE AND 10.2. EXPERIMENTAL RESULTS DYNAMIC TIME WARPING

104 Chapter 11

Conclusion and Future Work

Historical Document Image Analysis connects image analysis, pattern recog- nition, and other fields of computer science with the humanities, such as lin- guistics and history of science. As seen in recent years, this field is gaining more interest and concern from both sides. The urgent need for computerized tools for indexing, archiving, and accessing the historical documents makes it even more significant. During this research we have focused on analyzing historical documents written in Arabic scripts. Some of the methods we have developed can easily be used or adapted to be used in other languages, but others are unique and specific for the Arabic script. Our developed methods starts with the preprocessing steps where we used state of the art techniques for binarization and noise reduction; in our next steps, novel methods for page layout analysis, image matching, synthetic databases for training and evalua- tion, and finally two complete systems for word spotting have been developed. During this research, we have developed two full systems based on Hidden Markov Models and Dynamic Time Warping for recognizing on-line handwrit- ten Arabic, and an additional two full systems based on HMM, DTW, Chamfer Distance for key word searching, and spotting historical Arabic manuscripts. In this thesis, we start by presenting an efficient approach for generating large datasets of word-parts in order to build a comprehensive database that includes a different shape for each word in the Arabic script. The database is generated using any lexicon that determines the set of words and a set of hand- writing fonts that can be generated manually or extracted automatically from a given small dataset of word shapes. The results we have presented show the credibility of the procedure and report an improvement on the recognition rates, which result from the inherent ability of this approach to generate many shapes representing the wide variety of writing styles. In the next step, we suggest some global features such as additional strokes, ascenders, descenders, and loops to be used to reduce the vocabulary size for recognition tasks. In addition, we utilize the existence of common body parts, which are shared by multiple word-parts. Our experimental results show that

105 CHAPTER 11. CONCLUSION AND FUTURE WORK performing recognition on the word-part level is practical, it could be used to build a system for text recognizing using any classification technique and feature set, while at the same time keeping the time complexity reasonable. In the fourth chapter we present an HMM-based system with novel compo- nents to provide solutions for most of the inherent difficulties in recognizing Arabic text – letter connectivity, position-dependent letter shaping, and de- layed strokes. An evaluation of the system shows that the used features and letter models are adequate for writer-independent handwriting recognition at high rates. Our solution for delayed strokes can also be utilized to recognize texts that include diacritical marks (e.g., French, German, Spanish). A multi- level recognizer for online Arabic handwriting is presented in the next chapter. The multi-level recognition is performed through a series of filters that aim to reduce the search space. At each phase, the number of candidates is reduced. The core of the system is based on modified dynamic time warping, which is followed by a shape context classifier applied on the resulting top k can- didates. We have performed several tests on various datasets and received encouraging results. The last three chapters in this thesis are focused on images of Arabic histor- ical documents. In these three chapters, we present a language-independent approach for automatic text line extraction. The proposed algorithm computes an energy map of the input text block image and determines the seams that pass across text rows. The crossing seams mark the components that make the letters and words along text rows. These seams may not intersect all the components along text rows, which necessitates assigning (collecting) the un- marked components. The component collection procedure may require pa- rameter adjustments that may differ slightly from one language to another, and mainly depends on the existence of additional strokes – their expected lo- cation and size. Our experimental results show that our approach manages to determine the text line on various documents in different languages with high success rates. In the next chapter, we present keyword searching algo- rithms for Arabic documents based on HMM, DTW, and geometrical features. In the third chapter, we present a word-spotting algorithm for historical Ara- bic documents, which computes the distance between each two word-parts and used the nearest neighbor approach to distribute the word-parts into dis- joint clusters. To compute the distance between two word-parts our algorithm subdivides each word-part image into equal sized slices (windows) and uses dynamic time warping to measure the distance between them, while using the modified Chamfer distance as the cost function. Our unoptimized implemen- tation shows promising performance. The experimental results show high cor- rect classification rates using various documents from different periods. The scope of future work in word-spotting includes replacing the compo- nent’s contour by a representative skeleton that preserve the small features of the Arabic script such as tooth and closed loops. We also plan to re-extract

106 CHAPTER 11. CONCLUSION AND FUTURE WORK the order of the skeleton points and by that adopt algorithms used for on-line HWR. With On-line HWR, we plan to increase the system’s robustness to handle cases where delayed strokes are written before the completion of a word-part. We also plan to reduce the number of errors using geometric-computation techniques and a more sophisticated post-processing phase. Moreover, we plan on exploring sentence-level language modeling to improve word recog- nition. In both cases (on- and off-line), more consideration will be given in future work to test and create new features that preserve the nature of cursive handwriting and can discriminate similar shapes of different word-parts.

107 CHAPTER 11. CONCLUSION AND FUTURE WORK

108 Bibliography

[1] G. Masini A. Amin and J. Haton. Recognition of handwritten arabic words and sentences. Proc. of 7th Int. Conference on Pattern Recognition, Canada, pages 1055–1057, 1984.

[2] Haikal El Abed, Monji Kherallah, Volker Mrgner, and Adel M. Alimi. Arabic online handwriting recognition competition. In in 10th Interna- tional Conference on Document Analysis and Recognition (ICDAR), page to appear, 2009.

[3] I. S. I. Abuhaiba, S. Datta, and Murray J. J. Holt. Line extraction and stroke ordering of text pages. In ICDAR, page 390, 1995.

[4] V. Mrgner ad M. Pechwitz. Synthetic data for arabic ocr system devel- opment. In Sixth International Conference on Document Analysis and Recog- nition (ICDAR’01), page 1159, 2001.

[5] Samir Al-Emami and Mike Usher. On-line recognition of handwritten arabic characters. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 12(7):704–710, 1990.

[6] R. Al-Hajj, C. Mokbel, and L. Likforman-Sulem. Combination of hmm- based classifiers for the recognition of arabic handwritten words. In IC- DAR 2007. Ninth International Conference on Document Analysis and Recog- nition, volume 2, pages 959 – 963, Sept 2007.

[7] H. A. Al-Muhtaseb, S. A. Mahmoud, and R. S. Qahwaji. Recognition of off-line printed arabic text using hidden markov models. Signal Process., 88(12):2902–2912, 2008.

[8] Y. Al Ohali, M. Cheriet, and C. Suen. Databases for recognition of hand- written arabic cheques. Patterb Recognition, 36(1):111–121, January 2003.

[9] A. T. AL-Taani. An efficient feature extraction algorithm for the recog- nition of handwritten arabic digits. International journal of computational intelligence, 2(2), 2005.

109 BIBLIOGRAPHY BIBLIOGRAPHY

[10] H. Al-Yousefi and S. Udpa. Recognition of arabic characters. IEEE Trans. Pattern Analysis Machine Intell, 14(8):853–857., 1992.

[11] A. M. Alimi and O. A. Ghorbel. The analysis of error in an on-line recog- nition system of arabic handwritten characters. In ICDAR ’95: Proceed- ings of the Third International Conference on Document Analysis and Recog- nition (Volume 2), page 890, Washington, DC, USA, 1995. IEEE Computer Society.

[12] Adel M. Alimi. An evolutionary neuro-fuzzy approach to recognize on- line arabic handwriting. icdar, 00:382, 1997.

[13] Somaya Alma’adeed. Recognition of off-line handwritten arabic words using neural network. In GMAI ’06: Proceedings of the conference on Geo- metric Modeling and Imaging, pages 141–144, Washington, DC, USA, 2006. IEEE Computer Society.

[14] H. Almuallim and S. Yamaguchi. A method of recognition of arabic cur- sive handwriting. IEEE Trans. Pattern Analysis Machine Intell, pages 715– 722., 1987.

[15] B. Alsallakh and H. Safadi. Arapen: An arabic online handwriting recognition system. In Information and Communication Technologies, 2006. ICTTA ’06. 2nd, volume 1 of 10.1109/ICTTA.2006.1684669, pages 1844– 1849, April 2006.

[16] S.A. Alshebeili, A.A.F. Nabawi, and S.A. Mahmoud. Arabic character- recognition using 1-d slices of the character spectrum. SP, 56(1):59–75, January 1997.

[17] A. Amin. Off-line arabic character recognition : The state of the art. Pattern Recognition, 31(5):517–530, 1998.

[18] A. Amin, A. Kaced, J. Haton, and R. Mohr. Handwritten arabic charac- ter recognition by the irac system. In 5th Int. Conf. Pattern Recognition, Miami, FL., pages 729–73, 1980.

[19] A. Amin and J. Mari. Machine recognition and correction of printed arabic text. IEEE Trans. Syst. Man Cybern, 19(5):1300–1306., 1989.

[20] Shai Avidan and Ariel Shamir. Seam carving for content-aware image resizing. ACM Trans. Graph., 26(3):10, 2007.

[21] M. Soleymani Baghshah, S. Bagheri Shouraki, and S. Kasaei. A novel fuzzy approach to recognition of online persian handwriting. In ISDA

110 BIBLIOGRAPHY BIBLIOGRAPHY

’05: Proceedings of the 5th International Conference on Intelligent Systems De- sign and Applications, pages 268–273, Washington, DC, USA, 2005. IEEE Computer Society.

[22] F. bayadsi, R. Saabni, and J. El-Sana. Segmentation-free online arabic handwriting recognition. International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), 2011.

[23] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recog- nition using shape contexts. IEEE Trans. Pattern Analysis and Machine Intelligence, 24:509–522, 2002.

[24] A. Benouareth, A. Ennaji, and M. Sellami. Arabic handwritten word recognition using hmms with explicit state duration. EURASIP Journal on Advances in Signal Processing, 2008:13, 2008.

[25] F. Biadsy, J. El-Sana, and N. Habash. Online Arabic handwriting recogni- tion using Hidden Markov Models. In Proceedings of the 10th International Workshop on Frontiers of Handwriting and Recognition, 2006.

[26] F. Biadsy, R. Saabni, and J. El-Sana. Segmentation-free online arabic handwriting recognition. International Journal of Pattern Recognition and Artificial Intelligence, page to appear, 2011.

[27] Fadi Biadsy. Online arabic handwriting recognition. M.Sc Thesis, Ben Gurion University of the Negev, 2005.

[28] Fadi Biadsy, Jihad El-Sana, and Nizar Habash. Online arabic handwrit- ing recognition using hidden markov models. In IWFHR ’10 2006, France, 2006.

[29] Gunilla Borgefors. Hierarchical chamfer matching: A parametric edge matching algorithm. IEEE Trans. Pattern Anal. Mach. Intell., 10(6):849– 865, 1988.

[30] Ali Broumandnia, Jamshid Shanbehzadeh, and M. Nourani. Segmenta- tion of printed farsi/arabic words. In AICCSA, pages 761–766, 2007.

[31] E. Bruzzone and M.C. Coffetti. An algorithm for extracting cursive text lines. In in ICDAR 99: Proceedings of the Fifth International Conference on Document Analysis and Recognition. IEEE Computer Society, page 749, 1999.

[32] Syed Saqib Bukhari, Faisal Shafait, and Thomas M. Breuel. Script- independent handwritten textlines segmentation using active contours. In ICDAR, pages 446–450, 2009.

111 BIBLIOGRAPHY BIBLIOGRAPHY

[33] B.M.F. Bushofa and M. Spann. Segmentation and recognition of arabic characters by structural classification. IVC, 15(3):167–179, March 1997. [34] F.R. Chen, L. D. Wilcox, and D.S Bloomberg. Word spotting in scanned images using hidden markov models. In Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International Conference on, vol- ume 5, pages 1–4vol.5, 27-30 April 1993. [35] M. Dehghan, K. Faez, M. Ahmadi, and M. Shridhar. Unconstrained farsi handwritten word recognition using fuzzy vector quantization and hid- den markov models. Pattern Recognition Letters, 22:209–214, 2001. [36] M. Dehghan, K. Faeza, M. Ahmadi, and M. Shridhar. Handwritten farsi (arabic) word recognition: a holistic approach using discrete hmm. Pat- tern Recognition, 34(5):1057–1065, May 2001. [37] M. Dehghana, K. Faeza, M. Ahmadi, and M. Shridhar. Handwritten farsi (arabic) word recognition: a holistic approach using discrete hmm. Pattern Recognition, 34(5):1057–1065, May 2001. [38] Jean Duong, Myriam Cote,ˆ Hubert Emptoz, and Ching Y. Suen. Extrac- tion of text areas in printed document images. DocEng ’01 Proceedings of the 2001 ACM Symposium on Document engineering, pages 157–165, 2001. [39] S. El-Emami and M. Usher. On-line recognition of handwritten arabic characters. IEEE Trans. Pattern Analysis Machine Intell, 12(7):704–710., 1990. [40] F. El-Khaly and Sid-Ahmed M. Machine recogninon of optically cap- tured machine printed arabic text. Patlern Recognition, 23:1207–1214, 1990. [41] T. El-sheikh and R. Guindi. Automatic recognition of isolated arabic characters. Signal Processing, 14(2):177– 184., 1988. [42] Talaat S. El-Sheikh and Ramez M. Guindi. Computer recognition of ara- bic cursive scripts. Pattern Recognition, 21(4):293–302, 1988. [43] Faisal Farooq, Venu Govindaraju, and Michael Perrone. Pre-processing methods for handwritten arabic documents. Eighth International Confer- ence on Document Analysis and Recognition (ICDAR 2005), August 2005, Seoul, Korea 2005, pages 267–271, 2005. [44] L.Lik forman Sulem and C.Faure. Extracting text lines in handwritten documents by perceptual grouping. in Advances in handwriting and draw- ing:a multidisciplinary approach .Winter Eds, Europia,Paris, page 117135, 1994.

112 BIBLIOGRAPHY BIBLIOGRAPHY

[45] B. Gatos, A. Antonacopoulos, and N. Stamatopoulos. Icdar2007 hand- writing segmentation contest”. Proceedings of the 9th International Con- ference on Document Analysis and Recognition (ICDAR’07), Curitiba, Brazil, pp:1284-1288, September 2007.

[46] B. Gatos, T. Konidaris, K. Ntzios, I. Pratikakis, and S. J. Perantonis. A segmentation-free approach for keyword search in historical typewritten documents. In In ICDAR, 2005.

[47] Andrew Gillies, Erik Erl, John Trenkle, and Steve Schlosser. Arabic text recognition system. In Proceedings of the Symposium on Document Image Understanding Technology, 1999.

[48] L.O Gorman. The document spectrum for pagelay-out analysis. IEEE Trans. Pattern Analysis and Machine Intelligence., 15(11):11621173,, 1993.

[49] A. Gouda and M. Rashwan. Segmentation of connected arabic charac- ters using hidden markov models. In Computational Intelligence for Mea- surement Systems and Applications,, volume 14-16, pages 115 – 119, July 2004.

[50] Venu Govindaraju and Ram K. Krishnamurthy. Holistic handwritten word recognition using temporal features derived from off-line images. Pattern Recogn. Lett., 17(5):537–540, 1996.

[51] Didier Guillevic and Ching Y. Suen. Hmm word recognition engine. Doc- ument Analysis and Recognition, International Conference on, 0:544, 1997.

[52] Ramin Halavati, Mansour Jamzad, , and Mahdieh Soleymani. A novel approach to persian online hand writing recognition. TRANSACTIONS ON ENGINEERING AND TECHNOLOGY, 6(1305-5313.), June 2005.

[53] J. He and A. C. Downton. User-assisted archive document image analy- sis for digital library construction. In ICDAR ’03: Proceedings of the Sev- enth International Conference on Document Analysis and Recognition, page 498, Washington, DC, USA, 2003. IEEE Computer Society.

[54] J. Hu, S. G. Lim, and M. K. Brown. Writer independent on-line handwrit- ing recognition using an hmm approach. Pattern Recognition, 33(1):133– 147, 2000.

[55] J. Hu, S. C. Oh, J. H. Kim, and Y.B. Kwon. Unconstrained handwritten word recognition with interconnected hidden markov models. In In pro- ceedings Third Int. Workshop on Frontiers in Handwriting Recognition, pages 455–560, 1993.

113 BIBLIOGRAPHY BIBLIOGRAPHY

[56] Anil K. Jain and David Maltoni. Handbook of Fingerprint Recognition. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2003.

[57] 0. El-Dessouki K. El-Gowely and A. Nazif. Multi-phase recognition of multi-font photoscript arabic text. In Proc. 10th Conf. on Pattern Recogni- tion, pages 700–702., 1990.

[58] M. S. Khorsheed. Automatic recognition of words in arabic manuscripts. PhD thesis, University of Cambridge, 2000.

[59] M. S. Khorsheed. Recognising handwritten arabic manuscripts using a single hidden markov model. Pattern Recogn. Lett., 24(14):2235–2242, 2003.

[60] M. S. Khorsheed. Offline recognition of omnifont arabic text using the hmm toolkit (htk). Pattern Recogn. Lett., 28(12):1563–1571, 2007.

[61] A. L. Koerich, R. Sabourin, and C. Y. Suen. Large vocabulary off-line handwriting recognition: A survey. Pattern Analysis & Applications, 6(2):97–121, June 2003.

[62] Kise Koichi, Sato Akinori, and Iwata Motoi. Segmentation of page images using the area voronoi diagram. Comput. Vis. Image Underst., 70(3):370–382, 1998.

[63] A. Kolcz, J. Alspector, M. Augusteijn, R. Carlson, and G. Viorel Popescu. A line-oriented approach to word spotting in handwritten documents. Pattern Analysis and Applications, 3(2):153 – 168, june 2000.

[64] S. S. Kuo and O. E. Agazzi. Keyword spotting in poorly printed doc- uments using pseudo 2-d hidden markov models. IEEE Trans. Pattern Anal. Mach. Intell., 16(8):842–848, 1994.

[65] Victor Lavrenko, Toni M. Rath, and R. Manmatha. Holistic word recog- nition for handwritten historical documents. In DIAL ’04: Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL’04), page 278, Washington, DC, USA, 2004. IEEE Computer Soci- ety.

[66] Frank LeBourgeois. Robust multifont ocr system from gray level images. In ICDAR ’97: Proceedings of the 4th International Conference on Document Analysis and Recognition, pages 1–5, Washington, DC, USA, 1997. IEEE Computer Society.

[67] J. J. Lee, J. Kim, Jin, and H. Kim. Data driven design of hmm topology for on-line handwriting recognition. In in The 7th International Workshop

114 BIBLIOGRAPHY BIBLIOGRAPHY

on Frontiers in Handwriting Recognition, pages 107–121. World Scientific Publishing Company, 2001. [68] L. Likforman-Sulem, A. Hanimyan, and C. Faure. A hough based al- gorithm for extracting text lines in handwritten documents. Document Analysis and Recognition, International Conference on, 2:774, 1995. [69] Liana M. Lorigo and Venu Govindaraju. Offline arabic handwriting recognition: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:712–724, 2006. [70] M. Maamouri and A. Bies. Developing an arabic treebank: Methods, guidelines, procedures, and tools, 2004. [71] H. Maddouri, S.S. Amiri. Combination of local and global vision mod- elling for arabic handwritten words recognition. In Frontiers in Handwrit- ing Recognition, 2002. Proceedings. Eighth International Workshop on, pages 128–135, 2002. [72] S.A. Mahmoud. Arabic character recognition using fourier descrip- tors and character contour encoding. Pattern Recognition, 27(6):815–824., 1994. [73] S.i Mahmoud. Recognition of writer-independent off-line handwritten arabic (indian) numerals using hidden markov models. Signal Process., 88(4):844–857, 2008. [74] J. Makhoul, T. Starner, R. Schwartz, and G. Chou. On-line cursive hand- writing recognition using speech recognition methods. In Proceeding of IEEE ICASSP’94, pages V125–V128. IEEE, April 1994. [75] R. Manmatha and Toni Rath. Indexing handwritten historical docu- ments - recent progress. In Poceeding of the Symposium on Document Image Understanding (SDIUT 03), pages 77–86, 2003. [76] R. Manmatha and Jamie Rothfeder. A scale space approach for auto- matically segmenting words from degraded handwritten documents. TPAMI, 27(8):1212– 1225, 2004. [77] F. Menasri, N. Vincent, M. Cheriet, and E. Augustin. Shape-based alpha- bet for off-line arabic handwriting recognition. In ICDAR ’07: Proceedings of the Ninth International Conference on Document Analysis and Recognition, pages 969–973, Washington, DC, USA, 2007. IEEE Computer Society. [78] N. Mezghani, M. Cheriet, and A. Mitiche. Combination of pruned koho- nen maps for on-line arabic characters recognition. In ICDAR ’03: Pro- ceedings of the Seventh International Conference on Document Analysis and

115 BIBLIOGRAPHY BIBLIOGRAPHY

Recognition, page 900, Washington, DC, USA, 2003. IEEE Computer So- ciety.

[79] N. Mezghani, A. Mitiche, and M. Cheriet. On-line recognition of hand- written arabic characters using a kohonen neural network. In Proceedings of the International Workshop on Frontiers of Handwriting and Recognition, 2002.

[80] N. Mezghani, A. Mitiche, and M. Cheriet. Bayes classification of online arabic characters by gibbs modeling of class conditional densities. IEEE Trans. Pattern Anal. Mach. Intell., 30(7):1121–1131, 2008.

[81] Reza Farrahi Moghaddam, David Rivest-Henault,´ and Mohamed Cheriet. Restoration and segmentation of highly degraded characters using a shape-independent level set approach and multi-level classifiers. In ICDAR, pages 828–832, 2009.

[82] Khaled Mostafa, Samir I. Shaheen, Ahmed M. Darwish, and Ibrahim Farag. A novel approach for detecting and correcting segmentation and recognition errors in arabic ocr systems. In IEA/AIE ’99: Proceedings of the 12th international conference on Industrial and engineering applications of artificial intelligence and expert systems, pages 530–539, Secaucus, NJ, USA, 1999. Springer-Verlag New York, Inc.

[83] Deya Motawa, Adnan Amin, and Robert Sabourin. Segmentation of ara- bic cursive script. In ICDAR ’97: Proceedings of the 4th International Con- ference on Document Analysis and Recognition, pages 625–628, Washington, DC, USA, 1997. IEEE Computer Society.

[84] Saeed Mozaffari, Karim Faez, Volker Margner, and Haikal El Abed. Lex- icon reduction using dots for off-line farsi/arabic handwritten. Pattern Recognition Letters, 29(6):724–734, 2008.

[85] S. Mozzaffari, K. Faez, F Faradji, M. Ziaratban, and M. Golzan. A com- prehnsive isolated farsi/aarabic charachter database for handwritten ocr research. In In the Proceedings of the 10th Intl Workshop on Frontiers in Handwriting Recognition, France, pages pp. 385–389., Oct. 2006.

[86] R. Ward N. Kharma, M. Ahmed. A new comprehensive database of hand-written arabic words, numbers and signatures used for ocr test- ing. In IEEE Canadian Conference on Electrical and Computer Engineering, pages 766–768, 1999.

[87] George Nagy. Twenty years of document image analysis in pami. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:38–62, 2000.

116 BIBLIOGRAPHY BIBLIOGRAPHY

[88] S.C.S.G. Nagy and S.D. Stoddard. Document analysis with expert sys- tem. In Procedings of Pattern Recognition conference in practice II, 1985.

[89] Anguelos Nicolaou and Basilis Gatos. Handwritten text line segmenta- tion by shredding text into its lines. In ICDAR ’09: Proceedings of the 2009 10th International Conference on Document Analysis and Recognition, pages 626–630, Washington, DC, USA, 2009. IEEE Computer Society.

[90] S.C. Oh, J.Y. Ha, and J.H. Kim. Context-dependent search in intercon- nected hidden markov model for unconstrained handwriting recogni- tion. Pattern Recogn., 28(11):1693–1704, November 1995.

[91] Mario Pechwitz and Volker Maergner. Hmm based approach for hand- written arabic word recognition using the ifn/enit- database. In ICDAR ’03: Proceedings of the Seventh International Conference on Document Anal- ysis and Recognition, page 890, Washington, DC, USA, 2003. IEEE Com- puter Society.

[92] L. R. Rabiner. A tutorial on hidden markov models and selected applica- tions inspeech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.

[93] T. Rath, S. Kane, A. Lehman and. Partridge, and R. Manmatha. Indexing for a digital library of george washingtons manuscripts: A study of word matching techniques. CIIR Technical Report, University of Massachusetts Amherst., 2002.

[94] T. Rath, V. Lavrenko, and R. Manmatha. Retrieving historical manuscripts using shape. Technical report, Center for Intelligent Informa- tion Retrieval, Univ. of Massachusetts Amherst., 2003.

[95] T.M. Rath and R. Manmatha. Features for word spotting in historical manuscripts. In Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on, pages 218–222vol.1, 3-6 Aug. 2003.

[96] T.M. Rath and R. Manmatha. Word image matching using dynamic time warping. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 2, pages II–521–II– 527vol.2, 18-20 June 2003.

[97] Toni M. Rath, R. Manmatha, and Victor Lavrenko. A search engine for historical manuscript images. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July, United Kingdom., pages 369–376, 2004.

[98] S. M. Razavi and E. Kabir. A data base for online persian handwritten recognition. In 6th Conference on Intelligent Systems, In Farsi, 2004.

117 BIBLIOGRAPHY BIBLIOGRAPHY

[99] E. Lecolinet R.G. Casey. Strategies in character segmentation: a survey. In Third International Conference on Document Analysis and Recognition, volume 2, page 1028, 1995.

[100] K. Romeo-Pakker, H. Miled, and Y. Lecourtier. A new approach for latin/arabic character segmentation. Document Analysis and Recognition, International Conference on, 2:874, 1995.

[101] Jamie L. Rothfeder, Shaolei Feng, and Toni M. Rath. Using corner feature correspondences to rank word images by similarity. Computer Vision and Pattern Recognition Workshop, 3:30, 2003.

[102] P. Babu S. N. Srihari, H. Srinivasan and C. Bhole. Handwritten arabic word spotting using the cedarabic document analysis system. Proc. Sym- posium on Document Image Understanding (SDIUT 05), College Park, MD, November 2005.

[103] R. Saabni and J. El-Sana. A complete system for indexing arabic histori- cal documents. Technical Report, 2011.

[104] R. Saabni and J. El-Sana. Comprehensive synthetic arabic database for on/off-line text recognition research. Technical Report, 2011.

[105] R. Saabni and J. El-Sana. Word spotting for handwritten documents us- ing chamfer distance and dynamic time warping. In Document Recogni- tion and Retrieval XVIII Conference, San Francisco, Ca, USA., pages Proc. SPIE 7874, 78740J (2011);, 2011.

[106] R. Saabni and J. Elsana. Efficient generation of comprehensive database for arabic script. In in 10th International Conference on Document Analysis and Recognition (ICDAR), Barcelona, Spain., pages 1231–1235, 2009.

[107] Raid Saabni and Jihad El-Sana. Keyword searching for arabic hand- writing. In International conference of frontiers in Handwriting recognition ICFHR, Montreal, Canada., pages 271–276, August 2008.

[108] Raid Saabni and Jihad El-sana. Hierarchical on-line arabic handwriting recognition. In in 10th International Conference on Document Analysis and Recognition (ICDAR), Barcelona, Spain., pages 867–871, 2009.

[109] M. Sakkal. The Art of Arabic Calligraphy. Seattle Art Museum Resource Room, display boards, March 1993.

[110] Mehmet Sezgin and Bulent¨ Sankur. Survey over image thresholding techniques and quantitative performance evaluation. Journal of Electronic Imaging, 13(1):146–168, 2004.

118 BIBLIOGRAPHY BIBLIOGRAPHY

[111] Zhixin Shi, Srirangaraj Setlur, and Venu Govindaraju. A steerable direc- tional local profile technique for extraction of handwritten arabic text lines. Document Analysis and Recognition, International Conference on, 0:176–180, 2009.

[112] Aniko´ Simon, Jean-Christophe Pret, and A. Peter Johnson. A fast algo- rithm for bottom-up document layout analysis. IEEE Trans. Pattern Anal. Mach. Intell., 19(3):273–277, 1997.

[113] S.Nicolas, T.Paquet, and L.Heutte. Text line segmentation in handwrit- ten document using a production system. In in IWFHR04: Proceedings of the Ninth International Workshop on Frontiers in Handwriting Recognition (IWFHR04), page 245 250, 2004.

[114] F. Solimanpour, J. Sadri, and C. Y. Suen. Standard databases for recogni- tion of handwritten digits, numerical strings, legal amounts, letters and dates in farsi language. In Proceedings of the 10th IntlWorkshop on Frontiers in Handwriting Recognition (IWFHR), France., pages pp. 3–7., France, Oct 2006.

[115] S. T. Souici and L. M. Sellami. Off-line handwritten arabic character seg- mentation algorithm: Acsa. In Eighth International Workshop on Frontiers in Handwriting Recognition, pages 452–457, 2002.

[116] S. Srihari, H. Srinivasan, C. Huang, and S. Shetty. Spotting words in latin, devanagari and arabic scripts. Vivek: Indian Journal of Artificial In- telligence,, 16(3):2–9, 2003.

[117] S. N. Srihari, C. Huang, and H. Srinivasan. A search engine for hand- written documents. Document Recognition and Retrieval XII, San Jose, CA,Society of Photo Instrumentation Engineers (SPIE), pages pp. 66–75, Jan- uary 2005.

[118] Victor Lavrenko Toni Rath and R. Manmatha. A statistical approach to retrieving historical manuscript images. CIIR Technical Report MM-42., 2003.

[119] T.Pavlidis and J.Zhou. Page segmentation by white streams. In 1st Int. Conf. Document Analysis and Recognition. (ICDAR) Int. Assoc. Pattern Recognition, pages 945–953, 1991.

[120] Oivind Due Trier, Anil K. Jain, and Torfinn Taxt. Feature extraction meth- ods for character recognition-a survey. Pattern Recognition, 29:641–662, 1994.

119 BIBLIOGRAPHY BIBLIOGRAPHY

[121] Tamas´ Varga and Horst Bunke. Generation of synthetic training data for an hmm-based handwriting recognition system. In ICDAR ’03: Pro- ceedings of the Seventh International Conference on Document Analysis and Recognition, page 618, Washington, DC, USA, 2003. IEEE Computer So- ciety.

[122] Tams Varga, Daniel Kilchhofer, and Horst Bunke. Template-based syn- thetic handwriting generation for the training of recognition systems. In In Proceedings of the 12th Conference of the International Graphonomics Soci- ety, pages 206–211, 2005.

[123] Shapiro Vladimir, Gluchev Georgi, and Sgurev Vassil. Handwritten doc- ument image segmentation and analysis. Pattern Recogn. Lett., 14(1):71– 78, 1993.

[124] Jue Wang, Chenyu Wu, Ying-Qing Xu, Heung-Yeung Shum, and Liang Ji. Learning-based cursive handwriting synthesis. In IWFHR ’02: Pro- ceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02), page 157, Washington, DC, USA, 2002. IEEE Computer Society.

[125] Kwan Y. Wong, Richard G. Casey, and Friedrich M. Wahl. Document analysis system. IBM Journal of Research and Development, 26(6):647–656, 1982.

[126] www.phoenicia.org. http://phoenicia.org/imgs/evolchar.gif.

[127] Li Yi, Zheng Yefeng, Doermann David, and Jaeger Stefan. Script- independent text line segmentation in freestyle handwritten documents. LAMP-TR-136/ CS-TR-4836/ UMIACS-TR-2006-51/ CFAR-TR-1017, De- cember 2006.

[128] Itay Bar Yosef, Nate Hagbi, Klara Kedem, and Its’hak Dinstein. Line seg- mentation for degraded handwritten historical documents. In ICDAR, pages 1161–1165, 2009.

[129] J. You, E. Pissaloux, W. P. Zhu, and H. A. Cohen. Efficient image match- ing: A hierarchical chamfer matching scheme via distributed system. Real-Time Imaging, 1(4):245 – 259, 1995.

[130] Y.Pu and Z.Shi. Anatural learning algorithm based on hough transform for text lines extraction in hand written documents. In in Proceedings sixth International Workshop on Frontiers of Handwriting Recognition, page 637 646., 1998.

120 BIBLIOGRAPHY BIBLIOGRAPHY

[131] Abderrazak Zahour, Bruno Taconet, P. Mercy, and Said Ramdane. Arabic hand-written text-line extraction. In ICDAR, pages 281–285, 2001.

[132] Shi Zhixin, Setlur Srirangaraj, and Govindaraju Venu. Text extraction from gray scale historical document images using adaptive local con- nectivity map. In ICDAR ’05: Proceedings of the Eighth International Con- ference on Document Analysis and Recognition, pages 794–798, Washington, DC, USA, 2005. IEEE Computer Society.

[133] Shi Zhixin and Govindaraju Venu. Line separation for complex docu- ment images using fuzzy runlength. In DIAL ’04: Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL’04), page 306, Washington, DC, USA, 2004. IEEE Computer Society.

121 i'n,,,,1'1 tI".o"""", 1'I')"N n'~i"'.o :I'tIn",'tI tI:Jn tln"')'N n""':1,,, 1'I"''tI1'I .8 '.0"", i',.o:l tI')'~'" .n,") n",'tI till tI':J",,,,,,:I ,.0'" n,),,,,,,, tI"""",1'I tI':J",,,,,,:I nn.o", n"", 'tI,.o'n' n1",,'tI n1:J'lI", 'n'tl tI')'~'" UN b'l"nN1'I tI'i".o1'l 'l'tl:l •'tI1.o'n1'l m~'n j"''' 'll "N b"'" ,'tI 1'I'~i''''fl'N n":J" n':I'lI1'I 1'I.o'tl:l tI':I,n!J1'l 1'I~£Jn1'l m~1'I ",£J:' "l'''l~H1'I n'£J1'Ir.n ",:. n~'H"~)~'1'I 1'I'£J1'Ir.m 'VJ n':"VJn1'l n'H~'n1'ltl nnH l"H' 01'l"H 1'IVJ')1'IVJ 0"""'0'1'1 O"tl'Otl, 1'IVJ')1'I ,'Tn,'tl:., ."'~ "tl, 1'1:"1'1 n'VJ')l1'l1 1'I",H1'I 0'£J'1'I~ ,m O"tl'Otl i"''O' n"VJ£JH1'I .0~'Tn,~tl O'Hln:., n"'£J'O:' "tlVJl' 3'1"1'1 n':.)'tl 1'13'1"1'1 O'lVJ '~'H1'I nH nn£JnVJ 1'I:"VJn 1'Itl~VJtl mm ":")1' VJ'£J'n ,:. "'Oi'''' on"£J1'I l'tl 'nH~ :.n,1'I '1'Ii" O'VJ')l .0':" 3'1)1"" 'i'ntl, m1'l

'£J'Otl ,,):. nHn :.3'1'1'1 "1'I't o,nn:. n,VJi'1'I n'tl'VJtl1'ltl nnH m'1'I

"Ht'I i',n .VJ'£J'n, ":")1':' "tl,,£J, O"tl'Otl on'H'VJ nmtlnt'l n"£Jt'I 'VJ t'ltl'VJtl:'ll'i'tln11 fIt'i'ntl:' :.3'1,t'I 'VJ t'I'£J,)"i't'lVJ nHl ", :'3'1':' ,~" ,:." l'~ ,'O,£J,t'I n£J'i'n '3£J'VJ 1")1:' :.n'l O"tl'Otl On'Htl ,n,VJi't'I n'tl'VJtlt'ltl m'1'I m " :'3'1' 'VJ n:'VJn'tltlt'l n'''tl''''H1'I 1'IH"i't'I ,nUt.l1H' n:'VJnl ':")11'1 " t'I:.", n"'N:' "tlnVJ1'I O"tl'Otl On'Htl ~') i't:7n, 3'11'11 ,o"i',n11 'l£J:' a':" O',)nH t'I"tl)lt.l1 VJ'£J'n:. u'i'tln11 m 'i'ntl:' ,0":')1 3'1':"" n'H~'n i'£J'O' O'~" m'tln ":")1 'VJ n,'n'1l1'l n"),,'l'''1'IVJ i"n:. .n'VJ,n n,n'£J' nH~tlm n'tl"i' n,,,'VJ '(1tl'N ", ')1 " :'3'1' "t'I'l:" '':'')It'I ,nH'VJ nmtln:. O',tl

":")1 'VJ£JH' ",:. ':'')It'I :.3'1'1'1 'l"£JHtl nl'n:., O'£J'i'tl o':"VJ'n 3'1"')1:' t~,nnt'l m 'i'ntl 'VJ l'VJH'1'I '3'13'1 " O'£J'i'tl 0""'0''''''0 O':',VJtn O')'~tl 'lH ,4 'Otl i"£J:' .'3'1" 3'1':"" n'H~'n' t'I£JVJ1'I 'VJ OHn'tl 'VJ 3'1""3'1 'HVJ' "i"lt'l ')1 o':"VJ'n "VJ, O,£J'Otl ,O"tl:. on)l£J'1'I n'£J',n ,n':")lt'I t'I£JVJ:' O"tl11 'VJ nmtln:. 1m t'I,'VJ' t'I:,'n,:. l1'l '3'1" o'i""m a"'" 0"'1'I't 'VJ£JH' ",tl ,)1 O',tlt'l 'i',n, n,'n'Ht'I ,o'l;7t.l 0':")11'1 t'I£JVJt'I 3'11'tl " 'VJ '-"i'tl 0')'3'1) '0''0:' "~,, n'"n", t'lVJ,n t'I,,'VJ O')'~tl 13H 'VJ'tlnt'l i"£J:' a')'''' ''0''0:' ,,~, ',Htl'l'tl '

, DTW) 3'1'''''1'1 "t'I'tt'l n'tl"£J"~ 'nVJ ,)1 n,'O'O,:.tl n')~'tl1'l n""tl1'l .m 'i'ntl 1'3'1 unn'£JVJ

UN n'£JVJ "VJ:' 0"tl'Otl1,ntl "'Oi''' l;7VJ n",VJ ,,<,,'n, VJ'Tn on',,)m .0'l"£JHtl n'~:.i' "VJ, (HMM jpn)nJ. nn'VY) n11J.yn

Y)NtJ-JN 1Nn') ,"1

J.'Vn~n 'Y1~J npJn~n~

YJ.\Jn 'Y1~J n\JJ1p!:lJ. t:NL ~t:~ £lGC\C

N4li1ll.L.4dL Cd! WlGl «4m.L.~aUdl (\1I1iI dl«C\aL : ~

N<.(i\lL LlaCULl :..J ~ .. I.l (C ~~ u du "".. o.~ r I (,\U,-," I.""

Lll((i\ ~£l<.CNc\ NlC<.CL£l<.C\U cL nL<.lL CC(C

LN<.L a~Nt:(l.!

c(NLL

C

aaoac,o U,OC\lL"O i,Ul, lU'4l~ C\doC\ CCL('C ,L (\LC, ..;Ie' -.,:;-'"

t:NL '"t:(\ QGc\CfCL orOl

L!l((4\

LN<.L Qt\NCCl! QNU

C(UdL <11\0 c(,

"-.

c