LIGATURE RECOGNITION SYSTEM FOR PRINTED URDU SCRIPT USING GENETIC ALGORITHM BASED HIERARCHICAL CLUSTERING

A Thesis Submitted to the Faculty of the Institute of Management Sciences, Peshawar in Partial Fulfillment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY

COMPUTER SCIENCE

By NAILA HABIB KHAN

DEPARTMENT OF COMPUTER SCIENCE INSTITUTE OF MANAGEMENT SCIENCES PESHAWAR, PAKISTAN

SESSION 2014-2017 This is to certify that the research work presented in this thesis entitled “Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering” was conducted by Naila Habib Khan under the supervision of Dr. Awais Adnan, Institute of Management Sciences, Peshawar, Pakistan. No part of this thesis has been submitted anywhere else for any other degree. This thesis is submitted to the Institute of Management Sciences, Peshawar in partial fulfilment of the requirements for the degree of Doctor of Philosophy in the field of Computer Science.

Student Name: Naila Habib Khan Signature: ______

Examination Committee: a) External Foreign Examiner 1: Dr. Yue Cao School of Computing and Communications, Lancaster University, UK

Signature: ______b) External Foreign Examiner 2: Prof. Dr. Ibrahim A. Hameed Deputy Head of Research and Innovation Department of ICT and Natural Sciences, Norwegian University of Science and Technology, UK

Signature: ______c) External Local Examiner: Dr. Saeeda Naz Head of Department/ Assistant Professor Govt. Girls Postgraduate College, Abbotabad, Pakistan

Signature: ______

ii ) Internal Local Examiner: Dr. Imran Ahmed Mughal Assistant Professor Institute of Management Sciences, Peshawar, Pakistan

Signature: ______

Supervisor: Dr. Awais Adnan Assistant Professor Institute of Management Sciences, Peshawar, Pakistan

Signature: ______

Director: Dr. Muhammad Mohsin Khan Institute of Management Sciences, Peshawar, Pakistan

Signature: ______

iii I, Naila Habib Khan, hereby declare that my Ph.D. thesis entitled, “Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering” submitted to Research and Development Department (&DD) by me is my own original work. I am aware of the fact that in case my work is found to be plagiarized or not genuine, R&DD has the full authority to cancel my research work and I am liable to the penal action.

Naila Habib Khan

July 5, 2019

iv

I, solemly, declare that the research work presented in the thesis entitited, “Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering” is soley my research work with no significant contribution from any other person. Small contribution whereever taken has been duly acknowledged and that the complete thesis has been written by me.

I understand the zero-toelerance policy of the HEC and the Institute of Management Sciences, Peshawar, towards plagrism. Therefore, I as an author of the above-mentioned titled thesis declare that no portion of my thesis has been plagrised and any material used as a reference has been properly cited.

I understand that if I am found guilty of any form of plagrism in the above-mentioned titled thesis even after award of Ph.D. degree, the institute reserves the rights to withdraw/revoke my Ph.D. degree and that HEC and the Institute has the right to publish my name on the HEC/Institute website on which names of students are placed who submitted plagrised thesis.

Author’ Signature: ______

Naila Habib Khan

v This research is dedicated to my beloved parents. They give me strength when I am weak, they never let me fall and hold me up, see the best that is there in me, they are always there for me and stand by me. They have been an inspiration and a blessing for me. I am everything I am because I am loved by them.

vi Firstly, all glory is to Allah Almighty Who blessed me with a strong will and determination to complete this research.

I express my deepest gratitude to my supervisor Dr. Awais Adnan for his kind guidance, constant help and constructive feedback on my research. I especially thank him for his support throughout the research phase with patience. I really appreciate his input on my research, as it wouldn’ have been possible without his advice and assistance.

I am also grateful to all the faculty members and colleagues of the Department of Computer Science, Institute of Management Sciences for their support, inspiration and encouragement during my PhD research work. Thank you to my friend, Sadia Basar, for always been there, encouraging me with my work throughout this tough route to PhD.

Last but not the least, enormous thanks to my beloved parents, my dearest sisters, Asma Habib Khan and Nazma Habib Khan, and my dearest brothers, Imran Khan and Asif Khan, whose warm wishes, all-embracing backing, patience and prayers made the completion of this PhD research possible. I owe my gratitude to my elder sister Nazma Habib Khan for her constant advice and motivation. I would also like to thank my family members Iftikhar Anjum and Jane Agna Khan for their immense support.

vii In this dissertation, a method has been presented for ligature-based recognition of printed Urdu Nastalique script. The proposed recognition system uses a genetic algorithm based hierarchical clustering approach for recognition of Urdu ligatures. The overall proposed Urdu ligature recognition system has been divided into six phases, pre-processing, segmentation, feature extraction, hierarchical clustering, classification rules, genetic algorithm optimization and recognition.

In the first phase, the Urdu text line images are read one by one from the dataset. Next, in pre-processing the images are thresholded and noise is removed. Subsequently, an efficient and effective holistic approach algorithm is developed for the segmentation of the Urdu text lines into constituent ligatures. The proposed ligature segmentation algorithm is novel since its one of the first algorithm that doesn’t use the baseline information for ligature segmentation of Urdu script. Following, a unique set of fifteen hand-engineered features are extracted from the segmented ligature images. Out of these fifteen hand-engineered features, two are geometric features, nine are first-order statistical features and four are second-order statistical features. For data points distribution reduction, the features are hierarchically clustered and a total of 3645 classification rules are generated using simple IF-THEN statements. Since the rules are at an initial stage, genetic algorithm optimization is used for further refinement of the hierarchical clustering. The proposed genetic algorithm phase consists of population initialization, chromosome encoding, parent selection, crossover, mutation, fitness function and termination stages.

Experiments conducted on the benchmark UPTI dataset for the proposed Urdu Nastalique ligature recognition system yields promising results. The proposed ligature segmentation algorithm achieves an accuracy of 99.86%, whereas, the genetic algorithm based hierarchical clustering approach achieves a ligature recognition rate of 96.72%.

viii Table of Contents

Certificate of Approval ...... ii

Author’s Declaration ...... iv

Plagrism Undertaking ...... v

Dedication ...... vi

Acknowledgments ...... vii

Abstract...... viii

List of Figures ...... xiv

List of Tables ...... xvii

List of Abbreviations ...... xviii

Chapter 1. Introduction ...... 1

1.1 Overview...... 1

1.2 Motivation ...... 2

1.3 Problem Statement ...... 2

1.3.1 Problem Description ...... 2

1.4 Goal and Objectives ...... 3

1.5 Research Contributions ...... 3

1.6 Thesis Structure ...... 5

1.7 Summary ...... 6

Chapter 2. Background ...... 7

2.1 History of OCR ...... 7

2.2 Categories of OCR System ...... 9

2.2.1 Input Acquisition Modes ...... 9

2.2.2 Writing Modes...... 10

2.2.3 Font Constraints ...... 11

2.2.4 Script Connectivity ...... 12

2.3 Generic OCR Process ...... 12

ix 2.4 Image Acquisition ...... 13

2.5 Pre-Processing ...... 14

2.5.1 Thresholding ...... 14

2.5.2 Noise Removal ...... 15

2.5.3 Smoothing ...... 15

2.5.4 De-Skewing ...... 15

2.5.5 Thinning ...... 15

2.6 Segmentation ...... 16

2.6.1 Analytical Approach ...... 17

2.6.2 Holistic Approach ...... 18

2.7 Feature Extraction ...... 20

2.7.1 Feature Learning Approach ...... 20

2.7.2 Feature Engineering Approach ...... 21

2.8 Classification and Recognition ...... 22

2.8.1 Traditional Machine Learning ...... 23

2.8.2 Deep Learning ...... 24

2.8.3 Overview of Genetic Algorithm (GA) ...... 25

2.9 Post-Processing ...... 28

2.10 Application Areas of OCR ...... 28

2.10.1 Digital Reformatting ...... 28

2.10.2 Automated Text Translation ...... 29

2.10.3 Text-To-Speech Conversion...... 29

2.10.4 Automatic Number Plate Recognition (ANPR) ...... 29

2.10.5 Static-to-Electronic Media Conversion ...... 29

2.11 Urdu Script Preliminaries ...... 30

2.11.1 Urdu Script, Its History and Relation to Other Cursive Scripts...... 30

2.11.2 Joiner and Non-Joiner Characters ...... 33

2.11.3 Dots and Diacritics ...... 35

2.11.4 Urdu Nastalique OCR Challenges ...... 36

2.12 Summary ...... 39

Chapter 3. Literature Review ...... 40

3.1 Datasets for Urdu OCR ...... 40

3.2 Related Work for Different Categories of Urdu OCR ...... 42

3.3 Related Work for OCR Phases ...... 46

3.3.1 Related Work for Image Acquisition ...... 46

3.3.2 Related Work for Pre-Processing ...... 48

3.3.3 Related Work for Segmentation ...... 51

3.3.4 Related Work for Feature Extraction ...... 54

3.3.5 Related Work for Classification/Recognition ...... 59

3.4 Related Work for Genetic Algorithms ...... 67

3.5 Urdu Digit Recognition Systems ...... 68

3.6 Discussion of Literature Review ...... 70

3.7 Open Problems and Future Directions...... 70

3.8 Summary ...... 71

Chapter 4. Proposed Methodology ...... 72

4.1 System Overview ...... 72

4.2 Pre-Processing ...... 74

4.3 Ligature Segmentation ...... 75

4.3.1 Connected Component Labeling ...... 76

4.3.2 Connected Component Feature Extraction and Separation ...... 76

4.3.3 Connected Component Association ...... 78

4.3.4 Segmented Ligatures ...... 82

4.4 Hand-Engineered Feature Extraction ...... 82

4.4.1 Geometric Features ...... 83

4.4.2 First-Order Statistical Features ...... 84

4.4.3 Second-Order Statistical Features ...... 91

4.5 Hierarchical Clustering ...... 93

xi 4.6 Data Representation Using Classification Rules ...... 99

4.7 Optimization and Recognition ...... 101

4.7.1 Population Initialization ...... 103

4.7.2 Chromosome Encoding ...... 104

4.7.3 Parent Selection ...... 104

4.7.4 Crossover ...... 105

4.7.5 Mutation ...... 108

4.7.6 Fitness Function ...... 109

4.7.7 Survivor Selection ...... 111

4.7.8 Termination ...... 111

4.8 Summary ...... 112

Chapter 5. Experiments and Results ...... 113

5.1 Dataset and Ground-Truth ...... 113

5.2 Ligature Segmentation Results ...... 114

5.2.1 Comparison to Other Ligature Segmentation Algorithms ...... 116

5.3 Feature Extraction Results ...... 117

5.3.1 Space Complexity Comparison to Other Feature Vectors ...... 118

5.3.2 Accuracy and Reliability of the Feature Vector ...... 120

5.4 Hierarchical Clustering and Classification Rules ...... 120

5.5 Genetic Algorithm Results ...... 124

5.5.1 Algorithm Parameters ...... 124

5.5.2 Ligature Recognition Accuracy ...... 125

5.5.3 Convergence ...... 132

5.6 Comparison to Other State-Of-The-Art Classifiers ...... 133

5.7 Theoretical Analysis of Computational Complexity ...... 136

5.8 Comparative Analysis ...... 139

5.9 Shortcomings ...... 141

5.10 Summary ...... 143

xii Chapter 6. Conclusion and Recommendation ...... 144

6.1 Conclusion ...... 144

6.2 Future Recommendation ...... 145

References ...... 146

Appendix A: Journal Publications ...... 157

Appendix : Conference Publications ...... 162

xiii Figure 2.1 History of OCR...... 8 Figure 2.2 Categorization of OCR System ...... 9 Figure 2.3 (a) Online Character Recognition (b) Offline Character Recognition...... 10 Figure 2.4 Examples of Urdu Isolated Characters, Ligatures and Words ...... 12 Figure 2.5 Generic Optical Character Recognition Process ...... 13 Figure 2.6 (a) Original Image (b) Thresholded Image ...... 15 Figure 2.7 Segmentation Process for a Document Image...... 16 Figure 2.8 Approaches for Text Segmentation ...... 17 Figure 2.9 Overlapped Ligatures from An Un-Degraded Line Image ‘560’ Taken from UPTI Dataset Given In [33] ...... 18 Figure 2.10 Connected Component Labeling Computed for Un-Degraded Line Image Taken from UPTI Dataset Given In [33] ...... 19 Figure 2.11 Factors Affecting OCR System's Performance ...... 20 Figure 2.12 Basic Terminologies of Genetic Algorithm ...... 27 Figure 2.13 Application Areas of Urdu OCR ...... 28 Figure 2.14 Different Writing Styles for Script [52] ...... 31 Figure 2.15 ...... 32 Figure 2.16 ...... 32 Figure 2.17 Alphabet ...... 33 Figure 2.18 ...... 33 Figure 2.19 Joiner Urdu Characters ...... 34 Figure 2.20 Non-Joiner Urdu Characters ...... 34 Figure 2.21 Characters Associated with Dots ...... 35 Figure 2.22 Some of The Common Aerab Used with Urdu Characters ...... 35 Superscript...... 36 ط Figure 2.23 Retroflex Consonants and Figure 2.24 Shape of Character ‘Te’ Affected by Its Neighboring Characters ...... 36 Figure 2.25 (a) Diagonality in Urdu (b) Horizontal Baseline in Arabic...... 37 Figure 2.26 (a) Inter-Ligature Overlapping (b) Intra-Ligature Overlapping ...... 37 Figure 2.27 Placement of Dots at Non-standard Position...... 38 Figure 2.28 (a) Spacing between Urdu Ligatures (b) Spacing between Arabic Ligatures 38 Figure 2.29 Characters Touching Baseline (Blue), First-Descender Line (Pink) And Second-Descender Line (Green) [59] ...... 39

xiv Figure 3.1 Datasets for Urdu Optical Character Recognition ...... 40 Figure 3.2 Horizontal Word Stretching to Avoid Overlapping [16] ...... 50 Figure 3.3 Contour (Boundary) Extracted for a Ligature [33] ...... 55 Figure 3.4 Digit Training and Testing Sample [109] ...... 69 Figure 4.1 Overview of Proposed Urdu OCR System ...... 74 Figure 4.2 (a) Original Image from UPTI Dataset [33] (b) Thresholded Image Using Otsu's Method ...... 75 Figure 4.3 Block Diagram for Proposed Ligature Segmentation Algorithm ...... 76 Figure 4.4 Vertical Overlap Analysis for Urdu Text-line Images Taken from UPTI

Dataset Given in [33] (a) Association Using CXmin (b) Association Using

CXmin and CXmax ...... 79 Figure 4.5 Calculating Aspect Ratio for Ligature Image ...... 83 Figure 4.6 Horizontal Projection Profile Computed for a Ligature ...... 85 Figure 4.7 Vertical Projection Profile Computed for a Ligature ...... 85 Figure 4.8 Horizontal Edge Intensity Computed Using the Sobel Method ...... 86 Figure 4.9 Vertical Edge Intensity Computed Using the Sobel Method ...... 87 Figure 4.10 Mean Calculated from Horizontal Histogram of Ligature Image ...... 88 Figure 4.11 Mean Calculated from Vertical Histogram of Ligature Image ...... 88

Figure 4.12 Variance VH Calculated from Horizontal Histogram of Ligature Image ...... 89

Figure 4.13 Variance Vv Calculated from Vertical Histogram of Ligature Image ...... 90

Figure 4.14 Kurtosis KH Calculated from Horizontal Histogram of Ligature Image ...... 90 Figure 4.15 Kurtosis Kv Calculated from Vertical Histogram of Ligature Image ...... 91 Figure 4.16 Initial Data Points Distribution for Each Feature (F1 to F15)...... 94 Figure 4.17 Incremental Distribution of Sorted Data Points for Each Feature ...... 97 Figure 4.18 First-order Derivative Distribution of Data Points for Each Feature ...... 98 Figure 4.19 Mean of First-order Derivative Elements for Each Feature ...... 99 Figure 4.20 Specialized Tree Representation ...... 100 Figure 4.21 Block Diagram for Proposed GA Optimization and Recognition ...... 102 Figure 4.22 Chromosome with Permutation Encoding ...... 104 Figure 4.23 Parents Selection for Proposed Genetic Algorithm ...... 105 Figure 4.24 Crossover Operation for Proposed GA ...... 107 Figure 4.25 Mutation Operation for Proposed GA ...... 109 Figure 4.26 Multi-level Column Sorting Process for a Solution ...... 109 Figure 4.27 Survivor Selection Using Elitism ...... 111

xv Figure 5.1 Ligature Segmentation Results Extracted from Ground-Truth of UPTI Dataset Given In [33] ...... 114 Figure 5.2 (a) Text Line Image Taken from UPTI Dataset Given In [33] (b) Primary Connected Components Extracted Using Proposed Segmentation Algorithm (c) Secondary Connected Components Extracted Using The Proposed Segmentation Algorithm ...... 115 Figure 5.3 Ligatures Segmented Using Proposed Algorithm from Un-Degraded Sentence Image ‘560’ Taken from UPTI Dataset Given In [33] ...... 115 Figure 5.4 Sentence Text Image Taken from [15] and Its Segmented Ligatures Shown In (a) and (b), Respectively, Using The Proposed Algorithm ...... 116 Figure 5.5 Space Complexity Comparison for Proposed Features, Raw Pixel Features As Given in [92] and Autoencoder Features...... 119 Figure 5.6 Accuracy and Reliability of the Proposed Feature Vectors ...... 120 Figure 5.7 Data Distribution Reduction Using Hierarchical Clustering ...... 121 Figure 5.8 Random Test Data to Evaluate Ligature Recognition Accuracy for Each Chromosome ...... 126 Figure 5.9 Maximum Ligature Recognition Accuracy (%) Achieved for Each Generation ...... 132 Figure 5.10 Convergence Towards a Common Solution ...... 133 Figure 5.11 Computational Complexity of the Proposed Genetic Algorithm ...... 137 Figure 5.12 (a) A Handwritten Urdu Text Image [123] (b) Segmentation Results ...... 142 Figure 5.13 (a) Cursive and Calligraphic Diwani Script [52] (b) Segmentation Results 142

xvi Table 2.1 Genetic Algorithm Terminologies ...... 26 Table 2.2 Comparison of Arabic, Urdu, Persian and Pashto Writing Styles ...... 31 Table 3.1 Summary of Different Categories of Urdu OCR ...... 45 Table 3.2 Summary of Different Image Acquisition Sources Used for Urdu OCR...... 48 Table 3.3 Summary of Notable Contributions for Different Pre-processing Techniques Used for OCR ...... 51 Table 3.4 Summary of Notable Contributions Employing Explicit and Implicit Segmentation Strategies ...... 52 Table 3.5 Summary of Some Notable Contributions Using Holistic Approach for Text Segmentation ...... 53 Table 3.6 Notable Contributions Using Different Features ...... 58 Table 3.7 Summary of Contributions for Isolated Character Urdu OCR ...... 60 Table 3.8 Summary of Contributions for Cursive Character Urdu OCR ...... 63 Table 3.9 Summary of Contributions for Ligature Based Urdu OCR ...... 66 Table 3.10 Summary of Contributions for Genetic Algorithm Based OCR Systems ...... 68 Table 3.11 Summary of Urdu Numeral Recognition Systems...... 69 Table 4.1 Feature Vector Selected for Feature Extraction...... 82 Table 5.1 Results for Proposed Ligature Segmentation Algorithm ...... 115 Table 5.2 Comparison to Other Ligature Segmentation Algorithms ...... 117 Table 5.3 Feature Vector Generated for The First Ten Ligatures Extracted from Image '0.png' Taken from The UPTI Dataset ...... 118 Table 5.4 Space Complexity Comparison in terms of Big- to Other Feature Vectors . 119 Table 5.5 Results for Data Points Distribution Reduction ...... 121 Table 5.6 Parameters for Genetic Algorithm Model ...... 124 Table 5.7 Recognition Accuracy (%) for Each Population ...... 127 Table 5.8 Survivors Selected Using Elitism based on Ligature Recognition Accuracy given in (%) ...... 129 Table 5.9 Computational Complexity Comparison of GA Based Hierarchical Clustering to Other Machine Learning Techniques ...... 139 Table 5.10 Recognition Accuracies for Different Urdu Ligature Recognition Systems . 140

xvii ANPR Automatic Number Plate Recognition APTI Arabic Printed Text Image BLSTM Bidirectional Long-Short Term Memory CCL Connected Component Labeling CLE Centre for Language Engineering CNN Convolution Neural Network CTC Connectionist Temporal Classification DCT Discrete Cosine Transform EA Evolutionary Algorithm EMILLE Enabling Minority Language Engineering FFNN Feed Forward Neural Network GA Genetic Algorithm GBLSTM Gated Bidirectional Long-Short Term Memory GLCM Gray-Level Co-Occurrence Matrix GMM Gaussian Mixture Model GPU Graphical Processing Unit GSC Gradient, Structural and Concavity HMM Hidden Markov Model KL Karhunen Loeve -NN K-Nearest Neighbors LMCA Lettres Mots et Chiffres Arabe LSTM Long-Short Term Memory MATLAB Matrix Laboratory MDLSTM Multi-Dimensional Long Short-Term Memory MT Machine Translation NLP Natural Language Processing NMF Non-Negative Matrix Factorization NN Neural Network OCR Optical Character Recognition PC Personal Computer PDF Portable Document Format PDA Personal Digital Assistant

xviii RNN Recurrent Neural Network SURF Speeded Up Robust Features SVM Support Vector Machine UPTI Urdu Printed Text Image

xix

Replication of the human function, such as reading by machines has been an ancient dream. However, over the past years, the field of machine learning has advanced from a dream into reality. Optical character recognition (OCR), has become one of the most successful applications of the artificial intelligence and the pattern recognition field. Optical character recognition deals with the recognition of text obtained by optical means. Over the past few years, numerous optical character recognition commercial applications have been seen in the market meeting the basic requirements of users such as digital reformatting, process automation, text entry, automatic cartography, signature identification, automated text-to- speech conversion and ANPR (Automatic Number Plate Recognition). 1.1 Overview

Communication improvement between man and machine has been the leading inspiration for numerous researchers [1]. One fundamental application of Natural Language Processing (NLP) is a text processing system [1, 2]. A text processing system, formally known as, Optical Character Recognition, has served numerous benefits in this technology era, including conversion of century-old literature into computer understandable format [1]. Over the years the OCR has been predominantly used for producing meaningful outputs from text-based input patterns [3]. Presently, the OCR technology has advanced and extended to include several different types of texts and fonts, as well as, support for handwritten text recognition. Non-cursive scripts such as German, French and English are comparatively easy to recognize and therefore have seen a lot of research and development for the OCR applications [4]. However, the text recognition developments still haven’t excelled for cursive scripts such as Chinese, Korean, Persian, Pashto, Arabic and Urdu. Urdu is the national language of Pakistan spoken and understood by millions of people across the globe [5, 6]. Hence, Urdu language and its script holds great significance in Asia and across Middle East regions. Nastalique is the standard writing style used for Urdu. Nastalique calligraphic script [7, 8], is extremely cursive and context-sensitive in nature [6, 9-13]. Urdu script also shares the same level of written complexity with Arabic, Pashto and Persian scripts [4]. The abundant complexities associated with Urdu Script makes it a scarce language to be considered for OCR [10], hence, limiting its research and development as compared to the [14]. In Urdu, a sentence is composed of three textual components i.. words, ligatures and isolated alphabets [15]. Whereas the English sentence is formed by only two textural components i.e. words and isolated alphabets (characters). The extra component in Urdu – Ligature, is regarded as a sub-component of a 1 word which can also be considered as a sub-word [7]. It is composed of a combination of two or more characters. Due to the extreme cursiveness and overlapping issues, such as inter-ligature overlapping and intra-ligature overlapping, it is extremely challenging to perform segmentation [16]. Higher recognition rates for a cursive script such as Urdu, Arabic, Pashto, Persian and Sindhi will only be possible when grammatical or contextual information is used. One such idea is to recognize entire words or ligatures from the dictionary without segmenting the text into individual characters.

1.2 Motivation

Cursive text recognition has been an active area of research in the field of computer vision. Urdu informatics, specially OCR, lag behind due to the complexities and segmentation errors associated with its cursive script [17]. This cursive nature produces numerous challenges such as context sensitivity, overlapping, nuqtas placement, thickness variation, positioning and diagonality [18]. Recently, efforts have been made to develop an OCR system for Urdu script. However, most of the studies have focused on character based recognition, analytical approach, that requires intensive procedures for character level segmentation. Due to the calligraphic and the cursive nature of Urdu script, the character based recognition systems are more complex, challenging and more prone to errors. During the process of character segmentation, the shape of the characters might be deteriorated by segmenting at wrong segmentation points, leading to lower recognition accuracies. One solution to avoid the overhead of character segmentation in Urdu script recognition systems is to use ligatures for recognition. As per available literature, very few recognition systems exist for ligature level recognition of Urdu script. Proposing an OCR technique that can be used efficiently and successfully used to recognize Urdu ligatures will be a valuable addition to the Urdu Natural Language Processing.

1.3 Problem Statement

What modifications are needed in OCR so that it can recognize Urdu printed script at ligature level?

1.3.1 Problem Description

At present, most of the Urdu OCR systems use traditional machine learning algorithms, limited vocabulary, isolated characters and complex features. Urdu is written in the extremely cursive Nastalique script that introduces segmentation issues at character level [15]. Therefore, character level segmentation is a difficult task and generally deteriorates or disfigures the shape of a character, leading to erroneous segmentation. Segmentation may also require more pre-processing like skeletonization. In 2 this study, working at ligature level will also solve the challenges associated with character level segmentation. Henceforth, due to these identified problems, it is required to develop a robust recognition system for connected Urdu script using such algorithm that has low computational overhead.

1.4 Goal and Objectives

The main goal of this research is to develop a robust framework for printed Urdu Nastalique script Optical Character Recognition (OCR) system, that can efficiently recognize documents at ligature level with high accuracy. The goal can be achieved by following the primary objective and its sub-objectives. The primary objective of the proposed research is to extract a distinct set of features from Urdu ligatures and use a state-of-the-art classifier for classification. The sub-objectives of the proposed system are as follows. 1. To use a dataset having about 0.189 million ligatures with more than one million characters for classification. To use a subset of the data for testing purpose. 2. Apply pre-processing on dataset images and segment text images into ligatures. 3. Identify and extract a unique set of features from Urdu ligatures that are the most applicable and develop a feature dataset. 4. To define and apply a machine learning technique on subsets of the feature dataset for the classification and recognition of the ligatures.

1.5 Research Contributions

The main contribution of the proposed research is to explore and experiment the various processes required for classification and recognition of printed Urdu ligatures. The key contributions of this proposed research are briefly summarized as follows. • Dataset: Previously, most researchers have opted to work with isolated Urdu characters and very limited datasets. For the proposed research, instead of using a mere thousand ligatures, a vast dataset of Urdu ligatures is used to train the classifier. To the best of the authors’ knowledge, the total number of ligatures used in this research for classification and recognition is one of the highest ever reported. • Baseline Independent Ligature Segmentation: A connected component labeling (CCL) method-based ligature segmentation algorithm has been proposed and implemented. Most of the existing segmentation algorithms process the baseline for component separation and/or association. However, there is no single horizontal baseline for Urdu Nastalique script and it also comprises of multiple slopping baselines

3 due to its diagonal nature. Thus, the baseline detection methods are yet not robust and perfect, so using it as an element for the segmentation algorithm is an added complexity. To overcome the issues with baseline detection, the proposed ligature segmentation algorithm does not process the baseline. In comparison to the existing ligature segmentation studies, the proposed algorithm has reported one of the highest segmentation accuracies. • Modified Multi-Level Hierarchical Clustering Using Genetic Algorithm: Both the genetic algorithm and hierarchical clustering are well-defined algorithms and has been reported abundantly in the literature. However, here a modified technique has been used that combines both the genetic algorithm and the hierarchical clustering in such a way that a multi-level sorting approach is used for classification and the recognition of the ligatures. Most of the existing techniques are accurate, however, are slow and character based i.e. suitable for working on a small number of classes. Whereas, in comparison, a large number of ligature classes (3645) and a total of 0.189 million ligatures are used in this research. To overcome the issues of execution speed this modified approach in itself is one of the major contributions in the Urdu text recognition domain. The proposed approach has achieved high recognition rate on a benchmark dataset of Urdu text lines to the best of the author’s knowledge. Research articles have also been extracted from this dissertation. Some of the proposed algorithms, as well as related systems within the same field of study, multimedia, have been published as research articles in impact factor international journals as well as presented in international conferences. The list of published and tentative publications for journals and conferences is given as follows. List of PhD Publications Journals 1. Naila Habib Khan and Awais Adnan, “Urdu Optical Character Recognition Systems: Present Contributions and Future Directions,” IEEE Access, vol. 6, Issue 1, pp. 46019- 46046, August 2018. (Published, Impact Factor 4.098) 2. Naila Habib Khan, Awais Adnan and Sadia Basar, “Urdu Ligature Recognition Using Multi-Level Agglomerative Hierarchical Clustering,” Cluster Computing, vol. 21, pp. 503–514, March 2018. (Published, Impact Factor 1.601) 3. Naila Habib Khan and Awais Adnan, “Ego-motion Estimation, Concepts, Algorithms and Challenges: An Overview,” Multimedia Tools and Applications, vol. 76, Issue 15, pp. 16581–16603, August 2017. (Published, Impact Factor 1.541)

4 4. Naila Habib Khan and Awais Adnan, “Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering”, International Journal of Computer Vision. (Submitted) Conference Papers 1. Naila Habib Khan, Awais Adnan and Sadia Basar, “An analysis of off-line and on- line approaches in Urdu character recognition,” in Proceedings of the 15th International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases (AIKED), Venice, Italy, January 2016, pp. 280-286. 2. Naila Habib Khan, Awais Adnan and Sadia Basar, “Geometric feature extraction from Urdu ligatures,” in Recent Advances in Telecommunications, Informatics and Educational Technologies, Istanbul, Turkey, December 2014, pp. 229- 236. 3. Sadia Basar, Awais Adnan, Naila Habib Khan and Shahab Haider, “Color Image Segmentation Using K-Means Classification on RGB Histogram,” in Recent Advances in Telecommunications, Informatics and Educational Technologies, Istanbul, Turkey, December 2014, pp. 257-262.

1.6 Thesis Structure

This chapter identifies the problem statement and specifies the need to develop a ligature based recognition system for printed Urdu script. The organization and structure of the rest of this thesis is as follows. The Chapter 2 provides the background knowledge about the optical character recognition systems as well as its processes, namely, image acquisition, pre-processing, segmentation, feature extraction, classification/recognition, and post- processing. Chapter 3 examines the related studies in the field of Urdu OCR. The literature review has been provided for all the stages of an OCR system. Chapter 4 explains the various steps that are required to develop an Urdu ligature recognition system. The first part of the chapter discusses in detail the process of ligature segmentation. The second part of the chapter lists different geometrical and statistical features that are extracted from the segmented ligature images. The third part of the chapter discusses the classification and recognition processes for the proposed OCR system using the genetic algorithm based hierarchical clustering. Chapter 5, experiments and results, first discuss the dataset that has been used to evaluate the proposed ligature recognition system. The results for different major steps like segmentation, feature extraction, hierarchical clustering, classification rules generation and the final recognition accuracy computed using the proposed genetic algorithm based hierarchal clustering approach has been provided against the benchmark UPTI dataset. The recognition results have been compared to other similar Urdu ligature based recognition systems. Some shortcomings of the proposed ligature recognition system 5 have also been discussed. Chapter 6 concludes the thesis by summarizing the overall findings of the proposed research and provides future recommendations for research in the current field of study. The future recommendations are provided for the improvement and prospective research and development.

1.7 Summary

In chapter 1, “Introduction”, an elementary overview has been provided for the existing optical character recognition systems of different languages. It also discusses and focuses on the motivation and need to develop an Urdu ligature level recognition system. The problem statement, goal and the objectives of the proposed research have also been discussed. In the next chapter, an in-depth background knowledge will be provided for OCR systems, with the primary focus on Urdu OCR, its scripts and its challenges.

6

The OCR technology has a rich history, standard procedures and types, it also presents numerous benefits as well as challenges. This chapter gives an in-depth insight into the history of OCR systems, comparing various scripts, such as Latin, Arabic and Urdu. Different categories, input acquisition, writing modes, font constraints and script connectivity for an OCR system are also discussed. Generally, an OCR system has the following phases, image acquisition, pre-processing, segmentation, feature extraction, classification, recognition and post-processing. All these phases and its sub-phases if any are reviewed. Optical character recognition technology holds great significance for computer vision applications, some of these applications are also explained. Finally, the Urdu language and its alphabet, its numerous complexities in-comparison to other OCR systems of other scripts are provided.

2.1 History of OCR

The early optical character recognition ideas date back to the technologies that were developed to help the visually impaired people. Two famous devices the Tauschek’s reading machine and Fournier Optophone were developed during 1870 to 1931 to help the blind read [19]. In the 1950s, the invention of Gismo, a machine that was capable of translating printed text messages into machine codes was used for computer processing. These devices were also capable of reading text aloud. The device was developed by the efforts of David. . Shepard, a cryptanalyst and Harvey Cook. Intelligent Machines Research Corporation was the first company to sell out these OCR devices. Following the success, the world’s first OCR system was developed by David H. Shepard. Standard Oil Company of California used the OCR system for making credit card imprint. Other consumers for the OCR system included the Readers Digest and the Telephone Company. During the era of 1954 and 1974, first, portable OCR devices hit the market such as Optacon. These devices were used to scan and digitize postal addresses. Initially, the postal number recognition was very weak but with the advancement of technology it succeeded. The OCR technology progressed immensely by developing passport scanners and price tag scanners during the 1980’s. In late years of 1980’s and early years 1990’s, some of the most famous companies today in the field of OCR were developed, such as Caere Corporation, Kurzweil Computer Products Inc and ABBYY. During the period 2000 to 2017, the OCR technology has developed immensely. Technologies have been introduced that allows online services through the web OCR, as well as certain applications enabling real-time

7 translation of foreign languages on smartphones are developed. Tesseract a famous OCR engine was also published by Hewlett Packard and the University of Nevada, Las Vegas. Different OCR software’s have also been made available online for free by Adobe and Google Drive. Over the past decade’s most of OCR research and development has been directed toward non-cursive scripts such as Latin. The research for cursive scripts such as Arabic and Urdu started decades later than that for Latin script (see Figure 2.1).

Figure 2.1 History of OCR

Some of the earliest research towards Arabic OCR can be dated back to 1970, where an OCR was patented to read the basic printed Arabic numerals from a sheet [20]. Nowadays, the technologies and research for Arabic OCR have progressed to more intelligent character recognition systems that support handwritten and cursive scripts [21]. Currently, much commercial software’s such as ABBYY provides support for . However, the accuracy rates in comparison to the Latin text are low. Similarly, OCR for Urdu scripts is far behind than that of Latin script as well as Arabic script. The early systems for Urdu OCR can be traced back to 2003, where a system was developed to recognize the basic isolated printed Urdu characters [22]. Over the past decade, research interest towards the Urdu OCR has increased immensely. However, due to the cursive and context-sensitive nature of its script, it’s still lagging behind in printed OCR and very few developments have been reported for handwritten OCR. Some of the most famous software’s such as OmniPage, Adobe Acrobat, ABBYY FineReader, Readiris, Power PDF Advanced, Soda PDF and other commercial OCR’s have none or extremely little support for Urdu script. Most of these software’s are multilanguage, but

8 they provide higher accuracies for English, some of the Asian languages such as Urdu are mostly not well supported because their fonts are missing.

2.2 Categories of OCR System

Typically, Optical Character Recognition Systems can be divided into different types based on its characteristics i.e. input acquisition mode (online or offline), writing mode ( handwritten or printed), character connectivity (isolated or cursive) and lastly font constraints (single font or omni font) [23]. The categorization of OCR system is shown in Figure 2.2.

Categorization of OCR System

Input Font Script Acquisition Writing Modes Constraints Connectivity Modes

Online Handwritten Character Character Single Font Isolated Recognition Recognition

Offline Printed Character Character Omni Font Cursive Recognition Recognition

Figure 2.2 Categorization of OCR System

2.2.1 Input Acquisition Modes

The mode in which input is given to an Optical Character Recognition system can be divided into two types i.e. online recognition and offline recognition [18, 24, 25]. Online recognition deals with real-time recognition of characters, characters are recognized as the movements of the pen are received when writing something (see Figure 2.3 (a)). Online recognition requires specialized hardware such as pen and tablet to obtain the text input [26, 27]. A concept of digital ink is used, in which a sensor is used to analyze pen tip movements like pen up/down. In comparison to offline recognition, its less complex since the temporal information such as writing order, pen lifts, velocity, speed are readily available. Online OCR systems for Latin font are available in PDA’s, Handheld PC’s, and also available in some latest touchscreen mobile phones.

9

Figure 2.3 (a) Online Character Recognition (b) Offline Character Recognition

Offline recognition deals with recognition of text that has already been converted into a digital image (see Figure 2.3). The input image is usually a product of scanning through some digital device such as a scanner or a digital camera [18]. Offline character recognition is also sometimes referred to as static recognition [28]. Offline character recognition is a complex process in comparison to the online recognition, since, in offline, the characters first needs to be located. Offline recognition can be associated with both handwritten and printed scripts. While the online recognition can only be associated with the handwritten text.

2.2.2 Writing Modes

A major categorization for the OCR system is based on the mode of text the OCR system will be handling. The form of text i.e. printed or handwritten is known as the mode of text when developing an optical character recognition system [23]. The resources available for OCR can be in different formats such as typewritten text containing tables, headers, footers, borders, page numbers etc. or handwritten format having variations of writing styles for different people. Text recognition may seem like a minor task for human beings, however, for computational machines, it tends to be extremely challenging for both handwritten and printed script. It poses to be challenging for printed text due to the availability of a large number of fonts, while, for handwritten text, there are numerous variations possible. An optical character recognition that deals with printed text is sometimes known as a printed character recognition system. In the case of printed text, its uses different font styles such as for the English language, Times New Roman, Arial, Calibri, Courier etc. are used. Similarly, for Urdu, the most famous style of writing is the Nastalique. Printed character recognition systems are simpler as compared to the handwritten character recognition systems. However, printed text may pose to be complex for recognition based upon the

10 quality of the font, document, and writing rules of the language under consideration. Printed text can only be offline [29]. An OCR system that deals with handwritten text is sometimes known as a handwritten character recognition system. Handwritten Character Recognition is extremely challenging research area in the field of image processing and pattern recognition. Recognition of handwritten text is exceptionally difficult as compared to printed/typewritten text. Handwritten text possesses a lot of variations not only due to the different writing styles of different people but also due to the varying pen movements of the same writer. Even with the latest recognition methods and systems, the recognition of handwritten text still remains an extremely challenging task even for Latin script. Similarly, for the recognition of Urdu handwritten text, there is still room for a lot of improvement and progress. Few researchers have focused on using handwritten text for Urdu optical character recognition. Handwritten character recognition systems can be further divided into two types, i.e., offline handwritten character recognition and online handwritten character recognition [29].

2.2.3 Font Constraints

The geometric features of the characters written in one font style may vary to a great extent from character written in another font. Therefore, the OCR process is highly dependent on the font style. An OCR system that has been developed for one font style may completely fail to process or may partially succeed to recognize the same text written in another font style. If an OCR system is capable of processing only a single font style it is known as a single font recognition system. Systems capable of processing and recognizing multiple fonts are called Omni-font character recognition systems [30]. Usually, generalized algorithms are used with Omni-font systems. To use a different font style, only the training process is required to be performed out again. Most of the OCR systems available for Urdu only uses the Nastalique calligraphic font. Nastalique is basically a fusion of and Taliq writing styles. Mirza Ahmed Jameel, in 1980, computerized 20,000 Nastalique ligatures for the first time, ready to be used in computers. The font was named Noori Nastalique. Over the years many researchers have created their own version of the Nastalique calligraphic style such as Alvi Nastalique, Jameel Noori Nastalique and Faiz Lahori Nastalique. All the Nastalique fonts fulfill the basic characteristics of Nastalique writing style. However, Nastalique is far more complex than the fonts given for Arabic script [31, 32].

11 2.2.4 Script Connectivity

Another categorization for an OCR system is based on the use of the isolated or cursive script. Isolated scripts have characters that do not join with each other when they are written, contrarily, in cursive, the neighboring characters in words may join each other and may also affect the character to change its shape based on the characteristics and position of the character within the word. An optical character recognition using the cursive script is sometimes known as intelligent character recognition. If the system operates at word level its known as intelligent word recognition. Recognition of cursive text is an active area of research [18]. A new level of complexity is introduced when using cursive scripts with optical character recognition systems. This complexity adds an extra level of segmentation in the recognition process in order to isolate the characters within each word. Due to this added complexity, some of the languages are introducing segmentation free approaches. Segmentation free approach is more commonly known as the holistic approach and attempts to recognize the whole word or sub-word (ligature) without breaking it into subsequent characters [6, 33]. Currently, the OCR systems for cursive scripts are suffering due to segmentation complexities. To achieve higher recognition for general cursive scripts like Urdu, the use of contextual and grammatical information is required. For example, recognizing whole words or sub-words (ligatures) from a dictionary is easier than segmenting individual characters from the text. Words, ligatures and isolated characters for Urdu script are shown in Figure 2.4.

Figure 2.4 Examples of Urdu Isolated Characters, Ligatures and Words

2.3 Generic OCR Process

Generally, an offline character recognition system may contain few or all of the six phases, which includes: (1) image acquisition, (2) pre-processing, (3) segmentation, (4) feature extraction, (5) classification and recognition and (6) post-processing. Figure 2.5 shows the block diagram of a typical character recognition system. The different phases of an OCR system have been explained in the sections below.

12

Pre-Processing Segmentation Feature Extraction Image Acquisition Thresholding/ De- Line/ Ligature/ Statistical/ Structural Digitization noise/ Slant Character Correction etc. etc.

Classification and Recognition Editable Text Post-processing Machine Learning/ Deep Learning etc.

Figure 2.5 Generic Optical Character Recognition Process

2.4 Image Acquisition

Image acquisition is commonly the first stage of any computer vision system. Image acquisition is the process of acquiring an image into the digital form for manipulation by the digital computers [24, 34]. There are numerous resources for acquiring images into the computer. The text may also be entered into the computer using a tablet and a pen using online recognition input mode. In offline recognition, the source images can be obtained by scanning printed documents, typewritten documents, handwritten documents and by capturing a photograph through an attached camera, digital camera or image scanner. Further for offline recognition, the source images might also be synthetic i.e. generated without the scanning process. The image can be stored in any of the specific formats for example jpeg, bmp, png etc. Regardless of the source, the quality of the input image plays a vital role in the recognition accuracy. If an image has not been acquired properly then all the later tasks may be affected and the final goal of the recognition system might not be achievable. There are numerous reasons that may affect the overall quality of the input image such as having multiple subsequent copies generated for an original document. Poor printing quality may also make the scanned document to be noisy. Another major reason that might affect the image quality is the font style and its size. Extremely small fonts are more likely to be considered noise and go unrecognized by the OCR system. Punctuation marks, subscripts and superscripts may also introduce complexities in recognition and may be treated as noise in the image, if its size is extremely small. The recognition quality may also be affected by the quality of paper that was used for printing. Heavyweight and smooth papers are relatively easier to process than lightweight and transparent papers. High quality, smooth and noise-free images are more likely to result in better recognition rates. On the other hand, noise affected images are more prone to errors during the recognition process.

13 2.5 Pre-Processing

Pre-processing involves a series of operations that are carried out on the input image to make it more effective for the later stages of the recognition and to improve the overall performance [18]. Pre-processing is used to remove any kind of distortions, quality breakdown, orientation issues that are introduced during the image acquisition phase. The decline in image quality introduces several problems in the text analysis. Therefore, the pre-processing phase is extremely significant and plays a huge role in the development of a successful recognition system. There are a number of techniques, such as, image thresholding, noise removal, smoothing, de-skewing, skeletonization, image dilation, normalization etc. that can be used for pre-processing [18]. The selection of the techniques depends on the nature and source of the images. The final outcome of the pre-processing phase is a quality image that is suitable for the segmentation phase. Some of the pre- processing techniques that are frequently used in an OCR system are discussed in the sub- sections below.

2.5.1 Thresholding

The process of converting an RGB or Grey image to a bi-level image is known as thresholding [5] (see Figure 2.6). It is one of the simplest forms of image segmentation, that separates the foreground (actual text) from the background. Thresholding makes the acquired image small, fast and easy to analyze by removing all the unnecessary color information. The acquired image may be in an RGB or indexed format, where each pixel holds certain color information. However, for an OCR system, this color information is not needed and therefore must be removed. If the image is converted into a grey scale image, some color information is removed but still, the image has unnecessary information. The grey scale image is therefore converted into a bi-level image, where each pixel can hold a value of 1 or 0 [35]. The thresholding algorithms can be further divided into two main groups i.e. global thresholding and local adaptive thresholding. In global thresholding, a single threshold is generated for the entire image. The global thresholding is computed by exploiting the grey level intensity of the image histogram. Images that have non-varying backgrounds are considered more feasible for global thresholding. The implementation of global thresholding is easier than the local adaptive thresholding. On the other hand, local adaptive thresholding is a far more complex but an intelligent technique as compared to the global image thresholding. Instead of selecting a single threshold for the entire image, it classifies every single pixel into the foreground and the background. It works well for images having a varying background. The classification

14 for each pixel is performed by taking into consideration several properties such as the pixel neighborhood. If the pixel in question is darker than its adjacent neighbors, the pixel is converted into black and vice-versa. The results for local adaptive thresholding are far more accurate as compared to the global thresholding algorithm.

Figure 2.6 (a) Original Image (b) Thresholded Image

2.5.2 Noise Removal

The acquired images are usually distorted with unwanted elements. The external disturbance that leads to the degradation of an image signal is known as noise. There are many sources of noise such as bad photocopying or scanning etc. One of the most popular types of noise is the salt and pepper noise.

2.5.3 Smoothing

Smoothing is a procedure in which unwanted noise is eliminated from the edges of the image. The morphological operation of dilation and erosion can be used for the purpose of smoothing. Other than erosion, opening and closing can also be applied for smoothing. The opening morphological operation opens small gaps between an object in an image. The closing morphological operation, on the other hand, works by filling all the small gaps between an object’s edges in an image.

2.5.4 De-Skewing

Skewness of the document is when the lines of text become tilted. Skewness can be introduced as a result of bad photocopying or scanning. Skewness leads to numerous problems in segmentation. Hence, the de-skewing process is applied to remove the skewness from an image.

2.5.5 Thinning

Thinning, also known as skeletonization is a process of deleting the dark points along the edges of an object in an image. Thinning is performed until the object in an image is reduced to a thin line. The final thinned object is 1 pixel wide and henceforth known as the skeleton. Thinning is a very important step of a recognition system and has many

15 advantages. The skeleton of the text can be used to extract features like loops, holes, branch points etc. Thinning also reduces the amount of data to be handled.

2.6 Segmentation

Dividing a source image into sub-components is known as segmentation. Segmentation is used to segment and locate the text when used in an OCR system. Segmentation process for a page image can be divided into three levels as shown in Figure 2.7.

Page Text Decomposition Segmentation

Line Segmentation

Figure 2.7 Segmentation Process for a Document Image

The page decomposition is known as level 1 segmentation. Page decomposition refers to the separation of textual components from other elements within the source image. A source image may contain different types of elements such as tables, figures, header, footers etc. The initial step in page decomposition is identifying the different elements within the source image and dividing it into rectangular blocks. Next, each block is given a label such as a table, text, figure etc. Once the text has been extracted, further processing can be performed only on the text portions. The next process after page decomposition is the line (level 2) segmentation. One of the most common methods for line segmentation is the horizontal projection profile. In the horizontal projection profile, the projection value is calculated by summing the pixel values along the horizontal direction of the document image. Hence, each value of the horizontal projection profile is associated with the total number of foreground pixels in that row of the document image. Horizontal projection profile remains a natural choice for line segmentation in document images, however, there are other methods too such as smearing, grouping, stochastic methods and Hough transform. The whitespace between text lines can be easily located by exploring the zero height valleys in the horizontal projection profile. When a textual document has segmented into lines using any of the mentioned methods, the next step is to do character, ligature or word segmentation, known as level 3

16 segmentation. Text segmentation for an OCR system can be divided into two main types i.e. the holistic approach and the analytical approach (see Figure 2.8).

Holisitic Approach Text Explicit Segmentation Analytical Approach Implicit

Figure 2.8 Approaches for Text Segmentation

2.6.1 Analytical Approach

Analytical approach refers to the recognition of text by splitting it into characters. Further, the analytical approach can be categorized into two main types i.e. the explicit segmentation and the implicit segmentation. Explicit segmentation explicitly divides handwritten or printed text into characters. Great success has been achieved when using explicit approach for character segmentation [16, 17, 22, 35-37]. However, this method is prone to errors and requires extensive knowledge of a characters start and end points. Using this start and end information the characters are recognized and isolated, following the segmentation procedure is done. Detection of the start points and the end points are prone to error due to size variation, complexity and placement of characters. In implicit segmentation, the text is segmented into a very small number of segments that are based on the component classes of the alphabet. Hence, the implicit approach uses a concept of over-segmentation. Implicit segmentation is also referred to as recognition based segmentation and has been used successfully in several studies [10, 38-41]. Segmentation and recognition, both processes of OCR are done in parallel in implicit segmentation. Implicit segmentation may pose challenging when deciding the total number of segments. There are also numerous techniques to perform implicit segmentation, this may lead to over-segmentation and under-segmentation. Fewer segments lead to efficient computation but widely written words will not be covered. More segments mean more computationally expensive, increasing the junk segments that may also be modeled by the OCR recognizer [42].

17 2.6.2 Holistic Approach

When an OCR system recognizes text at word or ligature level it is known to be using the holistic approach [43]. Over the years the holistic approach has gained immense popularity due to its upfront solution by avoiding any character level segmentation [15, 33, 43-46]. Document image segmentation is one of the most significant tasks in document recognition (printed and handwritten). The overall accuracy of an Optical Character Recognition system is immensely dependent on the correct segmentation of the recognition units (character, ligature or word). As stated earlier, Urdu Nastalique script is highly cursive and context-sensitive in nature, having a lot of overlapping issues. Therefore, a proper segmentation algorithm is required that is robust enough to handle the complexities associated with Urdu script. Two of the most popular methods among researchers for ligature segmentation is the projection profile based method and connected component analysis based segmentation methods. (a) Projection Based Methods In document image analysis, projection profile is the histogram of the total number of foreground pixels in the document image. Specifically, it is a one-dimensional representation of a two-dimensional image. The values of histogram present the density distribution of the written script against the background. Usually, a projection profile method for document segmentation works well with text that is typewritten and non-cursive in nature. There are two main advantages of using projection profile based methods, first, they don’t require binarization, second, they are very robust to noise and other degradations. A projection profile can be horizontal or vertical in nature. Vertical projection profile is one of the most well-known methods for ligature segmentation. Vertical projection profile is calculated by summing the pixel values along the vertical direction of the document image. The valleys in a vertical projection profile correspond to ligature or word gaps. Furthermore, taking a vertical projection profile of a ligature or word, it is possible to segment it into individual character forms. Vertical projection profile based segmentation is a complicated process for Urdu Nastalique script due to its overlapping nature [47]. Inter-ligature overlapping may lead to wrong segmentation points when performing ligature segmentation (see Figure 2.9).

Figure 2.9 Overlapped Ligatures from An Un-Degraded Line Image ‘560’ Taken from UPTI Dataset Given In [33]

18 On the other hand, intra-ligature overlapping leads to wrong segmentation points during character segmentation. The vertical projection profile algorithm for segmenting ligatures from a thresholded text line image is described as follows, • Step 1: Generate histogram for every column of the input text line image. • Step 2: Find valleys having zero height. • Step 3: Take valleys as segmentation points to extract words and ligatures. • Step 4: Repeat Step 3 until the end of the text line.

(b) Connected Component Labeling (CCL) Based Methods Vertical projection profile based methods are more appropriate for segmenting the text images where the characters, ligatures or words are well-separated at the column. In image processing, connected component labeling based algorithms can be efficiently used for ligature segmentation, since it solves the problem of inter-ligature overlapping. In an image connected component refers to a set of pixels that forms a connected group. Subsequently, connected component labeling refers to the identification of all connected components in an image and assigning each one a unique label. Connected component labeling scans an image and groups its pixels into components based on pixel connectivity, that can be 4 or 8. CCL process scans an image from top to bottom and left to right, pixel-by-pixel in order to find the connected groups and assign each of them a unique label (see Figure 2.10). Once the connected components have been labeled they can be extracted from the image. CCL based segmentation methods have great advantages for Arabic-like cursive scripts, however, there are also a few disadvantages associated with it. The recognition complexity is added for Urdu and other Arabic-like cursive scripts because the CCL based segmentation methods separate the primary components from its secondary components (dots/diacritics). These primary and secondary components have to be reassembled to preserve its ligature shape. The CCL method for locating and labeling the connected components in a binary image is described below, • Step 1: Search an image for the next unlabeled pixel (p). • Step 2: Label all the pixels in the connected component containing (p). • Step 3: Repeat step 1 and 2 until all the pixels are labeled.

Figure 2.10 Connected Component Labeling Computed for Un-Degraded Line Image Taken from UPTI Dataset Given In [33]

19 2.7 Feature Extraction

When the input to an algorithm is extremely large and redundant for processing, then it can be transformed into a reduced set of parameters known as features. The features collectively are known as a feature vector. The feature itself can simply be referred to as “a distinct characteristic or property of an element”. After pre-processing and segmentation, a feature extraction technique is required to extract distinct features, followed by classification and an optional post-processing phase when developing an optical character recognition system. The primary goal of the feature extraction phase is to capture the necessary characteristics of all the text elements i.e. characters or words. Features hold great significance, since, it may directly affect the efficiency and recognition rate of an OCR system [48]. The total number of features, quality of features, dataset along with the classification method are said to contribute towards an effective OCR system (see Figure 2.11).

Figure 2.11 Factors Affecting OCR System's Performance

Feature extraction is broadly divided into two main categories i.e. the feature learning and feature engineering approach. If features are automatically identified and extracted it is known as feature learning approach. When hand-crafted/hand-engineered features are identified and extracted it is known as feature engineering approach. After feature extraction, sometimes there is a need to get a reduced or subset of the initial features, a task achieved through feature selection process. The features extraction approaches are discussed below.

2.7.1 Feature Learning Approach

Feature learning generates a large number of features that may improve the overall performance of the classification algorithms. Such systems are particularly valuable when such specialized features are needed that cannot be created by hand. Mostly, feature learning uses unsupervised machine learning algorithms that train on several layers of features to learn multi-level representations. Feature learning can be supervised or

20 unsupervised. Supervised feature learning deals with labeled input data, for e.g. supervised dictionary learning and neural networks. The labeled data allows the system to learn when the system fails to produce the correct label. Whereas, unsupervised feature learning deals with unlabeled input data, for e.g. independent component analysis, matrix factorization, clustering and auto-encoders.

2.7.2 Feature Engineering Approach

When using hand-crafted features, some parameters need to be considered such as its quality, quantity, usefulness, distinctiveness and effectiveness. There are numerous features associated with each character in an Urdu alphabet. For an optical character recognition system, it is necessary to observe techniques that achieve maximum recognition using simplest and minimum features. The hand-crafted features can be classified into three types [18, 48]. (a) Structural Features Structural features are related to the topological and/or geometric characteristics of characters such as loops, wedges, start point, end point, branches, crossing points, horizontal lines, vertical lines, number of endpoints, horizontal curves at top or bottom etc. Structural features require knowledge about the structure of the character, the knowledge about the strokes and the associated dots that make up the character. In case of Urdu character recognition, structural feature extraction is extremely difficult since the shape of a character varies according to its neighborhood. (b) Statistical Features Statistical features are related to the distribution of pixels in an image. Few of the popular methods to extract statistical features are zoning, projection profiles, crossing and distances. Statistical features are easy to detect and are not affected by noise/distortions as compared to the structural features. Statistical features provide low complexity and high speed to some extent, they also provide some level of font invariance. Statistical features may also be used for dimension reduction of the feature set. There are different methods to extract statistical features such as zoning, crossing, distances and projections. The statistical feature can be further classified into first-order, second-order and higher-order statistical features. First-order features compute properties of only individual features such as average and variance. Second-order and higher-order statistical features compute interactions between two or more-pixel values that are occurring at specific locations relative to each other. Zoning feature extraction is a popular statistical feature extraction method. The character image is divided into a pre-defined number of zones and from each zone, a feature is 21 selected. The zones might be overlapping or non-overlapping, the character strokes in different zones are analyzed. The image is usually divided into 2x2, 3x3, 4x4 etc. zones. Counting the number of transitions from the background to the foreground pixels in a character image is known as crossing. In crossing the transitions are computed along the vertical and the horizontal lines. Along the horizontal lines of an image, the distance calculated is the distance of the first pixel detected from the lower and upper boundaries in an image. For each character image, vertical and horizontal vectors are generated for each pixel in the background. The total number of times a character stroke is intersected by any of the vectors is used as a feature. (c) Global Transformation and Series Expansion The global transformation and series expansion present an image as continuous signals that contain more information and its features can be used for classification. Some of the famous features extracted are Fourier transforms, Gabor filter and transform, wavelets, Zernike moments and Karhunen-Loeve (KL) Expansion. Global transformation first transforms the image representation into such a form i.e. a signal so that relevant features can easily be extracted. There are numerous ways to represent a signal, such as a linear combination of a series of simpler smaller signals, known as series expansion. Some of the most common global transform and series expansion features are discussed below. Fourier transform feature uses a magnitude spectrum of measurement vector in an - dimensional Euclidean space as a feature vector. Fourier transform holds great significance due to its ability to recognize characters that have shifted its position by observing the magnitude spectrum. On the other hand, Hough transform is used to find the parameter curve of characters. It is also used as a technique for baseline detection in text documents. Gabor transform is a special form of Fourier Transform. In Gabor transform a windowed Fourier transform is applied to a character image. The window size is not discrete and is stated by a Gaussian function. Wavelets allow the representation of a signal at different levels of resolutions, hence, a series expansion technique. Finally, an Eigenvector analysis technique also known as Karhunen Loeve Expansion is used to reduce feature dimension by created new features from linear combinations of the original features. Moment features such as Zernike moments are image size, rotation and translation independent. Zernike moments are invariant descriptors for an image. The overall shape of an object is described in a compact way using only a small subset of value.

2.8 Classification and Recognition

Classification can be defined as a computational process that sorts images into groups/classes according to their similarities. Classification is a significant application of 22 image retrieval. Classification simplifies searching through an image dataset to retrieve those images with particular visual content. All classification procedures assume that the image under consideration possess one or more features (e.g. geometric) and that each of these features belongs to one of several distinct and exclusive classes. There are two types of classification. In supervised classification, the classes may be specified in advance by an analyst. In unsupervised classification, the data is automatically clustered into sets of prototype classes, where the analyst merely specifies the number of desired categories. Classification algorithms typically employ two phases of processing: training and testing. In the training phase, the characteristics of typical image features are isolated. Based on the image features a unique description of each classification category, i.e. training class, is created. In the subsequent testing phase, these feature-space partitions are used to classify image features. Two similar concepts i.e. traditional machine learning and deep learning have been on the rise in the research communities during the past several years. Deep learning is not new, but recently have gained hype by the research community and is getting more attention. Below both, the concepts have been explained, along with the related studies in Urdu Optical character recognition systems.

2.8.1 Traditional Machine Learning

Machine learning is a field of artificial intelligence that aims to mimic intelligent abilities of the humans by machines. Machine learning involves the important queries and procedures needed to make the machines capable of learning. It is difficult to define learning precisely since it covers a broad range of processes. In most dictionaries, the phrases used for its definition are “to gain knowledge”, “understanding of” and “to gain some skill by study, instruction, or experience". Some of the famous traditional machine learning techniques used with the character recognition systems are Neural Network, Support Vector Machine, K-Nearest Neighbors, Bayesian Classification, and Decision Tree Classification. Machine learning can be divided into two major types. However, there are some other types of machine learning techniques also available, such as, reinforcement learning, semi-supervised learning and learning to learn. Supervised learning involves inferring a function from labeled training data. In supervised learning, the work is primarily with a pair of input and its desired output. Supervised learning algorithms analyze the training data and produce a function. If the function has been inferred correctly it can then be used to correctly determine the future unseen input instances. Supervised learning is concerned with classification or regression. The primary goal is to enable the computer to learn a classification system that has been created by users. 23 Character and digit recognition systems are famous examples of classification based learning. Classification based learning is applied to any problem where classification is useful and easier to determine. However, classification is not always supervised, unsupervised learning may also be used for classification problems. Some of the renowned supervised learning algorithms are neural networks, naive bayes, nearest neighbor, regression models, Support Vector Machines (SVMs) and decision trees. Comparatively, in unsupervised learning, the major goal is to enable the computer to learn how to do something, without telling it how to do it. Unsupervised learning is much harder than the supervised learning. Unsupervised learning involves finding structures in unlabeled data. There are two approaches to unsupervised learning, first, clustering which includes k-means, mixture models and hierarchical clustering. Second, feature extraction techniques for e.g. independent component analysis, non-negative matrix factorization, and singular value decomposition. Some of the unsupervised learning algorithms are neural network based approaches for meeting a threshold, partial based clustering, hierarchical clustering, probabilistic based clustering, and Gaussian Mixture Models (GMM).

2.8.2 Deep Learning

Widely accepted, the deep learning is taken as a recently developed but important part of a broader family of machine learning methods. Deep learning architectures are also used for performing classification tasks. There are six main differences between the traditional machine learning and deep learning methods. Each of these differences is explained in the sub-paragraphs below: The performance of a classification algorithm highly depends on the features that have been identified and extracted. Traditional machine learning methods use the feature engineering approach to extract features for classification. Whereas, deep learning automatically finds out and extracts the raw pixels as features for classification. When hand-crafted features are used for classification, it is known as feature engineering. In feature engineering, the domain knowledge is put to use for feature creation and extraction. Feature engineering may pose to be difficult, expensive and time-consuming in terms of knowledge. Once the features have been identified they are hand-coded as per data type and domain. Hand- engineered features can be of different types, such as shape, pixel values, position, textures and orientation. Deep learning on the other hand automatically extracts high-level features from the data. Hence, the task of finding and extracting unique features for each problem are subsided. For example, the famous, Convolution Neural Network learns different low- level features such as lines and edges from an object in its early layers, it then learns medium-level features and later the final high-level representation is learned. 24 Second, the performance of deep learning increases as the scale of data increases. For smaller datasets, deep learning algorithms don’t perform that well. A large amount of data is required by deep learning algorithms to understand the data perfectly. Contrarily, traditional machine learning methods perform well for smaller datasets, its performance is not affected by larger datasets. Henceforth, the traditional machine learning algorithms reign in this scenario. Third, the traditional machine learning algorithms can work well on low-end machines. On the other hand, a deep learning algorithm requires high-end-machines to process and classify the data. GPU (Graphical Processing Unit) is capable of carrying out a large amount of matrix multiplication operations, it is an essential requirement for the working of deep learning. Fourth, the traditional machine learning algorithms usually break down a problem into sub- parts before solving it. There are usually two steps involved, object detection and recognition. Deep learning, on the other hand, solves the problem without sub-dividing it into sub-parts. It provides an end-to-end solution for problems. The fifth difference is that the traditional machine learning algorithms take less time to train, probably, from a few seconds to maybe few hours. Whereas, deep learning algorithms comparatively take a long time to train due to its large number of parameters. However, the traditional machine learning algorithms such as K-NN may take more test time. The final, sixth difference is the interpretability. Deep learning architectures give excellent results, having near human perfection. But with deep learning, it’s not possible to understand why has it given this much accuracy. The information about the nodes that were activated is known and can be mathematically found but the work of neurons and their modeling strategy is unrevealed. Hence, the results can’t be interpreted. Traditional machine learning algorithms, however, allows you to easily interpret the results and also gives clear rules, suggesting what was chosen and why did it choose it.

2.8.3 Overview of Genetic Algorithm (GA)

Nature has forever been a huge source of the invention to all the mankind. Genetic Algorithm (GA) is a metaheuristic algorithm belonging to a larger class of the Evolutionary Algorithm (EA). Genetic algorithms are based on the idea of the evolution theory given by Holland in 1975 [49]. Genetic algorithm uses a set of techniques such as selection, inheritance, mutation and recombination that is highly inspired by the evolutionary biology. It is commonly used to provide solutions to optimization and search related problems, in the machine learning and in the research. The optimization refers to improving something. The optimization deals with finding such input values that give the best output values. 25 Mathematically it refers to maximizing or minimizing the functions in order to produce an optimal solution by varying the input parameters. The basic terminologies for a genetic algorithm are given in Table 2.1.

Table 2.1 Genetic Algorithm Terminologies

Terminology Detail Population A subset of all the potential solutions to a given problem. Chromosomes Chromosome is a single solution to a given problem. Gene Gene is a single chromosome elements position. Allele Allele is the value for a specific gene, for a specific chromosome. Fitness Function It is a function that takes the solution as the input and checks the appropriateness of the solution as the output. Genetic Operators These operators alter the genetic composition of the offsprings. Genotype The population in the computation space. Phenotype The population in the actual real world.

In the genetic algorithm, there is an evolution of generations. Each generation has a subset of a population. The population that is used for a genetic algorithm is parallel to the population that is being used for human beings. However, in GA in its place, there are candidate solutions that represents the human beings. Each population has a set of candidate solutions known as the chromosomes. These candidate solutions are usually, represented using 0s and 1s. However, other encoding schemes can also be followed. These chromosomes can be altered using some biological operators such as crossover, mutation and selection. The most common method for a genetic algorithm is to create a random group of individuals for a given population. These individuals are then evaluated using an evolution function usually given by the programmers. The individuals are given a certain score that highlights its fitness according to a given situation. From the population, the individuals that are more fit, are given a higher priority to mate, hence, in accordance with the biological Darwinian Theory of the “Survival of the Fittest”. Usually, the top two individuals are selected and reproduction is carried out using a crossover, generating one or more offsprings. Following, random mutations are carried out on the offsprings. The genetic algorithm continues until an acceptable solution has been derived from the procedures. The population presented in computation space is known as a genotype. In computation space, it is easy to manipulate and understand the solutions using a computing system. On the other hand, the phenotype is represented in the real-world situations. The phenotype and genotype are usually the same for simple problems. For most cases the spaces are different, involving the decoding and encoding processes. Encoding is a process of transforming a solution from the phenotype to the genotype space. While decoding is

26 the process of transforming from the genotype to the phenotype space. The terminologies for a GA are shown diagrammatically in Figure 2.12.

Figure 2.12 Basic Terminologies of Genetic Algorithm

A Genetic Algorithm has the capability to give a “good enough” solution “fast enough”. The GA has various advantages compared to the traditional artificial intelligence. The genetic algorithm is more robust and more inclined toward the breakdowns due to the variations in the inputs. The genetic algorithm also provides much better and significant results compared to other optimization methods like linear programming, first or breadth- first and other heuristics. It is widely used in many fields like computer-aided molecular design, automotive design, robotics and engineering design. The genetic algorithms also perform much better than the usual random local search algorithms since they exploit the historical information as well. GA optimizes the discrete, continuous and multi-objective functions. It always gets an answer to the problem and that also keeps getting better over the time. However, the GA’s are not well suited for problems that are extremely simple. The fitness value is calculated repeatedly, hence for some problems it might be computationally expensive. The implementation of GA holds great significance. If the GA is not implemented correctly, it might not converge towards an optimal solution. There are a large number of problems that are NP-hard. Hence, even some of the most powerful

27 computer systems consumes a long time to resolve such problems. In such cases, GA proves to be an extremely effective tool, providing a solution in a short amount of time.

2.9 Post-Processing

Post-processing is the final phase of an Optical Character Recognition process, it involves the tasks which aims towards the improvement or correctness of classification/recognition of the system. The chosen classifier might not produce accurate results for an image. Hence, post-processing might be required. It may include different processes such as grammar correction, spell-checking, text-to-speech conversion and improving the overall recognition rate and output.

2.10 Application Areas of OCR

Urdu script has a rich historical background. Pakistan, India, and Bangladesh are few of the countries where Urdu is spoken, understood and written widely. Due to the popularity of Urdu language at verbal and written level, and its massive hardcopy literature, abundant efforts have been directed towards Urdu OCR systems. A fully functional and efficient Urdu OCR system can have abundant applications in various fields. Some of the most renowned and emerging application areas of Urdu OCR are digital reformatting, automated text translation, text-to-speech conversion, Automated Number Plate Recognition (ANPR) and static-to-electronic media conversion (See Figure 2.13). These application areas have been discussed in the sub-sections ahead.

Application Digital Reformatting Areas of Urdu OCR Automated Text Translation

Text to Speech Conversion

Automated Number Plate Recognition

Static to Electronic Media Conversion

Figure 2.13 Application Areas of Urdu OCR

2.10.1 Digital Reformatting

In digital reformatting, original documents are converted into digital form. These digital documents act as surrogates, preserving and eliminating the need to use the original version. With an automated system, like OCR, it is possible to convert all the physical libraries into digital libraries. The Internet can then be used to transfer and spread the literature, making it available worldwide. Currently, the Internet is being used as a 28 repository for making textual material online, it has been successful but with a few trade- offs. Most of the literature on the internet now is in form of images containing text. These images consume a lot of storage space, also the time required to transfer the files from one place to another through the internet is slow. Hence, digital reformatting will allow the conversion of physical libraries to digital libraries with lesser time and space consumption.

2.10.2 Automated Text Translation

Automated text translation is a famous application area of OCR, sometimes also referred to as "Machine Translation (MT)". Generally, automated text translation software translates text from a source language (for e.g. Urdu) to a target language (for e.g. English). Nowadays, Text translation software is being designed for personal, business as well as enterprise usage. These software's are extremely useful and lets you understand and convert a language script into the target language script in real time.

2.10.3 Text-To-Speech Conversion

OCR technology also provides handicapped accessibility to low-vision users. This is generally known as, "text-to-speech conversion" and more technically known as "speech synthesis". It involves converting the recognized text using OCR software into computer- generated speech. This technology allows low-vision or blind people to read books, magazines or any other reading material after scanning it.

2.10.4 Automatic Number Plate Recognition (ANPR)

ANPR is a technology that reads vehicle registration plates using OCR technology. ANPR requires a fast video camera to capture the image. ANPR technology is being used worldwide by law enforcement agencies to keep track of vehicles such as vehicle license, vehicle registration and electronic toll collection on pay-per-use roads.

2.10.5 Static-to-Electronic Media Conversion

E-media is an emerging application area of OCR technology. Electronic media (E-media) encompasses the use of electronics by the end user to access any content. On the contrary, the static media (print media) doesn't involve any use of the electronics. Static media such as newspaper can be converted to E-media using the OCR technology, by recognizing the newspaper headlines. Any handheld device, having a camera can be used to take a snap of the headlines or recognize the headlines in real-time. Once the headlines are recognized, the same news and its detailed content can be accessed online in a video form on the same handheld digital device.

29 2.11 Urdu Script Preliminaries

Urdu is the national language of Pakistan [1]. It is spoken in more than 20 countries by more than 70 million people across the world. It is also widely spoken to some extent in countries like Afghanistan, Bangladesh, India, Malawi, Nepal, Saudi Arabia, UAE, South Africa, United Kingdom, Thailand, and Zambia. It is also an official language of five Indian states. Cities like Mecca and Medina in Saudi Arabia also use Urdu for informational signage; this project the significance of Urdu language in the Muslim world.

2.11.1 Urdu Script, Its History and Relation to Other Cursive Scripts

The Arabic alphabet has influenced several languages including Persian, Urdu and Pashto [50, 51]. Each of the mentioned languages has some dissimilarity in the characters but share the same underlying foundation. Urdu has similarities to Arabic alphabet, due to the history it shares with it [26]. Urdu is basically derived from a Turkish word “Ordu” meaning “army” or “camp”. The history of Urdu language is vibrant and vivid. It is believed that Urdu can into existence during the Mughal Empire. After the 11th century, the Persian and Turkish invasions of the subcontinent cause the development of Urdu as a source of communication. During the Mughal Empire, Persian was the official language, whereas, Arabic was the language of religion, Turkish was spoken mostly by the high profile or the Sultans. Therefore, Urdu was highly under influence of these three languages. During the early years, it was just used for communication and was known as “Hindvi”. As years progressed its vocabulary expanded and several names were associated with it during this period, like Dehalvi and Zaban-e-Urdu. After independence Urdu was declared the national language of Pakistan. Arabic and Urdu both are written in Perso-Arabic script; therefore, they share similarities at the written level. Arabic and Persian writing styles have great influence on Urdu script. Hence, Urdu uses a modified and extended set of Arabic alphabets and Persian alphabets. Urdu uses Nastalique calligraphic style of the Perso-Arabic script for writing. The history of Nastalique dates back to the Islamic conquest of Persia. The Persian art of calligraphy was adopted by the Iranians. Mir-Ali Heravi Tabrizi famous Iranian calligrapher developed the Nastalique calligraphic style during the 14th century. Nastalique was formed by the combination of two scripts “Nash” and “Taliq”. In the early years, it was called “Nashtaliq” but later on, it was more formally known as Nastalique. In South Asia, Persian was the official language of the Mughal Empire. Nastalique emerged during these days and left a great influence on South Asia including Bangladesh, India and Pakistan. In Bangladesh, Nastalique was greatly used before 1971. In India, Nastalique is still observed widely.

30 Nastalique is found to be the standard calligraphic style for writing in Pakistan. Nastalique is extremely beautiful and more artistic as compared to the Naskh writing style of Arabic. There are several calligraphic styles for writing Arabic script such as Naskh, Nastalique, Koufi, Thuluthi, Diwani and Rouq’i style (see Figure 2.14). Naskh is the most common writing style that is used for Arabic, Persian as well as Pashto script [18].

Figure 2.14 Different Writing Styles for Arabic Script [52]

Arabic, Persian, Urdu and Pashto, all four alphabet systems are more or less the same, the only difference is the total number of characters (see Table 2.2). Arabic has the smallest number of characters in its alphabet. Persian uses the Arabic characters along with a greater number of characters. Urdu and Pashto both extend further from the Persian alphabet.

Table 2.2 Comparison of Arabic, Urdu, Persian and Pashto Writing Styles

Characteristic Urdu Arabic Persian Pashto Total No of letters 38 28 32 45 Order of Writing Right to left Right to left Right to left Right to left Cursive Yes Yes Yes Yes Dots and Diacritics Yes Yes Yes Yes

The Arabic alphabet is also known as an abjad. It is written from right to left and has a total of 28 characters [18] (see Figure 2.15). Arabic alphabet does not possess any distinct upper- case and lower-case forms. There are several characters that may have a similar appearance but they are given their own distinction by the use of dots that are placed above or below jīm) have the same)ج ,(’hā) ح ,(’khā)خ their central part. For example, the Arabic letters base shape, however, they have one dot below, no dot and one dot above.

31

Figure 2.15 Arabic Alphabet

The Persian alphabet and script share many similarities to that of the Arabic script. It is also written right-to-left and is an abjad, meaning the vowels are under-represented in the writing system. The Persian alphabet consisting of 32 characters is shown in Figure 2.16. The Persian script is cursive in nature; hence, the characters change their shape depending on its position: isolated, initial, middle and final of a word.

Figure 2.16 Persian Alphabet

Pashto is the official language of Afghanistan and is also widely spoken in the province of Pakistan. It’s used by 50 million people as a source for oral and written communication [53]. There is a total of 45 characters in (see Figure 2.17). The characters in the alphabet may have 0 to 4 diacritic marks. There have been no significant efforts devoted to the recognition of the Pashto script.

32

Figure 2.17 Pashto Alphabet

There is a total of 38 characters in Urdu alphabet [40]. In Urdu, the text lines are read from top to bottom, whereas, the characters are read from right to left. The characters can be grouped into similar classes based on the similarities of their base forms; the characters in the same class differ only by their dots or retroflex mark. In Figure 2.18, the character shape for basic isolated Urdu characters has been shown.

Figure 2.18 Urdu Alphabet

2.11.2 Joiner and Non-Joiner Characters

There are two types of characters in Urdu; joiners and non-joiners [31]. Joiner characters are written cursively. The shape of character changes depending on its neighboring character to which its connected, as well as its position within the word. Hence, all

33 connectors in principle have four basic shape forms i.e. the “isolated”, “start”, “middle” and “end”. There are 27 joiner characters in the Urdu alphabet as shown in Figure 2.19.

Figure 2.19 Joiner Urdu Characters

Alternatively, non-joiner characters are those characters that have no special start or middle forms, because they don’t connect to other characters. Hence, the non-joiner characters only take two basic shape forms i.e. the “isolated” and “end”. There is a total of 10 characters in Urdu alphabet that are non-joiner as shown in Figure 2.20.

Figure 2.20 Non-Joiner Urdu Characters

If a word ends with a joiner character then space must be inserted after the word, else it will result in merging the current word to the following word, resulting in a visually incorrect and meaningless word. For example, two words “ ” without space will become “ ”, making it incorrect. Another example considers two words” ”, without space separation between the joiner words will become “ ”, completely making it visually meaningless. However, words ending with non-joiner usually don’t have space from its next word, and therefore are ligatures within the same word. For example, the text seems like two words but there is no space between them. However, these are actually ” م“

and “ ” belonging to the same word. Segmentation of ligatures is ”م“ two ligatures extremely complex since there is no separation or space between the ligatures.

34 2.11.3 Dots and Diacritics

Urdu characters are surrounded by a special type of marks known as diacritics. The diacritics surrounds the characters main body and lie above or below it [33]. There are three superscript. The dot(s) placement and ’ط‘ types of diacritics i.e. Nuqta (Dot), Aerab and the number are used to distinguish several characters in the Urdu alphabet. The dot(s) can be placed below or above the associated character. The dot(s) can range from one to maximum three in number. Total 17 characters in the Urdu alphabet are accompanied by the dot(s) as shown in Figure 2.21.

Figure 2.21 Characters Associated with Dots

The consonants are represented by characters. Diacritics which serve as vowel marks are also sometimes known as aerab. Aerab helps in the pronunciation of the Urdu characters. Aerab are optional and written with the Urdu script when there is need to remove any confusion in the pronunciation. The aerab helps in changing the sound of the letter. Some of the common aerab are shown in Figure 2.22.

Figure 2.22 Some of The Common Aerab Used with Urdu Characters

There are three characters in Urdu that are known as retroflex consonants, see Figure 2.23. A retroflex consonant is spoken when the tongue has a curled, flat or concave shape. These were not present in the Persian or Arabic alphabet. The retroflex consonants are created by

superscript on three Urdu characters. These Urdu characters are known as ط placing the dental consonants. A dental consonant is spoken when the tongue is pressed against the upper teeth.

35

Superscript ط Figure 2.23 Retroflex Consonants and

2.11.4 Urdu Nastalique OCR Challenges

The extremely cursive and calligraphic nature of Nastalique makes the development of an Urdu OCR system challenging [17]. To write Nastalique text, a pen is used having a special nip called “qat” [16], which results in script with varying stroke width, making handwritten text recognition difficult. Even if developing an OCR for printed Urdu script, there are two approaches i.e. holistic and analytical. Regardless of the approach used, Nastalique font poses several challenges in developing a robust OCR System. Below, key challenges that are faced at the time of Urdu OCR system development are discussed. (a) Context Sensitivity Arabic script is cursive; characters are joined together to form ligatures/words. It’s written in Naskh font and the characters can have two to four basic forms based on their location in the ligature/word; isolated, initial, middle and final [54]. On the contrary, Urdu is written in Nastalique font that is far more complex than this 4-shapes phenomenon. In Urdu, the shape of a character is not only affected by its position but also by its neighboring characters. This sensitivity is referred to as context sensitivity [17]. Figure 2.24 shows the context sensitive behavior of character ‘te’ from the Urdu alphabet. It can be observed easily that the neighboring characters and its position (start, middle or final) in the ligature affect its shape.

Figure 2.24 Shape of Character ‘Te’ Affected by Its Neighboring Characters

(b) Diagonality One of the main characteristics of the Nastalique font is that it is written diagonally, from top right to the bottom left [18, 55]. As new characters are joined to the former characters,

36 a slant or slope is introduced in the ligature being written, this feature is known as diagonality. In comparison, English and Arabic are written horizontally along a single baseline. Figure 2.25 (a) shows the diagonal nature of Urdu script, on the contrary, the horizontalness and straightness of the Arabic script can be seen in Figure 2.25 (b).

Figure 2.25 (a) Diagonality in Urdu (b) Horizontal Baseline in Arabic

(c) Overlapping Problem Urdu text takes less horizontal space as compared to the Arabic Naskh style of writing. Therefore, Urdu has extreme ligature overlapping issues [10, 17]. Overall the overlapping can be divided into two types [52]. First, in Intra-Ligature overlapping, a character overlaps another character(s) within the same ligature. The second type of overlapping is known as Inter-Ligature overlapping, a character of one ligature overlaps a character(s) of another ligature. Both, Inter-Ligature overlapping and Intra-Ligature overlapping can be seen in Figure 2.26. Inter-Ligature overlapping makes the process of ligature segmentation extremely difficult while Intra-Ligature overlapping makes the process of character segmentation challenging [56].

Figure 2.26 (a) Inter-Ligature Overlapping (b) Intra-Ligature Overlapping

37 (d) Diacritics Placement The context sensitive and sloping nature of Nastalique script makes the placement of diacritics difficult, standard positions cannot be followed (see Figure 2.27). The diacritics for a character are shifted with the addition of every new character. The diacritics may be moved to a nearby position instead of its standard position [52]. This is mainly performed to avoid any clash or overlapping to occur between the nuqtas. Therefore, identifying the character to which the diacritics belong is difficult due to Inter-ligature and Intra-Ligature overlapping.

Figure 2.27 Placement of Dots at Non-standard Position

(e) Spacing The Nastalique font consumes very less space and ligatures are written very tightly [56]. It is a difficult task to identify the full word [57]. This lack of space between ligatures makes the task of segmentation in Urdu OCR problematic [56]. Figure 2.28 (a) shows Urdu written in its Nastalique calligraphic style, with little space between ligatures, as compared to Arabic Naskh font shown in Figure 2.28 (b).

Figure 2.28 (a) Spacing between Urdu Ligatures (b) Spacing between Arabic Ligatures

() Multiple Baselines A baseline is a horizontal line on which text is written. The baseline is extremely useful for line segmentation and skew detection. However, Urdu Nastalique script may have more than one virtual baseline that leads to ambiguities in baseline selection (see Figure 2.29). Also, since the baseline is not fixed and the mark placement rules are not standard, it makes the script extremely complex [6, 58]. 38

Figure 2.29 Characters Touching Baseline (Blue), First-Descender Line (Pink) And Second-Descender Line (Green) [59]

2.12 Summary

In this chapter, the background about the optical character recognition systems has been discussed in detail. A detailed explanation has been provided for its processes, namely, image acquisition, pre-processing, segmentation, feature extraction, classification and recognition, and post-processing. Urdu OCR has a large number of potential applications that can be developed and implemented in future. This chapter has also enlightened the challenges and complexities associated with Urdu Nastalique script. In the following chapter, the existing Urdu datasets and it’s OCR methods for different categories and different phases of will be reviewed.

39

The Urdu language has worldwide appeal, it is written and understood in numerous countries around the world, however, little to no progress or achievements have been reported for recognition of its script. This huge lag of research is mainly due to the inefficiencies in different fields, such as dictionaries, research funding’s, equipment’s, benchmark datasets and other necessary utilities. This chapter gives an across-the-board literature review of the most prominent and relevant studies in the field of Urdu optical character recognition. First, a comprehensive review is given for the existing studies using different modes and categories of Urdu Optical character recognition. Next, notable contributions in the field of Urdu are given for different phases of OCR i.e. image acquisition, pre-processing, image segmentation, feature extraction and classification. There haven’t been any remarkable procedures or results reported by the research community for the post-processing phase, hence, it is omitted from this literature review. Briefly, the Urdu numeral recognition systems are also discussed. The chapter concludes by discussing some open problems and future directions. Finally, the summary section concludes the entire chapter.

3.1 Datasets for Urdu OCR

There has been relatively little research in Urdu OCR because of the paucity of corpora and image datasets. A benchmark dataset is an essential part for the efficient and robust development of a character recognition system. However, there are not enough datasets available to do the basic research and development tasks in Urdu NLP applications such as Urdu OCR [13]. The commonly used datasets for various Urdu NLP tasks have been mentioned below and shown in Figure 3.1.

Urdu Datasets

EMILLE Ijaz and Hussain Urooj CLE Dataset UPTI Dataset Dataset Dataset Dataset

Figure 3.1 Datasets for Urdu Optical Character Recognition

EMILLE (Enabling Minority Language Engineering) project [60] that was initiated by Lancaster University in 2003, is the first ever initiative for making Urdu corpus available.

40 The main objective of the project was to develop a dataset for South Asian languages. Over the years, now the dataset has been extended to include more than 96 million words. It encompasses three types of data i.e. monolingual, parallel and annotated data. The dataset consists of a total of 512,000 spoken Urdu words and 1,640,000 words of Urdu text. Besides Urdu, the EMILLE project also includes thirteen other South Asian languages. Corpora and associated tools for Urdu text processing have also been developed by the Centre for Language Engineering (CLE) in Pakistan. It has been conducting research and development for computational aspects of Pakistan’s various languages. The main aim of the center is to enable the public to access information and communicate in their local languages using information and communication technology. They have provided 19.3 million ligature corpus that has been collected from a wide range of domain, namely, sports/games, news, finance, culture/entertainment, consumer information and personal communication. A high-frequency ligature corpus has also been provided by the CLE as stated in [43]. Another small dataset for Urdu was given by Shafait et al. [59]. The dataset is publicly available and consists of only 25 documents and had Ground-truth for both OCR and layout analysis. A huge word dictionary was developed by Ijaz and Hussain [61] in 2007. A total of 50,000 distinct words were collected from documents of different domains, namely, finance, sports, culture, entertainment, news, personal communication and consumer information. A dataset was provided to the public by Urooj et al. [62]. It consists of Urdu Nastalique font of various font sizes. There are more than 100, 000 words in the developed dataset. The data for the dataset has been collected from various domains, namely, interviews, press, novels, letters, translations, religion, short stories, sports, science, culture, health care and book reviews. A synthetic dataset was proposed by Sabbour and Shafait [33] called UPTI (Urdu Printed Text Image Database) dataset, it consists of 10,063 synthetically generated text lines and ligature images. It was developed as an analogy to the dataset, APTI (Arabic Printed Text Image), proposed by Slimane et al. [63]. The size of the dataset was increased by applying different degradation procedures. The UPTI dataset consists of both, the text-line version and ligature version to measure the accuracy of the recognition system. It consists of 12 sets of by varying four parameters for the images, namely jitter, elastic elongation, threshold and sensitivity. The synthetic dataset comprises different political, social and religious issues, and was collected from the Jang newspaper. The dataset also contains ground-truth information for both the text-line version and ligature version.

41 3.2 Related Work for Different Categories of Urdu OCR

As mentioned earlier, the modes for an OCR system can be divided into different categories based on input acquisition that can be online or offline, writing mode that can be handwritten or printed, font constraints that may include the font variations handled by a recognition system and finally the script connectivity that may include the segmentation process for character or ligature or may completely avoid the segmentation when developing a recognition system. Compared to online, most of the research has been conducted towards offline isolated and cursive analytical segmentation based character recognition systems. Several authors opted to use offline recognition for development of Urdu OCR systems for isolated characters [64-70]. Shamsher et al. [64] developed a system for printed Urdu script used for recognizing individual Urdu characters using their proposed algorithm. Similarly, a font size independent technique was proposed for Noori Nastalique font of Urdu, the technique was tested on single Urdu character ligatures [65]. The system proposed by Nawaz et al. [66], also aimed for recognition of isolated Urdu characters. The overall system provided the basic pre-recognition techniques such as image pre-processing, segmentation (line and character), and finally the creation of XML files for the training purpose. Some researchers, [69] recommended an offline recognition system but for both handwritten and printed Urdu text, and also highlighted the pre-processing, features extraction and classification of Urdu language text. An OCR named “soft converter”, was presented by [67], that could recognize the isolated characters of Urdu language utilizing a database. Khan et al. [68] developed two databases of images, namely, a ‘TrainDatabase’ and a ‘TestDatabase’ for offline recognition of Urdu text. Likewise, [70] also proposed the representation and recognition for offline printed isolated Urdu characters. Cursive texts are extremely complex to deal with. However, several authors also focused towards developing cursive analytical Urdu text OCR systems [16, 17, 22, 32, 34-36, 71]. Ahmad et al. [36] developed an OCR system that comprised of two main components, i.e. segmentation and classification. Both, segmentation and classification of the compound characters were studied. In another study, Ahmad et al. [16] discussed different characteristics of Urdu script and presented a novel and robust method to recognize printed Urdu script without a lexicon. Likewise, [34] proposed an offline recognition system for Naskh script, it operated on segmented characters and classified it into 33 groups for recognition purposes. In a study done by [32], the differences between Naskh and Nastalique fonts were discussed, an analytical segmentation-based model was proposed for pre-processing and recognition of Urdu script. Pal and Sarkar [22] proposed an OCR

42 system for printed Urdu script. Characters were segmented and recognized on basis of multiple features and achieved promising accuracy. Similarly, [17] and [35] trained an OCR system for on offline printed segmented characters, however, the font utilized has not been mentioned. Relatedly, Iqbal et al. [71], proposed a system that dealt with recognition of Urdu Nastalique font for character based segmentation of printed Urdu names. The recognized characters were converted into Roman Urdu, handling various complexities during the conversion. Likewise, Khan and Nagar [72] proposed a study for printed and handwritten cursive Urdu text written in Nastalique and Naskh font style. Ul-Hasan et al. [26] used deep learning for recognition of printed Urdu text in the Nastalique script. Whereas Patel and Thakkar [38] also used a deep learning approach but for recognition of cursive handwritten documents, both in Arabic and English scripts. The studies in [10] and [41] proposed an implicit segmentation based recognition system for Urdu text lines in the Nastalique script and a hybrid approach combining the convolution and deep learning for classification of cursive Urdu Nastalique script respectively. In another similar study, Naz et al. [39] achieved high accuracy for sentence based implicit recognition of Urdu Nastalique font using human extracted features and deep learning. Due to the added complexity of cursive script specifically for analytical systems, segmenting text at the character level, several authors are shifting towards recognition of words and ligatures [33, 40, 43, 73-84]. Working with ligature based systems seems more feasible since an extra level of character segmentation is omitted that is usually prone to errors. Ahmed et al. [40] developed an offline ligature based Urdu recognition system. Another offline OCR system was proposed by Sabbour and Shafait [33] called Nabocr. The system was trained to recognize both the Urdu Nastalique and the Arabic Naskh font. An alternative multi-script OCR system was proposed by Chanda and Pal [73] for word level recognition of multiple scripts i.e. English, Devanagari and Urdu from a single document. The characteristics of different scripts are taken into consideration for development of the final recognition system. Sattar et al. [74] presented a novel offline segmentation-free approach for Nastalique based Optical Character Recognition called NOCR. In [76], the authors developed a technique which was capable of extracting the Urdu font “Jameel Noori Nastalique” from images and converted it into editable textual ’s. The approach comprised of pre-processing techniques, label connected components, feature extraction, and image comparison. Javed and Hussain [81], focused on the development of a ligature based OCR system that input the main body without the diacritics and recognizes the related ligatures. Similarly, [82] proposed a technique for recognizing font invariant cursive Urdu Nastalique ligatures. Rana and Lehal [83], presented a ligature based OCR

43 system for Urdu Nastalique script, describing its various challenges and using k-NN and SVM classifiers. In [84], a new method was presented for offline recognition of cursive Urdu text written in Noori Nastalique style, the system aimed at ligature based identification. Khattak et al. [43] used a segmentation-free and scale-invariant technique for recognition of printed Urdu ligatures in Nastalique font. A system was presented to recognize printed Nastalique pre-segmented ligatures by [79] and [6]. Similarly, Hussain et al. [75] analyzed and modified the Tesseract engine to be used for the recognition of offline printed Nastalique calligraphic style of Urdu language. Similarly, the authors in [77], also developed a technique for recognizing Jameel Noori Nastalique font from Urdu newspaper clippings. The clippings were converted into editable textual Unicode’s. Mukhtar et al. [78], claims to be the first study reported for handwritten Urdu words. The system proposed was an offline OCR system for Urdu. Similarly, two strategies were investigated for improving the classification of Urdu printed ligatures using nearest neighbor by [80]. Online character recognition systems are less complex compared to that of offline recognition systems. Few authors have opted to use online Urdu character recognition systems, using handwritten writing mode by default. Shahzad et al. [85] proposed an online Urdu character recognition system capable of recognizing isolated, hand-sketched characters. Similarly, an online segmentation–free character recognition system was suggested by Razzak et al. [86]. Khan [87] studied the online recognition of Urdu characters considering only their initial half forms, whereas, Jan et al. [88] proposed a system to recognize both the Urdu characters and words. In studies implemented by Husain et al. [11] and Sardar and Wahab [5], both online and offline recognition system were combined that could operate independently of fonts and scripts. The summary of several notable contributions for different categories of Urdu OCR systems is given in Table 3.1.

44 Table 3.1 Summary of Different Categories of Urdu OCR

Input Writing Script Related Work Font Constraints Acquisition Mode Connectivity O Off OOff P H PH NQ NK NQK FI IS CU Ahmed et al. [40] ü ü Sabbour and Shafait [33] ü ü ü ü Husain et al. [11] ü ü ü ü Shahzad et al. [85] ü ü ü Razzak et al. [86] ü ü ü ü Khan [87] ü ü ü ü Jan et al. [88] ü ü ü Megherbi et al. [70] ü ü ü Husain [84] ü ü ü Pal and Sarkar [22] ü ü ü Chanda and Pal [73] ü ü ü ü Javed [35] ü ü ü Shamsher et al. [64] ü ü ü Ahmad et al. [16] ü ü ü ü Sattar et al. [74] ü ü ü Sattar et al. [17] ü ü ü Hussain et al. [34] ü ü ü ü Mukhtar et al. [78] ü ü ü Akram et al. [65] ü ü ü ü Nawaz et al. [66] ü ü ü ü Ahmad et al. [36] ü ü ü Javed et al. [6] ü ü ü ü Tariq et al. [67] ü ü ü Khan and Nagar [72] ü ü ü ü Khan et al. [68] ü ü Hussain et al. [75] ü ü ü ü Ul-Hasan et al. [26] ü ü ü ü Lehal and Rana [79] ü ü ü ü El-Korashy and Shafait ü ü ü ü [80] Javed and Hussain [81] ü ü ü ü Nazir and Javed [82] ü ü ü ü Patel and Thakkar [38] ü ü ü ü Khan and Khan [76] ü ü ü Khan and Khan [77] ü ü ü Khan et al. [69] ü ü ü Rana and Lehal [83] ü ü ü ü Naz et al. [39] ü ü ü ü Hussain and Ali [32] ü ü ü ü Naz et al. [10] ü ü ü ü Naz et al. [41] ü ü ü ü Sardar and Wahab [5] ü ü ü ü Khattak et al. [43] ü ü ü ü Iqbal et al. [71] ü ü ü ü O- Online Off- Offline OOff-Online and Offline P- Printed H-Handwritten PH-Printed and Handwritten NQ- Nastalique NK-Naskh NQK-Nastalique and Naskh FI-Font Independent IS- Isolated CU-Cursive

45 3.3 Related Work for OCR Phases

Below the related studies for different phases of an OCR system are discussed. The pre- processing, segmentation, feature extraction, classification and recognition phases are thoroughly reviewed.

3.3.1 Related Work for Image Acquisition

Image acquisition is the first phase of any recognition system. Over the years, researchers have used numerous sources to input text, either directly by scanning text images, by generating synthetic images or by using a digitizing tablet. Usually, the input method proposed by authors is mostly dependent on the input acquisition mode i.e. offline or online. Computer generated images i.e. synthetic images have been used by several authors for optical character recognition systems [17, 36, 66, 67, 76]. Different text images, that contained isolated printed Urdu characters was used for training by [66]. Likewise, [67] presented a soft converter that read images in form of a matrix. Sattar et al. [17] performed a number of experiments on a subset of word images, the font size was kept consistent for all words in the acquired images. In [36], Urdu text line images were given as input and segmentation process was carried out next. Likewise, [76] read images having file extensions tiff, jpeg or png. Real world images are far more complex to deal with since the images are scanned. During the scanning process, the images might be affected by noise, skew, slant and other degradations. Several authors opted to develop Urdu recognition systems using the scanned images as input [6, 22, 32, 35, 73, 77, 81, 82]. Hussain and Ali [32] scanned a total of 25 pages from 22 different number of books having varying paper variety, print and page transparencies. Similarly, Pal and Sarkar [22] tested the proposed OCR system on a variety of printed documents that were scanned. Some of the documents had good-quality and were printed on clean paper, while others had inferior printing and paper quality such as children’s alphabet books. Javed [35] used scanned images containing Urdu text as input and it was assumed that the images were kept proper during the scanning process to avoid skewness, noise and other distortions in the image. Khan and Khan [77] considered newspaper clippings as input images, that were converted into an editable form to be used on the notepad application. Chanda and Pal [73] digitized the images by an HP scanner at 300 DPI, whereas, Nazir and Javed [82] scanned Urdu document pages at a resolution of 200 dpi. In [82], the pages were set to have fixed size text and had all the letter as well as diacritics were distributed among a variety of context. The document images were

46 composed of single as well as multiple lines. Javed et al. [6] extracted three or more samples of each ligature from the text for the analysis purpose. 36 font sizes and Noori Nastalique font was used for printing the pages. After printing these pages were scanned at 150 dpi, and later segmented back into ligatures. Similarly, in another study, Javed and Hussain [81] used printed scanned document images at 36 font sizes. Some authors used a combination of synthetic and real-world images with the OCR system. Shamsher et al. [64] and Ahmad et al. [16] collected a corpus of two set of images the computer-generated (synthetic) images and real-world images consisting of scanned Urdu documents. The documents did not contain any other language other than Urdu. Likewise, [65] used manually generated data and data scanned from different books and magazines. Rana and Lehal [83], generated synthetic images of different font sizes i.e. 35, 38, 40,50 and 55 for the primary components of the ligatures, having being formatted as bold or regular. A total of 1200 primary components were also scanned from books and the rest of the primary the components were generated synthetically. Khan and Nagar [72] proposed a system that was trained on samples that contained different sized documents having text from different writers. All the input data was resized to 250x 250 and contained a single character or a single word. The overall scanning process was expected to be of high quality to avoid any skew correction. When developing online character recognition systems, the input source is directly shifted towards usage of such equipment that can take input in real time such as a digitizing tablet. Similarly, for Urdu such digitizing tablets have been used for input strokes acquisition [5, 11, 85, 87, 88]. Shahzad et al. [85] extracted hand-sketched Urdu characters drawn on a Tablet PC. Correspondingly, [87] also used a combination of pen-tablet to collect the data. The data signal was stored as a binary file containing the coordinates information and the pressure point values, reducing the complexity of the recognition system. Jan et al. [88] applied several pre-processing techniques before taking an input of the stroke trajectory to avoid noise due to the input device i.e. pen-tablet and hand movements. Similarly, Husain et al. [11] used a digitizing tablet for ligature acquisition. Each of the input strokes represents a ligature in its full form. The ligature is not broken into characters to avoid the errors associated with segmentation. Sardar and Wahab [5] proposed a methodology that works on both online and offline recognition, hence the input could be an image or given through an input device such as digitizing tablet. The different acquisition methods used by researchers are summarized in Table 3.2.

47 Table 3.2 Summary of Different Image Acquisition Sources Used for Urdu OCR

Related Work Synthetic Scanned Digitizing Tablet Husain et al. [11] ü Shahzad et al. [85] ü Khan [87] ü Jan et al. [88] ü Pal and Sarkar [22] ü Chanda and Pal [73] ü Javed [35] ü Shamsher et al. [64] ü ü Ahmad et al. [16] ü ü Sattar et al. [17] ü Akram et al. [65] ü ü Nawaz et al. [66] ü Ahmad et al. [36] ü Javed et al. [6] ü Tariq et al. [67] ü Khan and Nagar [72] ü ü Javed and Hussain [81] ü Nazir and Javed [82] ü Khan and Khan [76] ü Khan and Khan [77] ü Rana and Lehal [83] ü ü Hussain and Ali [32] ü Sardar and Wahab [5] ü ü

3.3.2 Related Work for Pre-Processing

There are a number of techniques that can be used for pre-processing, such as, image thresholding, noise removal, smoothing, de-skewing, skeletonization, image dilation, normalization etc. Megherbi et al. [70] applied filtering, smoothing and thinning algorithms to remove noise from the input images and to reduce the overall variation in the thickness and slanginess of the different Urdu characters. Nawaz et al. [66] performed the conversion of a grayscale image into a binary image as well as removing noise i.e. salt and pepper noise from the images. The noise must be removed in order to avoid erroneous system classification and recognition. In [11], a smoothing process was used to remove hooks from the data. Usually, the data obtained contained irregularity such as hooks and erratic handwriting. These hooks occurred due to the pen up/down inaccuracies by inexperienced users when using the tablet. 2 to 3-pixel smoothing was applied to remove these issues. Khan et al. [68] removed noise from the images of ‘TrainDatabase’ and ‘TestDatabase’. Usually, the salt and pepper or speckle noise were present in the scanned images. Jan et al. [88], used different filters to remove unwanted data and to extract the data of interest. The proposed OCR system also used Ramer Douglas Peucker algorithm to reduce the total number of data points. Khan et al. [69] used a filtering technique for noise removal.

48 Whereas, Global image thresholding was used for the binarization. The acquired images were also normalized, edges detected, skew was detected and corrected. For normalization, the thinning process was applied and the correlation approach was used for skew detection and correction. Skew detection and correction were also used by [22], for skew angle detection Hough transform algorithm was used. The image was rotated according to the detected skew angle. It was observed that the skew estimation method was font style and size invariant. The proposed skew estimation method could handle documents with angles between +45° to - 45° and could compute skew angles with a tolerance of ± 0.5 degrees. Sattar et al. [17] observed the input bitmap image to analyze skew or slant and remove it. Several other pre- processing processes were applied to the text images such as image binarization, noise removal, blur removal, thinning, skeletonization and edge detection. Likewise, [84] used smoothing, skew detection and correction, document decomposition, slant normalization etc. Tariq et al. [67] performed skew detection and correction on the acquired image. This operation ensured that the image came into its original position when the user rotated the image. As a pre-processing, the soft converter also performs image binarization operation. In a study by Javed and Hussain [81], document images were considered to be un-skewed and had less to no noise or distortion. First, the main bodies were extracted from the text lines within the image, next the baseline was detected. Once the main bodies were extracted, the skeletonization process was performed using the Jang Chin Algorithm. In Ahmed et al. [40], primarily the pre-processing was performed by skew correction, slant correction, de-noising and finally the text normalization. Chanda and Pal [73] used a histogram based method for thresholding and converting the input images into two tones binary image. The digitized image was further pre-processed to remove noise pixels and irregularities on the boundary of the characters. The images were smoothed out to remove the noise, since it may lead to undesired effects for the OCR system. Similarly, [34] performed image binarization, de-hooking and image smoothing operations. The input image obtained was converted into a binary image appropriate for the feature extraction phase. The input image in RGB format was first converted to a grey level image having pixel values between ‘0’ to ‘255’. This grey level image was later converted to a binary image containing only two-pixel values i.e. ‘0’ and ‘1’. Thinning or skeletonization was also applied to the binary image to make it one pixel wide and appropriate for feature extraction. Mukhtar et al. [78] and Sardar and Wahab [5] used a thresholding algorithm to perform binarization procedure, [78] also used a moment based algorithm was used for slant normalization whereas [5] used smoothing and noise removal.

49 Iqbal et al. [71], the pre-processing was divided into two phases: Binarization and Target Image Detection (TID). The first phase converted the image into binary form, while the second phase removed the unnecessary white pixels that surrounded the image. Javed [35] proposed skeletonization and another pre-processing technique to detect the baseline for Urdu script that can be used to differentiate between the main bodies and the diacritics. Horizontal projection profile was used for baseline estimation, rows having the maximum number of pixels was set as baseline initially. However, it was observed that this may lead to problems such as the baseline was sometimes set towards the top of the line. To avoid this issue the maximum number of pixels were just computed from the lower half of the line. Also, instead of using a single row, 5 to 10 lines of the row were used as a band of baseline. Likewise, [86] proposed a unique baseline detection rule for handwritten input for Urdu script written in both Naskh and Nastalique writing styles. Different pre- processing steps are required to minimize the unnecessary components in handwritten strokes, and to work with a vast variety of writing styles. Locally Minimum Enclosing Rectangle (MER) was used for baseline detection, using three primary strokes. Ul-Hasan et al. [26], each text line image was resized to a fixed height to extract the baseline information. In a study by Nazir and Javed [82], document image binarization was performed as well as baseline detection. The row containing the maximum number of pixels was marked as the baseline after line segmentation phases. Patel and Thakkar [38] rescaled each text line image to a fixed height to obtain baseline information. This baseline information is used for distinguishing characters. To eliminate the possibility of overlapping, characters in words and sub-words, [16] stretched the words images horizontally to make space between two connected characters (see Figure 3.2).

Figure 3.2 Horizontal Word Stretching to Avoid Overlapping [16]

Khan [87] used downsampling and discarding to remove the repeated sample records for each character. Due to the wide variations in the writing speeds of the uses, the sample rate is not constant. However, a tablet has a constant temporal data rate. This leads a large

50 number of samples to be processed at specific locations that may have already been used to process multiple samples, leading to erroneous values for feature extraction. The different pre-processing techniques used by researchers are summarized in Table 3.3.

Table 3.3 Summary of Notable Contributions for Different Pre-processing Techniques Used for OCR

Study B F T S SDC SN NR SK ED SZN Other Megherbi et al. [70] ü ü ü CM* Husain [84] ü ü ü DD* Pal and Sarkar [22] ü Chanda and Pal [73] ü ü Javed [35] ü ü Husain et al. [11] ü Ahmad et al. [16] WS* Sattar et al. [17] ü ü ü ü ü BM* Hussain et al. [34] ü ü ü D* Mukhtar et al. [78] ü ü Nawaz et al. [66] ü ü Razzak et al. [86] ü Sardar and Wahab [5] ü ü ü ü Tariq et al. [67] ü ü Iqbal et al. [71] ü Khan et al. [68] ü Ul-Hasan et al. [26] ü ü Khan [87] ü R* Javed and Hussain [81] ü ü Nazir and Javed [82] ü ü Patel and Thakkar [38] ü ü Khan et al. [69] ü ü ü ü ü Jan et al. [88] ü ü R* Ahmed et al. [40] ü ü ü ü B-Baseline F-Filtering T-Thresholding S-Smoothing SDC-Skew Detection & Correction SN-Slant Normalization NR-Noise Removal SK-Skeletonization ED-Edge Detection SZN-Size Normalization CM*: Component Marking and Framing DD*: Document Decomposition WS*: Word Stretching BM*: Blur Removal/Morphological D*: De-hooking R*: Re-sampling

3.3.3 Related Work for Segmentation

Segmentation is one of the most crucial stages in the optical character recognition process. Though considerable research has been carried out using both the holistic and analytical approaches, it’s still not mature enough. Limited datasets have been used for holistic approaches. To evaluate and compare the existing algorithms and techniques, benchmarks and large datasets should be used. The segmentation based approach including the implicit [10, 26, 38-41], as well as explicit [16, 17, 22, 35, 36, 72] segmentation, have been proposed by many researchers. However, over the years the researchers are shifting towards a more realist option of using a holistic approach for segmentation. Some of the popular studies for both explicit and implicit segmentation strategies are given in Table 3.4. 51 Table 3.4 Summary of Notable Contributions Employing Explicit and Implicit Segmentation Strategies

Study Explicit Segmentation Implicit Segmentation Pal and Sarkar [22] ü Javed [35] ü Ahmad et al. [16] ü Sattar et al. [17] ü Ahmad et al. [36] ü Khan and Nagar [72] ü Patel and Thakkar [38] ü Naz et al. [39] ü Naz et al. [10] ü Ahmed et al. [40] ü Naz et al. [41] ü Ul-Hasan et al. [26] ü

Numerous researchers focused on using connected component approach for holistic recognition of Urdu text [5, 33, 35, 43-45, 82, 84, 89, 90]. Husain [84] presented a method for recognition of cursive Urdu Nastalique script using connected component analysis. The overall, connected component labeling was divided into two steps. First, the secondary components (Dots, Tay, and Mad) were separated from the base ligatures. Several features; solidity, number of holes, axis ratio, moments, eccentricity, curvature, normalized segment length, the ratio of height and width of bounding box were analyzed to separate the special ligatures from the base ligatures. Second, the special ligatures were associated to the most feasible neighboring base ligatures. No segmentation accuracy was reported for the proposed segmentation. Identically, [35], acknowledged those connected components that lie on the baseline as main bodies of the ligature and the rest were considered as dots or diacritics. Sardar and Wahab [5] suggested an algorithm that extracted the connected components by reading text from right to left. Moreover, all connected components were analyzed and certain rules were applied to form the final ligature. The proposed algorithm achieved an accuracy of 98.86% for extraction and association. In [33], first extracted the baseline position using the maximum horizontal projection row. Following, all connected components were extracted and separated into primary and secondary. All those components that didn’t touch the baseline were considered as dots or diacritics. Dots and diacritics were then associated to their respective ligature using the horizontal span of each secondary component on the baseline. Nazir and Javed [82] used horizontal projection profile for baseline identification. All components that lie on the baseline are considered as ligatures. A size threshold was set to separate the ligatures from the diacritics. A vertical projection of start and end point of diacritics was studied for ligature association. Whereas, [44] proposed a line and ligature segmentation algorithm for Urdu printed Nastalique text.

52 For ligature segmentation, connected components were analyzed, extracted and associated. The proposed ligature segmentation algorithm examined height, width, coordinates, centroids and baseline information, leading to 99.80% accuracy. Similarly, [45], segmented text line into ligatures using primary (main body) and secondary (dots and diacritics) ligatures. Ali et al. [89] proposed a segmentation phase for an OCR system that had two main steps, dividing the text into lines and further lines into ligatures. The first step was achieved using horizontal projection profile and for the second step connected component analysis was used. The system was trained on 23204 ligatures. In [90], a ligature based OCR system was suggested by first separating the text into primary ligatures and diacritics. Once segmented, right-to-left HMMs were used for recognition purposes. The system was tested on 2017 high-frequency ligatures. Similarly, [43], separated the secondary components from the main body and used HMM for training and recognition. The system achieved an accuracy of 97.93% for a total of 2000 ligatures. Projection profile based textual image segmentation is also one of the most famous methods among researchers. In 2016, [47] presented a novel segmentation approach for Urdu Nastalique ligature recognition using projection profile. The proposed method was tested on a total of 300 ligature samples achieving segmentation accuracy of 91.3% and diacritic association accuracy of 78%. Similarly, [73] also used vertical projection profile for word segmentation followed by feature extraction. Husain et al. [11] identified a total of 250 base ligatures and 6 secondary stroke ligatures and used a novel algorithm for online ligature segmentation. Whereas in [86], a segmentation free hybrid approach was proposed for online Urdu handwritten script using HMM and fuzzy logic. Javed and Hussain [81] used the baseline to separate the diacritics from the main bodies and later extracted the main bodies. Some of the notable contributions used for holistic text segmentation is given in Table 3.5.

Table 3.5 Summary of Some Notable Contributions Using Holistic Approach for Text Segmentation

Study Algorithm Accuracy Reported Husain [84] Connected Component Labelling - Javed [35] Connected Component Labeling - Husain et al. [11] Online ligature recognition - Razzak et al. [86] Online ligature recognition - Sardar and Wahab [5] Connected Component Labelling 98.86% Sabbour and Shafait [33] Connected Component Labelling - Javed and Hussain [81] Baseline Information - Nazir and Javed [82] Connected Component Labelling - Ahmad et al. [44] Connected Component Labelling 99.80% Din et al. [45] Connected Component Labelling - Ali et al. [89] Connected Component Labelling 95%

53 Shabbir and Siddiqi [90] Connected Component Labelling - Khattak et al. [43] Connected Component Labelling 97.93% Ganai and Koul [47] Projection Profile 91.3% Chanda and Pal [73] Projection Profile -

3.3.4 Related Work for Feature Extraction

Over the years researchers have experimented vastly with the feature engineering approach, suggesting handcrafted features for text recognition. The recent feature learning approaches have also been utilized by several researchers, and are usually used with datasets that are extremely large and complex. Ahmad et al. [16] extracted simple topological features from characters. These topological features included the width, number of holes, height of the holes and the direction of holes. In [17], a unique set of features were extracted from the Nastalique characters and named it as the Nastalique feature set FS = Height, Thickness, Angle, Rotation. Hussain et al. [34] extracted more than twenty-five different features such as height, width, loops, curves, cross, endpoints, and joints during the extraction phase. These features were processed to be used for the final character recognition. Contrarily, [67] calculated three different features from each character i.e. the height, width and the checksum. The width of the character was calculated by counting the total number of black pixels in the image from left to right direction. Similarly, the height of character was calculated by counting the black pixels from top to bottom. Finally, the X checksum calculated the total number of black pixels making up the character body. Mukhtar et al. [78] extracted structural features, concavity features and GSC features from normalized character images. Structural features were computed from the image, examining the gradient direction. The concavity features extracted captured the image density and the stroke information. GSC feature was represented by a 512-bit binary vector. The feature detection methods of contour and topological are claimed to be robust but simple. Chanda and Pal [73] proposed contour and water reservoir-based features. For identification of English characters, it was observed that the upper portion of the top reservoir was not open, however, for Urdu characters it was open. Similarly, the reservoir was also computed from the right direction and the water flow levels were noted. Also, the normal distance was computed between the water flow line and the right side of the bounding box. It was observed that for Urdu character this distance was three times greater than the stroke width, however, for English, it was not true. Likewise, [22] extracted topological, contour and water reservoir based features from individual characters. The topological features used were holes, total number and position of holes, the ratio of hole

54 height to the character height, number of different components in the character. Contour features included the different profiles obtained from a portion of a character’s contour. There were numerous water reservoir based features extracted such as numbers, position, height, water flow and direction of water low, and the ratio of reservoir height to the component height. Similarly, [33] extracted contour from each ligature sample. Further, each ligature shape was described using a descriptor that was termed as the ‘shape context’. To extract the contour a logical grid was applied to the ligature image. Next, the transition points from black to white and/or from white to black were considered as the contour points. El-Korashy and Shafait [80] used the shape features as given by [33] as shown in Figure 3.3. However, some more features were added to the feature vector, such as size and the location of the dots, size features like width, height and aspect ratio were also used.

Figure 3.3 Contour (Boundary) Extracted for a Ligature [33]

Naz et al. [10] extracted 12 statistical hand-crafted features from the text lines by sliding a window over it. These features were not language or script dependent and were extremely simple to compute and work with. The features extracted were vertical and horizontal edges, foreground pixels, intensity features, projection features and GLCM Features. In [39] statistical features extracted were vertical edges intensities, horizontal edges intensities, foreground distribution, density function, intensity features, mean and variance of horizontal projections, mean and variance of vertical projections, GLCM features and center of gravity X and . The features were extracted by sliding a window of size 4 x 48 from right-to-left on the text line. The results showed that the proposed features significantly reduced the labeling errors. Whereas, in [86], a total of twenty-six time-variant structural and statistical features were extracted such as cusps, lines and loops based on the fuzzy logic from the base strokes.

55 Shamsher et al. [64] extracted simple features from all the characters. Once the characters were detected their pixels were copied to a simple matrix and 4 extreme points were detected. In the first pass the top, right and left extreme points were detected. In the second pass, only the bottom extreme point was detected. While, Nawaz et al. [66] extracted chain code from each input image. The chain code was calculated by scanning the input image and calculating the string of on and off pixels. This chain code is used in the recognition phase to match column by column with every character calculated chain code for class identification. Khan et al. [68] proposed to extract a matrix (MxM) for each image of the ‘TrainDatabase’ as well as the ‘TestDatabase’. Eigenvectors and Eigenvalues were calculated for each image of these databases. ’ Eigenvector was selected such that it had the highest Eigenvalues. In a study by Khan et al. [69] three different feature extraction techniques were used, namely, the Hu moments, Zernike moments and the principal component analysis. Hu moments of order 2 to 9 were used. Whereas, Zernike moments were used to calculate the Euclidean distance between the character to be recognized and the training image. Megherbi et al. [70] proposed and defined a unique set of seven fuzzy features to recognize the Urdu characters. Similarly, Javed [35] proposed an OCR system that first applied the process of skeletonization on the main bodies of the objects and then extracted region-based features from the objects. Handcrafting features can be an extensive task for an extremely diverse dataset. In such a situation, the entire image pixels can be taken as features. In studies [40], [26] and [38] raw pixel values were extracted as used as features, no other sophisticated features were tested. Similarly, in another study Naz et al. [41] employed a five-layered CNN model for extracting abstract and generic features from 60,000 handwritten digit images from the MNIST database. The first layer of the CNN extracted information from the raw pixels of the image such as the edges, lines and corner information. Features were selected from the first convolution layer in form of convolution kernels (K1-K6). There are numerous features that can be extracted from the text images. Such as, a cross- correlation was used by Sattar et al. [74] for recognition of the character shapes. The character codes were written in a sequence into a text file as characters were found during the recognition phase. In [75], simple height and width features were extracted from the main bodies of the ligatures. C 4.5 algorithm was used to evaluate the significance of both the height and width. It was found that the width was more significant, it was used for dividing the training data into four subclasses. Khan and Khan [77] proposed a novel technique that utilized the point feature matching, SURF features. Point feature matching uses point correspondence between the target and the reference image for detecting objects.

56 A total of 100 strongest SURF feature points were found in the reference image and 300 strongest points were found in the target image. However, Lehal and Rana [79] experimented with different feature extraction techniques. They used a combination of DCT, Gabor filters and zoning features. For Gabor filter, the word image was normalized to 32 x 32 pixels and partitioned into 16 sub-regions of 8x8 size. The images were convolved with symmetric and odd-symmetric Gabor filters. Higher value DCT coefficients were extracted in a zigzag fashion and stored a feature vector of size 100. For zoning features, the image was divided into 3x3, 4x4 and 7x7 zones. For each zone, the percentage of black pixels were calculated and various experiments were done. Nazir and Javed [82] extracted a feature vector that contained the code of the mark, base and the diacritics associated with them. Whereas, [83] used DCT, Gabor, gradient and directional features for the classification of the primary components. Husain [84] extracted features in two stages, first features were extracted only from the special ligatures i.e. solidity, number of holes, axis ratio, eccentricity, moments, normalized segment length, curvature, the ratio of bounding box width and height. In the second stage, twenty new features were added to the feature vector of the base ligature, which was done to associated special ligatures to the base ligatures. In [43], a model was proposed that used features capturing projection, concavity and curvature information. Right-to-left sliding windows were used to extract this information from the ligatures and feed it for training. Javed et al. [6] extracted diacritic dependent feature vector. If a diacritic was detected then the vector was ‘0’, else if there was no vector present then after detecting it the vector was set to ‘1’.Whereas, [85] extracted a feature vector after careful analysis of the Urdu characters. They extracted a set of Rubine features i.e. length of the bounding box diagonal, angle of the bounding box diagonal, distance between the first and last point, cosine of the angle between the first and last point, sine of the angle between the first and last point, total length of the primary stroke, total angle traversed, sum of absolute value of angle at each point, sum of the squared value of those angles from the primary components. Also, separate features were extracted from the secondary strokes that included the number, length, positioning and Number of dots in secondary strokes. Khan [87] used a multilevel one-dimensional wavelet analysis with Daubechies wavelet(db2) at level 2 to form the feature vector. In [88], geometric invariant features that were font, scale, rotation and shift invariant were extracted i.e. cosine angles of trajectory, discrete Fourier transform of trajectory, inflection points, self-intersections, convex hull, radial feature, grid (orthogonal and perspective), and retina feature. A feature vector was extracted by [11], the stroke(x, y) coordinates, chain codes and unique features for every stroke were detected. Total 20 features were extracted

57 from the base strokes that included start vertical, end vertical, Horizontal R2L, Horizontal L2R, Hedge, Curve L2R, CurveR2L, loop flag etc. 6 features were extracted from the secondary strokes. Sardar and Wahab [5] extracted five features from single ligature and characters. Four features were extracted using 3 types of sliding windows. These sliding windows computed the ratio between the white and black pixels. One feature was extracted using the Hu Invariant Moment. While [72] extracted and created vectors from the binary images, SOM model was used to captures the invariant features of Urdu script. Some of the notable contributions using different features are given in Table 3.6.

Table 3.6 Notable Contributions Using Different Features

Related Work Features Shamsher et al. [64] Extreme Points Ahmad et al. [16] Topological Features Sattar et al. [17] Height, Thickness, Angle, Rotation Hussain et al. [34] Height, Width, Loops, Curves, Cross, End Points, And Joints Nawaz et al. [66] Chain Code Tariq et al. [67] Height, Width and Checksum Khan et al. [68] Eigenvector, Eigenvalues and Principal Component Analysis Khan et al. [69] Hu Moments Zernike Moments Principal Component Analysis Megherbi et al. [70] Fuzzy Features Pal and Sarkar [22] Topological Contour Water Reservoir Javed [35] Region-Based Ahmed et al. [40] Raw Pixels Sabbour and Shafait [33] Contour Chanda and Pal [73] Contour Water Reservoir Sattar et al. [74] Cross-Correlation Hussain et al. [75] Height and Width Features Khan and Khan [77] SURF Features Mukhtar et al. [78] Structural Concavity GSU Lehal and Rana [79] DCT Gabor Filters Zoning Based Features El-Korashy and Shafait Contour Size, Location of The Dots Size Features Like Width, [80] Height and Aspect Ratio Nazir and Javed [82] Code of The Mark, Base and The Diacritics Husain [84] Solidity, Number of Holes, Eccentricity, Moments, Normalized Segment Length, Curvature, Ratio of Bounding Box Width and Height, Axis, Ratio Khattak et al. [43] Projection, Concavity and Curvature Information Javed et al. [6] ‘0’ & ‘1‘ Ul-Hasan et al. [26] Raw Pixels Patel and Thakkar [38] Raw Pixels Naz et al. [10] Statistical Features Naz et al. [41] Raw Pixels Naz et al. [39] Statistical Features Shahzad et al. [85] Rubine Features Razzak et al. [86] Statistical Structural Khan [87] Wavelet Analysis Jan et al. [88] Geometric Features Husain et al. [11] Stroke Coordinates, Chain Code and Unique Features Sardar and Wahab [5] Ratio Between White/Black Pixels and Hu Invariant Moment Khan and Nagar [72] SOM Model Used to Capture Invariant Features

58 3.3.5 Related Work for Classification/Recognition

Machine learning and deep learning both are very similar, but more recently, deep learning has gained immense hype among the research community. Traditional machine learning methods like SVM, k-NN, Template matching, Decision Tree etc. is being overtaken by the not so traditional high computation deep learning, such as the LSTM, Auto-encoders, Deep Belief Networks, Deep Neural Networks etc. Here, different classifiers are discussed in the perspective of the script connectivity used by the recognition system i.e. isolated or cursive (analytical/holistic). Neural Networks are extremely complex structures but performance and recognition wise they are remarkable. Several authors opted to develop recognition systems for isolated Urdu handwritten or printed characters using the neural network. Shamsher et al. [64] used supervised learning and Feed Forward Neural Network for training and testing. The prototype of the system was tested on printed Urdu characters and achieved an accuracy of 98.3% on an average. Likewise, Tariq et al. [67] used Neural Network for constructing an OCR, named Soft converter. The prototype of the system had an accuracy rate of 97.43%. In [70] a two-stage neural network was used to classify 36 Urdu characters and reported a high recognition accuracy. The classifier consisted of a multi-hidden-layers Back Propagation Neural Network architecture for classification of Urdu characters within a subclass. Whereas, Khan [87] used a Back Propagation Neural Network classifier was used for single stroke Urdu characters written in their initial half form. It was tested on 3600 instances of Urdu letters written in their initial half forms, achieving an overall accuracy of 91.3%. Contrary to the complex neural network architectures, traditional machine learners like HMM, k-NN, SVM, principal component analysis, decision tree, K-SOM, linear classifiers and template matching has also been reported to give good results. Akram et al. [65] presented an HMM technique to be tested on single Urdu character ligatures, the system achieved a recognition rate of 96% for scanned data and 98% for manually generated data. In a research by Khan et al. [68] an OCR system for Urdu script using the Principal Component Analysis was proposed. The recognition accuracy reported for noise-free images was superior, while for noisy images the recognition rate dropped. Overall 96.2% accuracy was reported for character recognition, the system was implemented using the MATLAB tool. Khan et al. [69] used decision tree from classification, specifically the decision tree -48 algorithm. The system was tested on 441 handwritten and machine written Urdu characters and accomplished a recognition rate of 92.06%. Hussain et al. [34] recognized the segmented characters in two steps, first by categorizing the different shapes

59 for the same character into 33 classes using the auto-clustering technique, Kohonen Self- Organizing Map. Next, in feature extraction, the 25 extracted features were once again processed for the final recognition. The system was tested for 104 segmented characters and was fully capable to recognize it. In [78], the OCR system was evaluated on a dataset of about 1300 handwritten words. For classification k-NN and SVM was used, achieving an accuracy on an average of 70% for top choice and 82% for the top three choices. Likewise, Lehal and Rana [79] used SVM, k-NN and HMM classifiers for training. The proposed system was capable of recognizing 9262 Urdu ligatures with more than 98% recognition accuracy. In [88], a linear support vector machine was used for training and testing, giving a 97% classification accuracy and a very low false rejection rate on the test data. Nawaz et al. [66] implemented a system and tested it for different types of fonts using pattern matching and it achieved an accuracy of 89% for the isolated character with 15 char/sec recognition rate. Similarly, [76] applied a template matching technique to different sentences of Urdu script having 72 font sizes. All identified objects were saved as templates on basis of comparison to other templates already available in the dataset. Whereas, [77] used a point feature matching technique i.e. SURF for feature extraction, classification and recognition. The system was tested on 20 newspaper clippings that had 177 objects and achieved a recognition accuracy of 93%. Hussain et al. [75] analyzed the tesseract engine for the recognition of Nastalique Urdu language. The original tesseract system was reported to have 65.59% accuracy for 14 font sizes and 65.84% accuracy for 16 font sizes. While the modified system outperformed the original system and improved the recognition from an average of 170 milliseconds to 84 milliseconds. The system was trained using 14,750 main body images for 14 and 16 font sizes separately achieving an accuracy of 97.87% for 14 font sizes and 97.71% for 16 font sizes. Summary of the notable contributions for isolated character Urdu OCR is given in Table 3.7.

Table 3.7 Summary of Contributions for Isolated Character Urdu OCR

Study Features Classifier* Dataset Accuracy Shamsher et al. [64] Extreme Points FFNN 100 Characters 98.3% Akram et al. [65] Extreme Points HMM Manual Data 98% Scanned Data 96% Nawaz et al. [66] Chain Code PM Urdu Characters 89% Tariq et al. [67] Height, Width, Checksum Neural Soft Matching DB 97.43% Network Hard Matching DB 100% Khan et al. [68] Principal Component Eigen Train and Test 96.2 % Analysis Space Database Khan et al. [69] Hu Moments Decision 441 Characters 92.06% Zernike Moments Tree

60 Principal Component Analysis Megherbi et al. [70] Fuzzy Features BPNN 36 Characters --

Hussain et al. [34] Height, Width, Loop, Joint, K-SOM 104 Characters 80 % Cross, Curves, End Points Hussain et al. [75] Height and Width Character 14,750 97.87% Class 14 Font Size Images 97.71% Analysis 16 Font Size Images Khan and Khan [76] White Pixel Length Template Different Sentences -- Matching Khan and Khan [77] SURF Point 20 Newspaper 93% Feature Clippings Matching

Mukhtar et al. [78] Structural Features k-NN 1300 Handwritten 70% and SVM Words 82% Lehal and Rana [79] DCT k-NN 9262 Ligatures 98% Gabor Filters HMM Zoning Features SVM

Khan [87] Wavelet Analysis BPNN 3600 Letters 91.30%

Jan et al. [88] Geometric SVM Primary Ligature of 97% each Alphabet Class having 128 Samples Classifier* FFNN: Feed Forward Neural Network HMM: Hidden Markov Model PM: Pattern Matching K-SOM: Kohonen Self-Organizing Map (SOM) Algorithm BPNN: Back Propagation Neural Network k-NN: k- Neural Network SVM: Support Vector Machine

Analytical recognition system for Urdu printed text has using Neural Networks has obtained phenomenal results on standard Urdu datasets like UPTI. Ahmad et al. [16] used Neural Networks for training different forms of a character. The system was tested on synthetic images as well as real-world images obtaining an accuracy of 93.4%. In [36] a Feed Forward Neural Network was used to train 56 different classes of the 41 characters, each having a total of 100 samples. The prototype of the system was implemented in MATLAB, having 70% reported accuracy on the average. Whereas, [40] investigated the performance of recurrent neural network (RNN) for cursive and non-cursive scripts. Bidirectional Long Short-Term Memory (BLSTM), a variant of RNN was used for evaluation of Latin as well as Urdu script. A special layer Connectionist Temporal Classification (CTC) was used for sequence alignment. The character recognition accuracy for non-cursive Latin script was 99.17% for the UNLV-ISRI dataset. For cursive Urdu, without position information, the accuracy reported was 88.94% and for the position, accuracy was 88.79%, evaluated on unconstrained datasets. Similarly, in [26] a variant of LSTM, i.e. a Bi-directional LSTM was evaluated on text line images from the UPTI dataset. The system was evaluated for characters by ignoring its shape and by considering the shape. An error rate of 5.15% for the first case and 13.6% for the second case was evaluated. Patel and Thakkar [38] used a multidimensional BLSTM and ANFIS method. The ANFIS 61 method learned different membership functions and rules from the data. This adaptive network was composed of nodes and directional links, providing a relationship between the inputs and the outputs. It was evaluated on the UPTI dataset, achieving recognition error rate of 5.4%. Likewise, [10] proposed a system that extracted statistical features and fed it to multi-dimensional long short-term memory recurrent neural network (MDLSTM RNN) with a connectionist temporal classification (CTC) output layer. The CTC layer was used to label the character sequences. The system was evaluated on the standard UPTI dataset having 10,000 lines written in Nastalique font. Overall the system gave promising recognition rate of 96.40%. Naz et al. [41] extracted invariant features using the Convolution Neural Network and then fed these features to the Multi-dimensional LSTM (Long Short-Term Memory) for learning. Experiments were carried out of UPTI dataset and achieved an accuracy of 98.12%. In [39] high accuracy was achieved for Urdu Nastalique using statistical features and multi-dimensional long short-term memory. The system was evaluated on the Urdu Printed Text Image dataset and achieved a good recognition accuracy of 94.97%. Several other methods have also been used for character level recognition system. Sattar et al. [17] proposed a finite state Nastalique text recognizer had the following two components: Character Shape Recognizer and Next-State Function. When the recognition process is carried out the character codes were stored in a text file. These text files were later concatenated in the sequence as the ligatures were found. In [32] segmentation based technique was used for recognition of Urdu Nastalique text using the HMM classifier. The OCR system was tested on 79093 instances of 5249 main body classes and achieved an overall recognition accuracy of 97.11%. The system was also tested on document images extracted from different books and achieved main body accuracy of 87.44%. Whereas, [22] implemented a feature based tree classifier. The system was tested on 3050 printed Urdu characters and numerals, it was reported to achieve an accuracy of 97.8%. Javed [35] used HMM technique to make ligature independent OCR system for Urdu Nastalique script using the HMM technique. For testing and analyzing different words were extracted from the Nokia dictionary. Three or more samples of each word were taken, that was written in Noori Nastalique font with a size of 36. For each letter accuracy was calculated separately, overall on average an accuracy of 95.76% was attained for a total of 2898 characters. Similarly, the system was also analyzed for a total of 1692 ligatures achieving an accuracy of 92.73%. In [71] template matching was used for identification of character within the image. After identification, the Roman Urdu character corresponding to the Urdu Nastalique character in the image was selected. Khan and Nagar [72] proposed an

62 algorithm based on the Kohonen SOM algorithm for recognition of Urdu characters. The training set was composed of 200 samples while the testing set was composed of 800 samples. The recognition rate of 79.9% was reported for the first choice and about 98.5% for the top three choices. Summary of the notable contributions for cursive character Urdu OCR is given in Table 3.8.

Table 3.8 Summary of Contributions for Cursive Character Urdu OCR

Study Features Classifier* Dataset Accuracy Ahmad et al. [16] Topological features Neural synthetic images real- 93.4% Networks world images Sattar et al. [17] Height, Thickness, Finite State -- -- Angle, Rotation Model Ahmad et al. [36] -- FFNN 56 Classes Of 100 70% Samples Hussain and Ali [32] -- HMM 79093 Instances Of 97.11%. 5249 Main Body 87.44%. Classes

Document Images from Books Pal and Sarkar [22] Topological Tree 3050 characters 97.8% Contour Classifier Water Reservoir Javed [35] Region-Based HMM 2898 characters 95.76% 1692 ligatures 92.73% Iqbal et al. [71] -- TM -- -- Ahmed et al. [40] Raw Pixels BLSTM UNLV-ISRI for Latin 99.17% Urdu-Jang 88.94% and UCOM for Urdu and 88.79% Ul-Hasan et al. [26] Raw Pixels BLSTM UPTI 94.85% Patel and Thakkar [38] Raw Pixels Multi- UPTI 94.6% dimensional BLSTM and ANFIS Naz et al. [10] Statistical Features MDLSTM UPTI 96.40% Naz et al. [41] Raw Pixels MDLSTM UPTI 98.12% Naz et al. [39] Statistical Features MDLSTM UPTI 94.97% Khan and Nagar [72] SOM model based K-SOM Testing 200 samples 79.9% Invariant Features (first choice) and 98.5% (top three choices) Classifier* FFNN: Feed Forward Neural Network HMM: Hidden Markov Model TM: Template Matching K-SOM: Kohonen Self-Organizing Map (SOM) Algorithm BPNN: Back Propagation Neural Network k-NN: k- Neural Network SVM: Support Vector Machine

63 Ligature based OCR systems have gained immense popularity recently. Javed and Hussain [81] proposed a Hidden Markov Model and a rule-based post-processor system that achieved an accuracy of 92.73% for printed and then scanned Urdu document images having 36 font sizes. In [43] Hidden Markov Model (HMM) was used to train the system. It was evaluated on 2000 frequently occurring Urdu ligatures and got a recognition rate of 97.93%. Equally, [6] used HMM for recognition of Urdu ligatures. A total of 3655 ligatures were tested and 3375 ligatures were accurately identified, giving an overall accuracy of 92%. Razzak et al. [86] used a hybrid HMM and fuzzy logic classification technique for large and complex data recognition. The proposed OCR system was evaluated on 1800 ligatures and obtained an accuracy of 87.6% and 74.1% for Nastalique and Naskh, respectively. Whereas, [45] presented an OCR system that relied heavily on the statistical features and employed Hidden Markov Models for classification. A total of 1525 unique high-frequency Urdu ligatures from the standard Urdu Printed Text Image (UPTI) database were considered in our study. Hidden Markov Models were trained separately for each ligature. The system gave an overall ligature recognition rate of 92%. Tree-based classifiers have also been tested for word and ligature based recognition systems. Chanda and Pal [73] applied a binary tree classifier on 8011 words, out of which Urdu words were 3210, English 2738 and Devanagari were 2063. Overall the accuracy of the system was 97.51%. The highest accuracy was reported for Urdu script of 98.09%. The confusion rate for the proposed system was 0.78%. In [80] Nearest Neighbour and Random Forrest classifier were used to evaluate the system. First, spectral hashing was carried out and the accuracy of the system was measured and accuracy of 81.5% was achieved. Second, Random forest classifier was used for identification of different ligatures, such as one-character, two-character, and three-or-more character ligatures. An accuracy of 98% was achieved using the random forest for single character ligatures. Likewise, [85] used a weighted linear classifier for training and classification. The proposed concept was integrated into an application used for aiding people in learning the Urdu language. Five Samples Of 38 Urdu characters were used for training the system. An accuracy of 92% was obtained for Urdu native writers. While an accuracy of 73% was reported for non-native Urdu writers. K-NN, SVM and correlation methods have also given high accuracies for text recognition. In [83] SVM and k-NN classifier were used to train and recognize 11,000 Urdu ligatures. An overall accuracy of 90.29% was reported for Urdu text images. Sabbour and Shafait [33] evaluated the performance of the system for both Urdu and Arabic script. After segmentation, the unknown ligatures from the dataset were classified in the training phase

64 using the k-NN, k-Nearest Neighbour. The performance of the system for Urdu clean text was 91% and for Arabic text was 86%. However, [5] used the K-Nearest Neighbors (k- NN) algorithm for features matching and applied Euclidean distance with 10 nearest neighbors. A total of five features were extracted independently by the k-NN. Overall the system gave a recognition accuracy of 97.12% different printed and handwritten documents of different fonts and script. Whereas, [74] proposed an algorithm based on correlation for printed Nastalique text. Experiments were carried out on a small subset of text and overall the recognition results obtained were very encouraging. Nazir and Javed [82] proposed a methodology for the processing and recognition of diacritics based Nastalique Urdu script. The system was capable of recognizing invariant cursive texts of 48 font size. Overall an accuracy of 97.40% was reported for the proposed new technique, correlation method, based recognition of 6728 main Urdu ligatures. Husain [84] used 200 carefully selected ligatures for training a back propagation neural network and got an accuracy of 100%. A total of 34 features were fed into the neural network, having 34 inputs, 65 hidden neurons and 45 output neurons. Correspondingly, [11] presented the design of an online Urdu handwriting recognition system that used a BPNN for the classification of every stroke to its respective class. A total of 850 single characters, 2 characters and 3 character ligatures were fed to the BPNN for classification. All these ligatures formed approximately 50000 words. A recognition rate of 93% was reported for base ligatures and a recognition rate of 98% was reported for secondary strokes. A stacked denoising autoencoder for automatic feature extraction from raw pixel values of ligature images [91]. Different stacked denoising autoencoders were trained on 178573 ligatures having a total of 3732 classes. For training and testing the un-degraded (noise-free) UPTI (Urdu Printed Text Image) dataset was used. Overall the recognition accuracy for the system was in the range of 93% to 96%. Ahmad et al. [92] proposed a Bidirectional long short-term memory (BLSTM) architecture for recognition of Urdu Nastalique sentence images. A gated BLSTM (GBLSTM) model for recognition of printed Urdu Nastalique that incorporated raw pixel values as input was used. The model was trained and tested on the un-degraded version of UPTI dataset achieving an accuracy of 96.71%. Summary of the notable contributions for ligature based Urdu OCR is given in Table 3.9.

65 Table 3.9 Summary of Contributions for Ligature Based Urdu OCR

Study Features Classifier* Dataset Accuracy Rana and Lehal DCT SVM 11,000 Ligatures 90.29% [83] Gabor Filters k-NN Directional Gradient Khattak et al. [43] Projection HMM 2,000 Ligatures 97.93% Concavity Curvature Information Shahzad et al. [85] Rubine Features Weighted Linear 5 Samples Of 38 92.80% Classifier Characters Sabbour and Shafait Contour K-NN UPTI- 10,000 91% [33] ligature Arabic 20,000 86%. ligatures Chanda and Pal Contour Binary Tree 3210 Urdu words 98.09% [73] Water Reservoir Classifier Sattar et al. [74] -- Cross-Correlation Small sub-set --

El-Korashy and Contour Nearest Neighbour Training-20,000 81.5% Shafait [80] Size and Location of The Classification Testing-18000 Dots width, height, and aspect Random Forest 98% ratio Javed and Hussain -- HMM 1692 ligatures 92.73% [81] Nazir and Javed Code of The Mark, Base Correlation 6728 ligatures 97.40% [82] and The Diacritics Method Husain [84] Solidity, Number of Feed Forward 200 ligatures 100% Holes, Eccentricity, Back Propagation Moments, Normalized neural network Segment Length, Curvature, Ratio of Bounding Box Width and Height, Axis, Ratio Javed et al. [6] ‘0’ & ‘1‘ HMM 3655 ligatures 92%

Razzak et al. [86] Statistical HMM 1800 ligatures 87.6% Structural Fuzzy Logic Nastalique and 74.1% Naskh Husain et al. [11] Stroke Coordinates, BPNN 850 ligatures 93% Chain Code and Unique Features Sardar and Wahab Ratio Between k-NN Handwritten and 97.12% [5] White/Black Pixels, Hu Printed Text Invariant Moment Ahmad et al. [91] Stacked Autoencoder SoftMax UPTI 93% to Features from Raw 96% Pixels Ahmad et al. [92] Raw Pixels GBLSTM UPTI 96.71%

Din et al. [45] Statistical HMM UPTI- 6187 92% ligatures Classifier* FFNN: Feed Forward Neural Network HMM: Hidden Markov Model TM: Template Matching K-SOM: Kohonen Self-Organizing Map (SOM) Algorithm BPNN: Back Propagation Neural Network k-NN: k- Neural Network SVM: Support Vector Machine

66 3.4 Related Work for Genetic Algorithms

Bio-inspired, genetic algorithms are widely used for feature optimization problems, finding an optimal solution in the least time from the problem space [93, 94]. Recently, the genetic algorithm has gained popularity for pattern recognition tasks such as optical character recognition [95]. The genetic algorithm has other numerous applications and has been successfully used for license plate recognition [96], intrusion detection system [97, 98], segmentation [99, 100]. Genetic algorithms have been used for Arabic optical character recognition systems. The authors in [101], proposed a system for recognizing online cursive Arabic handwriting using a genetic algorithm. A fuzzy neural network was used for recognition, whereas the genetic algorithm was used to select the best combination of characters after recognition. A feature extraction method using an enhanced genetic algorithm was proposed by Abandah and Anssari [102] for Arabic text recognition. It was observed that selecting a subset of features from the character’s main body and its secondaries significantly improved the accuracy. High accuracy was reported for using SVM classifier, selecting the most efficient features from the feature vector. Genetic algorithm and visual coding were also used for online handwriting recognition of Arabic script [103]. The main contribution of the research was based on the conception of an encoding system and a fitness function. The evaluation function was developed using the visual indices similarity. Words from Arabic dataset LMCA, french meaning, Lettres, Mots et Chiffres Arabe, developed by 24 participants from the same laboratory was used for the evaluation purposes. The final results obtained were very promising, proving that the proposed GA method was extremely powerful. Similarly [104], obtained phenomenal recognition rate of 95% for isolated Arabic characters. The unknown character was read from a file, numerous operations were performed on it to make the final recognition effective. Likewise, the genetic algorithm was used to select the best combination of characters for recognition of online handwritten and Beta neuro-fuzzy was used for recognition of the characters Alimi [105]. In [106], an approach was presented for an offline Arabic writer identification using a combination of structural and global features. A genetic algorithm was used for feature subset selection, whereas, Support vector machines and multilayer perceptron (MLP) were used as classifiers. The experiments were carried out on a database of 120 samples, achieving 94% accuracy for MLP. Dealing with a large number of features in OCR systems is not unusual and may increase the computational load of the recognition process [107]. For reducing the unnecessary and redundant features from the recognition process for Farsi script, a genetic algorithm was employed. Lower computational complexity and enhanced recognition rates

67 were reported for the optical recognition system. Kala et al. [108] proposed a method where graphs were generated for each of the 26 capital alphabets of the English language. These graphs were then intermixed to generate new styles using the genetic algorithm. The final recognition was carried out by matching the graph generated by an unknown character image to the graphs generated by the mixing process. An accuracy of 98.44% was achieved for the proposed method. Summary of the notable contributions for genetic algorithm based Urdu OCR systems is given in Table 3.10.

Table 3.10 Summary of Contributions for Genetic Algorithm Based OCR Systems

Study Text Features Recognition Dataset Accuracy Alimi [101] Handwritten - Optimization: GA 2000 89% Arabic Recognition: Fuzzy Characters Neural Network Abandah and Arabic Zernike Optimization: GA - 90% Anssari [102] Moment Recognition: SVM Kherallah et Arabic - Recognition: LMCA - al. [103] Visual Encoding And Genetic Algorithm Abed et al. Isolated Courner Feature Genetic Algorithm - 95% [104] Arabic

Alimi [105] Handwritten - Optimization: GA - 89.5% Arabic Recognition: Hierarchical Beta Neuro-Fuzzy System (BNFS) Gazzah and Handwritten Structural and Optimization: 120 Samples 94% Amara [106] Arabic Global Features Genetic Algorithm Recognition: SVM and MLP Soryani and Printed Farsi Characteristic Genetic Algorithm 1080 Images 88.89% Rafat [107] Loci Approach Kala et al. Handwritten Graphs Genetic Algorithm 26 Alphabets 98.44% [108] English

3.5 Urdu Digit Recognition Systems

A lot of research work has been done for recognition of numerals from different languages like English and Chinese. However, very limited research has been carried out for the recognition of numerals for alphabets of Farsi, Arabic and Urdu. Urdu numerals are composed of different curves and line segments and written using old Arabic script. The numerals for Urdu are similar to that of the Farsi script, however, commonly old Arabic numeral form is used for writing it instead of Urdu numeral form. Ansari and Borse [109] proposed a research for OCR of handwritten Urdu Digits. In the pre-processing stage, different processes were applied to the Urdu digit images, such as grey level image conversion, image thresholding, median filtering, and image normalization (64x64). For feature extraction, different types of Daubechies Wavelet transform and zonal densities 68 from different zones of images were used. Back Propagation Neural Network was used for classification of the digits. The proposed system was tested on 200 samples of each digit, a total of 2150 samples, and achieved recognition accuracy of 92.07% on average. A sample digit training and testing sample are shown in Figure 3.4.

Figure 3.4 Digit Training and Testing Sample [109]

In [110] a handwritten Urdu Character recognition technique was presented based on Zernike invariants and SVM as a classifier. Zernike moment invariants were used for feature extraction, overall 22 features were extracted from each numeral. Support Vector Machine (SVM) was used for classification and got a success rate of 96.29%. The problem of handwritten offline numerals was addressed by Uddin et al. [111]. A novel approach of Non-Negative Matrix Factorization (NMF) was proposed in the research. A two-page form was developed to get the input of Urdu numerals from a variety of people. Numerals from the first page of the form were used for training. Numerals the second page were used for testing. The pre-processing stage involves various steps, noise removal, locating the rectangular boxes, and numeral isolation from the form images, padding the numerals with appropriate margins to preserve its orientation and resizing the numeral images to 175x175. Overall the system achieved an accuracy of 86% for nearly 1600 pages. Summary of the notable contributions for numeral recognition is given in Table 3.11.

Table 3.11 Summary of Urdu Numeral Recognition Systems

Study Features Classification Dataset Accuracy

Ansari and Borse [109] Daubechies Wavelet Back Propagation 2150 Samples 92.07% Transforms Neural Network Zonal Densities

Kaushal et al. [110] Zernike Moments SVM 700 Samples 96%

Uddin et al. [111] Non-Negative Matrix L2-Norm 1600 Pages 86% Factorization (NMF)

69 3.6 Discussion of Literature Review

Overviewing and analyzing the previous literature, the following important information is identified. • Offline Urdu character recognition systems are extremely complex and there is need to develop a fully functional OCR system to deal with the Nastalique calligraphic standard Urdu text of writing. The existing offline OCR systems are mostly capable of recognizing isolated non-cursive Urdu text. • The Image acquisition phase is highly dependent on the type of OCR system being used. Since there are very few online recognition systems for Urdu script, very few digitizing tablets have been used for input. • In the pre-processing phase, thresholding is one of the most frequently used methods. • Segmentation-free systems i.e. holistic systems are very scarce. Developing such systems may omit the segmentation challenges associated with the isolated script, explicit and implicit segmentation. • If developing a recognition system to deal with ligatures, its more appropriate to extract features that are not concerned with the structure of the text, since there are a lot of variations. • Due to large number ligatures in Urdu text, it’s more appropriate to do clustering to identify the appropriate number of textual classes.

3.7 Open Problems and Future Directions

The existing literature has been extensively reviewed leading to identifying several open problems. The possible solutions to tackle each problem have also been briefly discussed in this section. The following open problems have been unveiled for Urdu optical character recognition. 1. Test Datasets: The existing studies for Urdu Nastalique have been trained and tested on small datasets. 2. Labeling: How to label the characters when it possesses different shapes at different positions on the basis of its neighboring characters? 3. Segmentation: Urdu script is highly cursive in nature. How to make sure the text recognition system is capable of performing segmentation without affecting the original text? 4. Feature Extraction: What type of features can be used for effective ligature recognition? 5. Classification: How to decide which algorithm works best for ligature based recognition systems?

70 Most of the existing Urdu Nastalique based recognition systems have been trained and tested on small datasets. Standard datasets need to be created, maintained and used to evaluate the all the research studies in future. Labeling of data is an important pre- requisite for supervised learning. It’s an extremely intensive job to manually label characters for large dataset due to the variations of the character’s shape within the ligature or word. As a solution, labeling a ligatures dataset may be observed. Another challenge usually faced when developing a recognition system is that of text segmentation. Urdu text is written in the Nastalique calligraphic style which is extremely cursive. This cursiveness introduces several problems when dealing with analytical segmentation based recognition systems. This problem can be tackled by shifting the focus towards segmentation free i.e. ligature based recognition systems. Automated feature learning is sometimes an extremely time and space consuming process, instead, hand-crafted manual features can be used, since it delivers the same results but with less space consumption. Many of the prevalent studies for character based recognition systems heavily rely on hand- engineered feature extraction methods. In most studies, structural hand engineered features are extracted that requires the processing of the overall structure of the character, hence, inappropriate for ligature based recognition systems. Since the ligatures are large in number in comparison to the characters. Instead, geometrical or statistical features can be easily extracted without requiring any knowledge about the structural information of characters within the ligature.

3.8 Summary

In chapter 3, “Literature Review”, the literature for Urdu optical character recognition has been reviewed thoroughly to identify the knowledge gaps and figuring out the future research and development in the field. Overall the literature reviewed has been divided based on different categories and phases of the OCR system i.e. image acquisition, pre- processing, segmentation, feature extraction, classification and recognition. Various open problems and challenges have also been addressed. In this exhaustive literature study, it has been found that this research area is relatively new and wide open for research and development. The effective implementations of an Urdu OCR can lead to numerous applications in the future.

71

The extreme challenges in Urdu Nastalique script like cursiveness, context sensitivity, diagonality, inter-ligature and intra-ligature overlapping, less horizontal spacing and a large number of diacritics has led to hold-up and delay towards the technological advancements and developments in Urdu text recognition. However, there is a dire need for a recognition system that can convert all historical Urdu handwritten/ printed textbooks and works of literature into the digital format. In this chapter, the proposed methodology for Urdu ligature recognition using a genetic algorithm based hierarchical clustering is presented in detail. Different phases, such as pre-processing, ligature segmentation, feature extraction, hierarchical clustering, classification rules, GA optimization and recognition of the proposed ligature recognition system are addressed and all necessary details are provided accordingly.

4.1 System Overview

Usually, a text document is composed of characters, and words, however, for Urdu script, there is an added sub-word known as the ligature. The word recognition systems are known as intelligent character recognition systems since whole words are recognized instead of characters. These word level recognition systems are superior to that of character level recognition systems since they omit the intensive step for character segmentation. Similarly, instead of extracting structural features from the segmented characters, simple geometrical and statistical features can be extracted from the ligature images. Currently, most of the Urdu OCR systems have been developed using analytical recognition, requiring intensive segmentation. Most of the existing studies for ligature recognition using the holistic approach have worked with limited datasets, extracted a large number of features or features that are extremely difficult to process, and have low recognition rates. In this research, a system is presented for the recognition of offline printed Urdu Nastalique script. The proposed system uses a modified genetic algorithm based hierarchical clustering approach for the recognition of Urdu ligatures. The proposed OCR system uses the ground truth data from the UPTI dataset [33], corresponding to each text line image and its subsequent ligature image. The total number of classes i.e. 3645 generated from the dataset is equivalent to the total number of unique ligatures in the UPTI dataset. In the initial stages, a text line image is read, pre-processed and segmented into ligatures. A holistic segmentation approach is used to segment Urdu text lines into individual ligatures by using a novel Connect Component Labeling (CCL) method. The connected components having

72 pixels fewer than 4 are considered as noise and removed. Some ligature segmentation methods do exist but most of them can’t deal with the cursive nature of Urdu script and use many heuristics to deal with the ligature overlapping issues. The existing methods use baseline information that is an added complexity to the segmentation process, however, in the proposed system a novel baseline independent ligature segmentation method is used. It only uses the vertical overlap analysis for connected component processing. After segmentation, 15 hand-engineered geometric and statistical features are extracted from the segmented ligature images. These features are then concatenated to form the final feature vector for each ligature image. The existing automated features have high space complexity, whereas, the hand-engineered structural features are inappropriate for ligatures due to the large variations in the shapes of the ligatures. In the proposed system, only a handful of geometric and statistical features i.e. aspect ratio, compactness, density function, horizontal edge intensity, vertical edge intensity, horizontal mean, vertical mean, horizontal variance, vertical variance, horizontal kurtosis, vertical kurtosis and GLCM features (Contrast, Correlation, Energy, Homogeneity) are extracted from the segmented ligature images. Following the feature extraction, the data points for each of the feature are clustered using a hierarchical clustering algorithm in order to reduce the wide distribution of the data points. Classification rules are used to represent the clustered data points and the available ground-truth classes for each of the ligatures. The classification rules are encoded in the form of IF-THEN statements. Finally, the Genetic Algorithm (GA) is used for optimization and recognition uses a multi-level sorting approach. The hierarchical clustering is optimized and henceforth, the classification rules are optimized. The recognition accuracy for the proposed ligature recognition system is calculated using the predicted information against the known label information i.e. ground-truth. The overview of the proposed system is given in Figure 4.1.

73 Dataset Read Text Line Images Pre-Processing

Ground-truth Segmentation

Feature Extraction

Hierarchical Clustering

Classification Rules

Optimization and Recognition

Figure 4.1 Overview of Proposed Urdu OCR System

4.2 Pre-Processing

In the proposed research, the text line images are read from the dataset and fed to the OCR system. The pre-processing phase is of extreme significance for the later phases of feature extraction and classification. First, in the pre-processing stage, the text line images are converted into a bi-level binary image using image thresholding. Next, the text lines are pre-processed and all those connected components having less than 4 pixels are removed from the text line images. For image thresholding, global image thresholding is applied to the text line images using the Otsu’s method as given in [112]. Otsu’s thresholding method involves continuous iterations through all the possible values of the threshold. It calculates either side of threshold values for all the pixels, those that fall in the background as well as those that fall in the foreground. The Otsu method is based on four basic steps. First, calculate the histogram and probabilities at each intensity level. Next, set up an initial class probability and initial class means. Third, step through all the possible thresholds, up to the maximum intensity. While doing this, update the values of class probability and class mean, also compute the class variance. Last, the preferred threshold value relates to the maximum value of the inter-class variance. The original line image and its thresholded version are shown in Figure 4.2.

74

Figure 4.2 (a) Original Image from UPTI Dataset [33] (b) Thresholded Image Using Otsu's Method

4.3 Ligature Segmentation

Ligature segmentation is one of the most crucial stages for the development of a holistic Optical Character Recognition (OCR) system for Arabic-like cursive scripts. Incorrect ligature segmentation affects the overall accuracy and recognition rate of an OCR system. In this research, a novel algorithm is proposed for ligature segmentation of Urdu Nastalique text using Connected Component Labeling (CCL) method. Nastalique font is extremely cursive and context-sensitive in nature, it presents great challenges for the development of an efficient OCR system. The proposed algorithm is divided into four main stages; in the first stage, connected components are identified, labeled and extracted from Urdu text line images. Next, the labeled connected components are divided into primary or secondary components based on height, width and region features. In the third stage, vertical overlap analysis is used to associate the primary and secondary components to their respective ligatures. In the final stage, the ligatures are segmented. The proposed ligature segmentation algorithm uses Connected Component Labeling (CCL) method and henceforth it overcomes the problems associated with cursive script segmentation. The suggested ligature segmentation algorithm divides a text line into constituent ligatures. The proposed methodology for the proposed ligature segmentation has four main steps as shown in Figure 4.3. Each of these steps is covered using an algorithm. The last step, Segmented Ligatures, has no algorithm and just shows the segmented resultant ligatures. The proposed methodology is covered using five algorithms. Algorithm 1 is the main baseline independent ligature segmentation algorithm, it gathers results from all the sub-algorithms. Algorithms 2 and 3 are sub-parts of Algorithm 1, that deals with connected component feature extraction and connected component association respectively. Next, the Algorithm 4 and 5 are sub-parts of Algorithm 3, responsible for the primary component association and secondary component association respectively.

75 Connected Component Labelling

Connected Component Feature Extraction and Separation

Connected Component Association

Segmented Ligatures

Figure 4.3 Block Diagram for Proposed Ligature Segmentation Algorithm

The main contribution of the proposed segmentation method is a novel CCL based ligature segmentation algorithm, that has been tested on a large dataset of Urdu Nastalique script. The suggested algorithm only uses the height, width and upper region information for separation of connected components into primary and secondary. The final association of the primary and secondary components is carried out using the vertical overlap analysis. Unlike previous CCL methods, the proposed algorithm doesn’t consider the baseline (virtual horizontal line on which the primary component’s last character may rest) information. Overall the operation of the proposed algorithm is very effective and efficient.

4.3.1 Connected Component Labeling

For a binary image, all connected components are located and labeled. A connectivity of 8 neighborhood is used for locating the connected components. The labeling process is done by scanning the image from left to right and top to bottom, thus, labeling the connected components one by one. Line 4 of Algorithm 1 deals with the processes of connected component extraction and labeling using a connectivity of 8 neighborhood.

4.3.2 Connected Component Feature Extraction and Separation

To extract the features, the smallest bounding box in a rectangular shape is developed that is large enough to enclose the target component completely. The bounding box is used to provide four parameters (Bx, By, width, height) for each of the connected components. Bx and By represent the coordinates of the upper left corner of a bounding box. Algorithm 1: Line 5 to 7 deals with the process of bounding box creation for all the components. Algorithm 2: Line 2 deals with the extraction of the bounding box parameters.

76

From each connected component, five distinct parameters are extracted i.e. Height, Width,

Upper Region, CXmin, CXmax. The connected components are divided into primary and secondary components based on the cut-off height, cut-off width and upper region features.

In vertical overlap analysis, CXmin and CXmax features are used for the association of the primary components and the secondary components to the constituent ligatures.

77 Algorithm 2 shows the necessary steps taken and the features extracted from each of the connected components. For each connected component, the CXmin refers to the vertical index at Bx where the connected component stroke begins when the binary image is scanned from left to right. CXmax refers to the vertical index at (Bx+width) where the connected component stroke ends when the image is scanned from left to right.

4.3.3 Connected Component Association

Once the connected components have been identified as the secondary or primary component, the next step is the association of these components to constituent ligatures. The association is based on the well-known understanding of ligature composition. Each Urdu ligature can have only one primary and one or more secondary components. Henceforth, using vertical overlap analysis only one primary component is taken into consideration, other components are associated as it’s secondary and afterward the ligature is segmented. One-by-one each of the connected components go through a set of procedures. For each connected component (primary or secondary), the vertical overlap analysis is based on the vertical projection at CXmin and CXmax for finding overlap for each connected component. Using vertical projection, intersecting components are found. These intersecting components denote the existence of an overlap. The primary and secondary components are checked at their CXmin and CXmax vertically for overlapping by any other connected component. The primary emphasis is on using the CXmin feature of connected components. If a connected component is vertically overlapped by more than one component then its CXmax feature is processed too. A different set of procedures are observed for primary and secondary components using the vertical overlap. The vertical overlap analysis using CXmin feature for connected components for an Urdu text line image is shown in Figure 4.4 (a), both CXmin and CXmax shown in Figure 4.4 (b). Each connected component is checked vertically to see if it being overlapped by any other connected component. The feature CXmax is used sparsely and only processed when a secondary component is overlapped by more than one component.

78 (a)

(b)

Figure 4.4 Vertical Overlap Analysis for Urdu Text-line Images Taken from UPTI Dataset Given in [33] (a) Association Using CXmin (b) Association Using CXmin and CXmax

Algorithm 3 is used for connected component ligature association. For a binary image, all of its connected components are processed. The outcome of Algorithm 3 is a set of all ligatures for a binary image. The maximum value of the ligature variable reflects the total number of ligatures present in the binary text-line image. As a result, each connected component is associated to a constitutent ligature.

Algorithm 4 and 5 are sub-parts of Algorithm 3. Both Algorithm 4 and Algorithm 5 are used for association. There are minor differences between Algorithm 4 and Algorithm 5. Algorithm 4 carries out a set of processes for a primary component. Whereas, Algorithm 5 carries out a set of processes if the component is secondary. The main output of Algorithm

79 4 and Algorithm 5 is a component that has been associated to a ligature. Both Algorithm 4 and Algorithm 5 checks each component for vertical overlapping.

80

81 4.3.4 Segmented Ligatures

Using the association results generated from previous algorithms, segmentation is carried out by placing the related connected component into resultant ligature images.

4.4 Hand-Engineered Feature Extraction

In the proposed research, a total of 15 features are extracted from the segmented ligature images. Only two geometric features i.e. Aspect Ratio and Compactness are extracted from the ligature images. The statistical features considered for feature extraction are divided into two main types, first-order and second-order statistical features. A total of nine first- order statistical features are extracted from the ligature images i.e. the Density Function, Horizontal Edge Intensity, Vertical Edge Intensity, Horizontal Mean, Vertical Mean, Horizontal Variance, Vertical Variance, Horizontal Kurtosis and Vertical Kurtosis. Next, four second-order statistical features are extracted from the ligature images, namely, Contrast, Correlation, Energy and Homogeneity. The feature vector extracted from each ligature image is given in Table 4.1.

Table 4.1 Feature Vector Selected for Feature Extraction

F.No. Feature Detail GEOMETRIC FEATURES F1 Aspect Ratio The ratio of width to the height F2 Compactness A function of the perimeter � and the area � FIRST-ORDER STATISTICAL FEATURES F3 Density Function The ratio of pixels of ligature body to the total pixels in an image F4 Horizontal Edge Intensity Sum of distribution of horizontal edges (1s) in an image F5 Vertical Edge Intensity Sum of distribution of vertical edges (1s) in an image F6 Horizontal Mean Mean of the horizontal projection elements F7 Vertical Mean Mean of the vertical projection elements F8 Horizontal Variance Histogram width that calculates the deviation of from the mean of the horizontal projection elements F9 Vertical Variance Histogram width that calculates the deviation of from the mean of the vertical projection elements F10 Horizontal Kurtosis Measure of horizontal projection profile sharpness F11 Vertical Kurtosis Measure of vertical projection profile sharpness SECOND-ORDER STATISTICAL FEATURES F12 Contrast Measure of the local variations in the gray-level co-occurrence matrix F13 Correlation Measure the joint probability occurrence of the specified pixel pairs F14 Energy Measure of the sum of squared elements in the GLCM F15 Homogeneity Measure of the closeness of the distribution of elements in the GLCM to the GLCM diagonal

82 4.4.1 Geometric Features

Geometric features, also known as shape-based features, are extremely simple and consists of computing the similarity between images based on their geometric qualities. In the proposed ligature based recognition system, two geometric features i.e. aspect ratio and compactness are extracted from the segmented ligature images. (a) Aspect Ratio The aspect ratio for any geometric shape measures the ratio of the size in different dimensions. In images, the aspect ratio is dependent on the relationship of the width to the height of the ligature image. Mathematically, the aspect ratio is most often expressed as two numbers separated by the colon such as �: �, where the width is given by � unit of length and � is also given using the same unit of length. The aspect ratio for each ligature image is calculated using the equation (4.1). The aspect ratio computation for a ligature image is shown in Figure 4.5.

�� = �/� (4.1)

Figure 4.5 Calculating Aspect Ratio for Ligature Image

(b) Compactness Measure The compactness of an image is given as a function of the perimeter � and the area �. The area is the count of the total number of pixels within a ligature image. The area A for a ligature image is calculated as the product of the height and width of the image (see equation (4.2)).

� = � × � (4.2)

83 The perimeter of a ligature image is computed as the total number of pixels surrounding the image boundary. The perimeter P for a ligature image is calculated using the equation (4.3),

� = �(� + �) (4.3)

Mathematically, as given in equation (4.4), the compactness is most often expressed as a ratio between the product of area (A), pi(π), and 4 to the square of the perimeter P.

(���) ����������� = (4.4) ��

4.4.2 First-Order Statistical Features

Statistical features are used to quantify different properties of an entire image by observing the relationships between the gray-level distributions. The distribution of gray levels pixels in the region is used to determine the texture of an image. Generally, the texture can be coarse or fine, irregular or smooth, homogeneous or inhomogeneous. Based on the total number of pixels in an image, statistical methods are further classified into first-order or higher-order statistics. The primary difference between the two is that the first-order statistical features only estimates the properties of distinct pixel values. On the other hand, the higher-order statistical features only estimate the properties of two or more pixels based on the spatial interaction between the image pixels. Usually, the histogram based approaches are used for horizontal or vertical pixel distribution in an image, a part of the image or all of it is represented as a histogram. The histogram method provides a simple and concise summary of statistical information contained in the image. Hence, the histogram contains the first-order statistics information of the image or its parts. Common histogram based features used in this research are the edge intensity, mean, variance and kurtosis. Features are extracted from the projection profiles of the ligature images. For the proposed features, a binary image � is assumed such that, I (N, M), where N is the total number of rows (height of the image) and M is the total number of columns (width of the image). The projection profiles are calculated for each ligature image. The horizontal projection profile is the sum of pixel distribution calculated along the rows of a ligature image as given in equation (4.5) and shown in Figure 4.6.

� ��[�] = �(�, �) (4.5) ��

84

Figure 4.6 Horizontal Projection Profile Computed for a Ligature

Similarly, the vertical projection profile is the sum of pixel distribution calculated along the columns for a ligature image as given in equation (4.6) and shown in Figure 4.7.

� ��[�] = �(�, �) (4.6) ��

Figure 4.7 Vertical Projection Profile Computed for a Ligature

(a) Density Function The density function is used for finding the distribution of the total number of pixels covered by the ligature body to the total number of pixels in an image. To find the pixels occupied by a ligature, the count of the number of pixels with an intensity 1 in the ligature image is taken into consideration. The total number of pixels in an image is calculated by

85 finding the area of the image, a product of the image height and width, A= N x M. N in an image represents the height of the image and the total number of rows, M represents the width of the image and the total number of columns. The density function can be calculated using the equation (4.7) or equation (4.8).

∑� � [�] ������� �������� = �� � (4.7) × �

� ∑�� ��[�] ������� �������� = (4.8) � × �

(b) Horizontal Edge Intensity To find the horizontal edge intensity, first, the edges in the ligature image are located using the Sobel edge detection method. The Sobel method can detect edges in the horizontal as well as vertical direction. It uses two 3 x3 kernels, one for edge detection in the horizontal direction and one for the vertical direction. The Sobel edge detection method for horizontal direction returns an image where the edges are represented using 1s and 0s elsewhere (see Figure 4.8). The Sobel edge detection in the horizontal direction is calculated using equation (4.9). The total sum of distribution of horizontal edges (1s) in an image is used as a feature, computed using equation (4.10).

� � −� �� = � � −� × � (4.9) � � −�

�� = �� (4.10)

Figure 4.8 Horizontal Edge Intensity Computed Using the Sobel Method 86 (c) Vertical Edge Intensity The Sobel edge detection method for vertical direction returns an image where the edges are represented using 1s and 0s elsewhere (see Figure 4.9). The Sobel edge detection in the vertical direction is calculated using equation (4.11). The sum of distribution of vertical edges (1s) in an image is used as a feature, computed using equation (4.12).

−� −� −� �� = � � � × � � � � (4.11)

�� = �� (4.12)

Figure 4.9 Vertical Edge Intensity Computed Using the Sobel Method

(d) Horizontal Mean The mean provides the average distribution of all pixels in an image, hence, it’s extremely useful for giving a rough idea of the intensity. The horizontal mean returns the mean of the horizontal projection elements for a ligature image as shown in Figure 4.10. It is calculated for an image using the equation (4.13), where N is the total number of rows in the image.

� � � = � [�] (4.13) � � � ��

87 40 On Pixels Mean 35

30

25

20

15

Frequency of On Pixels 10

5

0 0 10 20 30 40 50 60 70 80 90 100 Horizontal Histogram Distribution Figure 4.10 Mean Calculated from Horizontal Histogram of Ligature Image

(e) Vertical Mean The vertical mean returns the mean of the vertical projection elements for a ligature image Figure 4.11). It is calculated for an image I using the equation (4.14), where M is the total number of columns in the image. � � � = � [�] (4.14) � � � ��

30 On Pixels Mean 25

20

15

10 Frequency of On Pixels

5

0 0 20 40 60 80 100 120 140 Vertical Histogram Distribution Figure 4.11 Mean Calculated from Vertical Histogram of Ligature Image

88 (f) Horizontal Variance

The variance given by V is used as a measure of the histogram width that calculates the deviation from the mean for horizontal projection elements computed for a ligature image (see Figure 4.12). The equation (4.15) is used to compute the horizontal variance from a ligature image. � � � = |� [�] − � |� (4.15) � � − � � � ��

110 On Pixels 100 Variance 90

80

70

60

50

40

30 Frequency of On Pixels

20

10

0 0 10 20 30 40 50 60 70 80 90 100 Horizontal Histogram Distribution

Figure 4.12 Variance VH Calculated from Horizontal Histogram of Ligature Image

(g) Vertical Variance

The variance given by � , is used as a measure of the histogram width that calculates the deviation from the mean for vertical projection elements calculated for a ligature image (see Figure 4.13). The equation (4.16) is used to find the vertical variance for a ligature image.

� � � = |� [�] − � |� (4.16) � � − � � � ��

89 60 On Pixels Variance 50

40

30

20 Frequency of On Pixels

10

0 0 20 40 60 80 100 120 140 Vertical Histogram Distribution

Figure 4.13 Variance Vv Calculated from Vertical Histogram of Ligature Image

(h) Horizontal Kurtosis Kurtosis is a measure of the histogram sharpness. It describes the shape of the tail of a histogram and how outliner prone a distribution is. Kurtosis measures the peak distribution of the intensity values around the horizontal mean (see Figure 4.14). The horizontal kurtosis for a ligature image is computed using the equation (4.17).

∑� (� [�] − � )� � = �� � � (4.17) � (�)�

40 On Pixels Kurtosis 35 Mean Std 30

25

20

15

Frequency of On Pixels 10

5

0 0 10 20 30 40 50 60 70 80 90 100 Horizontal Histogram Distribution

Figure 4.14 Kurtosis KH Calculated from Horizontal Histogram of Ligature Image

90 (i) Vertical Kurtosis Kurtosis measures the peak distribution of the intensity values around the vertical mean (see Figure 4.15). The vertical kurtosis for a ligature is computed using the equation (4.18).

� � ∑��(��[�] − ��) � = (4.18) � (�)�

30 On Pixels Kurtosis 25 Mean Std

20

15

10 Frequency of On Pixels 5

0 0 20 40 60 80 100 120 140 Vertical Histogram Distribution Figure 4.15 Kurtosis Kv Calculated from Vertical Histogram of Ligature Image

4.4.3 Second-Order Statistical Features

The first-order statistical features only provide the information related to the grey-level distribution in a ligature image. It doesn’t provide any information related to the position of different grey levels within the image. The second-order statistical features consider the relationship among pixels or group of pixels. The grey-level co-occurrence matrices (GLCM) are one of the most well-known texture based second-order statistical features. It shows how often each grey-level occurs at a pixel located relative to another pixel. The GLCM provides a measure for the variation of intensity at the pixel of interest. Two pixels are considered at a time by the GLCM texture i.e. the reference pixel and the neighbor pixel. All pixel’s information can be extracted from the co-occurrence matrix that measures the second-order statistics for a ligature image. The co-occurrence matrix is a function of two parameters i.e. the relative distance and their relative orientation (horizontal, diagonal, vertical and anti-diagonal by 0 ̊, 45 ,̊ 90 ̊ and 135 ̊ respectively). The features extracted using

91 the co-occurrence matrix defines the coarseness, texture and smoothness. Contrast, correlation, energy and homogeneity are a few of these features that are observed in this research. (a) Contrast Contrast is a measure of the local variations in the grey-level co-occurrence matrix. The contrast measures the intensity of the contrast computed between a pixel and its neighbor over the entire ligature image. If an image is constant then the contrast returned is 0. The contrast for a ligature image is computed using the equation (4.19).

� �������� = (� − �) �(�, �) (4.19) �,� (b) Correlation The correlation measures the joint probability occurrence of the specified pixel pairs. It shows how correlated a pixel is to its neighbor over the entire image. The correlation for a ligature image is computed using the equation (4.20).

(� − ��)(� − ��)�(�, �) ����������� = (4.20) ���� �,� (c) Energy Energy is also known as the uniformity, uniformity of energy or the angular second moment. It measures the sum of squared elements in the GLCM. If an image is constant then the energy returned is 1. The energy for a ligature image is computed using the equation (4.21).

� ������ = �(�, �) (4.21) �,� (d) Homogeneity The homogeneity is used to measure the closeness of the distribution of elements in the GLCM to the GLCM diagonal. A homogeneity of 1 is given for a diagonal GLCM. The homogeneity for a ligature image is computed using the equation (4.22).

�(�, �) ����������� = � + |� − �| (4.22) �,�

92 4.5 Hierarchical Clustering

In raw data, all the ligatures properties appear to be sparse and dissimilar in so many ways, that it prevents efficient searching and organization of the common data points. The range of feature values (data points) varies widely, which is problematic for any machine learning and classification method. The abundant amount of data points for each feature introduce challenges during the machine learning tasks, specifically, the objective functions might not work properly. Likewise, it is challenging to train the learning algorithms in a feasible amount of time. The classification accuracy of the model may also be degraded. Hence, the distribution reduction of total data points for each feature becomes an essential task for developing an efficient and robust learning model. In the proposed research, a hierarchical clustering algorithm is utilized for partitioning and reducing the wide distribution of data points for each feature (F1 to F15). The proposed clustering process reduces the wide number of random data points for each feature. Figure 4.16 shows the initial distribution of feature data points for each of the extracted features from each segmented ligature of the UPTI dataset. It clearly shows that the distribution of feature data points for each feature is widespread, giving a high distribution of data.

93

Figure 4.16 Initial Data Points Distribution for Each Feature (F1 to F15)

The working of the hierarchical clustering algorithm is extremely simple. For each of the features (F1to F5), the following steps are observed for clustering. First, all of the data points are taken into consideration. Next, the data points are sorted incrementally, smallest to the largest. Afterward, the first-order derivative is calculated to find the rate of change between the feature’s data points. The change greater than zero reflects data points that may belong to a different cluster. If change is zero, it means that the data points are similar 94 and may belong to the same cluster. The first-order derivative for a feature is the difference between its adjacent elements and is given as, [�� (2) − �� (1), �� (3) − �� (2), �� (4) − �� (3) … … … … , ��(�) − ��(� − 1)]. Where dp stands for a data point and � is the total number of data points. The change for a single data point is calculated using the equation (4.23),

∆��� = (���� − ���) (4.23)

The total number of data points are reduced to N-1 after the first-order derivative. Following, the mean is calculated for only the positive (greater than 0) first-order derivative elements. Next, all first-order derivatives having a value greater than mean are taken into consideration. Subsequently, the second-order derivative is computed on the resultant elements of the first-order derivative to find the segmentation points for clustering.

∆���� − ∆��� < ��������� (4.24)

A focal data point is selected, ∆��, and all those data points that fall within the threshold are assigned to a cluster. When the threshold limit is crossed, a new segmentation point is identified for the next collection of the data points. Likewise, a new focal data point is selected, ∆��. The above steps are repeated until all the first-order derivatives satisfying the mean condition are processed. For the proposed research, a minimum threshold of 30 is set as the total number of data points for each cluster. For each feature, the total number of clusters as well as the minimum and maximum data point value per cluster is taken into consideration for the final clustering. The proposed algorithm of hierarchical feature clustering is given in Algorithm 6.

95

96 The distribution of the feature data points after incremental sorting can be seen in Figure 4.17. All the graphs show curves sloping upwards from left to right. The graphical distribution for change, i.e. first-order derivative for each of the feature extracted from the ligatures is shown in Figure 4.18. The mean of the positive first-order derivative for each feature is shown in Figure 4.19. The red line shown in the graphs of Figure 4.19 represents the mean value.

Figure 4.17 Incremental Distribution of Sorted Data Points for Each Feature

97

Figure 4.18 First-order Derivative Distribution of Data Points for Each Feature

98

Figure 4.19 Mean of First-order Derivative Elements for Each Feature

4.6 Data Representation Using Classification Rules

The hierarchical clustering results in clustered data points. Using the clustered data points and the given ground-truth classes, the upper and lower threshold limit of each feature for a given class label can be found. The cluster composition of the boundaries represents certain rules that can be used for decision making. Since these rules can be used for decision making, they are said to be classification rules. However, these classification rules are not hand-coded but actually comprehended from the hierarchical clustering results. There are 99 numerous techniques to represent the cluster composition of the boundaries (upper and lower limits) such as using the propositional logic, IF-else rules, trees and networks [113]. Regardless of the representation technique, the output is dependent on the total number of ligatures, the total number of features, upper/lower threshold limits of the data points and the total number of ground-truth classes. In this research since we are dealing with data having more than one dimension (features), therefore, IF-THEN rules seem to be the best representation. Trees, linear list and network representations have their own limitations and are unfeasible to be used here to represent all the clustered output data due to the limitations of the page numbers and the size of the thesis. A specialized tree representation for the output clustered data is given in Figure 4.20. The IF-THEN rules in comparison to some other representations such as direct access and the tree representation reveal the lowest space complexity. If � is constant and � represents the total number of ligatures, for direct access of the clustered output, the time complexity is �(�) and space complexity is �(30(�)). Whereas, for the tree representation as given in Figure 4.20 the time complexity is �(15 ∗ 255) if the maximum cluster value is 255 for each feature. However, space complexity in the case of tree representation is much higher i.e. �(255). The IF-THEN rule representation used in this research for the representation of the output data has the lowest space complexity of �(�), while, the time complexity is �(15(�)).

F3 F3

1 2 F3 F4 164

F2 1 F4 F2 F2 2 F3 1 2 F2 3 F4 F2 1 3 346 4 F3 F1 2 5 F2 8 3 1619 7 6 F3 F4 F2 F2 164 F2

F3

Figure 4.20 Specialized Tree Representation

100 Since the classification rules are best represented using conditional expressions. Each rule is encoded as an IF-THEN statement. Each feature corresponds to a condition and the class column corresponds to the conclusion. The rule is expressed in the following form, �� ��������� ���� ���������� The IF part of the rule is known as the rule antecedent and the THEN part of the rule is known as the rule consequent. The antecedent part of the rule here consists of 15 attribute tests and these tests are logically ANDed. The rule consequent is the class, representing the prediction for each ligature. The rules against the hierarchically clustered data for the proposed research are given as, IF (Aspect Ratio=N) AND (Compactness Measure= N) AND (Density Function= N) AND (Horizontal Edge Intensity =N) AND (Vertical Edge Intensity =N) AND (Horizontal Mean =N) AND (Vertical Mean =N) AND (Horizontal Variance =N) AND (Vertical Variance =N) AND (Horizontal Kurtosis =N) AND (Vertical Kurtosis =N) AND (Contrast =N) AND (Correlation =N) AND (Energy =N) AND (Homogeneity =N) THEN (CLASS = (C1 to C3645)), where N represents a value/range for the feature. As stated earlier that the IF-THEN statements are just used for representation of the clustered output data. Therefore, as given in this IF-THEN representation both the condition and the conclusion parts are dynamic and adaptive in nature and subject to change for different types of datasets, features and the classes under consideration. The classification rules will keep varying on the basis of the optimality of the hierarchical clustering.

4.7 Optimization and Recognition

The recognition of the ligatures is carried out using a Genetic Algorithm. The proposed Genetic Algorithm (GA) is used for optimization of the hierarchical clustering and hence improving the classification rules. GA is used to find an optimal solution(s) to a given computation problem. For the proposed research, the computation problem is finding the best sequence for processing the features for improving the hierarchical clustering and henceforth, generating a set of classification rules for recognition of the ligatures while achieving the maximum accuracy. Therefore, the proposed genetic algorithm in this research deals with the permutation problem of the features. This permutation problem helps in deciding the order in which the features should be processed in each generation. Based on this sequence of features, the hierarchical clustering is optimized, henceforth, the classification rules are modified and improved.

101 The basic structure of the proposed GA consists of the initial population, chromosome encoding, parent selection, crossover, mutation, fitness function, survivor selection and the termination process (see Figure 4.21).

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 CLASS Input to GA 92 144 33 4 2 53 2 57 4 902 46 143 875 70 895 15 80 129 180 5 3 142 9 119 73 297 143 136 661 65 1042 30 ...... Hierarchical Clustering 38 52 139 1 2 70 6 21 157 62 291 143 1083 68 631 5 Population Initialization

CHROMOSOME ENCODING

FITNESS FUNCTION Optimization and Recognition PARENT SELECTION

C1 C2 Crossover

C3 GENETIC ALGORITHM C4 Reproduction

C10

C1 Mutation C2

C10

FITNESS EVALUTAION New Population Next Generation Parent + Offspring C1 Survivor Selection C2 C3 C4

C10

NEW POPULATION

NO TERMINATION YES CRITERIA END

Figure 4.21 Block Diagram for Proposed GA Optimization and Recognition

102 In initialization, an initial population is created. Next, the chromosomes within the population are expressed using a representation. In fitness function, the ligature recognition accuracy is computed for the hierarchical clustering using the feature permutation under consideration. Subsequently, in parent selection, a pair of chromosomes from the population are selected for the further process of reproduction. The reproduction consists of two operations, crossover and mutation. The resultant mutated offsprings are again subject to the fitness function where recognition accuracy is computed using optimized hierarchical clustered dataset. The reproduction operation is processed for all the pair of chromosomes in the population and repeated for all the generations. At the end of each generation, except the first generation, ‘�’ survivors are selected. This ‘�’ is equivalent to the initial size of the population given by ‘�’. The ten best chromosomes from both the parents and the offsprings survive and are used as a new population for the next generation. The GA is terminated when the maximum limit of generations (101) is reached. The proposed genetic algorithm is given in Algorithm 7.

4.7.1 Population Initialization

The genetic algorithm begins with an initial population. Each of the population provides a subset of solutions in the current generation. In the proposed research, these subsets of solutions are termed as chromosomes. The population size is not kept very large since it could slow down the entire GA process. An extremely smaller population size is also avoided to ensure having a good mating pool. Hence, an optimal population size is decided 103 using trial and error. The population is given as a two-dimensional array of population size and chromosome size. For the proposed research, the population size is 10 and each chromosome has 15 genes (equivalent to the number of features). Hence, each population matrix has a size of 10 x15 in each generation. Random initialization is used to populate the initial population with completely random solutions.

4.7.2 Chromosome Encoding

In the proposed research, permutation encoding is selected for finding the optimal order to process the features for optimization and recognition. For each population, the chromosome stores the information about the access order of the features. Every chromosome is represented as a string of numbers (1 to 15) as shown in Figure 4.22. Each number represents a feature from the feature vector. This representation is then used for mappings from the representation space to the phenotype space in the fitness function, where the optimization of the hierarchical clustering and the ligature recognition accuracy is computed.

Representation: Numbers 1 to 15 Chromosome 1 6 3 11 7 14 8 5 15 1 2 4 13 9 10 12 Chromosome 2 7 1 15 13 2 14 6 10 12 11 4 8 3 9 5 Chromosome 3 2 10 4 5 15 3 8 12 11 7 1 13 14 6 9 Chromosome 4 10 9 15 6 13 2 1 11 14 3 8 7 4 5 12 Chromosome 5 8 9 14 12 2 10 3 7 4 1 13 11 5 15 6 Chromosome 6 2 3 11 1 6 14 5 9 15 8 4 13 12 7 10 Chromosome 7 9 2 1 6 11 15 13 10 8 14 3 7 4 12 5 Chromosome 8 14 7 10 8 2 15 6 12 1 3 4 5 11 9 13 Chromosome 9 3 7 1 15 13 9 14 8 11 4 10 5 6 12 2 Chromosome 10 11 13 1 5 14 12 15 8 10 9 4 3 7 6 2

Figure 4.22 Chromosome with Permutation Encoding

4.7.3 Parent Selection

Parents are selected from the population for reproduction using the crossover and mutation operations. A pair of chromosomes is selected as parents, parent chromosomes, from the breeding pool of the population to generate a pair of offsprings. The parent chromosome pair is selected sequentially from the population for maintaining good diversity in the future generations. Maintaining good diversity leads to good solutions and success of the GA. Figure 4.23 shows the process of parent selection for reproduction. 104 Chromosome 1 6 3 11 7 14 8 5 15 1 2 4 13 9 10 12 Parents Chromosome 2 7 1 15 13 2 14 6 10 12 11 4 8 3 9 5 Chromosome 3 2 10 4 5 15 3 8 12 11 7 1 13 14 6 9 Parents Chromosome 4 10 9 15 6 13 2 1 11 14 3 8 7 4 5 12 Chromosome 5 8 9 14 12 2 10 3 7 4 1 13 11 5 15 6 Parents Chromosome 6 2 3 11 1 6 14 5 9 15 8 4 13 12 7 10 Chromosome 7 9 2 1 6 11 15 13 10 8 14 3 7 4 12 5 Parents Chromosome 8 14 7 10 8 2 15 6 12 1 3 4 5 11 9 13 Chromosome 9 3 7 1 15 13 9 14 8 11 4 10 5 6 12 2 Parents Chromosome 10 11 13 1 5 14 12 15 8 10 9 4 3 7 6 2

Figure 4.23 Parents Selection for Proposed Genetic Algorithm

4.7.4 Crossover

Single point crossover is applied during the crossover operation. A single crossover point i.e. gene 8 is selected as the cutoff. Up to gene 7 initially, the parents are kept as it is for the offsprings. Genes 8 to 15 are scanned one by one for both the parents and their alleles are exchanged with each other to generate the offspring. If the allele is not repeated in the offspring, it is directly added to the offspring. If any repetition is observed, the offspring are scanned for the first instance of the repeated alleles and they are swapped accordingly. The overall crossover operation is explained in Algorithm 8.

105

This crossover operation is repeated for all the parents of a given population in a generation. Figure 4.24 shows the crossover process only for a pair of the parent. The crossover point selected is gene 8. For each gene, 8 to 15, the crossover process is applied. For a gene, the alleles are swapped and placed in the offsprings. The offsprings are then observed for repeating alleles, if found, the first instance of repetition in both offsprings are swapped. The same crossover process is repeated for the next gene from both the parents. The process is repeated for all the genes from the crossover point till the end of the chromosome.

106 Crossover Point: Gene 8 Parent 1 6 3 11 7 14 8 5 15 1 2 4 13 9 10 12 Parent 2 7 1 15 13 2 14 6 10 12 11 4 8 3 9 5

Processing Gene 8 Offspring 1 6 3 11 7 14 8 5 10 1 2 4 13 9 10 12 Repeating Allele Offspring 2 7 1 15 13 2 14 6 15 12 11 4 8 3 9 5 Repeating Allele

Offspring 1 6 3 11 7 14 8 5 15 1 2 4 13 9 10 12 Repeating Allele Removal Offspring 2 7 1 10 13 2 14 6 15 12 11 4 8 3 9 5 Repeating Allele Removal

Processing Gene 9 Offspring 1 6 3 11 7 14 8 5 15 12 2 4 13 9 10 12 Repeating Allele Offspring 2 7 1 10 13 2 14 6 15 1 11 4 8 3 9 5 Repeating Allele

Offspring 1 6 3 11 7 14 8 5 15 1 2 4 13 9 10 12 Repeating Allele Removal Offspring 2 7 12 10 13 2 14 6 15 1 11 4 8 3 9 5 Repeating Allele Removal

Processing Gene 10 Offspring 1 6 3 11 7 14 8 5 15 1 11 4 13 9 10 12 Repeating Allele Offspring 2 7 12 10 13 2 14 6 15 1 2 4 8 3 9 5 Repeating Allele

Offspring 1 6 3 2 7 14 8 5 15 1 11 4 13 9 10 12 Repeating Allele Removal Offspring 2 7 12 10 13 11 14 6 15 1 2 4 8 3 9 5 Repeating Allele Removal

Processing Gene 11 Offspring 1 6 3 2 7 14 8 5 15 1 11 4 13 9 10 12 Offspring 2 7 12 10 13 11 14 6 15 1 2 4 8 3 9 5

Processing Gene 12 Offspring 1 6 3 2 7 14 8 5 15 1 11 4 8 9 10 12 Repeating Allele Offspring 2 7 12 10 13 11 14 6 15 1 2 4 13 3 9 5 Repeating Allele

Offspring 1 6 3 2 7 14 13 5 15 1 11 4 8 9 10 12 Repeating Allele Removal Offspring 2 7 12 10 8 11 14 6 15 1 2 4 13 3 9 5 Repeating Allele Removal

Processing Gene 13 Offspring 1 6 3 2 7 14 13 5 15 1 11 4 8 3 10 12 Repeating Allele Offspring 2 7 12 10 8 11 14 6 15 1 2 4 13 9 9 5 Repeating Allele

Offspring 1 6 9 2 7 14 13 5 15 1 11 4 8 3 10 12 Repeating Allele Removal Offspring 2 7 12 10 8 11 14 6 15 1 2 4 13 3 9 5 Repeating Allele Removal

Processing Gene 14 Offspring 1 6 9 2 7 14 13 5 15 1 11 4 8 3 9 12 Repeating Allele Offspring 2 7 12 10 8 11 14 6 15 1 2 4 13 3 10 5 Repeating Allele

Offspring 1 6 10 2 7 14 13 5 15 1 11 4 8 3 9 12 Repeating Allele Removal Offspring 2 7 12 9 8 11 14 6 15 1 2 4 13 3 10 5 Repeating Allele Removal

Processing Gene 15 Offspring 1 6 10 2 7 14 13 5 15 1 11 4 8 3 9 5 Repeating Allele Offspring 2 7 12 9 8 11 14 6 15 1 2 4 13 3 10 12 Repeating Allele

Offspring 1 6 10 2 7 14 13 12 15 1 11 4 8 3 9 5 Repeating Allele Removal Offspring 2 7 5 9 8 11 14 6 15 1 2 4 13 3 10 12 Repeating Allele Removal

Figure 4.24 Crossover Operation for Proposed GA 107 4.7.5 Mutation

The mutation operation is performed to maintain and introduce diversity in the new population. The mutation operation alters the genes within the same offsprings generated after the crossover operation. In the mutation operation, normal random mutation is used for each chromosome. A mutation rate of 0.5% is considered during the mutation operation. For each crossover offspring in the population, two random genes are selected and their allele is exchanged with each other. The steps required to perform the process of mutation is given in Algorithm 9.

The mutation operation for a population with 10 offsprings (chromosome) generated from the crossover process is shown in Figure 4.25. First, two random genes are selected from an offspring. Next, their allele is exchanged with each other. For offspring 1, gene 3 and gene 8 are selected, their allele 2 and 15 are swapped with each other. The same process is repeated for all the offsprings in the population. The resultant population is referred to as a mutated population. The fitness function is evaluated against this mutated population. Similarly, this mutated population is also used during the survivor selection process.

108 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 10 Gene 11 Gene 12 Gene 13 Gene 15 Gene 14 Gene 1 Gene 7 Gene 9 Gene 8 Offspring 1 6 10 2 7 14 13 12 15 1 11 4 8 3 9 5 Exchange 2 à 15 Offspring 2 7 5 9 8 11 14 6 15 1 2 4 13 3 10 12 Exchange 13 à 10 Offspring 3 2 10 14 6 15 13 1 9 11 3 8 7 4 5 12 Exchange 2 à 7 Offspring 4 10 12 15 5 7 2 8 4 11 3 1 13 14 6 9 Exchange 13 à 14 Offspring 5 1 15 14 5 2 6 3 9 11 8 4 13 12 7 10 Exchange 14 à 4 Offspring 6 2 3 13 8 10 14 12 9 7 1 4 11 5 15 6 Exchange 3 à 4 Offspring 7 12 2 8 6 4 15 5 10 1 14 3 7 11 9 13 Exchange 2 à 4 Offspring 8 11 13 9 1 2 15 6 10 8 14 3 7 4 12 5 Exchange 15 à 12 Offspring 9 5 12 1 15 13 10 14 8 11 9 4 3 7 6 2 Exchange 8 à 3 Offspring 10 4 13 1 3 14 6 15 8 11 9 10 5 7 12 2 Exchange 15 à 5

Mutated Offspring 1 6 10 15 7 14 13 12 2 1 11 4 8 3 9 5 Mutated Offspring 2 7 5 9 8 11 14 6 15 1 2 4 10 3 13 12 Mutated Offspring 3 7 10 14 6 15 13 1 9 11 3 8 2 4 5 12 Mutated Offspring 4 10 12 15 5 7 2 8 4 11 3 1 14 13 6 9 Mutated Offspring 5 1 15 4 5 2 6 3 9 11 8 14 13 12 7 10 Mutated Offspring 6 2 4 13 8 10 14 12 9 7 1 3 11 5 15 6 Mutated Offspring 7 12 4 8 6 2 15 5 10 1 14 3 7 11 9 13 Mutated Offspring 8 11 13 9 1 2 12 6 10 8 14 3 7 4 15 5 Mutated Offspring 9 5 12 1 15 13 10 14 3 11 9 4 8 7 6 2 Mutated Offspring 10 4 13 1 3 14 6 5 8 11 9 10 15 7 12 2

Figure 4.25 Mutation Operation for Proposed GA

4.7.6 Fitness Function

The fitness for each of the chromosome is calculated. Each chromosome is transformed into the phenotype space, where the features are accessed sequentially as per order of the alleles in the chromosome. Figure 4.26 shows a single chromosome sequence as a solution. The chromosome encoding representation is then translated into the phenotype space. The sequence of alleles in the chromosome is taken into consideration. According to the chromosome sequence (6, 10, 15, 7, 14, 13, 12, 2, 1, 11, 4, 8, 3, 9, 5), the features in the phenotype space are selected in the same sequence and are multi-level sorted, ordered in ascending order, during the fitness function optimization process. The features are multi- level sorted such that feature 6 i.e. the horizontal variance is selected first and sorted, feature 10 i.e. contrast is sorted second, feature 15 i.e. homogeneity is sorted third and so on.

Solution 1 6 10 15 7 14 13 12 2 1 11 4 8 3 9 5

Multi-level Sort Sequence 9 8 13 11 15 1 4 12 14 2 10 7 6 5 3

Ligature Dataset Lig F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 Class

Figure 4.26 Multi-level Column Sorting Process for a Solution

109 Sorting the features, optimizes the hierarchical clustering and generates an optimized set of classification rules for recognition. The fitness of a chromosome is assessed by its ligature recognition accuracy for a sample test data. The fitness for each chromosome is found using the equation (4.25). The steps taken to calculate the fitness of each solution is given in Algorithm 10.

∑(������ == ������� ) ∑�� ���� + � � �� � ∑ (������ > −� ) (4.25) ��� = � ��� ����� ���� ����

The overall working of the fitness function to find the recognition accuracy is extremely simple and efficient. The dataset in the phenotype space is sorted according to the gene sequence for a chromosome from the population. For a sample test data, the last allele in the sequence is taken and its feature is located in the phenotype space. Next, for this feature, the difference between the adjacent elements is found. All those elements for which the feature has a difference greater than zero are found. The number of rows within two

110 adjacent difference elements is iteratively scanned for its accuracy. The ground-truth, the class information is assigned to all those rows and the most repeated class is found. The True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN) results are observed for all of these rows. The accuracy of each chromosome sequence in the population is found by evaluating and summing all the cluster accuracies.

4.7.7 Survivor Selection

The survivor selection process determines which chromosomes will survive for a new population to be used for the next generation. During the survivor selection process, its made sure that the best solutions survive to be used in the next generation. In the proposed study to ensure that the best solutions are selected, elitism is used. For a generation �, if � is the parents and � are the offsprings, the elitism results in generating a new population by replacing the current population ���. Elitism ensures that the solutions obtained by a GA will not decrease from one generation to the next generation. The process for survivor selection using elitism is shown in Figure 4.27. Best 10 chromosomes from the parent and the offsprings are selected as new population for the next generation.

Parents (P) Offsprings (O) 63.81 56.55 56.55 67.95 72.01 58.08 61.64 87.95 54.97 77.05 49.31 92.37 65.41 57.48 53.47 58.18 66.14 64.2 63.22 60.6

P + O

63.81 56.55 56.55 67.95 72.01 58.08 61.64 87.95 54.97 77.05 49.31 92.37 65.41 57.48 53.47 58.18 66.14 64.2 63.22 60.6

Best to Worst (P + O) 92.37 87.95 77.05 72.01 67.95 66.14 65.41 64.2 63.81 63.22 61.64 60.6 58.18 58.08 57.48 56.55 56.55 54.97 53.47 49.31

Elite Solutions (Best 10) 92.37 87.95 77.05 72.01 67.95 66.14 65.41 64.2 63.81 63.22

New Population Fitness

C1 7 5 9 8 11 14 6 15 1 2 4 10 3 13 12 92.37

C2 14 7 10 8 2 15 6 12 1 3 4 5 11 9 13 87.95

C3 11 13 1 5 14 12 15 8 10 9 4 3 7 6 2 77.05

C4 8 9 14 12 2 10 3 7 4 1 13 11 5 15 6 72.01

C5 10 9 15 6 13 2 1 11 14 3 8 7 4 5 12 67.95

...... C10 5 12 1 15 13 10 14 8 11 9 4 3 7 6 2 63.22

Figure 4.27 Survivor Selection Using Elitism

4.7.8 Termination

The entire genetic algorithm process is terminated when the termination condition is met. In the proposed research the termination condition is met when the maximum limit of the

111 generations i.e. 101 is reached. At the end of each generation, a set of optimized rules are generated, each chromosome represents a set of optimized rules for ligature recognition.

4.8 Summary

This chapter has provided an overview of the proposed ligature recognition system for Urdu printed script. The proposed system comprises of six main stages i.e. pre-processing, segmentation, feature extraction, hierarchical clustering, classification rules and genetic algorithm optimization/recognition. A novel ligature segmentation algorithm has been presented for ligature separation from the text lines. A total of 15 hand-engineered geometric and statistical features has been presented for extraction from the segmented ligature images. The hierarchical clustering algorithm is applied to the feature dataset to reduce the data points distribution and generate 3645 classification rules. To optimize the hierarchical clustering and perform ligature recognition, use of a genetic algorithm architecture has been proposed. In the following chapter, the experiments carried out to evaluate the performance of the proposed ligature recognition system will be presented.

112

This chapter presents the results obtained using the proposed methodology for ligature recognition of printed Urdu script. First, this chapter presents the results of the holistic segmentation algorithm proposed for the OCR system. The proposed ligature segmentation algorithm is tested using the UPTI dataset [33], comprising a total of 10,063 text lines, 189274 ligatures and 3645 classes. Subsequently, the feature extraction results are provided and compared in terms of space complexity to other recent feature vectors. Next, the results are reported for hierarchical feature clustering using the specified algorithm for feature data points distribution reduction of the 189003 ligatures. 3645 classification rules are generated for the entire 189003 ligatures dataset. Finally, a genetic algorithm is used for optimization and recognition of the ligatures. Currently, no studies exist that have used a genetic algorithm based hierarchical clustering approach for recognition of printed Urdu Nastalique ligatures using hand-engineered features. The recognition accuracy of the proposed genetic algorithm is tested on different samples of test data, through different evolving generations of optimizations and improvements.

5.1 Dataset and Ground-Truth

To evaluate the performance of the proposed algorithm, un-degraded text line images from UPTI (Urdu Printed Text Image) dataset developed by Sabbour and Shafait [33] are considered. UPTI is one the most famous benchmark dataset for Urdu Nastalique script based OCR systems. The Nastalique script has been selected because it’s the standard script used for Urdu and the most widely used Urdu font. However, the proposed method is also applicable to other fonts as well, except handwritten and calligraphic script such as Dewani (Refer to Figure 2.14). The UPTI dataset consists of a total of 10,063 un-degraded text line images. It is frequently used by researchers for evaluating the performance of Urdu Nastalique OCR system. More details about the UPTI dataset are given in Section 3.1. The ground-truth is an important factor for finding out the default number of ligatures that are present in the entire dataset. These ground-truth results can be used by the proposed algorithm to find the recognition accuracy. As per ligature splits of the UPTI ground-truth, a total of 189274 ligatures and 339344 connected components are identified within the dataset, these ligatures are well- separated into 3645 classes. Other useful information is also extracted from ligature splits of the UPTI ground-truth as shown in Figure 5.1.

113

Figure 5.1 Ligature Segmentation Results Extracted from Ground-Truth of UPTI Dataset Given In [33]

5.2 Ligature Segmentation Results

The proposed ligature segmentation algorithm is evaluated against the benchmark dataset, UPTI, un-degraded ligature sentence images dataset. Impressive results are observed for a total of 10,063 text line images. The result of the proposed ligature segmentation algorithm is compared to the segmentation splits of the UPTI ground-truth. Using the proposed ligature segmentation algorithm, a total of 189003 ligatures are successfully segmented giving an accuracy of 99.86%. The ligature segmentation accuracy is evaluated by checking the total number of ligatures extracted from the actual line image to the total number of ligatures in the ground-truth of the text line. The proposed algorithm successfully extracted 189186 primary components giving an accuracy of 99.95% for primary component extraction. A total of 149782 secondary components are correctly extracted giving an accuracy of 99.80% for secondary component extraction. Total 149503 secondary components are correctly associated to 189003 primary components with no confusion. The UPTI dataset’s un-degraded ligature images contain some incorrectly connected ligatures, dots and diacritics [44]. This results in reducing the accuracy of the proposed algorithm. The above-mentioned results are summarized in Table 5.1.

114 Table 5.1 Results for Proposed Ligature Segmentation Algorithm

UPTI Dataset Proposed Segmentation Accuracy Ground-Truth Algorithm Segmented Ligatures 189274 189003 99.86% Primary Components 189274 189186 99.95% Secondary Components 150070 149782 99.80%

The segmentation results for an image from UPTI dataset having a total of 43 ligatures as per ground-truth is shown in Figure 5.2 (a). Using the proposed algorithm, a total of 72 connected components are detected, 43 are primary components (see Figure 5.2 (b)) and the rest 29 are secondary components (see Figure 5.2 (c)). The proposed algorithm successfully segments 43 ligatures from the text line. The segmented ligatures are shown in Figure 5.3. A total of 29 secondary components are associated to 43 primary components successfully with no confusion.

Figure 5.2 (a) Text Line Image Taken from UPTI Dataset Given In [33] (b) Primary Connected Components Extracted Using Proposed Segmentation Algorithm (c) Secondary Connected Components Extracted Using The Proposed Segmentation Algorithm

Figure 5.3 Ligatures Segmented Using Proposed Algorithm from Un-Degraded Sentence Image ‘560’ Taken from UPTI Dataset Given In [33]

115 To prove the algorithm’s efficiency, it is evaluated on a sentence image taken from [15] (see Figure 5.4). The proposed algorithm successfully segments the ligatures, giving an accuracy of 100% for the displayed text line image. A total of 38 connected components are extracted, out of which 23 are primary components and 15 are secondary components.

Figure 5.4 Sentence Text Image Taken from [15] and Its Segmented Ligatures Shown In (a) and (b), Respectively, Using The Proposed Algorithm

5.2.1 Comparison to Other Ligature Segmentation Algorithms

Development of a holistic OCR system for Arabic-like cursive scripts is highly dependent on accurate ligature segmentation. In the recent years, holistic segmentation approach has gained popularity, since, it provides a more straightforward solution and whole word or ligature is recognized at once. The proposed algorithm shows superior performance in comparison to the other Connected Components Labeling (CCL) based segmentation methods for Urdu ligature [15, 44, 58]. All of the previous holistic segmentation CCL studies have used baseline information. In some studies, it is reported that the secondary components don’t cross or rest on the baseline. However, it has been found that the baseline might not be an accurate measure for primary and secondary component separation and association. Several primary components, such as Alif may not touch the baseline at all. Similarly, some secondary components might cross the baseline. Mostly, the baseline is associated with other features to improve its robustness for connected component separation and association. The proposed ligature segmentation algorithm doesn’t take into consideration the baseline information either for connected components separation into primary and secondary components, nor, for the component association. The proposed algorithm uses height, width and upper region for connected components separation into primary and secondary components. Mostly, horizontal overlap or centroid-to-centroid 116 distance is used for associating the secondary components to the primary. However, in the proposed research, vertical overlap analysis is used for connected component association into constituent ligatures. The comparison of the proposed ligature segmentation algorithm to other state-of-the-art algorithms is summarized in Table 5.2.

Table 5.2 Comparison to Other Ligature Segmentation Algorithms

Study Baseline Correct/Total Correct/Total Segmentation Information Connected Ligatures Accuracy Components Lehal [15] Yes 42441 23325/23555 99.02% Ahmad et al. [44] Yes 332492/333825 189232/189584 99.80% Javed and Hussain [58] Yes 3436/3655 94% Proposed Algorithm No 338968/339344 189003/189274 99.86%

The proposed ligature segmentation algorithms accuracy is close to that reported by a study in [44]. However, the proposed ligature segmentation algorithm is better since it is baseline independent. The algorithm in [44] used features, namely, height, width, centroids and baseline information to decide the primary components and the secondary components. In the proposed segmentation method only three features i.e. height, width and the upper region features are extracted to divide the connected components into primary and secondary components. The algorithm in [44] used height and baseline information to associate the secondary components to their respective primary components.

Comparatively, the proposed algorithm in this study uses the CXmin and CXmax features, without any baseline processing for the association of connected components to appropriate ligatures.

5.3 Feature Extraction Results

The proposed hand-engineered features are standard features. Some of these features have been used in the past, however, those studies have been directed towards character based Urdu recognition systems as discussed in section 3.3.4. Most of the character based Urdu text recognition systems use implicit segmentation, where the image under consideration may have an extremely small part of the character under consideration. The proposed hand- engineered features are novel since the combination of geometric and statistical (1st order and 2nd order) features have been used for the first time for extraction from connected Urdu script images i.e. ligatures. Table 5.3 shows the feature vector extracted from the first ten ligature images, of the first line image from the UPTI dataset.

117 Table 5.3 Feature Vector Generated for The First Ten Ligatures Extracted from Image '0.png' Taken from The UPTI Dataset

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15

0.7073 0.7624 0.0191 58 62 1.1098 1.5690 4.9384 7.0215 6.9592 3.E+00 5.E+02 1.E-01 1.E-02 1.E-01

0.6585 0.7522 0.0291 65 69 1.5732 2.3889 8.1736 26.0157 4.0918 5.E+00 4.E+02 -7.E-02 8.E-03 2.E-01

0.5488 0.7188 0.0266 27 64 1.1951 2.1778 2.8010 48.6950 1.9305 9.E+00 5.E+02 3.E-01 1.E-02 1.E-01

1.3537 0.7678 0.0820 283 345 9.0976 6.7207 57.2743 56.7304 3.4403 9.E+00 2.E+03 -2.E-02 1.E-03 7.E-02

0.5488 0.7188 0.0266 27 64 1.1951 2.1778 2.8010 48.6950 1.9305 9.E+00 5.E+02 3.E-01 1.E-02 1.E-01

1.1463 0.7818 0.0485 191 174 4.5610 3.9787 31.0148 21.5049 3.6606 3.E+00 9.E+02 -2.E-02 3.E-03 9.E-02

0.7683 0.7720 0.0403 66 95 2.5366 3.3016 10.8937 42.7302 5.7762 2.E+01 5.E+02 1.E-01 5.E-03 1.E-01

0.5488 0.7188 0.0266 27 64 1.1951 2.1778 2.8010 48.6950 1.9305 9.E+00 5.E+02 3.E-01 1.E-02 1.E-01

0.8171 0.7775 0.0683 157 163 4.5732 5.5970 23.7045 77.1230 2.8162 3.E+00 5.E+02 -4.E-02 3.E-03 1.E-01 0.5488 0.7188 0.0266 27 64 1.1951 2.1778 2.8010 48.6950 1.9305 9.E+00 5.E+02 3.E-01 1.E-02 1.E-01

To evaluate the performance of the proposed feature vector, its space complexity is computed for the entire ligature dataset and compared to some other well-known and standard automated features.

5.3.1 Space Complexity Comparison to Other Feature Vectors

The space complexity is an important measure when memory is a scarce resource and huge datasets are involved. The space complexity for the proposed feature vector is computed against some of the well-known pixel based [92] and autoencoder features [91] for ligature recognition. In [92], all the segmented ligature images were resized to 60 x 60, 80 x 80 and 90 x 90 dimensions. The raw pixels from these ligature images were then considered as a feature vector. Whereas, in a research study conducted by [91], stacked denoising autoencoder based features were generated using raw image pixels. However, for the space complexity comparison here, non-stacked and non-denoising autoencoder feature vector is used for normalized ligature images of 60 x 60, 80 x 80 and 90 x 90 dimensions. The space complexity computed for each feature vector is based on the total number of bytes consumed by the feature vector for all the ligatures (189003) in the memory (see Figure 5.5). It can be clearly concluded that the raw pixels consume the most memory with approximately 5 GB, 9 GB and 11 GB memory storage for 60 x 60, 80 x 80, 90 x 90 respectively. The features extracted through an autoencoder have comparatively lower space complexity with 865 MB, 1154 MB, 1300 MB. The proposed feature vector only occupies 22 MB for 189003 ligatures and hence the best among all with the lowest space complexity.

118 1.4E+10

1.2E+10

1E+10

8E+09

6E+09

Number of Bytes 4E+09

2E+09

0

Raw Pixels (60 x 60) Proposed Features (15) Raw Pixels (80 x 80) Raw Pixels (90 x 90 )

Autoencoder Features (60 x 60) Autoencoder Features (80 x 80) Autoencoder Features (90 x 90)

Figure 5.5 Space Complexity Comparison for Proposed Features, Raw Pixel Features As Given in [92] and Autoencoder Features

Based on the amount of memory space consumed by each of the feature vectors for comparison, the space complexity in terms of Big-O is given in Table 5.4. The proposed hand-engineered feature vector possesses a constant space complexity of �(1) i.e. the amount of memory required by the proposed feature vector remains the same for all the input ligature images. Whereas, the amount of memory space by raw pixels and stacked autoencoder increase linearly with the increase of the input ligature image dimension and is given by �(�).

Table 5.4 Space Complexity Comparison in terms of Big-O to Other Feature Vectors

Feature Vector Space Complexity Proposed Features �(�) Raw Pixels (60 x 60) Raw Pixels (80 x 80) �(�) Raw Pixels (90 x 90) Autoencoder Features (60 x 60) Autoencoder Features (80 x 80) �(�) Autoencoder Features (90 x 90)

119 5.3.2 Accuracy and Reliability of the Feature Vector

The reliability of the feature vectors is proven by the fitness function of the genetic algorithm, where the features are multi-level sorted throughout generations till an optimal solution is reached. The chromosome for which the optimal solution is reached verifies the reliability of the features since in each chromosome the allele is represented by a number that refers to a feature from the feature vector in the phenotype space. The best sequence of features for the genetic algorithm is given after evaluating the performance measure for each chromosome. An accuracy of 96.72 is achieved for a number of chromosomes as given in Section 5.5.2. The chromosome for which this accuracy is achieved is represented by a string of numbers. The sequence of numbers (14, 9, 5, 7, 11, 13, 4, 3, 2, 8, 1, 12, 10, 6, 15) translated from the representation space to the phenotype space represents the features from the feature vector that provides the maximum ligature recognition accuracy using the proposed GA based hierarchical clustering (see Figure 5.6).

Accuracy

Chromosome 14 9 5 7 11 13 4 3 2 8 1 12 10 6 15 96.72

Vertical Horizontal Vertical Vertical Vertical Density Horizontal Aspect Horizontal Horizontal Energy Edge Correlation Edge Compactness Contrast Homogeneity Features Variance Mean Kurtosis Function Variance Ratio Kurtosis Mean Intensity Intensity

Figure 5.6 Accuracy and Reliability of the Proposed Feature Vectors

5.4 Hierarchical Clustering and Classification Rules

The proposed hierarchical clustering divides the features data points into groups i.e. clusters for the purpose of improved understanding, summarization and decreasing the wide distribution of data. The hierarchical feature clustering approach significantly reduces the distribution of data points for all the features (F1 to F15) extracted from the ligature images (see Figure 5.7). Figure 5.7 shows the initial distribution of data points before the hierarchical clustering process. The initial distribution indicates the total number of unique data points (clusters) for each feature initially. The reduced distribution indicates the clustered data points after the hierarchical feature clustering process.

120 60000

50000

40000

30000 Number of Clusters

20000

10000

0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 Features Initial Distribution Reduced Distribution Initial Dimensionality Reduced Dimensionality

Figure 5.7 Data Distribution Reduction Using Hierarchical Clustering

It can be easily drawn from the results that the distribution for each of the features was reduced using the hierarchical clustering approach. The proposed clustering algorithm doesn’t require the total number of clusters to be known in advance. However, the minimum threshold for a cluster needs to be set. The minimum number of data points for each cluster is set to 30. The detailed results for the proposed hierarchical clustering for distribution reduction are given in Table 5.5.

Table 5.5 Results for Data Points Distribution Reduction

Proposed Features Default Clusters Reduction (%) Algorithm Clusters F1 3860 346 91.04 F2 2488 164 93.41 F3 26221 1619 93.83 F4 516 17 96.71 F5 486 20 95.88 F6 16093 1279 92.05 F7 3854 397 89.70 F8 28941 1074 96.29 F9 4471 375 91.61 F10 28776 1338 95.35 F11 4442 341 92.32 F12 56572 592 98.95 F13 56983 1251 97.80 F14 1240 70 94.35 F15 56284 1077 98.09

121 Once, all the features are clustered, classification rules are used for representation of the clustered data points. The classification rules using IF-THEN statements provides just a compact representation of the output clustered data. Using the classification rules, each ligature can be associated to a class (1 to 3645). The total number of rules and the number of classes as per ground-truth information of the UPTI dataset given here is the same i.e. 3645. The rules for some of the classes (first seven classes and last seven classes) are given below,

RULE 1: IF (15 ≤ F1 ≤ 121) AND (16 ≤ F2 ≤ 162) AND (1 ≤ F3 ≤ 58) AND (3 ≤ F4 ≤ 5) AND (F5 = 2) AND (1 ≤ F6 ≤ 64) AND (F7 = 1) AND (34 ≤ F8 ≤ 91) AND (F9 = 6) AND (1153 ≤ F10 ≤ 1338) AND (F11 = 120) AND (41 ≤ F12 ≤ 168) AND (3 ≤ F13 ≤ 1110) AND (F14 = 70) AND (ء) Class=1 THEN 1022) ≤ F15 ≥ 183)

RULE 2: IF (13 ≤ F1 ≤ 217) AND (14 ≤ F2 ≤ 164) AND (71 ≤ F3 ≤ 1239) AND (1 ≤ F4 ≤ 8) AND (2 ≤ F5 ≤ 10) AND (32 ≤ F6 ≤ 963) AND (6 ≤ F7 ≤ 309) AND (11 ≤ F8 ≤ 317) AND (142 ≤ F9 ≤ 355) AND (26 ≤ F10 ≤ 1175) AND (80 ≤ F11 ≤ 333) AND (26 ≤ F12 ≤ 229) AND (30 ≤ F13 ≤ (آ) Class=2 THEN 1059) ≤ F15 ≤ (77 AND 68) ≤ F14 ≤ (2 AND (1167

RULE 3: IF (50 ≤ F1 ≤ 66) AND (72 ≤ F2 ≤ 105) AND (173 ≤ F3 ≤ 208) AND (F4 = 6) AND (F5 = 10) AND (136 ≤ F6 ≤ 154) AND (F7 = 16) AND (F8 = 34) AND (F9 = 161) AND (48 ≤ F10 ≤ 67) AND (F11 = 310) AND (F12 = 146) AND (419 ≤ F13 ≤ 805) AND (F14 = 64) AND (436 ≤ (أ) Class=3 THEN 978) ≤ F15

RULE 4: IF (19 ≤ F1 ≤ 117) AND (22 ≤ F2 ≤ 160) AND (64 ≤ F3 ≤ 760) AND (5 ≤ F4 ≤ 8) AND (2 ≤ F5 ≤ 10) AND (69 ≤ F6 ≤ 375) AND (9 ≤ F7 ≤ 59) AND (95 ≤ F8 ≤ 131) AND (73 ≤ F9 ≤ 146) AND (68 ≤ F10 ≤ 708) AND (94 ≤ F11 ≤ 143) AND (60 ≤ F12 ≤ 161) AND (180 ≤ F13 ≤ (ؤ) Class=4 THEN 1022) ≤ F15 ≤ (249 AND 65) ≤ F14 ≤ (49 AND (1102

RULE 5: IF (1 ≤ F1 ≤ 308) AND (1 ≤ F2 ≤ 164) AND (20 ≤ F3 ≤ 1092) AND (1 ≤ F4 ≤ 8) AND (2 ≤ F5 ≤ 10) AND (2 ≤ F6 ≤ 903) AND (6 ≤ F7 ≤ 210) AND (1 ≤ F8 ≤ 403) AND (9 ≤ F9 ≤ 210) AND (1 ≤ F10 ≤ 1303) AND (33 ≤ F11 ≤ 323) AND (22 ≤ F12 ≤ 482) AND (23 ≤ F13 ≤ 1235) (ا) Class=5 THEN 1052) ≤ F15 ≤ (8 AND 68) ≤ F14 ≤ (2 AND

RULE 6: IF (113 ≤ F1 ≤ 275) AND (144 ≤ F2 ≤ 164) AND (376 ≤ F3 ≤ 1030) AND (F4 = 8) AND (F5 = 10) AND (442 ≤ F6 ≤ 760) AND (F7 = 103) AND (552 ≤ F8 ≤ 790) AND (F9 = 46) AND (808 ≤ F10 ≤ 1315) AND (F11 = 36) AND (146 ≤ F12 ≤ 187) AND (97 ≤ F13 ≤ 1166) AND (F14 (ب) Class=6 THEN 761) ≤ F15 ≤ (56 AND (2 = 122

RULE 7: IF (71 ≤ F1 ≤ 85) AND (F2 = 133) AND (274 ≤ F3 ≤ 720) AND (5 ≤ F4 ≤ 8) AND (4 ≤ F5 ≤ 10) AND (177 ≤ F6 ≤ 377) AND (21 ≤ F7 ≤ 122) AND (156 ≤ F8 ≤ 275) AND (90 ≤ F9 ≤ 201) AND (301 ≤ F10 ≤ 450) AND (70 ≤ F11 ≤ 79) AND (142 ≤ F12 ≤ 146) AND (655 ≤ F13 ≤ (ة) Class=7 THEN 986) ≤ F15 ≤ (421 AND 63) ≤ F14 ≤ (22 AND (934

RULE 3639: IF (F1 = 340) AND (F2 = 69) AND (F3 = 1214) AND (F4 = 9) AND (F5 = 11) AND (F6 = 1241) AND (F7 = 261) AND (F8 = 916) AND (F9 = 131) AND (F10 = 103) AND (F11 = 117) AND (F12 = 586) AND (F13 = 667) AND (F14 = 2) AND (F15 = 254) THEN Class=3639 ( ﻮﻨﯿﻤﺴﭩﯿﺑ )

RULE 3640: IF (F1 = 336) AND (F2 = 81) AND (F3 = 853) AND (F4 = 17) AND (F5 = 20) AND (F6 = 1242) AND (F7 = 276) AND (F8 = 575) AND (F9 = 89) AND (F10 = 73) AND (F11 = 46) ( ﭩﺳ ﯿ ﺲﮑﭩﺴﭩ ) Class=3640 THEN 88) = (F15 AND 2) = (F14 AND 666) = (F13 AND 592) = (F12 AND

RULE 3641: IF (F1 = 344) AND (F2 = 44) AND (F3 = 1461) AND (F4 = 14) AND (F5 = 18) AND (F6 = 1276) AND (F7 = 318) AND (F8 = 699) AND (F9 = 173) AND (F10 = 97) AND (F11 = 64) ( ﻮﯿﻨﯿﻄﺴﻠﻓ ) Class=3641 THEN 153) = (F15 AND 2) = (F14 AND 666) = (F13 AND 589) = (F12 AND

RULE 3642: IF (340 ≤ F1 ≤ 346) AND (17 ≤ F2 ≤ 68) AND (1375 ≤ F3 ≤ 1572) AND (F4 = 17) AND (F5 = 20) AND (1275 ≤ F6 ≤ 1279) AND (F7 = 339) AND (1065 ≤ F8 ≤ 1067) AND (F9 = 127) AND (75 ≤ F10 ≤ 82) AND (F11 = 59) AND (590 ≤ F12 ≤ 592) AND (648 ≤ F13 ≤ 666) ( ﭧﻨﻨﯿﭩﻔﯿﻟ ) Class=3642 THEN 210) ≤ F15 ≤ (84 AND 2) = (F14 AND

RULE 3643: IF (F1 = 329) AND (F2 = 97) AND (F3 = 1156) AND (F4 = 17) AND (F5 = 20) AND (F6 = 1275) AND (F7 = 372) AND (F8 = 638) AND (F9 = 122) AND (F10 = 72) AND (F11 = 27) ( ﻦﺸﯿﮑﯿﻔﯿﻧ ) Class=3643 THEN 84) = (F15 AND 1) = (F14 AND 664) = (F13 AND 592) = (F12 AND

RULE 3644: IF (337 ≤ F1 ≤ 339) AND (70 ≤ F2 ≤ 74) AND (1240 ≤ F3 ≤ 1254) AND (F4 = 17) AND (F5 = 20) AND (F6 = 1278) AND (F7 = 374) AND (587 ≤ F8 ≤ 591) AND (F9 = 122) AND (79 ≤ F10 ≤ 80) AND (F11 = 27) AND (F12 = 592) AND (666 ≤ F13 ≤ 668) AND (F14 = 1) AND ( ﻦﺸﯿﮑﯿﻔﯿﭨ ) Class=3644 THEN 159) ≤ F15 ≥ 89)

RULE 3645: IF (F1 = 346) AND (3 ≤ F2 ≤ 16) AND (1095 ≤ F3 ≤ 1336) AND (F4 = 17) AND (F5 = 20) AND (1277 ≤ F6 ≤ 1279) AND (F7 = 284) AND (970 ≤ F8 ≤ 1006) AND (F9 = 90) AND (83 ≤ F10 ≤ 86) AND (F11 = 109) AND (F12 = 592) AND (665 ≤ F13 ≤ 669) AND (F14 = ( ﭩﺳ ﯿ ﺒ ﭧﻨﻤﺸﻠ ) Class=3645 THEN 92) ≤ F15 ≤ (72 AND (2

123 5.5 Genetic Algorithm Results

The proposed Urdu ligature based recognition system uses the genetic algorithm for optimization and recognition. Over the period of various generations, the hierarchical clustering is optimized, hereafter, classification rules are optimized. The existing deep learning based Urdu recognition systems are extremely complex, requires high computational power as well as special tools such as GPU (Graphical Processing Unit). On the contrary, the proposed ligature recognition system is extremely simple as compared to other existing deep learning models, it also attains a high ligature recognition rate within a small amount of time. The proposed model is further described in detail as follows. The main idea of the proposed genetic algorithm is derived from the idea of natural evolution. Numerous chromosomes are generated over multiple generations and hence optimization and recognition are performed. Each chromosome represents a sequence to access the features in the feature vector for processing (multi-level sorting) within the hierarchically clustered ligature dataset. Each allele represents a feature of the ligature and the gene location represents the sequence in which the features need to be accessed. At the end of each generation, the hierarchical clustering is optimized and henceforth, the rules for classification are optimized. The ligature recognition accuracy is found for each chromosome in the population, over a number of generations. Random sized sample test data is taken and used for evaluating the recognition accuracy using the fitness function of the genetic algorithm. The fitness of a chromosome is assessed by its ligature recognition accuracy for a sample test data.

5.5.1 Algorithm Parameters

Several parameters are taken into consideration when developing the proposed genetic algorithm for optimization and recognition (see Table 5.6).

Table 5.6 Parameters for Genetic Algorithm Model

Parameter Value(s) Number of Generations 101 Population Per Generation 1 Population Size (Chromosomes) 10 Chromosome Size (Genes) 15 Allele Scale 1 to 15 Chromosome Encoding Permutation Parent Selection Sequential Order Crossover Type One Point Mutation Rate 0.5 Fitness Measure Recognition Accuracy Survivor Selection Elitism (Parents + Offsprings) Termination Criteria Maximum Limit of Generations

124 The ligature recognition accuracy is tested for 101 generations. Each generation has 1 population and each population has a total of 10 chromosomes. The encoding for each chromosome is the permutation order to access the features for multi-level sorting. After initial population is generated and its fitness is evaluated, a pair of parents are selected sequentially from the population. One-point crossover method using gene 8 as the crossover point is applied to each of the parent’s pair, resulting in two offsprings. The offsprings generated by crossover are then mutated using a mutation rate of 0.5. The fitness function consists of finding the recognition accuracy for each mutated chromosome generated after the mutation process. Before the termination, the survivors are selected by finding the 10 best chromosomes using the elitism from the parents and the offsprings. The proposed ligature recognition genetic algorithm is tested for a total of 101 generations. The algorithm is terminated when the maximum limit of 101 generations is reached.

5.5.2 Ligature Recognition Accuracy

For each chromosome, the ligature recognition is calculated after multi-level sorting of the entire ligature dataset and hence optimization of the classification rules. Fitness evaluation i.e. the recognition accuracy of each chromosome is tested for all the ligatures for which the difference between consecutive cluster data points for the lowest-level feature is greater than zero. Over the generations, for each chromosome, the test samples generated are of varying sizes. Hence, a total of 1010 random test samples are generated for each of the 1010 chromosomes for all the generations. The smallest test sample is of 162 ligatures, while the maximum test sample is of 30755 ligatures. The random ligature test datasets used for all the generation is shown in Figure 5.8.

125

Figure 5.8 Random Test Data to Evaluate Ligature Recognition Accuracy for Each Chromosome

The proposed ligature recognition system performs so well that for first generation i.e. initial population a maximum ligature recognition accuracy of 87.95% is reported for a chromosome. The recognition results for the test ligature samples across various generations for all chromosomes in the population are evaluated. The detailed information for the recognition accuracy of all the chromosomes for each population for all generations (101) is given in Table 5.7.

126 Table 5.7 Recognition Accuracy (%) for Each Population

CHROMOSOMES G C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 1 63.81 56.55 56.55 67.95 72.01 58.08 61.64 87.95 54.97 77.05 2 49.31 92.37 65.41 57.48 53.47 58.18 66.14 64.2 63.22 60.6 3 84.76 92.18 57.91 64.96 57.94 54.8 53.99 58.32 55.09 76.7 4 76.27 92.5 73.98 77.44 74.98 62.47 61.47 65.21 54.8 73.63 5 91.22 92.37 70.25 92.18 77.44 84.76 74.98 66.67 63.97 92.92 6 92.36 92.18 73.75 68.8 85.15 92.09 70.24 92.25 84.76 84.82 7 91.03 92.18 92.13 92.37 92.93 65.6 92.18 92.51 92.18 92.16 8 92.92 81.49 92.21 92.18 92.37 92.37 92.37 92.37 92.18 92.38 9 92.92 92.93 92.37 94.47 92.12 92.51 94.32 92.36 92.37 92.37 10 84.9 81.04 92.93 92.93 92.92 92.85 73.08 85.13 92.22 92.18 11 94.22 80.39 92.93 92.93 92.93 92.93 94.91 62.88 92.92 92.92 12 96.2 89.67 94.45 94.48 92.93 92.93 92.93 92.02 92.93 92.93 13 96.02 96.2 94.17 94.48 94.34 94.45 92.21 94.48 92.93 92.93 14 73.21 96.2 94.91 96.02 94.48 86.59 94.17 94.48 76.2 93.91 15 94.47 96.2 96.02 94.52 94.91 94.69 94.49 94.91 94.48 58.82 16 96.19 96.2 96.2 96.2 96.02 96.02 94.91 96.02 94.58 94.91 17 88.72 96.2 96.2 96.2 96.2 96.2 94.93 96.2 96.02 96.02 18 96.2 96.19 96.2 96.2 96.2 96.2 96.2 94.91 96.2 96.2 19 93.39 84.85 80.78 96.2 96.2 96.2 96.2 96.2 96.2 93.07 20 96.2 75.12 96.2 96.2 94.74 96.2 96.2 96.2 96.2 96.2 21 94.11 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 22 96.2 96.2 96.2 93.99 94.66 96.2 62.15 96.2 96.2 94.66 23 96.2 96.2 96.2 96.21 96.64 96.2 96.2 96.2 96.29 94.36 24 96.2 96.64 96.1 96.21 93.13 96.2 96.21 96.2 96.2 96.2 25 96.64 96.64 96.29 96.1 58.76 96.21 62.15 96.2 96.2 96.21 26 96.34 96.64 96.64 96.64 94.03 96.29 96.21 96.21 96.21 96.2 27 96.64 96.64 96.65 96.64 96.48 96.64 96.34 94.06 96 96.26 28 96.64 96.65 96.64 96.64 59.99 96.51 94.22 96.64 96.64 96.64 29 96.65 96.5 96.64 94.53 96.64 96.64 96.64 96.64 96.65 96.64 30 96.65 96.65 87.97 96.65 96.64 94.94 96.65 96.64 96.64 96.64 31 96.65 96.65 75.37 96.52 96.64 96.65 96.65 96.65 96.64 96.64 32 96.64 81.44 96.65 96.43 94.05 96.65 96.65 96.65 96.65 96.65 33 96.64 96.65 96.65 96.65 96.65 96.65 96.65 96.65 88.01 96.65 34 80.23 94.97 86 96.65 96.65 96.65 96.65 80.23 96.65 96.65 35 96.65 96.65 96.65 96.66 96.65 96.65 96.62 61.02 96.5 96.65 36 94.7 96.66 94.68 96.66 96.66 96.65 96.65 96.65 96.65 96.65 37 81.44 96.66 96.66 96.43 96.66 96.65 96.65 96.66 96.65 96.65 38 96.66 96.51 96.66 96.66 94.68 96.66 96.66 96.66 96.64 96.66 39 96.34 96.66 96.66 96.66 96.66 96.65 96.66 80.23 96.65 96.66 40 96.66 85.96 96.66 96.65 96.66 96.66 96.66 96.66 96.66 96.64 41 96.66 96.66 96.66 96.66 96.66 96.66 96.65 96.66 96.66 96.66

127 42 96.62 96.66 96.66 94.71 96.54 96.66 96.24 96.66 96.66 94.61 43 96.66 96.65 96.66 96.66 81.31 96.66 82.78 96.66 96.66 96.66 44 96.66 96.66 96.64 94.19 96.54 96.66 96.62 96.66 96.66 96.66 45 96.66 96.66 96.32 96.65 96.42 96.66 96.62 96.66 87.99 96.66 46 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.65 96.65 96.66 47 96.66 96.66 96.66 96.66 96.65 96.66 96.66 96.66 96.66 94.68 48 96.66 96.66 96.66 96.51 96.44 96.66 96.66 96.66 96.65 96.43 49 96.66 96.66 80.23 96.66 96.66 82.95 96.65 96.66 96.66 96.66 50 96.66 96.66 86.72 96.66 96.66 96.66 96.66 96.66 96.66 96.66 51 96.66 96.66 96.65 96.66 96.66 96.51 96.66 96.66 96.48 94.71 52 96.55 96.66 96.66 96.65 96.65 96.66 96.66 96.66 96.35 96.66 53 96.66 80.24 96.66 94.7 96.66 96.65 96.66 96.66 96.66 96.66 54 96.66 96.48 96.66 96.66 94.72 88 96.66 96.66 94.97 96.65 55 96.66 96.66 96.66 96.66 85.96 96.66 96.66 73.63 96.66 94.97 56 96.66 96.66 96.66 96.66 96.42 96.66 96.66 96.66 96.68 96.66 57 94.94 73.63 96.66 96.66 96.66 96.66 96.42 96.71 96.66 94.96 58 96.71 96.68 96.66 96.66 96.42 96.66 96.66 96.66 96.42 86.72 59 62.3 96.71 96.68 96.65 73.45 96.48 96.66 94.74 96.5 94.76 60 83.17 96.71 94.99 96.68 96.68 96.68 96.66 96.35 96.66 96.66 61 83.75 96.71 96.71 76.78 96.68 96.68 88.51 96.68 96.68 96.68 62 96.71 94.74 96.71 96.71 96.71 96.67 96.68 96.68 96.68 80.33 63 83.3 96.71 96.24 96.71 89.2 96.71 96.71 96.71 96.71 96.71 64 76.78 96.69 96.71 96.71 96.64 96.56 74.64 96.71 96.71 96.55 65 96.67 83.17 96.71 96.71 96.71 96.7 96.27 96.67 96.71 96.71 66 96.71 96.71 93.06 96.6 96.71 96.71 96.7 96.57 96.72 96.7 67 96.72 89.49 74.63 96.71 85.21 96.71 96.71 96.71 96.59 96.71 68 96.72 96.67 96.71 96.71 82.61 96.71 96.72 96.71 96.14 96.71 69 96.71 96.71 92.78 96.72 96.71 96.71 96.71 96.71 75.2 96.71 70 96.71 96.72 74.67 96.28 96.72 96.72 96.71 96.71 96.17 96.61 71 96.72 96.42 96.72 96.72 96.72 96.72 96.72 94.59 96.15 96.71 72 96.72 96.72 96.71 96.72 92.82 96.72 96.72 96.72 96.72 84.4 73 96.17 96.72 96.72 96.72 96.72 96.72 96.72 96.72 81.3 96.72 74 96.72 93.92 96.72 96.17 93.06 96.72 74.67 96.71 96.53 96.72 75 96.18 96.72 96.72 96.72 96.72 89.33 96.72 93.06 71.86 96.72 76 96.72 96.68 96.72 96.68 96.72 96.17 96.72 96.72 77.23 96.72 77 96.72 96.72 96.72 96.72 96.72 96.72 96.72 66.32 82.61 96.72 78 96.72 96.5 96.72 96.72 96.72 96.72 96.53 96.72 94.68 96.72 79 96.72 96.72 96.72 96.71 96.72 56 96.72 96.72 96.72 82.85 80 96.72 96.72 96.72 96.7 84.39 76.78 96.65 83.75 96.71 96.72 81 96.45 62.3 96.72 96.72 96.72 96.72 96.72 62.3 96.72 96.44 82 96.72 96.72 96.72 96.72 96.72 96.7 96.72 96.7 96.72 96.72 83 96.72 96.72 96.7 96.72 83.27 96.72 96.72 96.72 96.7 75.2 84 96.72 96.72 96.64 96.72 96.72 96.58 96.72 96.67 96.72 96.7 85 96.72 96.72 96.72 96.72 96.72 89.45 71.86 96.45 66.32 93.92 86 96.72 89.49 96.72 83.15 96.72 85.21 96.72 88.5 96.72 85.21

128 87 96.72 96.72 96.71 96.72 96.6 86.28 96.72 96.72 96.72 96.72 88 96.17 96.72 96.72 96.72 96.71 89.73 96.72 96.7 96.72 96.72 89 96.72 96.72 96.72 96.72 96.72 96.72 89.49 96.72 76.78 96.72 90 96.42 96.72 96.72 89.45 96.72 96.72 96.72 96.6 74.67 96.72 91 96.72 96.72 96.72 94.68 96.72 96.72 96.72 96.72 96.72 96.72 92 96.72 85.21 96.72 96.72 88.5 96.72 83.27 96.72 93.57 96.72 93 96.72 92.84 96.72 96.7 96.72 96.72 96.72 96.72 96.5 96.72 94 96.14 96.14 96.72 88.39 96.72 96.7 96.72 96.72 96.72 96.65 95 96.72 96.64 96.72 74.67 96.72 84.39 96.72 96.72 96.17 96.72 96 83.75 96.72 96.7 96.72 96.72 96.72 96.72 96.72 96.72 96.72 97 82.85 96.72 96.72 96.72 96.72 96.42 96.72 96.72 96.72 96.72 98 96.72 96.17 96.64 88.39 96.72 96.72 96.59 96.72 96.72 96.72 99 96.72 96.72 96.72 96.72 96.72 96.72 73.53 96.72 96.65 96.72 100 96.72 96.45 96.6 96.72 96.72 96.72 96.72 96.72 96.72 96.17 101 96.72 96.72 96.72 73.53 93.06 96.5 96.67 86.28 96.65 96.7

Survivors are selected at the end of each generation using the fitness (accuracy in %) of the chromosomes, parents as well as the offsprings. A total of 10 survivors are selected at the end of each generation using the elitism strategy. The results for survivor selection are given in Table 5.8. It should be noted that the survivors are generated for a total of 100 generation, since, the first generation doesn’t go through the survivor selection process. The results for survivor selection are promising since the fitter individuals have not been kicked out and are kept for the next generation.

Table 5.8 Survivors Selected Using Elitism based on Ligature Recognition Accuracy given in (%)

SURVIVORS G S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 2 92.37 87.95 77.05 72.01 67.95 66.14 65.41 64.2 63.81 63.22 3 92.37 92.18 87.95 84.76 77.05 76.7 72.01 67.95 66.14 65.41 4 92.5 92.37 92.18 87.95 84.76 77.44 77.05 76.7 76.27 74.98 5 92.92 92.5 92.37 92.37 92.18 92.18 91.22 87.95 84.76 84.76 6 92.92 92.5 92.37 92.37 92.36 92.25 92.18 92.18 92.18 92.09 7 92.93 92.92 92.51 92.5 92.37 92.37 92.37 92.36 92.25 92.18 8 92.93 92.92 92.92 92.51 92.5 92.38 92.37 92.37 92.37 92.37 9 94.47 94.32 92.93 92.93 92.92 92.92 92.92 92.51 92.51 92.5 10 94.47 94.32 92.93 92.93 92.93 92.93 92.92 92.92 92.92 92.92 11 94.91 94.47 94.32 94.22 92.93 92.93 92.93 92.93 92.93 92.93 12 96.2 94.91 94.48 94.47 94.45 94.32 94.22 92.93 92.93 92.93 13 96.2 96.2 96.02 94.91 94.48 94.48 94.48 94.47 94.45 94.45 14 96.2 96.2 96.2 96.02 96.02 94.91 94.91 94.48 94.48 94.48 15 96.2 96.2 96.2 96.2 96.02 96.02 96.02 94.91 94.91 94.91 16 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.19 96.02 96.02

129 17 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 18 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 19 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 20 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 21 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 22 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 96.2 23 96.64 96.29 96.21 96.2 96.2 96.2 96.2 96.2 96.2 96.2 24 96.64 96.64 96.29 96.21 96.21 96.21 96.2 96.2 96.2 96.2 25 96.64 96.64 96.64 96.64 96.29 96.29 96.21 96.21 96.21 96.21 26 96.64 96.64 96.64 96.64 96.64 96.64 96.64 96.34 96.29 96.29 27 96.65 96.64 96.64 96.64 96.64 96.64 96.64 96.64 96.64 96.64 28 96.65 96.65 96.64 96.64 96.64 96.64 96.64 96.64 96.64 96.64 29 96.65 96.65 96.65 96.65 96.64 96.64 96.64 96.64 96.64 96.64 30 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.64 96.64 31 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 32 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 33 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 34 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 35 96.66 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 96.65 36 96.66 96.66 96.66 96.66 96.65 96.65 96.65 96.65 96.65 96.65 37 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.65 96.65 38 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 39 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 40 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 41 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 42 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 43 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 44 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 45 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 46 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 47 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 48 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 49 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 50 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 51 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 52 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 53 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 54 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 55 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 56 96.68 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 57 96.71 96.68 96.66 96.66 96.66 96.66 96.66 96.66 96.66 96.66 58 96.71 96.71 96.68 96.68 96.66 96.66 96.66 96.66 96.66 96.66 59 96.71 96.71 96.71 96.68 96.68 96.68 96.66 96.66 96.66 96.66 60 96.71 96.71 96.71 96.71 96.68 96.68 96.68 96.68 96.68 96.68 61 96.71 96.71 96.71 96.71 96.71 96.71 96.68 96.68 96.68 96.68

130 62 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 63 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 64 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 65 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 66 96.72 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 67 96.72 96.72 96.71 96.71 96.71 96.71 96.71 96.71 96.71 96.71 68 96.72 96.72 96.72 96.72 96.71 96.71 96.71 96.71 96.71 96.71 69 96.72 96.72 96.72 96.72 96.72 96.71 96.71 96.71 96.71 96.71 70 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.71 96.71 71 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 73 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 74 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 75 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 76 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 77 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 78 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 79 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 80 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 81 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 82 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 83 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 84 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 85 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 86 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 87 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 88 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 89 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 90 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 91 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 92 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 93 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 94 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 95 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 97 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 98 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 99 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 100 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 101 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72 96.72

The maximum ligature recognition accuracy achieved at the end of each generation is given in Figure 5.9. For the first generation, the maximum recognition accuracy of 87.95% is achieved for a sample test data. Over the course of the remaining 100 generations, the hierarchical clustering is optimized and hence, the recognition accuracy is improved. The 131 maximum ligature recognition rate of 96.72% is obtained for 66th generation from the highest survivor (see Figure 5.9). The accuracy at end of each generation shows that the best chromosomes are selected and therefore the recognition rate never decreases.

98

96

94

92

90

88

86 Recognition Accuracy Recognition 84

82 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 Number of Generations

Figure 5.9 Maximum Ligature Recognition Accuracy (%) Achieved for Each Generation

5.5.3 Convergence

The best solutions are those that have a common ancestor and their fitness is very identical both to each other and to that of high fitness solution from the previous generations. The proposed solutions generated by the genetic algorithm based hierarchical clustering approach for Urdu ligature recognition also stabilizes after a time and converges towards a common fitness value (see Figure 5.10). The chromosomes in a population become increasingly similar with each generation and hence converges to a common solution. The 77th generation and onwards, the solutions (chromosomes) converges towards a fitness value of 96.72%. Also, the proposed system doesn’t have premature convergence, variability is maintained in the population over a number of generations.

132

90

80

70 2

4 60 6 50 8

40 10

10 20 30 40 50 60 70 80 90 100

Figure 5.10 Convergence Towards a Common Solution

5.6 Comparison to Other State-Of-The-Art Classifiers

The primary incentive to use a GA based hierarchical clustering algorithm in comparison to other state-of-the-art classifiers for ligature recognition has been discussed in detail in this section. A theoretical comparison has been provided to other state-of-the-art classifiers like SVM, HMM, Neural Network, Random Forests, Deep learning methods and other clustering algorithms in detail with substantiation using the previous literature, experiments carried out in this thesis, size of the dataset (189003) and the number of classes (3645). (a) SVM SVM is a popular binary classifier, however, few studies, Rana and Lehal [83] have also used it for ligature recognition. SVM involves dense computations and take long execution times for a large number of classes [83]. Also, the previous studies using SVM on ligatures recommended the separation of the ligatures into primary and secondary components before the final classification in order to avoid long execution times and reduce the total number of the classes. Further, the primary components are needed to be sub-divided into sub-classes and instead of a single SVM, multiple small SVM are needed to be implemented for a smaller number of primary component sub-classes. Similarly, the secondary components are needed to be divided into classes and classified separately using

133 an SVM. Even with the sub-classes and multiple SVM poor execution times has been reported for ligature recognition in the past [83]. In the proposed research, whole ligatures are taken into consideration for classification, instead of separating the ligatures into primary and secondary components they are allocated and recombined during the segmentation process. The use of whole ligatures seems a more accurate consideration for ligature recognition, since, the separate classification of the primary and secondary components seems partial without proper allocation of the recognized components back into the whole ligature form. Furthermore, in this research, the actual number of ligature classes (3645) extracted from the UPTI dataset are considered without any sub-class divisions. A single SVM can’t be used to process and classify such a large number of classes and multi-class SVM would require to train multiple SVM classifiers and combine its output. (b) HMM Numerous studies have used HMM for ligature classification and recognition [43, 81, 83, 86, 114]. All these studies have reported separating the ligature’s main body from the diacritics, features are to be extracted separately from the ligature’s main body and the diacritics for successful classification. Similarly, the primary components in a ligature might not be possible to be trained for a higher number of states [79]. In [81], the failure of the recognizer for variational ligature images was reported using the HMM. In [114], each connected body was treated as a separate HMM. Regardless, the HMM technique fails to recognize several ligatures due to the similarities between ligature shapes. This results in same state transition probabilities [114]. The primary reason for ligature separation into components is to reduce the total number of ligature classes which are usually enormously high in number. However, when ligature components are separately classified, the results seem biased since whole ligatures have not been considered. As reported in the literature the HMM approach for the total number of ligature classes used in our proposed research, will result in a large number of states. A dataset of 189003 ligatures have been used and the classes used are 3645, if each class is treated as a separate HMM as was done by [114], it would result in 3645 HMM’s. (c) Neural Network One of the biggest problems faced while classifying the ligatures is due to a large number of classes. In the past, the neural networks have also been reported to be used for recognition of Urdu ligatures, however, with small datasets and classes. The ligatures are divided into separate components and the diacritics are treated as separate ligatures [11, 84]. Once separated the components are classified separately using neural network

134 architecture [11]. Almost all of the studies reported for ligature recognition requires this component separation to reduce the total number of classes. However, as already stated in the proposed research, ligature classes are considered in their original form. Therefore, the number of given classes are comparatively extremely large in number. The neural network is impractical because multilayer neural network will require at least 15 neurons at the input layer, whereas, the output layer neurons are dependent on the number of the classes which are very high in our case. This results in a very complex Neural Network architecture with a very low performance for a large volume of data as used in the proposed research. During the experiments, the Neural Network was also tried for the comparison but for such large data, its execution time is extremely high. For less data, it provides quick results but the results are not acceptable because the number of classes is too high and fewer data doesn’t cover all the classes under consideration. (d) Random Forest Random forest is a state-of-the-art classifier; however, it can handle only up to a few hundred classes [80]. The random forest itself is incapable of dealing with a large number of classes and therefore usually needs to be combined with other techniques [80]. In a research study done by Solé et al. [115], through different experiments it was proved that the Random Forest is unfeasible for large-scale multi-class (1000) classification. In comparison, the proposed research has not only more than 100 but more than 1000 classes of ligatures. (e) Deep Learning Methods The deep learning methods have gained immense popularity in recent years. However, the deep learning methods such as CNN, RNN, LSTM are feature learning models. CNN and LSTM have been used for ligature recognition as feature extractors and classifiers [91, 92, 116]. However, in our research hand-engineered features have been used and the motivation for selecting the proposed feature vector over some other recent feature vectors has been discussed in Section 5.3. As the proposed ligature recognition system is already using hand-crafted features, therefore the use of deep learning models for evaluation is not considered. (f) Clustering Algorithms The clustering algorithms are unsupervised learning techniques used to find structures in a set of unlabeled data. If the clustering algorithms are used it will separate the ligatures into groups based on the extracted hand-engineered features. Whereas, the proposed ligature recognition system deals with labeled data. The ligatures, 189003 used in the proposed research are already well separated into 3645 classes using the ground-truth information of

135 the UPTI dataset. Therefore, the use of clustering algorithms is not required since the data under consideration is already labeled.

5.7 Theoretical Analysis of Computational Complexity

In this section, the computational complexity of the proposed GA based hierarchical clustering algorithm is compared to some of the state-of-the-art machine learning algorithms. As enlightened in the problem description Section 1.3.1, the main requirement is to develop a ligature recognition system that has low computational complexity. Generally, the computational complexity for any algorithm depends on the total number of samples � used with it. In the proposed research, a large sample of ligatures has been used. The primary motive to use such a large sample of ligatures is to obtain a recognition rate that is more reliable. Similarly, another reason to use a large dataset is that in the Urdu language the total number of ligatures ��� are comparatively far greater in number than the total number of characters �ℎ�� as summarized in equation (5.1). Henceforth, the total number of ligature samples � used for developing a robust ligature recognition system must be nearly equal to the total number of existing ligatures, as given in equation (5.2).

��� ≫ ���� (5.1)

� ≈ ��� (5.2)

Once the purpose has been established for using a large number of ligatures i.e. 189003 ligatures from the UPTI dataset. The proposed approach of genetic algorithm based hierarchical clustering is compared to other standard machine learning techniques such as SVM, Decision Trees and Random Forest in terms of its computational complexity. The computational complexity of the traditional machine learning techniques such as Decision Trees, Random Forest and SVM is taken from studies discussed in [117, 118], [119] and [120-122]. Whereas, the computational complexity for the proposed learning technique i.e. genetic algorithm based hierarchical clustering is computed and provided in terms of the Big-O notation. To find the computational complexity of genetic algorithm based on hierarchical clustering, two separate computational complexities are computed. First, the computational complexity is calculated for the hierarchical clustering. Next, the computational complexity is for the proposed genetic algorithm is computed. The proposed hierarchical clustering uses � features for � number of ligature samples, giving an initial computational complexity as given in equation (5.3). 136

�(��) (5.3)

However, for each � and �, clusters � are associated. The relationship between �, � and � is given using equation (5.4).

���� + ���� + ���� + ���� … … . +���� (5.4)

Equation (5.4) can be further refined and written as,

(�� + �� + �� + ⋯ … . ��)�� (5.5)

The clusters (� + � + � + ⋯ … . �) refers to all the clusters and hence can be collectively written as �. Subsequently, equation (5.5) becomes,

��� (5.6)

The total number of clusters as well as the features are constant and can be replaced by a constant �. Hence, it is observed that the computational complexity of hierarchical clustering is completely dependent on �. The computational complexity of the proposed hierarchical clustering algorithm is hence given in terms of Big-O notation as,

�(�) (5.7)

Fitness (p) Population of Size

Figure 5.11 Computational Complexity of the Proposed Genetic Algorithm

137 The computational complexity of a genetic algorithm is dependent on the size of generation �, size of population � and the size of chromosome � (see Figure 5.11). Hence, the computational complexity for a genetic algorithm can be given using equation (5.8).

�(���) (5.8)

The chromosome size is 15 and the population size is 10 for the given genetic algorithm. Since, both � and � are constants, equation (5.8) can be written as,

�(����) (5.9)

The numeric value 150 is constant in nature, therefore can be replaced with a constant � such that the computational complexity becomes,

�(��) (5.10)

The big-O analysis is more interested in the growth rate of a function and ignores constants, therefore � is dropped. Conclusively, it is observed that the computational complexity of a genetic algorithm is dependent on the size of generations. The growth rate of the function is logarithmic in nature, over a certain number of generations the solutions converge towards a common fitness value as shown in Figure 5.11. Consequently, the computational complexity for genetic algorithm is given using equation (5.10), where � refers to the total number of samples/observations.

�(�����) (5.10)

The computational complexity of genetic algorithm based hierarchical clustering can be computed by summing up the complexities of genetic algorithm and hierarchical clustering as given in equation (5.7) and equation (5.10).

�(�����) + �(�) (5.11)

The computational complexity of genetic algorithm based hierarchical clustering in comparison to other machine learning techniques has been analyzed in Table 5.9. The comparison has been carried out by analyzing datasets of varying sizes including the dataset size of 189003 used for the proposed ligature recognition system. The computational time 138 analysis has been provided for a 1GHz computing system. It has been observed that the amount of time taken by different machine learning techniques and the proposed algorithm is almost similar for smaller datasets. However, for larger datasets, the proposed algorithm outperforms the standard machine learning techniques in terms of the computational time. SVM which is one of the most renowned machine learning technique will take a total of 79 days for training the dataset of 189003 ligatures as proposed in this research. Whereas, the GA based hierarchical clustering approach will take computation time only in nanoseconds. It is also evident from the computational complexity of the machine learning techniques given in Table 5.9 that the proposed algorithm has the lowest computational complexity. Likewise, the proposed algorithm used for ligature classification and recognition is highly adaptive and dynamic in nature, since, the rules generated are not fixed or static. The structure of the rules by its nature is fixed but the values and logical operations (AND/OR) are dynamic and changes according to the dataset. Overall, the proposed GA based hierarchical clustering algorithm is highly adaptive and modified as the data changes. Rules are used for explanation of the clustered ranges against each feature.

Table 5.9 Computational Complexity Comparison of GA Based Hierarchical Clustering to Other Machine Learning Techniques

Training Time Taken for Dataset (n) Complexity Algorithm (Order of 10 103 106 108 189003 Magnitude) Decision Tree �(�) 100 1,000,000 10 sec 116 Days 36 sec ns Random Forest �(�) 100 ns 1,000,000 10 sec 116 Days 36 sec ns SVM �(�) 1000 ns 1 sec 12 31,688,738 79 Days Days Years Proposed Algorithm �(�����) 10 ns 3000 ns 500,000 800,000,000 1,002,960 (GA based Hierarchical ns ns ns Clustering) ns – Nano Second �(��)- Quadratic �(��)- Cubic �(�����)- Log Linear

5.8 Comparative Analysis

In this section, a comparative analysis is provided for different ligature recognition systems as followed by others [33, 41, 43, 92]. Table 5.10 provides the recognition accuracies for different existing ligature recognition systems in contrast to our proposed system.

139 The dataset (UPTI) used for evaluating the performance of the proposed system has also been used by some studies done in [10], [38], [39], [41], [33], [26], [91] and [92]. Sabbour and Shafait [33] extracted simple contour-based features from the segmented ligature images and used K-NN for classification, a dataset of 10,000 ligatures was used acquiring an accuracy of 91%. Similarly, Din et al. [45] extracted simple statistical features and used HMM for classification and recognition, for a small dataset of 6187 ligatures and 1525 classes, recognition rate of 92% was reported. Ahmad et al. [91] extracted stacked autoencoder features from raw pixels of 178573 segmented ligature images, adjusted in 3732 classes. Whereas, in another study, Ahmad et al. [92] achieved an accuracy of 96.71% for 187039 ligature images, divided into 3732 unique classes. A total of 127180 ligatures were used for training and 29935 ligatures were used for testing. The dataset size used in both of these studies is smaller than that of the proposed OCR system. Both the studies as reported in Ahmad et al. [91] and Ahmad et al. [92] used features that have higher space complexity than the proposed hand-engineered features as discussed in Section 5.3.1. The accuracy of [43], [73], [80], [82], [84] is higher than the accuracy of the proposed study, however, the authors have used extremely small datasets, with a very small number of classes for training and testing the classifiers. The proposed system uses a genetic algorithm based hierarchical clustering algorithm for Urdu Nastalique ligature recognition using hand-engineered (geometric and statistical) features and follows a holistic approach. The recognition accuracy of the proposed ligature recognition system is 96.72% for 3645 classes and about 189k training samples. The proposed ligature recognition system works extremely well for such large number of classes, can be executed on any lower-end computing device, since, it has low computational complexity in comparison to some other machine learning algorithms as discussed in Section 5.7.

Table 5.10 Recognition Accuracies for Different Urdu Ligature Recognition Systems

Study Features Classifier Dataset Number Accuracy of Ligatures Rana and DCT SVM Author 11,000 90.29% Lehal [83] Gabor Filters k-NN Generated Directional Gradient Khattak et al. Projection HMM Author 2,000 97.93% [43] Concavity Generated Curvature Information Shahzad et al. Rubine Features Weighted Linear Author 5 Samples 92.80% [85] Classifier Generated Of 38 Characters Chanda and Contour Binary Tree Author 3210 98.09% Pal [73] Water Reservoir Classifier Generated

140 El-Korashy Contour Nearest Neighbour Author Training- 81.5% and Shafait Size and Location of Classification Generated 20,000 [80] The Dots Testing- width, height, and Random Forest 18000 98% aspect ratio Javed and -- HMM Author 1692 92.73% Hussain [81] Generated Ligatures Nazir and Code of The Mark, Correlation Author 6728 97.40% Javed [82] Base and The Method Generated Ligatures Diacritics Husain [84] Solidity, Number of Feed Forward Author 200 100% Holes, Eccentricity, Back Propagation Generated Ligatures Moments, Normalized neural network Segment Length, Curvature, Ratio of Bounding Box Width and Height, Axis, Ratio Javed et al. ‘0’ & ‘1‘ HMM Author 3655 92% [6] Generated Ligatures Razzak et al. Statistical HMM Author 1800 87.6% [86] Structural Fuzzy Logic Generated Ligatures Nastalique and 74.1% Naskh Husain et al. Unique Features BPNN Author 850 93% [11] Generated Ligatures Sabbour and Contour K-NN UPTI 10,000 91% Shafait [33]

Ahmad et al. Stacked Denoising SoftMax UPTI Training: 93% to 96% [91] Auto-encoder Features 178573 Testing: 60,000 Din et al. Statistical HMM UPTI 6187 92% [45] Ahmad et al. Raw Pixels GBLSTM UPTI Training: 96.71% [92] 127180 Testing: 29935 Proposed Geometric and GA Based UPTI Training: 96.72% Ligature Statistical Features Hierarchical 189,003 Recognition Clustering Testing: System Variable

5.9 Shortcomings

Regardless of the high accuracy, lower computational complexity and the ability to process a large number of classes (3645), the proposed ligature recognition system also has several shortcomings. The ligature recognition system cannot be used for handwritten Urdu text. This is, in fact, an inadequacy of its segmentation algorithm. Dissimilar writing styles and varying pen movements make handwritten text extremely complex. Figure 5.12 shows the segmented ligatures for a handwritten Urdu text image. Instead of six ligatures, only five have been generated. Since our system uses whole ligature images, only two ligatures are considered to be correctly segmented.

141 (a)

(b)

Figure 5.12 (a) A Handwritten Urdu Text Image [123] (b) Segmentation Results

Extremely calligraphic and overlapping scripts such Diwani cannot be used with the proposed system (see Figure 5.13). The immense calligraphic scripts even if taken in printed forms are highly variational as compared to the standard scripts and therefore lead to segmentation errors. Similarly, such calligraphic scripts are extremely overlapping in nature, hence, it is difficult to analyze correct segmentation points. It can be clearly seen in Figure 5.13 that the ligature recognition system’s segmentation stage didn’t segment the printed text written in Diwani script correctly.

(a)

(b)

Figure 5.13 (a) Cursive and Calligraphic Diwani Script [52] (b) Segmentation Results

Additionally, the proposed system uses a genetic algorithm for optimization and recognition. Due to the use of the genetic algorithm, the proposed ligature recognition system may have varying optimal solution i.e. the recognition rate may change for the same input data. The proposed ligature recognition system sometimes faces premature convergence, where a recognition rate is reached too early in the generations and there is no further increase in the recognition rate observed for the remaining generations. Therefore, the genetic parameters such as the mutation rate are needed to be tweaked in

142 such a way to ensure that a variation is maintained through the generations and the average recognition rate of the population keeps increasing.

5.10 Summary

In chapter 5, “Experiments and Results”, the results for different phases of the proposed ligature recognition system for printed Urdu script using a genetic algorithm based hierarchical clustering has been presented. First, the results generated from the ground- truth of the UPTI dataset has been provided. As per UPTI ground-truth, a total of 189274 ligatures and 339344 connected components have been identified within the dataset. The ligature segmentation algorithm has given a segmentation accuracy of 99.86% which is one of the highest ligature segmentation accuracy reported. Following, the feature extraction results and comparisons in terms of space complexity has been reported. In the subsequent sections, the results have been reported for hierarchical clustering to reduce the feature vector data points distribution. The proposed ligature recognition system has been compared to other state-of-the-art classifiers in terms of their ability for using large number of classes as used in this research. Next, the computational complexity of the proposed GA based hierarchical clustering algorithm has also been provided. The proposed ligature recognition system has reported a ligature recognition rate of 96.72%, this is compared relative to the recognition accuracies of other ligature based prevalent studies. The chapter is concluded by discussing some shortcomings and weaknesses of the proposed ligature recognition system. In the next chapter, the conclusion and future directions for research in the proposed field of study will be discussed.

143

Extensive research has been carried out in the field of OCR, but most have been directed towards non-cursive languages. Hence, there is a dire need towards the development of optical character recognition system for cursive scripts, especially for Urdu. Most of the existing studies have focused on isolated or analytical Urdu text recognition systems. This research work focuses on developing a recognition system for ligature based offline printed Urdu Nastalique script. In this chapter, the research is concluded and future directions are provided for prospective research in this field.

6.1 Conclusion

In the proposed research, first, a holistic segmentation based recognition system for Urdu text written using the Nastalique cursive and calligraphic style is presented. The holistic recognition system observes the ligatures as a basic unit for the recognition. The overall recognition system comprises the use of hierarchical clustering and a genetic algorithm for optimization and recognition. A holistic segmentation algorithm, using a novel vertical overlap information is used for the segmentation of text into ligatures. The proposed ligature segmentation algorithm uses height, width and region feature for connected components separation into primary and secondary components. Vertical overlap analysis is used for connected component association into constituent ligatures. The proposed algorithm successfully segments a total of 189003 ligatures, giving an accuracy of 99.86% that is higher than the existing Urdu Nastalique ligature segmentation algorithms. Total 149503 secondary components are correctly associated to 189003 primary components with no confusion. Next, novel geometric as well as statistical (first-order and second- order) features are extracted from the segmented ligature images. The feature vector comprises a total of 15 features, i.e. aspect ratio, compactness, density function, horizontal edge intensity, vertical edge intensity, horizontal mean, vertical mean, horizontal variance, vertical variance, horizontal kurtosis, vertical kurtosis and GLCM based features of contrast, correlation, energy and homogeneity. For the distribution reduction of data points for each of the features, hierarchical clustering is applied. After the hierarchical clustering classification rules are generated, where each ligature belongs to a unique class (1 to 3645). The clustered features, along with the ground-truth information is then given to the genetic algorithm for optimization of the hierarchical clustering and recognition of the ligatures. The proposed genetic algorithm is evaluated on the segmented ligatures having no confusion, 189003 ligatures. The overall recognition rate for the proposed ligature

144 recognition system is 96.72%, which is one of the highest ligature recognition accuracy reported for the dataset.

6.2 Future Recommendation

The proposed Urdu ligature recognition system has given very high accuracy 96.72%, as compared to other existing methods. Currently, the system has been used for ligature recognition, in future, the same genetic algorithm based hierarchical clustering methodology can be modified to be used for recognition of Urdu characters. Potential researchers may also investigate the use of other hand-engineered or automated features, to be used for recognition of Urdu text using genetic algorithm based hierarchical clustering. Handwriting recognition is extremely complex due to the variations in writings styles. The proposed recognition system can also be modified and updated to be used for the complex and challenging recognition of handwritten Urdu text. The last stage of post- processing can be added and updated to deal with complex situations like grammar correction, spelling detection and correction. In future, the proposed ligature segmentation algorithm can also be extended and modified to be used for other cursive handwritten and printed scripts such as Arabic, Pashto, Panjabi, Seraiki, Kurdish, Persian, Sindhi, Kashmiri, Ottoman Turkish and Malay. Further, the deep and transfer learning methods can be explored for ligature based recognition systems and compared to the results of the proposed system. The given system has been evaluated using UPTI dataset, in future, the CLE dataset can be used for the experiments and evaluation purposes.

145 [1] N. H. Khan, A. Adnan, and S. Basar, "Urdu ligature recognition using multi-level agglomerative hierarchical clustering," Cluster Computing, May 2017, doi: 10.1007/s10586-017-0916-2. [2] N. Y. Habash, "Introduction to Arabic natural language processing," Synthesis Lectures on Human Language Technologies, vol. 3, no. 1, pp. 1-187, 2010. [3] N. N. Kharma and R. K. Ward, "Character recognition systems for the non-expert," IEEE Canadian Review, vol. 33, pp. 5-8, 1999. [4] R. Ahmad, S. Naz, M. Z. Afzal, S. H. Amin, and T. Breuel, "Robust optical recognition of cursive Pashto script using scale, rotation and location invariant approach," Plos One, vol. 10, no. 9, pp. 1-16, 2015. [5] S. Sardar and A. Wahab, "Optical character recognition system for Urdu," in International Conference on Information and Emerging Technologies (ICIET), June 2010, pp. 1-5: IEEE. [6] S. T. Javed, S. Hussain, A. Maqbool, S. Asloob, S. Jamil, and H. Moin, "Segmentation free nastalique urdu ocr," World Academy of Science, Engineering and Technology, vol. 46, pp. 456-461, 2010. [7] S. A. Sattar, S. Haque, M. K. Pathan, and . Gee, "Implementation challenges for character recognition," in Wireless Networks, Information Processing and Systems, ser. Communications in Computer and Information Science, 2009, vol. 20, pp. 279-285: Springer, Berlin, Heidelberg. [8] M. A. U. Rehman, "A new scale invariant optimized chain code for Nastaliq character representation," in Proceedings of 2nd International Conference on Computer Modeling and Simulation (ICCMS), 2010, vol. 4, pp. 400-403: IEEE. [9] P. Choudhary and N. Nain, "A four-tier annotated Urdu handwritten text image dataset for multidisciplinary research on Urdu Script," ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 15, no. 4, pp. 1-23, 2016. [10] S. Naz, A. I. Umar, R. Ahmad, S. B. Ahmed, S. H. Shirazi, I. Siddiqi, and M. I. Razzak, "Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks," Neurocomputing, vol. 177, pp. 228-241, 2016.

146 [11] S. A. Husain, A. Sajjad, and F. Anwar, "Online Urdu character recognition system," in Proceedings of IAPR Conference on Machine Vision Applications (MVA), Tokyo, Japan, 2007, pp. 98-101. [12] M. . Sagheer, C. L. , N. Nobile, and C. Y. Suen, "Holistic Urdu handwritten word recognition using support vector machine," in Proceedings of 20th International Conference on Pattern Recognition (ICPR), 2010, pp. 1900-1903: IEEE. [13] A. Daud, W. Khan, and D. , "Urdu language processing: a survey," Artificial Intelligence Review, vol. 47, no. 3, pp. 279-311, 2017. [14] D. N. Hakro and A. Z. Talib, "Printed text image database for Sindhi OCR," ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 15, no. 4, pp. 1-18, 2016. [15] G. S. Lehal, "Ligature segmentation for Urdu OCR," in Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR), August 2013, pp. 1130-1134: IEEE. [16] Z. Ahmad, J. K. Orakzai, I. Shamsher, and A. Adnan, "Urdu Nastaleeq optical character recognition," in Proceedings of World Academy of Science, Engineering and Technology, 2007, vol. 26, pp. 249-252. [17] S. A. Sattar, S.-u. Haque, and M. K. Pathan, "A finite state model for urdu nastalique optical character recognition," International Journal Of Computer Science And Network Security (IJCSNS), vol. 9, no. 9, p. 116, September 2009. [18] S. Naz, K. Hayat, M. I. Razzak, M. W. Anwar, S. A. Madani, and S. U. Khan, "The optical character recognition of Urdu-like cursive scripts," Pattern Recognition, vol. 47, no. 3, pp. 1229-1248, 2014. [19] H. F. Schantz, History of OCR, Optical Character Recognition. Vermont, United States: Recognition Technologies Users Association, 1982. [20] W. J. Bijleveld and A. J. Van De Toorn, "Process and apparatus for producing and reading arabic numbers on a record sheet," United States Patent 3527927, September 8, 1970. [21] M. Rabi, M. Amrouch, and Z. Mahani, "Recognition of cursive Arabic handwritten text using embedded training based on hidden markov models," International Journal of Pattern Recognition and Artificial Intelligence, vol. 32, no. 01, pp. 1-19, 2018.

147 [22] U. Pal and A. Sarkar, "Recognition of printed Urdu script," in Proceedings of Seventh International Conference on Document Analysis and Recognition (ICDAR), August 2003, pp. 1183-1187: IEEE. [23] D. A. Satti, "Offline Urdu Nastaliq OCR for printed text using analytical approach," Master’s thesis, Quaid-i-Azam University, Islamabad, Pakistan, 2013. [24] M. S. Khorsheed, "Off-line Arabic character recognition - A review," Pattern Analysis & Applications, vol. 5, no. 1, pp. 31-45, 2002. [25] L. M. Lorigo and V. Govindaraju, "Offline Arabic handwriting recognition: a survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 5, pp. 712-724, 2006. [26] A. Ul-Hasan, S. B. Ahmed, F. Rashid, F. Shafait, and T. M. Breuel, "Offline printed Urdu Nastaleeq script recognition with bidirectional LSTM networks," in Proceedings of 12th International Conference on Document Analysis and Recognition (ICDAR), 2013, pp. 1061-1065: IEEE. [27] M. Kumar, M. Jindal, and R. Sharma, "Review on OCR for Handwritten Indian Scripts Character Recognition," in Advances in Digital Image Processing and Information Technology: Springer, 2011, pp. 268-276. [28] J. Ortega-Garcia, J. Bigun, D. Reynolds, and J. Gonzalez-Rodriguez, "Authentication gets personal with biometrics," IEEE Signal Processing Magazine, vol. 21, no. 2, pp. 50-62, 2004. [29] N. Islam, Z. Islam, and N. Noor, "A Survey on Optical Character Recognition System," Journal of Information & Communication Technology - JICT, vol. 10, no. 2, pp. 1-4, 2016. [30] R. Mehran, H. Pirsiavash, and F. Razzazi, "A front-end OCR for omni-font Persian/Arabic cursive printed documents," in Proceedings of Digital Image Computing: Techniques and Applications (DICTA), 2005, pp. 56-56: IEEE. [31] S. Naz, A. I. Umar, S. H. Shirazi, S. B. Ahmed, M. I. Razzak, and I. Siddiqi, "Segmentation techniques for recognition of Arabic-like scripts: A comprehensive survey," Education and Information Technologies, vol. 21, no. 5, pp. 1225-1241, 2016. [32] S. Hussain and S. Ali, "Nastalique segmentation-based approach for Urdu OCR," International Journal on Document Analysis and Recognition (IJDAR), vol. 18, no. 4, pp. 357-374, 2015.

148 [33] N. Sabbour and F. Shafait, "A segmentation-free approach to Arabic and Urdu OCR," in Document Recognition and Retrieval XX, 2013, vol. 8658, p. 86580N: International Society for Optics and Photonics. [34] S. A. Hussain, S. Zaman, and M. Ayub, "A self organizing map based Urdu Nasakh character recognition," in International Conference on Emerging Technologies (ICET), 2009, pp. 267-273: IEEE. [35] S. T. Javed, "Investigation into a segmentation based OCR for the Nastaleeq writing system," Master’s thesis, National University of Computer & Emerging Sciences, Lahore, Pakistan, 2007. [36] Z. Ahmad, J. K. Orakzai, and I. Shamsher, "Urdu compound character recognition using feed forward neural networks," in Proceedings of the 2nd international conference on computer science and information technology (ICCSIT), 2009, pp. 457-462: IEEE. [37] S. Mir, S. Zaman, and M. W. Anwar, "Printed Urdu Nastalique Script Recognition Using Analytical Approach," in 13th International Conference on Frontiers of Information Technology (FIT), 2015, pp. 334-340: IEEE. [38] R. Patel and M. Thakkar, "Handwritten Nastaleeq script recognition with BLSTM- CTC and ANFIS method," International Journal of Computer Trends and Technology, vol. 11, no. 3, pp. 131-136, 2014. [39] S. Naz, A. I. Umar, R. Ahmad, S. B. Ahmed, S. H. Shirazi, and M. I. Razzak, "Urdu Nasta’liq text recognition system based on multi-dimensional recurrent neural network and statistical features," Neural Computing and Applications, vol. 28, no. 2, pp. 219-231, 2017. [40] S. B. Ahmed, S. Naz, M. I. Razzak, S. F. Rashid, M. Z. Afzal, and T. M. Breuel, "Evaluation of cursive and non-cursive scripts using recurrent neural networks," Neural Computing and Applications, vol. 27, no. 3, pp. 603-613, 2016. [41] S. Naz, A. I. Umar, R. Ahmad, I. Siddiqi, S. B. Ahmed, M. I. Razzak, and F. Shafait, "Urdu Nastaliq recognition using convolutional–recursive deep learning," Neurocomputing, vol. 243, pp. 80-87, 2017. [42] A. Rehman, D. Mohamad, and G. Sulong, "Implicit vs explicit based script segmentation and recognition: A performance comparison on benchmark database," Int. J. Open Problems Compt. Math, vol. 2, no. 3, pp. 352-364, 2009. [43] I. U. Khattak, I. Siddiqi, S. Khalid, and C. Djeddi, "Recognition of Urdu ligatures- a holistic approach," in 13th International Conference on Document Analysis and Recognition (ICDAR), 2015, pp. 71-75: IEEE.

149 [44] I. Ahmad, X. Wang, R. Li, M. Ahmed, and R. Ullah, "Line and ligature segmentation of Urdu Nastaleeq text," IEEE Access, vol. 5, pp. 10924-10940, 2017. [45] I. U. Din, I. Siddiqi, S. Khalid, and T. Azam, "Segmentation-free optical character recognition for printed Urdu text," EURASIP Journal on Image and Video Processing, no. 62, pp. 1-18, 2017, doi: 10.1186/s13640-017-0208-z. [46] G. Kaur, S. Singh, and A. Kumar, "Urdu ligature recognition techniques-A review," in International Conference on Intelligent Communication and Computational Techniques (ICCT), 2017, pp. 285-291: IEEE. [47] A. F. Ganai and A. Koul, "Projection profile based ligature segmentation of Nastaleeq Urdu OCR," in 4th International Symposium on Computational and Business Intelligence (ISCBI), 2016, pp. 170-175: IEEE. [48] M. S. M. El-Mahallawy, "A large scale HMM-based omni font-written OCR system for cursive scripts," Ph.D. dissertation, Faculty of Engineering, Cairo University, Giza, Egypt, 2008. [49] J. H. Holland, "Genetic algorithms," Scientific American, vol. 267, no. 1, pp. 66- 73, 1992. [50] R. Ahmad, S. Naz, M. Z. Afzal, S. F. Rashid, M. Liwicki, and A. Dengel, "The impact of visual similarities of Arabic-like scripts regarding learning in an OCR system," in 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, vol. 7, pp. 15-19: IEEE. [51] S. Naz, A. I. Umar, R. Ahmed, M. I. Razzak, S. F. Rashid, and F. Shafait, "Urdu Nasta’liq text recognition using implicit segmentation based on multi-dimensional long short term memory neural networks," SpringerPlus, vol. 5, no. 1, pp. 1-16, 2016. [52] S. Naz, K. Hayat, M. I. Razzak, M. W. Anwar, and H. Akbar, "Arabic script based language character recognition: Nasta'liq vs Naskh analysis," in World Congress on Computer and Information Technology (WCCIT), 2013, pp. 1-7: IEEE. [53] R. Ahmad, M. Z. Afzal, S. F. Rashid, M. Liwicki, T. Breuel, and A. Dengel, "KPTI: Katib's Pashto Text Imagebase and Deep Learning Benchmark," in 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China, 2016, pp. 453-458: IEEE. [54] A. Wali and S. Hussain, "Context sensitive shape-substitution in nastaliq writing system: Analysis and formulation," Sobh T. (eds) Innovations and Advanced Techniques in Computer and Information Sciences and Engineering, pp. 53-58, 2007.

150 [55] S. Hussain, "Complexity of Asian writing systems: A case study of Nafees Nasta'leeq for Urdu," in Proceedings of the 12th AMIC Annual Conference on e- Worlds: Governments, Business and Civil Society, Asian Media Information Center, Singapore, 2003, pp. 1-10. [56] D. A. Satti and K. Saleem, "Complexities and implementation challenges in offline urdu Nastaliq OCR," in Proceedings of the Conference on Language & Technology, 2012, pp. 85-91. [57] M. Akram and S. Hussain, "Word segmentation for urdu OCR system," in Proceedings of the 8th Workshop on Asian Language Resources, Beijing, China, 2010, pp. 88-94. [58] S. T. Javed and S. Hussain, "Improving Nastalique specific pre-recognition process for Urdu OCR," in 13th International Multitopic Conference (INMIC), 2009, pp. 1- 6: IEEE. [59] F. Shafait, D. Keysers, and T. M. Breuel, "Layout analysis of Urdu document images," in 10th International Multitopic Conference (INMIC), 2006, pp. 293-298: IEEE [60] P. Baker, A. Hardie, T. McEnery, H. Cunningham, and R. J. Gaizauskas, "EMILLE, A 67-million word corpus of indic languages: Data collection, mark-up and harmonisation," in LREC, 2002, pp. 819-825. [61] M. Ijaz and S. Hussain, "Corpus based Urdu lexicon development," in Proceedings of Conference on Language Technology (CLT), University of Peshawar, Pakistan, 2007, vol. 73, pp. 1-12. [62] S. Urooj, S. Hussain, F. Adeeba, F. Jabeen, and R. Parveen, "CLE Urdu digest corpus," in Language & Technology, 2012, vol. 47, pp. 47-53. [63] F. Slimane, R. Ingold, S. Kanoun, A. M. Alimi, and J. Hennebert, "A new Arabic printed text image database and evaluation protocols," in 10th International Conference on Document Analysis and Recognition (ICDAR), 2009, pp. 946-950: IEEE. [64] I. Shamsher, Z. Ahmad, J. K. Orakzai, and A. Adnan, "OCR for printed Urdu script using feed forward neural network," in Proceedings of World Academy of Science, Engineering and Technology, 2007, vol. 23, pp. 172-175. [65] Q. U. A. Akram, S. Hussain, and Z. Habib, "Font size independent OCR for Noori Nastaleeq," in Proceedings of Graduate Colloquium on Computer Sciences (GCCS), Lahore, Pakistan, 2010, vol. 1: Department of Computer Science, FAST- NU.

151 [66] T. Nawaz, S. A. H. S. Naqvi, H. u. Rehman, and A. Faiz, "Optical character recognition system for Urdu (Naskh font) using pattern matching technique," International Journal of Image Processing (IJIP), vol. 3, no. 3, pp. 92-104, 2009. [67] J. Tariq, U. Nauman, and M. U. Naru, "Softconverter: A novel approach to construct OCR for printed Urdu isolated characters," in 2nd International Conference on Computer Engineering and Technology (ICCET), 2010, vol. 3, pp. 495-498: IEEE. [68] K. Khan, R. Ullah, N. A. Khan, and K. Naveed, "Urdu character recognition using principal component analysis," International Journal of Computer Applications, vol. 60, no. 11, pp. 1-4, 2012. [69] K. Khan, R. U. Khan, A. Alkhalifah, and N. Ahmad, "Urdu text classification using decision trees," in 12th International Conference on High-Capacity Optical Networks and Enabling/Emerging Technologies (HONET), 2015, pp. 1-4: IEEE. [70] D. B. Megherbi, S. Lodhi, and A. Boulenouar, "A two-stage neural network based technique for Urdu characters two-dimensional shape representation, classification, and recognition," in Applications and Science of Computational Intelligence IV, 2001, vol. 4390, pp. 84-96: International Society for Optics and Photonics. [71] F. Iqbal, A. Latif, N. Kanwal, and T. Altaf, "Conversion of Urdu nastaliq to roman Urdu using OCR," in 4th International Conference on Interaction Sciences (ICIS), 2011, pp. 19-22: IEEE. [72] Y. Khan and C. Nagar, "Recognize handwritten Urdu script using kohenen som algorithm," International Journal of Ocean System Engineering, vol. 2, no. 1, pp. 57-61, 2012. [73] S. Chanda and U. Pal, "English, Devanagari and Urdu text identification," in Proceedings of International Conference on Document Analysis and Recognition, 2005, pp. 538-545. [74] S. A. Sattar, S. Haque, and M. K. Pathan, "Nastaliq optical character recognition," in Proceedings of the 46th Annual Southeast Regional Conference on XX, 2008, pp. 329-331: ACM. [75] S. Hussain, A. Niazi, U. Anjum, and F. Irfan, "Adapting Tesseract for complex scripts: an example for Urdu Nastalique," in 11th IAPR International Workshop on Document Analysis Systems (DAS), 2014, pp. 191-195: IEEE. [76] E. R. Q. Khan and E. W. Q. Khan, "Urdu optical character recognition technique for Jameel Noori Nastaleeq script," Journal of Independent Studies and Research, vol. 13, no. 1, pp. 81-86, 2015.

152 [77] W. Q. Khan and R. Q. Khan, "Urdu optical character recognition technique using point feature matching; a generic approach," in International Conference on Information and Communication Technologies (ICICT), 2015, pp. 1-7: IEEE. [78] O. Mukhtar, S. Setlur, and V. Govindaraju, "Experiments on Urdu text recognition," in Guide to OCR for Indic Scripts, London: Springer, 2009, pp. 163-171. [79] G. S. Lehal and A. Rana, "Recognition of Nastalique Urdu ligatures," in Proceedings of the 4th International Workshop on Multilingual OCR, Washington, D.C., USA, 2013, pp. 1-5: ACM. [80] A. El-Korashy and F. Shafait, "Search space reduction for holistic ligature recognition in Urdu Nastalique script," in 12th International Conference on Document Analysis and Recognition (ICDAR), Washington, DC, USA, 2013, pp. 1125-1129: IEEE. [81] S. T. Javed and S. Hussain, "Segmentation based Urdu Nastalique OCR," in Iberoamerican Congress on Pattern Recognition, Berlin, Heidelberg, 2013, vol. 8259, pp. 41-49: Springer. [82] S. Nazir and A. Javed, "Diacritics recognition based Urdu Nastalique OCR system," Nucleus, vol. 51, no. 3, pp. 361-367, 2014. [83] A. Rana and G. S. Lehal, "Offline Urdu OCR using ligature based segmentation for Nastaliq Script," Indian Journal of Science and Technology, vol. 8, no. 35, pp. 1-9, 2015. [84] S. A. Husain, "A multi-tier holistic approach for Urdu Nastaliq recognition," in Proceedings of the 6th International Multitopic Conference (INMIC), 2002, pp. 528-532: IEEE. [85] N. Shahzad, B. Paulson, and T. Hammond, "Urdu Qaeda: recognition system for isolated Urdu characters," in Proceedings of the IUI Workshop on Sketch Recognition, Sanibel Island, Florida, 2009. [86] M. I. Razzak, F. Anwar, S. A. Husain, A. Belaid, and M. Sher, "HMM and fuzzy logic: a hybrid approach for online Urdu script-based languages’ character recognition," Knowledge-Based Systems, vol. 23, no. 8, pp. 914-923, 2010. [87] K. U. Khan, "Online urdu handwritten character recognition: Initial half form single stroke characters," in 12th International Conference on Frontiers of Information Technology (FIT), 2014, pp. 292-297: IEEE. [88] Z. Jan, M. Shabir, M. Khan, A. Ali, and M. Muzammal, "Online Urdu handwriting recognition system using geometric invariant features," Nucleus, vol. 53, no. 2, pp. 89-98, 2016.

153 [89] T. Ali, T. Ahmad, and M. Imran, "UOCR: A ligature based approach for an Urdu OCR system," in 3rd International Conference on Computing for Sustainable Global Development (INDIACom), 2016, pp. 388-394: IEEE. [90] S. Shabbir and I. Siddiqi, "Optical character recognition system for Urdu words in Nastaliq font," International Journal of Advanced Computer Science and Applications (IJACSA), vol. 7, no. 5, pp. 567-576, 2016. [91] I. Ahmad, X. Wang, R. Li, and S. Rasheed, "Offline Urdu Nastaleeq optical character recognition based on stacked denoising autoencoder," China Communications, vol. 14, no. 1, pp. 146-157, 2017. [92] I. Ahmad, X. Wang, Y. hao Mao, G. Liu, H. Ahmad, and R. Ullah, "Ligature based Urdu Nastaleeq sentence recognition using gated bidirectional long short term memory," Cluster Computing, 2017, doi: 10.1007/s10586-017-0990-5. [93] A. Onan and S. Korukoğlu, "A feature selection model based on genetic rank aggregation for text sentiment classification," Journal of Information Science, vol. 43, no. 1, pp. 25-38, 2017. [94] S. Sural and P. Das, "A genetic algorithm for feature selection in a neuro-fuzzy OCR system," in Proceedings of Sixth International Conference on Document Analysis and Recognition (ICDAR), Seattle, WA, USA, 2001, pp. 987-991: IEEE. [95] S. K. Pal and P. P. Wang, Genetic Algorithms for Pattern Recognition, 1st Edition ed. Boca Raton: CRC press, 2017. [96] J. Tarigan, R. Diedan, and Y. Suryana, "Plate recognition using backpropagation neural network and genetic algorithm," Procedia Computer Science, vol. 116, pp. 365-372, 2017. [97] M. Middlemiss and G. Dick, "Feature Selection of Intrusion Detection Data Using A Hybrid Genetic Algorithm/KNN Approach," in Design and Application of Hybrid Intelligent Systems: IOS Press, 2003, pp. 519-527. [98] M. S. Hoque, M. A. Mukit, and M. A. N. Bikas, "An implementation of intrusion detection system using genetic algorithm," International Journal of Network Security & Its Applications (IJNSA), vol. 4, no. 2, pp. 109-120, 2012. [99] T. Saba, A. Rehman, and G. Sulong, "Non-linear segmentation of touched roman characters based on genetic algorithm," International Journal on Computer Science and Engineering (IJCSE), vol. 2, no. 6, pp. 2167-2172, 2010. [100] X. Wei, S. Ma, and Y. Jin, "Segmentation of connected Chinese characters based on genetic algorithm," in Proceedings of Eight International Conference on Document Analysis and Recognition, Seoul, South Korea, 2005, pp. 645-649: IEEE.

154 [101] A. M. Alimi, "An evolutionary neuro-fuzzy approach to recognize on-line Arabic handwriting," in Proceedings of the Fourth International Conference on Document Analysis and Recognition, Ulm, Germany, 1997, vol. 1, pp. 382-386: IEEE. [102] G. Abandah and N. Anssari, "Novel moment features extraction for recognizing handwritten Arabic letters," Journal of Computer Science, vol. 5, no. 3, pp. 226- 232, 2009. [103] M. Kherallah, F. Bouri, and A. M. Alimi, "On-line Arabic handwriting recognition system based on visual encoding and genetic algorithm," Engineering Applications of Artificial Intelligence, vol. 22, no. 1, pp. 153-170, 2009. [104] M. A. Abed, A. N. Ismail, and Z. M. Hazi, "Pattern recognition using genetic algorithm," International Journal of Computer and Electrical Engineering, vol. 2, no. 3, pp. 583-588, 2010. [105] A. M. Alimi, "Evolutionary computation for the recognition of on-line cursive handwriting," IETE Journal of Research, vol. 48, no. 5, pp. 385-396, 2002. [106] S. Gazzah and N. B. Amara, "Neural networks and support vector machines classifiers for writer identification using Arabic script," International Arab Journal of Information Technology (IAJIT), vol. 5, no. 1, pp. 92-101, 2008. [107] M. Soryani and N. Rafat, "Application of genetic algorithms to feature subset selection in a Farsi OCR," in Proceedings of World Academy of Science, Engineering and Technology, 2006, vol. 18, pp. 113-116. [108] R. Kala, H. Vazirani, A. Shukla, and R. Tiwari, "Offline handwriting recognition using genetic algorithm," International Journal of Computer Science Issues (IJCSI), vol. 7, no. 2, pp. 16-25, 2010. [109] I. A. Ansari and D. R. Borse, "Automatic recognition of offline handwritten Urdu digits In unconstrained environment using daubechies wavelet transforms," IOSR Journal of Engineering (IOSRJEN), vol. 3, no. 9, pp. 50-56, 2013. [110] D. S. Kaushal, Y. Khan, and D. S. Varma, "Handwritten Urdu character recognition using zernike MI’s feature extraction and support vector machine classifier," International Journal of Research, vol. 1, no. 7, pp. 1084-1089, 2014. [111] S. Uddin, M. Sarim, A. B. Shaikh, and S. K. Raffat, "Offline Urdu numeral recognition using non-negative matrix factorization," Research Journal of Recent Sciences, vol. 3, no. 11, pp. 98-102, 2014. [112] N. Otsu, "A threshold selection method from gray-level histograms," IEEE Transactions On Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62-66, 1979.

155 [113] H. Liu, A. Gegov, and M. Cocea, "Representation of classification rules," in Rule Based Systems for Big Data: Springer, 2016, pp. 51-62. [114] S. T. Javed, S. Hussain, A. Maqbool, S. Asloob, S. Jamil, and H. Moin, "Segmentation Free Nastalique Urdu OCR," World Academy Of Science, Engineering And Technology, vol. 70, October 2010. [115] X. Solé, A. Ramisa, and C. Torras, "Evaluation of Random Forests on large-scale classification problems using a Bag-of-Visual-Words representation," in CCIA, 2014, pp. 273-276. [116] N. Javed, S. Shabbir, I. Siddiqi, and K. Khurshid, "Classification of Urdu Ligatures using Convolutional Neural Networks-A novel approach," in International Conference on Frontiers of Information Technology (FIT), 2017, pp. 93-97: IEEE. [117] T. Oates and D. Jensen, "The effects of training set size on decision tree complexity," in Proc. 14th Int. Conf. on Machine Learning, 1997, pp. 254–262, San Francisco, CA. [118] J. K. Martin and D. Hirschberg, "On the complexity of learning decision trees," in Proceedings of Fourth International Symposium on Artificial Intelligence and Mathematics, 1996, pp. 112-115, Fort Lauderdale, FL. [119] G. Louppe, "Understanding random forests: From theory to practice," PhD dissertation, Department of Electrical Engineering & Computer Science, 2014. [120] A. Abdiansah and R. Wardoyo, "Time complexity analysis of support vector machines (SVM) in LibSVM," International Journal of Computer and Application, pp. 1-7, 2015. [121] R. K. Sevakula, M. Suhail, and N. K. Verma, "Fast data sampling for large scale support vector machines," in Computational Intelligence: Theories, Applications and Future Directions (WCI), 2015 IEEE Workshop on, 2015, pp. 1-6: IEEE. [122] M. Blachnik, "Reducing time complexity of svm model by lvq data compression," in International Conference on Artificial Intelligence and Soft Computing, 2015, pp. 687-695: Springer. [123] A. Abidi, A. Jamil, I. Siddiqi, and K. Khurshid, "Word spotting based retrieval of Urdu handwritten documents," in International Conference on Frontiers in Handwriting Recognition (ICFHR), 2012, pp. 331-336: IEEE.

156 Total Impact Factor: 7.24

1. Naila Habib Khan and Awais Adnan, “Urdu Optical Character Recognition Systems: Present Contributions and Future Directions,” IEEE Access, Volume 6, Issue 1, pp. 46019-46046, August 2018. (Published, Impact Factor 4.098) 2. Naila Habib Khan, Awais Adnan and Sadia Basar, “Urdu Ligature Recognition Using Multi-Level Agglomerative Hierarchical Clustering,” Cluster Computing, vol. 21, pp. 503–514, March 2018. (Published, Impact Factor 1.601) 3. Naila Habib Khan and Awais Adnan, “Ego-motion Estimation, Concepts, Algorithms and Challenges: An Overview,” Multimedia Tools and Applications, vol. 76, Issue 15, pp 16581–16603, August 2017. (Published, Impact Factor 1.541) 4. Naila Habib Khan and Awais Adnan, “Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering”, International Journal of Computer Vision. (Submitted)

157 Journal Name: IEEE Access Publisher: IEEE Manuscript Status: Published Volume: 6 Issue: 1 Pages: 46019-46046 Month and Year: August 2018 Impact Factor: 4.098

Urdu Optical Character Recognition Systems: Present Contributions and Future Directions

Naila Habib Khan1 and Awais Adnan1

1Department of Computer Science, Institute of Management Sciences, Peshawar, Khyber Pakhtunkhwa, Pakistan

Abstract: This research article gives an across-the-board comprehensive review and survey of the most prominent studies in the field of Urdu Optical Character Recognition (OCR). This study introduces the OCR technology and presents a historical review of the OCR systems, providing comparisons between the English, Arabic and Urdu systems. Detailed background and literature have also been provided for Urdu script, discussing the script’s past, OCR categories and phases. The research paper further reports all state-of-the-art studies for different phases, namely, image acquisition, pre-processing, segmentation, feature extraction, classification/recognition and post-processing for an Urdu OCR system. In the segmentation section, the analytical and holistic approaches for Urdu text have been emphasized. In the feature extraction section, a comparison has been provided between the feature learning and feature engineering approaches. Deep learning and traditional machine learning approaches have been discussed. The Urdu numeral recognition systems have also been deliberated concisely. The research paper concludes by identifying some open problems and suggesting some future directions.

Index Terms: Cursive; Optical Character Recognition; Urdu Text Recognition.

158 Journal Name: Cluster Computing Publisher: Springer Manuscript Status: Published DOI: 10.1007/s10586-017-0916-2 Month and Year: March 2018 Impact Factor: 1.601

Urdu Ligature Recognition Using Multi-Level Agglomerative Hierarchical Clustering

Naila Habib Khan1, Awais Adnan1 and Sadia Basar2

1Department of Computer Science, Institute of Management Sciences, Peshawar, Khyber Pakhtunkhwa, Pakistan 2Department of Information Technology, Hazara University, Hazara, Khyber Pakhtunkhwa, Pakistan

Abstract: Optical character recognition (OCR) system holds great significance in human- machine interaction. OCR has been the subject of intensive research especially for Latin, Chinese and Japanese script. Comparatively, little work has been done for Urdu OCR, due to the complexities and segmentation errors associated with its cursive script. This paper proposes an Urdu OCR system which aims at ligature level recognition of Urdu text. This ligature based recognition approach overcomes the character level segmentation problems associated with cursive scripts. A newly developed OCR algorithm is introduced that uses a semi-supervised multi-level clustering for categorization of the ligatures. Classification is performed using four machine learning techniques i.e. decision trees, linear discriminant analysis, naïve Bayes and k-nearest neighbor (K-NN). The system was implemented and the results show 62, 61, 73 and 90% accuracy for decision tree, linear discriminant analysis, naïve Bayes and K-NN respectively.

Keywords: Agglomerative; Clustering; Classification; OCR; Urdu.

159 Journal Name: Multimedia Tools and Applications Publisher: Springer Manuscript Status: Published Volume: 76 Issue: 15 Pages:16581-16603 Month and Year: August 2017 Impact Factor: 1.541

Ego-Motion Estimation Concepts, Algorithms and Challenges: An Overview

Naila Habib Khan1 and Awais Adnan1

1Department of Computer Science, Institute of Management Sciences, Peshawar, Khyber Pakhtunkhwa, Pakistan

Abstract: Ego-motion technology holds great significance for computer vision applications, robotics, augmented reality and visual simultaneous localization and mapping. This paper is a study of ego-motion estimation basic concepts, equipment, algorithms, challenges and its real-world applications. First, we provide an overview for motion estimation in general with special focus on ego-motion estimation and its basic concepts. For ego-motion estimation it’s necessary to understand the notion of independent moving objects, focus of expansion, motion field, and optical flow. Vital algorithms that are used for ego-motion estimation are critically discussed in the following section of the paper. Various camera setups and their potential weakness and strength are also studied in context of ego-motion estimation. We also briefly specify some ego-motion applications used in the real world. We conclude the paper by discussing some open problems, provide some future directions and finally summarize the entire paper in the conclusions.

Keywords: Camera motion; Ego-motion; Motion estimation; Visual odometry.

160 Journal Name: International Journal of Computer Vision Publisher: Springer Manuscript Status: Submitted

Urdu Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering

Naila Habib Khan1 and Awais Adnan1

1Department of Computer Science, Institute of Management Sciences, Peshawar, Khyber Pakhtunkhwa, Pakistan

Abstract: Cursive text recognition has been an active area of research in the field of computer vision. This paper presents a novel method for recognition of printed Urdu Nastalique script using ligatures. The proposed Urdu ligature recognition system uses a genetic algorithm based hierarchical clustering approach for the recognition of ligatures. The system has been divided into six phases, (1) pre-processing, (2) segmentation, (3) feature extraction, (4) hierarchical clustering, (5) classification rules and (6) genetic algorithm optimization and recognition. Urdu text line images are read one by one from the dataset and are thresholded into the foreground and background, subsequently, the text lines are segmented into ligatures. Next, fifteen renowned hand-engineered geometric and statistical features are extracted from the ligature images. For reduction of data points wide distribution, each of the features is hierarchically clustered. A total of 3645 classification rules, using simple conditional statements are used for the representation of the clustered data. Next, a genetic algorithm is used for further optimization of the hierarchical clustering and the final recognition of Urdu ligatures. Experiments conducted on benchmark UPTI dataset for the proposed Urdu Nastalique ligature recognition system yields promising results. In comparison to the prevailing ligature recognition systems, the proposed system achieves one of the highest recognition rates i.e. 96.72%.

Keywords: Genetic Algorithm; Hierarchical Clustering; Ligature Recognition; Urdu OCR.

161 1. Naila Habib Khan, Awais Adnan and Sadia Basar, “An analysis of off-line and on-line approaches in Urdu character recognition,” in Proceedings of the 15th International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases (AIKED), Venice, Italy, January 2016, pp. 280-286. 2. Naila Habib Khan, Awais Adnan and Sadia Basar, “Geometric feature extraction from Urdu ligatures,” in Recent Advances in Telecommunications, Informatics and Educational Technologies, Istanbul, Turkey, December 2014, pp. 229- 236. 3. Sadia Basar, Awais Adnan, Naila Habib Khan and Shahab Haider. “Color Image Segmentation Using K-Means Classification on RGB Histogram,” in Recent Advances in Telecommunications, Informatics and Educational Technologies, Istanbul, Turkey, December 2014, pp. 257-262.

162 Conference: Proceedings of the 15th International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases (AIKED '16) Publisher: Wseas Pages: 280-286 Month and Year: January, 2016 Location: Venice, Italy

An Analysis of Off-line and On-line Approaches in Urdu Character Recognition

Naila Habib Khan1, Awais Adnan1 and Sadia Basar1

1Department of Computer Science, Institute of Management Sciences, 1-A, Sector E-5, Phase VII, Hayatabad Peshawar, Pakistan

Abstract: In this research article a detailed analysis has been proposed for various offline and online character recognition systems for Urdu script from year 2002 to 2012. This analysis is based on the Methodology, Text Type, Font, Recognition Level, Sample and Accuracy Level achieved by each individual Urdu script recognition system. This paper attempts to incorporate various aspects of offline and online character recognition systems to provide wide exposure to this research topic with special emphasis on Urdu Script. Generally, character recognition is the capability of a computer system to comprehend printed or handwritten text from different sources like documents, books, reports, photographs or directly from digital touch screens. In Offline Character Recognition system, an image is sensed by a scanner having printed text. When using any digital device in real time for example a touch-screen or a digital pen, it is referred to as Online Character Recognition.

Keywords: Online; Offline; OCR; Urdu.

163

Conference: Recent Advances in Telecommunications, Informatics and Educational Technologies Publisher: Wseas Pages: 229- 236 Month and Year: December, 2014 Location: Istanbul, Turkey

Geometric Feature Extraction from Urdu Ligatures

Naila Habib Khan1, Awais Adnan1 and Sadia Basar1

1Department of Computer Science, Institute of Management Sciences, 1-A, Sector E-5, Phase VII, Hayatabad Peshawar, Pakistan

Abstract: This research aims at the extraction of geometric features from Urdu ligatures. Though structural features are robust, its extraction and analysis is exceptionally complex and time- consuming task. The extraction and analysis is uncomplicated in case of the geometric features. Geometric features are language, script and font independent. There are twelve significant geometric features extracted from the ligature images. Specifically, these twelve features are the height, width, aspect ratio, density function, perimeter, area, perimeter to area ratio, horizontal projection profile, vertical projection profile, start point, end point and the slope between start and end point.

Keywords: Features; Geometric; Ligature; Structural; Urdu.

164 Conference: Recent Advances in Telecommunications, Informatics and Educational Technologies Publisher: Wseas Pages: 257-262 Month and Year: December, 2014 Location: Istanbul, Turkey

Color Image Segmentation Using K-Means Classification on RGB Histogram Sadia Basar1, Awais Adnan1, Naila Habib Khan1 and Shahab Haider1

1Department Of Computer Science, Institute of Management Sciences, 1-A, Sector E-5, Phase VII, Hayatabad Peshawar Pakistan

Abstract: The paper presents the approach of Color Image Segmentation Using k-means Classification on RGB Histogram. The k-means is an iterative and an unsupervised method. The existing algorithms are accurate, but missing the locality information and required high-speed computerized machines to run the segmentation algorithms. The proposed method is content-aware and feature extraction method, which is able to run on low-end computerized machines, simple approach, required low quality streaming, efficient and used for security purpose. It has the capability to highlight the boundary and the object. The proposed approach used the unsupervised clustering technique in the paper in order to detect the image feature extraction, color and region identification. The proposed technique has solved the missing of the locality information problem and presents the image in distinct colors and clearly identify the objects of the image. At first, the image is read and then it is adjusting in a standard size. In another step, the pixels are divided into different clusters based on their color, texture and region, then cluster values are calculated by using the k-means clustering algorithm. If there is no pixel remaining, in another phase all the clusters are combined and finally the image is presented in the form of segments such as segmented image.

Keywords: Image; Digital image; Image segmentation; Clustering; K-means algorithm.

165