LIGATURE RECOGNITION SYSTEM FOR PRINTED URDU SCRIPT USING GENETIC ALGORITHM BASED HIERARCHICAL CLUSTERING
A Thesis Submitted to the Faculty of the Institute of Management Sciences, Peshawar in Partial Fulfillment of the Requirements for the Degree of
DOCTOR OF PHILOSOPHY
COMPUTER SCIENCE
By NAILA HABIB KHAN
DEPARTMENT OF COMPUTER SCIENCE INSTITUTE OF MANAGEMENT SCIENCES PESHAWAR, PAKISTAN
SESSION 2014-2017 This is to certify that the research work presented in this thesis entitled “Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering” was conducted by Naila Habib Khan under the supervision of Dr. Awais Adnan, Institute of Management Sciences, Peshawar, Pakistan. No part of this thesis has been submitted anywhere else for any other degree. This thesis is submitted to the Institute of Management Sciences, Peshawar in partial fulfilment of the requirements for the degree of Doctor of Philosophy in the field of Computer Science.
Student Name: Naila Habib Khan Signature: ______
Examination Committee: a) External Foreign Examiner 1: Dr. Yue Cao School of Computing and Communications, Lancaster University, UK
Signature: ______b) External Foreign Examiner 2: Prof. Dr. Ibrahim A. Hameed Deputy Head of Research and Innovation Department of ICT and Natural Sciences, Norwegian University of Science and Technology, UK
Signature: ______c) External Local Examiner: Dr. Saeeda Naz Head of Department/ Assistant Professor Govt. Girls Postgraduate College, Abbotabad, Pakistan
Signature: ______
ii d) Internal Local Examiner: Dr. Imran Ahmed Mughal Assistant Professor Institute of Management Sciences, Peshawar, Pakistan
Signature: ______
Supervisor: Dr. Awais Adnan Assistant Professor Institute of Management Sciences, Peshawar, Pakistan
Signature: ______
Director: Dr. Muhammad Mohsin Khan Institute of Management Sciences, Peshawar, Pakistan
Signature: ______
iii I, Naila Habib Khan, hereby declare that my Ph.D. thesis entitled, “Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering” submitted to Research and Development Department (R&DD) by me is my own original work. I am aware of the fact that in case my work is found to be plagiarized or not genuine, R&DD has the full authority to cancel my research work and I am liable to the penal action.
Naila Habib Khan
July 5, 2019
iv
I, solemly, declare that the research work presented in the thesis entitited, “Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering” is soley my research work with no significant contribution from any other person. Small contribution whereever taken has been duly acknowledged and that the complete thesis has been written by me.
I understand the zero-toelerance policy of the HEC and the Institute of Management Sciences, Peshawar, towards plagrism. Therefore, I as an author of the above-mentioned titled thesis declare that no portion of my thesis has been plagrised and any material used as a reference has been properly cited.
I understand that if I am found guilty of any form of plagrism in the above-mentioned titled thesis even after award of Ph.D. degree, the institute reserves the rights to withdraw/revoke my Ph.D. degree and that HEC and the Institute has the right to publish my name on the HEC/Institute website on which names of students are placed who submitted plagrised thesis.
Author’s Signature: ______
Naila Habib Khan
v This research is dedicated to my beloved parents. They give me strength when I am weak, they never let me fall and hold me up, see the best that is there in me, they are always there for me and stand by me. They have been an inspiration and a blessing for me. I am everything I am because I am loved by them.
vi Firstly, all glory is to Allah Almighty Who blessed me with a strong will and determination to complete this research.
I express my deepest gratitude to my supervisor Dr. Awais Adnan for his kind guidance, constant help and constructive feedback on my research. I especially thank him for his support throughout the research phase with patience. I really appreciate his input on my research, as it wouldn’t have been possible without his advice and assistance.
I am also grateful to all the faculty members and colleagues of the Department of Computer Science, Institute of Management Sciences for their support, inspiration and encouragement during my PhD research work. Thank you to my friend, Sadia Basar, for always been there, encouraging me with my work throughout this tough route to PhD.
Last but not the least, enormous thanks to my beloved parents, my dearest sisters, Asma Habib Khan and Nazma Habib Khan, and my dearest brothers, Imran Khan and Asif Khan, whose warm wishes, all-embracing backing, patience and prayers made the completion of this PhD research possible. I owe my gratitude to my elder sister Nazma Habib Khan for her constant advice and motivation. I would also like to thank my family members Iftikhar Anjum and Jane Agna Khan for their immense support.
vii In this dissertation, a method has been presented for ligature-based recognition of printed Urdu Nastalique script. The proposed recognition system uses a genetic algorithm based hierarchical clustering approach for recognition of Urdu ligatures. The overall proposed Urdu ligature recognition system has been divided into six phases, pre-processing, segmentation, feature extraction, hierarchical clustering, classification rules, genetic algorithm optimization and recognition.
In the first phase, the Urdu text line images are read one by one from the dataset. Next, in pre-processing the images are thresholded and noise is removed. Subsequently, an efficient and effective holistic approach algorithm is developed for the segmentation of the Urdu text lines into constituent ligatures. The proposed ligature segmentation algorithm is novel since its one of the first algorithm that doesn’t use the baseline information for ligature segmentation of Urdu script. Following, a unique set of fifteen hand-engineered features are extracted from the segmented ligature images. Out of these fifteen hand-engineered features, two are geometric features, nine are first-order statistical features and four are second-order statistical features. For data points distribution reduction, the features are hierarchically clustered and a total of 3645 classification rules are generated using simple IF-THEN statements. Since the rules are at an initial stage, genetic algorithm optimization is used for further refinement of the hierarchical clustering. The proposed genetic algorithm phase consists of population initialization, chromosome encoding, parent selection, crossover, mutation, fitness function and termination stages.
Experiments conducted on the benchmark UPTI dataset for the proposed Urdu Nastalique ligature recognition system yields promising results. The proposed ligature segmentation algorithm achieves an accuracy of 99.86%, whereas, the genetic algorithm based hierarchical clustering approach achieves a ligature recognition rate of 96.72%.
viii Table of Contents
Certificate of Approval ...... ii
Author’s Declaration ...... iv
Plagrism Undertaking ...... v
Dedication ...... vi
Acknowledgments ...... vii
Abstract...... viii
List of Figures ...... xiv
List of Tables ...... xvii
List of Abbreviations ...... xviii
Chapter 1. Introduction ...... 1
1.1 Overview...... 1
1.2 Motivation ...... 2
1.3 Problem Statement ...... 2
1.3.1 Problem Description ...... 2
1.4 Goal and Objectives ...... 3
1.5 Research Contributions ...... 3
1.6 Thesis Structure ...... 5
1.7 Summary ...... 6
Chapter 2. Background ...... 7
2.1 History of OCR ...... 7
2.2 Categories of OCR System ...... 9
2.2.1 Input Acquisition Modes ...... 9
2.2.2 Writing Modes...... 10
2.2.3 Font Constraints ...... 11
2.2.4 Script Connectivity ...... 12
2.3 Generic OCR Process ...... 12
ix 2.4 Image Acquisition ...... 13
2.5 Pre-Processing ...... 14
2.5.1 Thresholding ...... 14
2.5.2 Noise Removal ...... 15
2.5.3 Smoothing ...... 15
2.5.4 De-Skewing ...... 15
2.5.5 Thinning ...... 15
2.6 Segmentation ...... 16
2.6.1 Analytical Approach ...... 17
2.6.2 Holistic Approach ...... 18
2.7 Feature Extraction ...... 20
2.7.1 Feature Learning Approach ...... 20
2.7.2 Feature Engineering Approach ...... 21
2.8 Classification and Recognition ...... 22
2.8.1 Traditional Machine Learning ...... 23
2.8.2 Deep Learning ...... 24
2.8.3 Overview of Genetic Algorithm (GA) ...... 25
2.9 Post-Processing ...... 28
2.10 Application Areas of OCR ...... 28
2.10.1 Digital Reformatting ...... 28
2.10.2 Automated Text Translation ...... 29
2.10.3 Text-To-Speech Conversion...... 29
2.10.4 Automatic Number Plate Recognition (ANPR) ...... 29
2.10.5 Static-to-Electronic Media Conversion ...... 29
2.11 Urdu Script Preliminaries ...... 30
2.11.1 Urdu Script, Its History and Relation to Other Cursive Scripts...... 30
2.11.2 Joiner and Non-Joiner Characters ...... 33
2.11.3 Dots and Diacritics ...... 35
2.11.4 Urdu Nastalique OCR Challenges ...... 36
x 2.12 Summary ...... 39
Chapter 3. Literature Review ...... 40
3.1 Datasets for Urdu OCR ...... 40
3.2 Related Work for Different Categories of Urdu OCR ...... 42
3.3 Related Work for OCR Phases ...... 46
3.3.1 Related Work for Image Acquisition ...... 46
3.3.2 Related Work for Pre-Processing ...... 48
3.3.3 Related Work for Segmentation ...... 51
3.3.4 Related Work for Feature Extraction ...... 54
3.3.5 Related Work for Classification/Recognition ...... 59
3.4 Related Work for Genetic Algorithms ...... 67
3.5 Urdu Digit Recognition Systems ...... 68
3.6 Discussion of Literature Review ...... 70
3.7 Open Problems and Future Directions...... 70
3.8 Summary ...... 71
Chapter 4. Proposed Methodology ...... 72
4.1 System Overview ...... 72
4.2 Pre-Processing ...... 74
4.3 Ligature Segmentation ...... 75
4.3.1 Connected Component Labeling ...... 76
4.3.2 Connected Component Feature Extraction and Separation ...... 76
4.3.3 Connected Component Association ...... 78
4.3.4 Segmented Ligatures ...... 82
4.4 Hand-Engineered Feature Extraction ...... 82
4.4.1 Geometric Features ...... 83
4.4.2 First-Order Statistical Features ...... 84
4.4.3 Second-Order Statistical Features ...... 91
4.5 Hierarchical Clustering ...... 93
xi 4.6 Data Representation Using Classification Rules ...... 99
4.7 Optimization and Recognition ...... 101
4.7.1 Population Initialization ...... 103
4.7.2 Chromosome Encoding ...... 104
4.7.3 Parent Selection ...... 104
4.7.4 Crossover ...... 105
4.7.5 Mutation ...... 108
4.7.6 Fitness Function ...... 109
4.7.7 Survivor Selection ...... 111
4.7.8 Termination ...... 111
4.8 Summary ...... 112
Chapter 5. Experiments and Results ...... 113
5.1 Dataset and Ground-Truth ...... 113
5.2 Ligature Segmentation Results ...... 114
5.2.1 Comparison to Other Ligature Segmentation Algorithms ...... 116
5.3 Feature Extraction Results ...... 117
5.3.1 Space Complexity Comparison to Other Feature Vectors ...... 118
5.3.2 Accuracy and Reliability of the Feature Vector ...... 120
5.4 Hierarchical Clustering and Classification Rules ...... 120
5.5 Genetic Algorithm Results ...... 124
5.5.1 Algorithm Parameters ...... 124
5.5.2 Ligature Recognition Accuracy ...... 125
5.5.3 Convergence ...... 132
5.6 Comparison to Other State-Of-The-Art Classifiers ...... 133
5.7 Theoretical Analysis of Computational Complexity ...... 136
5.8 Comparative Analysis ...... 139
5.9 Shortcomings ...... 141
5.10 Summary ...... 143
xii Chapter 6. Conclusion and Recommendation ...... 144
6.1 Conclusion ...... 144
6.2 Future Recommendation ...... 145
References ...... 146
Appendix A: Journal Publications ...... 157
Appendix B: Conference Publications ...... 162
xiii Figure 2.1 History of OCR...... 8 Figure 2.2 Categorization of OCR System ...... 9 Figure 2.3 (a) Online Character Recognition (b) Offline Character Recognition...... 10 Figure 2.4 Examples of Urdu Isolated Characters, Ligatures and Words ...... 12 Figure 2.5 Generic Optical Character Recognition Process ...... 13 Figure 2.6 (a) Original Image (b) Thresholded Image ...... 15 Figure 2.7 Segmentation Process for a Document Image...... 16 Figure 2.8 Approaches for Text Segmentation ...... 17 Figure 2.9 Overlapped Ligatures from An Un-Degraded Line Image ‘560’ Taken from UPTI Dataset Given In [33] ...... 18 Figure 2.10 Connected Component Labeling Computed for Un-Degraded Line Image Taken from UPTI Dataset Given In [33] ...... 19 Figure 2.11 Factors Affecting OCR System's Performance ...... 20 Figure 2.12 Basic Terminologies of Genetic Algorithm ...... 27 Figure 2.13 Application Areas of Urdu OCR ...... 28 Figure 2.14 Different Writing Styles for Arabic Script [52] ...... 31 Figure 2.15 Arabic Alphabet ...... 32 Figure 2.16 Persian Alphabet ...... 32 Figure 2.17 Pashto Alphabet ...... 33 Figure 2.18 Urdu Alphabet ...... 33 Figure 2.19 Joiner Urdu Characters ...... 34 Figure 2.20 Non-Joiner Urdu Characters ...... 34 Figure 2.21 Characters Associated with Dots ...... 35 Figure 2.22 Some of The Common Aerab Used with Urdu Characters ...... 35 Superscript...... 36 ط Figure 2.23 Retroflex Consonants and Figure 2.24 Shape of Character ‘Te’ Affected by Its Neighboring Characters ...... 36 Figure 2.25 (a) Diagonality in Urdu (b) Horizontal Baseline in Arabic...... 37 Figure 2.26 (a) Inter-Ligature Overlapping (b) Intra-Ligature Overlapping ...... 37 Figure 2.27 Placement of Dots at Non-standard Position...... 38 Figure 2.28 (a) Spacing between Urdu Ligatures (b) Spacing between Arabic Ligatures 38 Figure 2.29 Characters Touching Baseline (Blue), First-Descender Line (Pink) And Second-Descender Line (Green) [59] ...... 39
xiv Figure 3.1 Datasets for Urdu Optical Character Recognition ...... 40 Figure 3.2 Horizontal Word Stretching to Avoid Overlapping [16] ...... 50 Figure 3.3 Contour (Boundary) Extracted for a Ligature [33] ...... 55 Figure 3.4 Digit Training and Testing Sample [109] ...... 69 Figure 4.1 Overview of Proposed Urdu OCR System ...... 74 Figure 4.2 (a) Original Image from UPTI Dataset [33] (b) Thresholded Image Using Otsu's Method ...... 75 Figure 4.3 Block Diagram for Proposed Ligature Segmentation Algorithm ...... 76 Figure 4.4 Vertical Overlap Analysis for Urdu Text-line Images Taken from UPTI
Dataset Given in [33] (a) Association Using CXmin (b) Association Using
CXmin and CXmax ...... 79 Figure 4.5 Calculating Aspect Ratio for Ligature Image ...... 83 Figure 4.6 Horizontal Projection Profile Computed for a Ligature ...... 85 Figure 4.7 Vertical Projection Profile Computed for a Ligature ...... 85 Figure 4.8 Horizontal Edge Intensity Computed Using the Sobel Method ...... 86 Figure 4.9 Vertical Edge Intensity Computed Using the Sobel Method ...... 87 Figure 4.10 Mean Calculated from Horizontal Histogram of Ligature Image ...... 88 Figure 4.11 Mean Calculated from Vertical Histogram of Ligature Image ...... 88
Figure 4.12 Variance VH Calculated from Horizontal Histogram of Ligature Image ...... 89
Figure 4.13 Variance Vv Calculated from Vertical Histogram of Ligature Image ...... 90
Figure 4.14 Kurtosis KH Calculated from Horizontal Histogram of Ligature Image ...... 90 Figure 4.15 Kurtosis Kv Calculated from Vertical Histogram of Ligature Image ...... 91 Figure 4.16 Initial Data Points Distribution for Each Feature (F1 to F15)...... 94 Figure 4.17 Incremental Distribution of Sorted Data Points for Each Feature ...... 97 Figure 4.18 First-order Derivative Distribution of Data Points for Each Feature ...... 98 Figure 4.19 Mean of First-order Derivative Elements for Each Feature ...... 99 Figure 4.20 Specialized Tree Representation ...... 100 Figure 4.21 Block Diagram for Proposed GA Optimization and Recognition ...... 102 Figure 4.22 Chromosome with Permutation Encoding ...... 104 Figure 4.23 Parents Selection for Proposed Genetic Algorithm ...... 105 Figure 4.24 Crossover Operation for Proposed GA ...... 107 Figure 4.25 Mutation Operation for Proposed GA ...... 109 Figure 4.26 Multi-level Column Sorting Process for a Solution ...... 109 Figure 4.27 Survivor Selection Using Elitism ...... 111
xv Figure 5.1 Ligature Segmentation Results Extracted from Ground-Truth of UPTI Dataset Given In [33] ...... 114 Figure 5.2 (a) Text Line Image Taken from UPTI Dataset Given In [33] (b) Primary Connected Components Extracted Using Proposed Segmentation Algorithm (c) Secondary Connected Components Extracted Using The Proposed Segmentation Algorithm ...... 115 Figure 5.3 Ligatures Segmented Using Proposed Algorithm from Un-Degraded Sentence Image ‘560’ Taken from UPTI Dataset Given In [33] ...... 115 Figure 5.4 Sentence Text Image Taken from [15] and Its Segmented Ligatures Shown In (a) and (b), Respectively, Using The Proposed Algorithm ...... 116 Figure 5.5 Space Complexity Comparison for Proposed Features, Raw Pixel Features As Given in [92] and Autoencoder Features...... 119 Figure 5.6 Accuracy and Reliability of the Proposed Feature Vectors ...... 120 Figure 5.7 Data Distribution Reduction Using Hierarchical Clustering ...... 121 Figure 5.8 Random Test Data to Evaluate Ligature Recognition Accuracy for Each Chromosome ...... 126 Figure 5.9 Maximum Ligature Recognition Accuracy (%) Achieved for Each Generation ...... 132 Figure 5.10 Convergence Towards a Common Solution ...... 133 Figure 5.11 Computational Complexity of the Proposed Genetic Algorithm ...... 137 Figure 5.12 (a) A Handwritten Urdu Text Image [123] (b) Segmentation Results ...... 142 Figure 5.13 (a) Cursive and Calligraphic Diwani Script [52] (b) Segmentation Results 142
xvi Table 2.1 Genetic Algorithm Terminologies ...... 26 Table 2.2 Comparison of Arabic, Urdu, Persian and Pashto Writing Styles ...... 31 Table 3.1 Summary of Different Categories of Urdu OCR ...... 45 Table 3.2 Summary of Different Image Acquisition Sources Used for Urdu OCR...... 48 Table 3.3 Summary of Notable Contributions for Different Pre-processing Techniques Used for OCR ...... 51 Table 3.4 Summary of Notable Contributions Employing Explicit and Implicit Segmentation Strategies ...... 52 Table 3.5 Summary of Some Notable Contributions Using Holistic Approach for Text Segmentation ...... 53 Table 3.6 Notable Contributions Using Different Features ...... 58 Table 3.7 Summary of Contributions for Isolated Character Urdu OCR ...... 60 Table 3.8 Summary of Contributions for Cursive Character Urdu OCR ...... 63 Table 3.9 Summary of Contributions for Ligature Based Urdu OCR ...... 66 Table 3.10 Summary of Contributions for Genetic Algorithm Based OCR Systems ...... 68 Table 3.11 Summary of Urdu Numeral Recognition Systems...... 69 Table 4.1 Feature Vector Selected for Feature Extraction...... 82 Table 5.1 Results for Proposed Ligature Segmentation Algorithm ...... 115 Table 5.2 Comparison to Other Ligature Segmentation Algorithms ...... 117 Table 5.3 Feature Vector Generated for The First Ten Ligatures Extracted from Image '0.png' Taken from The UPTI Dataset ...... 118 Table 5.4 Space Complexity Comparison in terms of Big-O to Other Feature Vectors . 119 Table 5.5 Results for Data Points Distribution Reduction ...... 121 Table 5.6 Parameters for Genetic Algorithm Model ...... 124 Table 5.7 Recognition Accuracy (%) for Each Population ...... 127 Table 5.8 Survivors Selected Using Elitism based on Ligature Recognition Accuracy given in (%) ...... 129 Table 5.9 Computational Complexity Comparison of GA Based Hierarchical Clustering to Other Machine Learning Techniques ...... 139 Table 5.10 Recognition Accuracies for Different Urdu Ligature Recognition Systems . 140
xvii ANPR Automatic Number Plate Recognition APTI Arabic Printed Text Image BLSTM Bidirectional Long-Short Term Memory CCL Connected Component Labeling CLE Centre for Language Engineering CNN Convolution Neural Network CTC Connectionist Temporal Classification DCT Discrete Cosine Transform EA Evolutionary Algorithm EMILLE Enabling Minority Language Engineering FFNN Feed Forward Neural Network GA Genetic Algorithm GBLSTM Gated Bidirectional Long-Short Term Memory GLCM Gray-Level Co-Occurrence Matrix GMM Gaussian Mixture Model GPU Graphical Processing Unit GSC Gradient, Structural and Concavity HMM Hidden Markov Model KL Karhunen Loeve K-NN K-Nearest Neighbors LMCA Lettres Mots et Chiffres Arabe LSTM Long-Short Term Memory MATLAB Matrix Laboratory MDLSTM Multi-Dimensional Long Short-Term Memory MT Machine Translation NLP Natural Language Processing NMF Non-Negative Matrix Factorization NN Neural Network OCR Optical Character Recognition PC Personal Computer PDF Portable Document Format PDA Personal Digital Assistant
xviii RNN Recurrent Neural Network SURF Speeded Up Robust Features SVM Support Vector Machine UPTI Urdu Printed Text Image
xix
Replication of the human function, such as reading by machines has been an ancient dream. However, over the past years, the field of machine learning has advanced from a dream into reality. Optical character recognition (OCR), has become one of the most successful applications of the artificial intelligence and the pattern recognition field. Optical character recognition deals with the recognition of text obtained by optical means. Over the past few years, numerous optical character recognition commercial applications have been seen in the market meeting the basic requirements of users such as digital reformatting, process automation, text entry, automatic cartography, signature identification, automated text-to- speech conversion and ANPR (Automatic Number Plate Recognition). 1.1 Overview
Communication improvement between man and machine has been the leading inspiration for numerous researchers [1]. One fundamental application of Natural Language Processing (NLP) is a text processing system [1, 2]. A text processing system, formally known as, Optical Character Recognition, has served numerous benefits in this technology era, including conversion of century-old literature into computer understandable format [1]. Over the years the OCR has been predominantly used for producing meaningful outputs from text-based input patterns [3]. Presently, the OCR technology has advanced and extended to include several different types of texts and fonts, as well as, support for handwritten text recognition. Non-cursive scripts such as German, French and English are comparatively easy to recognize and therefore have seen a lot of research and development for the OCR applications [4]. However, the text recognition developments still haven’t excelled for cursive scripts such as Chinese, Korean, Persian, Pashto, Arabic and Urdu. Urdu is the national language of Pakistan spoken and understood by millions of people across the globe [5, 6]. Hence, Urdu language and its script holds great significance in Asia and across Middle East regions. Nastalique is the standard writing style used for Urdu. Nastalique calligraphic script [7, 8], is extremely cursive and context-sensitive in nature [6, 9-13]. Urdu script also shares the same level of written complexity with Arabic, Pashto and Persian scripts [4]. The abundant complexities associated with Urdu Script makes it a scarce language to be considered for OCR [10], hence, limiting its research and development as compared to the Latin script [14]. In Urdu, a sentence is composed of three textual components i.e. words, ligatures and isolated alphabets [15]. Whereas the English sentence is formed by only two textural components i.e. words and isolated alphabets (characters). The extra component in Urdu – Ligature, is regarded as a sub-component of a 1 word which can also be considered as a sub-word [7]. It is composed of a combination of two or more characters. Due to the extreme cursiveness and overlapping issues, such as inter-ligature overlapping and intra-ligature overlapping, it is extremely challenging to perform segmentation [16]. Higher recognition rates for a cursive script such as Urdu, Arabic, Pashto, Persian and Sindhi will only be possible when grammatical or contextual information is used. One such idea is to recognize entire words or ligatures from the dictionary without segmenting the text into individual characters.
1.2 Motivation
Cursive text recognition has been an active area of research in the field of computer vision. Urdu informatics, specially OCR, lag behind due to the complexities and segmentation errors associated with its cursive script [17]. This cursive nature produces numerous challenges such as context sensitivity, overlapping, nuqtas placement, thickness variation, positioning and diagonality [18]. Recently, efforts have been made to develop an OCR system for Urdu script. However, most of the studies have focused on character based recognition, analytical approach, that requires intensive procedures for character level segmentation. Due to the calligraphic and the cursive nature of Urdu script, the character based recognition systems are more complex, challenging and more prone to errors. During the process of character segmentation, the shape of the characters might be deteriorated by segmenting at wrong segmentation points, leading to lower recognition accuracies. One solution to avoid the overhead of character segmentation in Urdu script recognition systems is to use ligatures for recognition. As per available literature, very few recognition systems exist for ligature level recognition of Urdu script. Proposing an OCR technique that can be used efficiently and successfully used to recognize Urdu ligatures will be a valuable addition to the Urdu Natural Language Processing.
1.3 Problem Statement
What modifications are needed in OCR so that it can recognize Urdu printed script at ligature level?
1.3.1 Problem Description
At present, most of the Urdu OCR systems use traditional machine learning algorithms, limited vocabulary, isolated characters and complex features. Urdu is written in the extremely cursive Nastalique script that introduces segmentation issues at character level [15]. Therefore, character level segmentation is a difficult task and generally deteriorates or disfigures the shape of a character, leading to erroneous segmentation. Segmentation may also require more pre-processing like skeletonization. In 2 this study, working at ligature level will also solve the challenges associated with character level segmentation. Henceforth, due to these identified problems, it is required to develop a robust recognition system for connected Urdu script using such algorithm that has low computational overhead.
1.4 Goal and Objectives
The main goal of this research is to develop a robust framework for printed Urdu Nastalique script Optical Character Recognition (OCR) system, that can efficiently recognize documents at ligature level with high accuracy. The goal can be achieved by following the primary objective and its sub-objectives. The primary objective of the proposed research is to extract a distinct set of features from Urdu ligatures and use a state-of-the-art classifier for classification. The sub-objectives of the proposed system are as follows. 1. To use a dataset having about 0.189 million ligatures with more than one million characters for classification. To use a subset of the data for testing purpose. 2. Apply pre-processing on dataset images and segment text images into ligatures. 3. Identify and extract a unique set of features from Urdu ligatures that are the most applicable and develop a feature dataset. 4. To define and apply a machine learning technique on subsets of the feature dataset for the classification and recognition of the ligatures.
1.5 Research Contributions
The main contribution of the proposed research is to explore and experiment the various processes required for classification and recognition of printed Urdu ligatures. The key contributions of this proposed research are briefly summarized as follows. • Dataset: Previously, most researchers have opted to work with isolated Urdu characters and very limited datasets. For the proposed research, instead of using a mere thousand ligatures, a vast dataset of Urdu ligatures is used to train the classifier. To the best of the authors’ knowledge, the total number of ligatures used in this research for classification and recognition is one of the highest ever reported. • Baseline Independent Ligature Segmentation: A connected component labeling (CCL) method-based ligature segmentation algorithm has been proposed and implemented. Most of the existing segmentation algorithms process the baseline for component separation and/or association. However, there is no single horizontal baseline for Urdu Nastalique script and it also comprises of multiple slopping baselines
3 due to its diagonal nature. Thus, the baseline detection methods are yet not robust and perfect, so using it as an element for the segmentation algorithm is an added complexity. To overcome the issues with baseline detection, the proposed ligature segmentation algorithm does not process the baseline. In comparison to the existing ligature segmentation studies, the proposed algorithm has reported one of the highest segmentation accuracies. • Modified Multi-Level Hierarchical Clustering Using Genetic Algorithm: Both the genetic algorithm and hierarchical clustering are well-defined algorithms and has been reported abundantly in the literature. However, here a modified technique has been used that combines both the genetic algorithm and the hierarchical clustering in such a way that a multi-level sorting approach is used for classification and the recognition of the ligatures. Most of the existing techniques are accurate, however, are slow and character based i.e. suitable for working on a small number of classes. Whereas, in comparison, a large number of ligature classes (3645) and a total of 0.189 million ligatures are used in this research. To overcome the issues of execution speed this modified approach in itself is one of the major contributions in the Urdu text recognition domain. The proposed approach has achieved high recognition rate on a benchmark dataset of Urdu text lines to the best of the author’s knowledge. Research articles have also been extracted from this dissertation. Some of the proposed algorithms, as well as related systems within the same field of study, multimedia, have been published as research articles in impact factor international journals as well as presented in international conferences. The list of published and tentative publications for journals and conferences is given as follows. List of PhD Publications Journals 1. Naila Habib Khan and Awais Adnan, “Urdu Optical Character Recognition Systems: Present Contributions and Future Directions,” IEEE Access, vol. 6, Issue 1, pp. 46019- 46046, August 2018. (Published, Impact Factor 4.098) 2. Naila Habib Khan, Awais Adnan and Sadia Basar, “Urdu Ligature Recognition Using Multi-Level Agglomerative Hierarchical Clustering,” Cluster Computing, vol. 21, pp. 503–514, March 2018. (Published, Impact Factor 1.601) 3. Naila Habib Khan and Awais Adnan, “Ego-motion Estimation, Concepts, Algorithms and Challenges: An Overview,” Multimedia Tools and Applications, vol. 76, Issue 15, pp. 16581–16603, August 2017. (Published, Impact Factor 1.541)
4 4. Naila Habib Khan and Awais Adnan, “Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering”, International Journal of Computer Vision. (Submitted) Conference Papers 1. Naila Habib Khan, Awais Adnan and Sadia Basar, “An analysis of off-line and on- line approaches in Urdu character recognition,” in Proceedings of the 15th International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases (AIKED), Venice, Italy, January 2016, pp. 280-286. 2. Naila Habib Khan, Awais Adnan and Sadia Basar, “Geometric feature extraction from Urdu ligatures,” in Recent Advances in Telecommunications, Informatics and Educational Technologies, Istanbul, Turkey, December 2014, pp. 229- 236. 3. Sadia Basar, Awais Adnan, Naila Habib Khan and Shahab Haider, “Color Image Segmentation Using K-Means Classification on RGB Histogram,” in Recent Advances in Telecommunications, Informatics and Educational Technologies, Istanbul, Turkey, December 2014, pp. 257-262.
1.6 Thesis Structure
This chapter identifies the problem statement and specifies the need to develop a ligature based recognition system for printed Urdu script. The organization and structure of the rest of this thesis is as follows. The Chapter 2 provides the background knowledge about the optical character recognition systems as well as its processes, namely, image acquisition, pre-processing, segmentation, feature extraction, classification/recognition, and post- processing. Chapter 3 examines the related studies in the field of Urdu OCR. The literature review has been provided for all the stages of an OCR system. Chapter 4 explains the various steps that are required to develop an Urdu ligature recognition system. The first part of the chapter discusses in detail the process of ligature segmentation. The second part of the chapter lists different geometrical and statistical features that are extracted from the segmented ligature images. The third part of the chapter discusses the classification and recognition processes for the proposed OCR system using the genetic algorithm based hierarchical clustering. Chapter 5, experiments and results, first discuss the dataset that has been used to evaluate the proposed ligature recognition system. The results for different major steps like segmentation, feature extraction, hierarchical clustering, classification rules generation and the final recognition accuracy computed using the proposed genetic algorithm based hierarchal clustering approach has been provided against the benchmark UPTI dataset. The recognition results have been compared to other similar Urdu ligature based recognition systems. Some shortcomings of the proposed ligature recognition system 5 have also been discussed. Chapter 6 concludes the thesis by summarizing the overall findings of the proposed research and provides future recommendations for research in the current field of study. The future recommendations are provided for the improvement and prospective research and development.
1.7 Summary
In chapter 1, “Introduction”, an elementary overview has been provided for the existing optical character recognition systems of different languages. It also discusses and focuses on the motivation and need to develop an Urdu ligature level recognition system. The problem statement, goal and the objectives of the proposed research have also been discussed. In the next chapter, an in-depth background knowledge will be provided for OCR systems, with the primary focus on Urdu OCR, its scripts and its challenges.
6
The OCR technology has a rich history, standard procedures and types, it also presents numerous benefits as well as challenges. This chapter gives an in-depth insight into the history of OCR systems, comparing various scripts, such as Latin, Arabic and Urdu. Different categories, input acquisition, writing modes, font constraints and script connectivity for an OCR system are also discussed. Generally, an OCR system has the following phases, image acquisition, pre-processing, segmentation, feature extraction, classification, recognition and post-processing. All these phases and its sub-phases if any are reviewed. Optical character recognition technology holds great significance for computer vision applications, some of these applications are also explained. Finally, the Urdu language and its alphabet, its numerous complexities in-comparison to other OCR systems of other scripts are provided.
2.1 History of OCR
The early optical character recognition ideas date back to the technologies that were developed to help the visually impaired people. Two famous devices the Tauschek’s reading machine and Fournier Optophone were developed during 1870 to 1931 to help the blind read [19]. In the 1950s, the invention of Gismo, a machine that was capable of translating printed text messages into machine codes was used for computer processing. These devices were also capable of reading text aloud. The device was developed by the efforts of David. H. Shepard, a cryptanalyst and Harvey Cook. Intelligent Machines Research Corporation was the first company to sell out these OCR devices. Following the success, the world’s first OCR system was developed by David H. Shepard. Standard Oil Company of California used the OCR system for making credit card imprint. Other consumers for the OCR system included the Readers Digest and the Telephone Company. During the era of 1954 and 1974, first, portable OCR devices hit the market such as Optacon. These devices were used to scan and digitize postal addresses. Initially, the postal number recognition was very weak but with the advancement of technology it succeeded. The OCR technology progressed immensely by developing passport scanners and price tag scanners during the 1980’s. In late years of 1980’s and early years 1990’s, some of the most famous companies today in the field of OCR were developed, such as Caere Corporation, Kurzweil Computer Products Inc and ABBYY. During the period 2000 to 2017, the OCR technology has developed immensely. Technologies have been introduced that allows online services through the web OCR, as well as certain applications enabling real-time
7 translation of foreign languages on smartphones are developed. Tesseract a famous OCR engine was also published by Hewlett Packard and the University of Nevada, Las Vegas. Different OCR software’s have also been made available online for free by Adobe and Google Drive. Over the past decade’s most of OCR research and development has been directed toward non-cursive scripts such as Latin. The research for cursive scripts such as Arabic and Urdu started decades later than that for Latin script (see Figure 2.1).
Figure 2.1 History of OCR
Some of the earliest research towards Arabic OCR can be dated back to 1970, where an OCR was patented to read the basic printed Arabic numerals from a sheet [20]. Nowadays, the technologies and research for Arabic OCR have progressed to more intelligent character recognition systems that support handwritten and cursive scripts [21]. Currently, much commercial software’s such as ABBYY provides support for Arabic script. However, the accuracy rates in comparison to the Latin text are low. Similarly, OCR for Urdu scripts is far behind than that of Latin script as well as Arabic script. The early systems for Urdu OCR can be traced back to 2003, where a system was developed to recognize the basic isolated printed Urdu characters [22]. Over the past decade, research interest towards the Urdu OCR has increased immensely. However, due to the cursive and context-sensitive nature of its script, it’s still lagging behind in printed OCR and very few developments have been reported for handwritten OCR. Some of the most famous software’s such as OmniPage, Adobe Acrobat, ABBYY FineReader, Readiris, Power PDF Advanced, Soda PDF and other commercial OCR’s have none or extremely little support for Urdu script. Most of these software’s are multilanguage, but
8 they provide higher accuracies for English, some of the Asian languages such as Urdu are mostly not well supported because their fonts are missing.
2.2 Categories of OCR System
Typically, Optical Character Recognition Systems can be divided into different types based on its characteristics i.e. input acquisition mode (online or offline), writing mode ( handwritten or printed), character connectivity (isolated or cursive) and lastly font constraints (single font or omni font) [23]. The categorization of OCR system is shown in Figure 2.2.
Categorization of OCR System
Input Font Script Acquisition Writing Modes Constraints Connectivity Modes
Online Handwritten Character Character Single Font Isolated Recognition Recognition
Offline Printed Character Character Omni Font Cursive Recognition Recognition
Figure 2.2 Categorization of OCR System
2.2.1 Input Acquisition Modes
The mode in which input is given to an Optical Character Recognition system can be divided into two types i.e. online recognition and offline recognition [18, 24, 25]. Online recognition deals with real-time recognition of characters, characters are recognized as the movements of the pen are received when writing something (see Figure 2.3 (a)). Online recognition requires specialized hardware such as pen and tablet to obtain the text input [26, 27]. A concept of digital ink is used, in which a sensor is used to analyze pen tip movements like pen up/down. In comparison to offline recognition, its less complex since the temporal information such as writing order, pen lifts, velocity, speed are readily available. Online OCR systems for Latin font are available in PDA’s, Handheld PC’s, and also available in some latest touchscreen mobile phones.
9
Figure 2.3 (a) Online Character Recognition (b) Offline Character Recognition
Offline recognition deals with recognition of text that has already been converted into a digital image (see Figure 2.3). The input image is usually a product of scanning through some digital device such as a scanner or a digital camera [18]. Offline character recognition is also sometimes referred to as static recognition [28]. Offline character recognition is a complex process in comparison to the online recognition, since, in offline, the characters first needs to be located. Offline recognition can be associated with both handwritten and printed scripts. While the online recognition can only be associated with the handwritten text.
2.2.2 Writing Modes
A major categorization for the OCR system is based on the mode of text the OCR system will be handling. The form of text i.e. printed or handwritten is known as the mode of text when developing an optical character recognition system [23]. The resources available for OCR can be in different formats such as typewritten text containing tables, headers, footers, borders, page numbers etc. or handwritten format having variations of writing styles for different people. Text recognition may seem like a minor task for human beings, however, for computational machines, it tends to be extremely challenging for both handwritten and printed script. It poses to be challenging for printed text due to the availability of a large number of fonts, while, for handwritten text, there are numerous variations possible. An optical character recognition that deals with printed text is sometimes known as a printed character recognition system. In the case of printed text, its uses different font styles such as for the English language, Times New Roman, Arial, Calibri, Courier etc. are used. Similarly, for Urdu, the most famous style of writing is the Nastalique. Printed character recognition systems are simpler as compared to the handwritten character recognition systems. However, printed text may pose to be complex for recognition based upon the
10 quality of the font, document, and writing rules of the language under consideration. Printed text can only be offline [29]. An OCR system that deals with handwritten text is sometimes known as a handwritten character recognition system. Handwritten Character Recognition is extremely challenging research area in the field of image processing and pattern recognition. Recognition of handwritten text is exceptionally difficult as compared to printed/typewritten text. Handwritten text possesses a lot of variations not only due to the different writing styles of different people but also due to the varying pen movements of the same writer. Even with the latest recognition methods and systems, the recognition of handwritten text still remains an extremely challenging task even for Latin script. Similarly, for the recognition of Urdu handwritten text, there is still room for a lot of improvement and progress. Few researchers have focused on using handwritten text for Urdu optical character recognition. Handwritten character recognition systems can be further divided into two types, i.e., offline handwritten character recognition and online handwritten character recognition [29].
2.2.3 Font Constraints
The geometric features of the characters written in one font style may vary to a great extent from character written in another font. Therefore, the OCR process is highly dependent on the font style. An OCR system that has been developed for one font style may completely fail to process or may partially succeed to recognize the same text written in another font style. If an OCR system is capable of processing only a single font style it is known as a single font recognition system. Systems capable of processing and recognizing multiple fonts are called Omni-font character recognition systems [30]. Usually, generalized algorithms are used with Omni-font systems. To use a different font style, only the training process is required to be performed out again. Most of the OCR systems available for Urdu only uses the Nastalique calligraphic font. Nastalique is basically a fusion of Naskh and Taliq writing styles. Mirza Ahmed Jameel, in 1980, computerized 20,000 Nastalique ligatures for the first time, ready to be used in computers. The font was named Noori Nastalique. Over the years many researchers have created their own version of the Nastalique calligraphic style such as Alvi Nastalique, Jameel Noori Nastalique and Faiz Lahori Nastalique. All the Nastalique fonts fulfill the basic characteristics of Nastalique writing style. However, Nastalique is far more complex than the fonts given for Arabic script [31, 32].
11 2.2.4 Script Connectivity
Another categorization for an OCR system is based on the use of the isolated or cursive script. Isolated scripts have characters that do not join with each other when they are written, contrarily, in cursive, the neighboring characters in words may join each other and may also affect the character to change its shape based on the characteristics and position of the character within the word. An optical character recognition using the cursive script is sometimes known as intelligent character recognition. If the system operates at word level its known as intelligent word recognition. Recognition of cursive text is an active area of research [18]. A new level of complexity is introduced when using cursive scripts with optical character recognition systems. This complexity adds an extra level of segmentation in the recognition process in order to isolate the characters within each word. Due to this added complexity, some of the languages are introducing segmentation free approaches. Segmentation free approach is more commonly known as the holistic approach and attempts to recognize the whole word or sub-word (ligature) without breaking it into subsequent characters [6, 33]. Currently, the OCR systems for cursive scripts are suffering due to segmentation complexities. To achieve higher recognition for general cursive scripts like Urdu, the use of contextual and grammatical information is required. For example, recognizing whole words or sub-words (ligatures) from a dictionary is easier than segmenting individual characters from the text. Words, ligatures and isolated characters for Urdu script are shown in Figure 2.4.
Figure 2.4 Examples of Urdu Isolated Characters, Ligatures and Words
2.3 Generic OCR Process
Generally, an offline character recognition system may contain few or all of the six phases, which includes: (1) image acquisition, (2) pre-processing, (3) segmentation, (4) feature extraction, (5) classification and recognition and (6) post-processing. Figure 2.5 shows the block diagram of a typical character recognition system. The different phases of an OCR system have been explained in the sections below.
12
Pre-Processing Segmentation Feature Extraction Image Acquisition Thresholding/ De- Line/ Ligature/ Statistical/ Structural Digitization noise/ Slant Character Correction etc. etc.
Classification and Recognition Editable Text Post-processing Machine Learning/ Deep Learning etc.
Figure 2.5 Generic Optical Character Recognition Process
2.4 Image Acquisition
Image acquisition is commonly the first stage of any computer vision system. Image acquisition is the process of acquiring an image into the digital form for manipulation by the digital computers [24, 34]. There are numerous resources for acquiring images into the computer. The text may also be entered into the computer using a tablet and a pen using online recognition input mode. In offline recognition, the source images can be obtained by scanning printed documents, typewritten documents, handwritten documents and by capturing a photograph through an attached camera, digital camera or image scanner. Further for offline recognition, the source images might also be synthetic i.e. generated without the scanning process. The image can be stored in any of the specific formats for example jpeg, bmp, png etc. Regardless of the source, the quality of the input image plays a vital role in the recognition accuracy. If an image has not been acquired properly then all the later tasks may be affected and the final goal of the recognition system might not be achievable. There are numerous reasons that may affect the overall quality of the input image such as having multiple subsequent copies generated for an original document. Poor printing quality may also make the scanned document to be noisy. Another major reason that might affect the image quality is the font style and its size. Extremely small fonts are more likely to be considered noise and go unrecognized by the OCR system. Punctuation marks, subscripts and superscripts may also introduce complexities in recognition and may be treated as noise in the image, if its size is extremely small. The recognition quality may also be affected by the quality of paper that was used for printing. Heavyweight and smooth papers are relatively easier to process than lightweight and transparent papers. High quality, smooth and noise-free images are more likely to result in better recognition rates. On the other hand, noise affected images are more prone to errors during the recognition process.
13 2.5 Pre-Processing
Pre-processing involves a series of operations that are carried out on the input image to make it more effective for the later stages of the recognition and to improve the overall performance [18]. Pre-processing is used to remove any kind of distortions, quality breakdown, orientation issues that are introduced during the image acquisition phase. The decline in image quality introduces several problems in the text analysis. Therefore, the pre-processing phase is extremely significant and plays a huge role in the development of a successful recognition system. There are a number of techniques, such as, image thresholding, noise removal, smoothing, de-skewing, skeletonization, image dilation, normalization etc. that can be used for pre-processing [18]. The selection of the techniques depends on the nature and source of the images. The final outcome of the pre-processing phase is a quality image that is suitable for the segmentation phase. Some of the pre- processing techniques that are frequently used in an OCR system are discussed in the sub- sections below.
2.5.1 Thresholding
The process of converting an RGB or Grey image to a bi-level image is known as thresholding [5] (see Figure 2.6). It is one of the simplest forms of image segmentation, that separates the foreground (actual text) from the background. Thresholding makes the acquired image small, fast and easy to analyze by removing all the unnecessary color information. The acquired image may be in an RGB or indexed format, where each pixel holds certain color information. However, for an OCR system, this color information is not needed and therefore must be removed. If the image is converted into a grey scale image, some color information is removed but still, the image has unnecessary information. The grey scale image is therefore converted into a bi-level image, where each pixel can hold a value of 1 or 0 [35]. The thresholding algorithms can be further divided into two main groups i.e. global thresholding and local adaptive thresholding. In global thresholding, a single threshold is generated for the entire image. The global thresholding is computed by exploiting the grey level intensity of the image histogram. Images that have non-varying backgrounds are considered more feasible for global thresholding. The implementation of global thresholding is easier than the local adaptive thresholding. On the other hand, local adaptive thresholding is a far more complex but an intelligent technique as compared to the global image thresholding. Instead of selecting a single threshold for the entire image, it classifies every single pixel into the foreground and the background. It works well for images having a varying background. The classification
14 for each pixel is performed by taking into consideration several properties such as the pixel neighborhood. If the pixel in question is darker than its adjacent neighbors, the pixel is converted into black and vice-versa. The results for local adaptive thresholding are far more accurate as compared to the global thresholding algorithm.
Figure 2.6 (a) Original Image (b) Thresholded Image
2.5.2 Noise Removal
The acquired images are usually distorted with unwanted elements. The external disturbance that leads to the degradation of an image signal is known as noise. There are many sources of noise such as bad photocopying or scanning etc. One of the most popular types of noise is the salt and pepper noise.
2.5.3 Smoothing
Smoothing is a procedure in which unwanted noise is eliminated from the edges of the image. The morphological operation of dilation and erosion can be used for the purpose of smoothing. Other than erosion, opening and closing can also be applied for smoothing. The opening morphological operation opens small gaps between an object in an image. The closing morphological operation, on the other hand, works by filling all the small gaps between an object’s edges in an image.
2.5.4 De-Skewing
Skewness of the document is when the lines of text become tilted. Skewness can be introduced as a result of bad photocopying or scanning. Skewness leads to numerous problems in segmentation. Hence, the de-skewing process is applied to remove the skewness from an image.
2.5.5 Thinning
Thinning, also known as skeletonization is a process of deleting the dark points along the edges of an object in an image. Thinning is performed until the object in an image is reduced to a thin line. The final thinned object is 1 pixel wide and henceforth known as the skeleton. Thinning is a very important step of a recognition system and has many
15 advantages. The skeleton of the text can be used to extract features like loops, holes, branch points etc. Thinning also reduces the amount of data to be handled.
2.6 Segmentation
Dividing a source image into sub-components is known as segmentation. Segmentation is used to segment and locate the text when used in an OCR system. Segmentation process for a page image can be divided into three levels as shown in Figure 2.7.
Page Text Decomposition Segmentation
Line Segmentation
Figure 2.7 Segmentation Process for a Document Image
The page decomposition is known as level 1 segmentation. Page decomposition refers to the separation of textual components from other elements within the source image. A source image may contain different types of elements such as tables, figures, header, footers etc. The initial step in page decomposition is identifying the different elements within the source image and dividing it into rectangular blocks. Next, each block is given a label such as a table, text, figure etc. Once the text has been extracted, further processing can be performed only on the text portions. The next process after page decomposition is the line (level 2) segmentation. One of the most common methods for line segmentation is the horizontal projection profile. In the horizontal projection profile, the projection value is calculated by summing the pixel values along the horizontal direction of the document image. Hence, each value of the horizontal projection profile is associated with the total number of foreground pixels in that row of the document image. Horizontal projection profile remains a natural choice for line segmentation in document images, however, there are other methods too such as smearing, grouping, stochastic methods and Hough transform. The whitespace between text lines can be easily located by exploring the zero height valleys in the horizontal projection profile. When a textual document has segmented into lines using any of the mentioned methods, the next step is to do character, ligature or word segmentation, known as level 3
16 segmentation. Text segmentation for an OCR system can be divided into two main types i.e. the holistic approach and the analytical approach (see Figure 2.8).
Holisitic Approach Text Explicit Segmentation Analytical Approach Implicit
Figure 2.8 Approaches for Text Segmentation
2.6.1 Analytical Approach
Analytical approach refers to the recognition of text by splitting it into characters. Further, the analytical approach can be categorized into two main types i.e. the explicit segmentation and the implicit segmentation. Explicit segmentation explicitly divides handwritten or printed text into characters. Great success has been achieved when using explicit approach for character segmentation [16, 17, 22, 35-37]. However, this method is prone to errors and requires extensive knowledge of a characters start and end points. Using this start and end information the characters are recognized and isolated, following the segmentation procedure is done. Detection of the start points and the end points are prone to error due to size variation, complexity and placement of characters. In implicit segmentation, the text is segmented into a very small number of segments that are based on the component classes of the alphabet. Hence, the implicit approach uses a concept of over-segmentation. Implicit segmentation is also referred to as recognition based segmentation and has been used successfully in several studies [10, 38-41]. Segmentation and recognition, both processes of OCR are done in parallel in implicit segmentation. Implicit segmentation may pose challenging when deciding the total number of segments. There are also numerous techniques to perform implicit segmentation, this may lead to over-segmentation and under-segmentation. Fewer segments lead to efficient computation but widely written words will not be covered. More segments mean more computationally expensive, increasing the junk segments that may also be modeled by the OCR recognizer [42].
17 2.6.2 Holistic Approach
When an OCR system recognizes text at word or ligature level it is known to be using the holistic approach [43]. Over the years the holistic approach has gained immense popularity due to its upfront solution by avoiding any character level segmentation [15, 33, 43-46]. Document image segmentation is one of the most significant tasks in document recognition (printed and handwritten). The overall accuracy of an Optical Character Recognition system is immensely dependent on the correct segmentation of the recognition units (character, ligature or word). As stated earlier, Urdu Nastalique script is highly cursive and context-sensitive in nature, having a lot of overlapping issues. Therefore, a proper segmentation algorithm is required that is robust enough to handle the complexities associated with Urdu script. Two of the most popular methods among researchers for ligature segmentation is the projection profile based method and connected component analysis based segmentation methods. (a) Projection Based Methods In document image analysis, projection profile is the histogram of the total number of foreground pixels in the document image. Specifically, it is a one-dimensional representation of a two-dimensional image. The values of histogram present the density distribution of the written script against the background. Usually, a projection profile method for document segmentation works well with text that is typewritten and non-cursive in nature. There are two main advantages of using projection profile based methods, first, they don’t require binarization, second, they are very robust to noise and other degradations. A projection profile can be horizontal or vertical in nature. Vertical projection profile is one of the most well-known methods for ligature segmentation. Vertical projection profile is calculated by summing the pixel values along the vertical direction of the document image. The valleys in a vertical projection profile correspond to ligature or word gaps. Furthermore, taking a vertical projection profile of a ligature or word, it is possible to segment it into individual character forms. Vertical projection profile based segmentation is a complicated process for Urdu Nastalique script due to its overlapping nature [47]. Inter-ligature overlapping may lead to wrong segmentation points when performing ligature segmentation (see Figure 2.9).
Figure 2.9 Overlapped Ligatures from An Un-Degraded Line Image ‘560’ Taken from UPTI Dataset Given In [33]
18 On the other hand, intra-ligature overlapping leads to wrong segmentation points during character segmentation. The vertical projection profile algorithm for segmenting ligatures from a thresholded text line image is described as follows, • Step 1: Generate histogram for every column of the input text line image. • Step 2: Find valleys having zero height. • Step 3: Take valleys as segmentation points to extract words and ligatures. • Step 4: Repeat Step 3 until the end of the text line.
(b) Connected Component Labeling (CCL) Based Methods Vertical projection profile based methods are more appropriate for segmenting the text images where the characters, ligatures or words are well-separated at the column. In image processing, connected component labeling based algorithms can be efficiently used for ligature segmentation, since it solves the problem of inter-ligature overlapping. In an image connected component refers to a set of pixels that forms a connected group. Subsequently, connected component labeling refers to the identification of all connected components in an image and assigning each one a unique label. Connected component labeling scans an image and groups its pixels into components based on pixel connectivity, that can be 4 or 8. CCL process scans an image from top to bottom and left to right, pixel-by-pixel in order to find the connected groups and assign each of them a unique label (see Figure 2.10). Once the connected components have been labeled they can be extracted from the image. CCL based segmentation methods have great advantages for Arabic-like cursive scripts, however, there are also a few disadvantages associated with it. The recognition complexity is added for Urdu and other Arabic-like cursive scripts because the CCL based segmentation methods separate the primary components from its secondary components (dots/diacritics). These primary and secondary components have to be reassembled to preserve its ligature shape. The CCL method for locating and labeling the connected components in a binary image is described below, • Step 1: Search an image for the next unlabeled pixel (p). • Step 2: Label all the pixels in the connected component containing (p). • Step 3: Repeat step 1 and 2 until all the pixels are labeled.
Figure 2.10 Connected Component Labeling Computed for Un-Degraded Line Image Taken from UPTI Dataset Given In [33]
19 2.7 Feature Extraction
When the input to an algorithm is extremely large and redundant for processing, then it can be transformed into a reduced set of parameters known as features. The features collectively are known as a feature vector. The feature itself can simply be referred to as “a distinct characteristic or property of an element”. After pre-processing and segmentation, a feature extraction technique is required to extract distinct features, followed by classification and an optional post-processing phase when developing an optical character recognition system. The primary goal of the feature extraction phase is to capture the necessary characteristics of all the text elements i.e. characters or words. Features hold great significance, since, it may directly affect the efficiency and recognition rate of an OCR system [48]. The total number of features, quality of features, dataset along with the classification method are said to contribute towards an effective OCR system (see Figure 2.11).
Figure 2.11 Factors Affecting OCR System's Performance
Feature extraction is broadly divided into two main categories i.e. the feature learning and feature engineering approach. If features are automatically identified and extracted it is known as feature learning approach. When hand-crafted/hand-engineered features are identified and extracted it is known as feature engineering approach. After feature extraction, sometimes there is a need to get a reduced or subset of the initial features, a task achieved through feature selection process. The features extraction approaches are discussed below.
2.7.1 Feature Learning Approach
Feature learning generates a large number of features that may improve the overall performance of the classification algorithms. Such systems are particularly valuable when such specialized features are needed that cannot be created by hand. Mostly, feature learning uses unsupervised machine learning algorithms that train on several layers of features to learn multi-level representations. Feature learning can be supervised or
20 unsupervised. Supervised feature learning deals with labeled input data, for e.g. supervised dictionary learning and neural networks. The labeled data allows the system to learn when the system fails to produce the correct label. Whereas, unsupervised feature learning deals with unlabeled input data, for e.g. independent component analysis, matrix factorization, clustering and auto-encoders.
2.7.2 Feature Engineering Approach
When using hand-crafted features, some parameters need to be considered such as its quality, quantity, usefulness, distinctiveness and effectiveness. There are numerous features associated with each character in an Urdu alphabet. For an optical character recognition system, it is necessary to observe techniques that achieve maximum recognition using simplest and minimum features. The hand-crafted features can be classified into three types [18, 48]. (a) Structural Features Structural features are related to the topological and/or geometric characteristics of characters such as loops, wedges, start point, end point, branches, crossing points, horizontal lines, vertical lines, number of endpoints, horizontal curves at top or bottom etc. Structural features require knowledge about the structure of the character, the knowledge about the strokes and the associated dots that make up the character. In case of Urdu character recognition, structural feature extraction is extremely difficult since the shape of a character varies according to its neighborhood. (b) Statistical Features Statistical features are related to the distribution of pixels in an image. Few of the popular methods to extract statistical features are zoning, projection profiles, crossing and distances. Statistical features are easy to detect and are not affected by noise/distortions as compared to the structural features. Statistical features provide low complexity and high speed to some extent, they also provide some level of font invariance. Statistical features may also be used for dimension reduction of the feature set. There are different methods to extract statistical features such as zoning, crossing, distances and projections. The statistical feature can be further classified into first-order, second-order and higher-order statistical features. First-order features compute properties of only individual features such as average and variance. Second-order and higher-order statistical features compute interactions between two or more-pixel values that are occurring at specific locations relative to each other. Zoning feature extraction is a popular statistical feature extraction method. The character image is divided into a pre-defined number of zones and from each zone, a feature is 21 selected. The zones might be overlapping or non-overlapping, the character strokes in different zones are analyzed. The image is usually divided into 2x2, 3x3, 4x4 etc. zones. Counting the number of transitions from the background to the foreground pixels in a character image is known as crossing. In crossing the transitions are computed along the vertical and the horizontal lines. Along the horizontal lines of an image, the distance calculated is the distance of the first pixel detected from the lower and upper boundaries in an image. For each character image, vertical and horizontal vectors are generated for each pixel in the background. The total number of times a character stroke is intersected by any of the vectors is used as a feature. (c) Global Transformation and Series Expansion The global transformation and series expansion present an image as continuous signals that contain more information and its features can be used for classification. Some of the famous features extracted are Fourier transforms, Gabor filter and transform, wavelets, Zernike moments and Karhunen-Loeve (KL) Expansion. Global transformation first transforms the image representation into such a form i.e. a signal so that relevant features can easily be extracted. There are numerous ways to represent a signal, such as a linear combination of a series of simpler smaller signals, known as series expansion. Some of the most common global transform and series expansion features are discussed below. Fourier transform feature uses a magnitude spectrum of measurement vector in an n- dimensional Euclidean space as a feature vector. Fourier transform holds great significance due to its ability to recognize characters that have shifted its position by observing the magnitude spectrum. On the other hand, Hough transform is used to find the parameter curve of characters. It is also used as a technique for baseline detection in text documents. Gabor transform is a special form of Fourier Transform. In Gabor transform a windowed Fourier transform is applied to a character image. The window size is not discrete and is stated by a Gaussian function. Wavelets allow the representation of a signal at different levels of resolutions, hence, a series expansion technique. Finally, an Eigenvector analysis technique also known as Karhunen Loeve Expansion is used to reduce feature dimension by created new features from linear combinations of the original features. Moment features such as Zernike moments are image size, rotation and translation independent. Zernike moments are invariant descriptors for an image. The overall shape of an object is described in a compact way using only a small subset of value.
2.8 Classification and Recognition
Classification can be defined as a computational process that sorts images into groups/classes according to their similarities. Classification is a significant application of 22 image retrieval. Classification simplifies searching through an image dataset to retrieve those images with particular visual content. All classification procedures assume that the image under consideration possess one or more features (e.g. geometric) and that each of these features belongs to one of several distinct and exclusive classes. There are two types of classification. In supervised classification, the classes may be specified in advance by an analyst. In unsupervised classification, the data is automatically clustered into sets of prototype classes, where the analyst merely specifies the number of desired categories. Classification algorithms typically employ two phases of processing: training and testing. In the training phase, the characteristics of typical image features are isolated. Based on the image features a unique description of each classification category, i.e. training class, is created. In the subsequent testing phase, these feature-space partitions are used to classify image features. Two similar concepts i.e. traditional machine learning and deep learning have been on the rise in the research communities during the past several years. Deep learning is not new, but recently have gained hype by the research community and is getting more attention. Below both, the concepts have been explained, along with the related studies in Urdu Optical character recognition systems.
2.8.1 Traditional Machine Learning
Machine learning is a field of artificial intelligence that aims to mimic intelligent abilities of the humans by machines. Machine learning involves the important queries and procedures needed to make the machines capable of learning. It is difficult to define learning precisely since it covers a broad range of processes. In most dictionaries, the phrases used for its definition are “to gain knowledge”, “understanding of” and “to gain some skill by study, instruction, or experience". Some of the famous traditional machine learning techniques used with the character recognition systems are Neural Network, Support Vector Machine, K-Nearest Neighbors, Bayesian Classification, and Decision Tree Classification. Machine learning can be divided into two major types. However, there are some other types of machine learning techniques also available, such as, reinforcement learning, semi-supervised learning and learning to learn. Supervised learning involves inferring a function from labeled training data. In supervised learning, the work is primarily with a pair of input and its desired output. Supervised learning algorithms analyze the training data and produce a function. If the function has been inferred correctly it can then be used to correctly determine the future unseen input instances. Supervised learning is concerned with classification or regression. The primary goal is to enable the computer to learn a classification system that has been created by users. 23 Character and digit recognition systems are famous examples of classification based learning. Classification based learning is applied to any problem where classification is useful and easier to determine. However, classification is not always supervised, unsupervised learning may also be used for classification problems. Some of the renowned supervised learning algorithms are neural networks, naive bayes, nearest neighbor, regression models, Support Vector Machines (SVMs) and decision trees. Comparatively, in unsupervised learning, the major goal is to enable the computer to learn how to do something, without telling it how to do it. Unsupervised learning is much harder than the supervised learning. Unsupervised learning involves finding structures in unlabeled data. There are two approaches to unsupervised learning, first, clustering which includes k-means, mixture models and hierarchical clustering. Second, feature extraction techniques for e.g. independent component analysis, non-negative matrix factorization, and singular value decomposition. Some of the unsupervised learning algorithms are neural network based approaches for meeting a threshold, partial based clustering, hierarchical clustering, probabilistic based clustering, and Gaussian Mixture Models (GMM).
2.8.2 Deep Learning
Widely accepted, the deep learning is taken as a recently developed but important part of a broader family of machine learning methods. Deep learning architectures are also used for performing classification tasks. There are six main differences between the traditional machine learning and deep learning methods. Each of these differences is explained in the sub-paragraphs below: The performance of a classification algorithm highly depends on the features that have been identified and extracted. Traditional machine learning methods use the feature engineering approach to extract features for classification. Whereas, deep learning automatically finds out and extracts the raw pixels as features for classification. When hand-crafted features are used for classification, it is known as feature engineering. In feature engineering, the domain knowledge is put to use for feature creation and extraction. Feature engineering may pose to be difficult, expensive and time-consuming in terms of knowledge. Once the features have been identified they are hand-coded as per data type and domain. Hand- engineered features can be of different types, such as shape, pixel values, position, textures and orientation. Deep learning on the other hand automatically extracts high-level features from the data. Hence, the task of finding and extracting unique features for each problem are subsided. For example, the famous, Convolution Neural Network learns different low- level features such as lines and edges from an object in its early layers, it then learns medium-level features and later the final high-level representation is learned. 24 Second, the performance of deep learning increases as the scale of data increases. For smaller datasets, deep learning algorithms don’t perform that well. A large amount of data is required by deep learning algorithms to understand the data perfectly. Contrarily, traditional machine learning methods perform well for smaller datasets, its performance is not affected by larger datasets. Henceforth, the traditional machine learning algorithms reign in this scenario. Third, the traditional machine learning algorithms can work well on low-end machines. On the other hand, a deep learning algorithm requires high-end-machines to process and classify the data. GPU (Graphical Processing Unit) is capable of carrying out a large amount of matrix multiplication operations, it is an essential requirement for the working of deep learning. Fourth, the traditional machine learning algorithms usually break down a problem into sub- parts before solving it. There are usually two steps involved, object detection and recognition. Deep learning, on the other hand, solves the problem without sub-dividing it into sub-parts. It provides an end-to-end solution for problems. The fifth difference is that the traditional machine learning algorithms take less time to train, probably, from a few seconds to maybe few hours. Whereas, deep learning algorithms comparatively take a long time to train due to its large number of parameters. However, the traditional machine learning algorithms such as K-NN may take more test time. The final, sixth difference is the interpretability. Deep learning architectures give excellent results, having near human perfection. But with deep learning, it’s not possible to understand why has it given this much accuracy. The information about the nodes that were activated is known and can be mathematically found but the work of neurons and their modeling strategy is unrevealed. Hence, the results can’t be interpreted. Traditional machine learning algorithms, however, allows you to easily interpret the results and also gives clear rules, suggesting what was chosen and why did it choose it.
2.8.3 Overview of Genetic Algorithm (GA)
Nature has forever been a huge source of the invention to all the mankind. Genetic Algorithm (GA) is a metaheuristic algorithm belonging to a larger class of the Evolutionary Algorithm (EA). Genetic algorithms are based on the idea of the evolution theory given by Holland in 1975 [49]. Genetic algorithm uses a set of techniques such as selection, inheritance, mutation and recombination that is highly inspired by the evolutionary biology. It is commonly used to provide solutions to optimization and search related problems, in the machine learning and in the research. The optimization refers to improving something. The optimization deals with finding such input values that give the best output values. 25 Mathematically it refers to maximizing or minimizing the functions in order to produce an optimal solution by varying the input parameters. The basic terminologies for a genetic algorithm are given in Table 2.1.
Table 2.1 Genetic Algorithm Terminologies
Terminology Detail Population A subset of all the potential solutions to a given problem. Chromosomes Chromosome is a single solution to a given problem. Gene Gene is a single chromosome elements position. Allele Allele is the value for a specific gene, for a specific chromosome. Fitness Function It is a function that takes the solution as the input and checks the appropriateness of the solution as the output. Genetic Operators These operators alter the genetic composition of the offsprings. Genotype The population in the computation space. Phenotype The population in the actual real world.
In the genetic algorithm, there is an evolution of generations. Each generation has a subset of a population. The population that is used for a genetic algorithm is parallel to the population that is being used for human beings. However, in GA in its place, there are candidate solutions that represents the human beings. Each population has a set of candidate solutions known as the chromosomes. These candidate solutions are usually, represented using 0s and 1s. However, other encoding schemes can also be followed. These chromosomes can be altered using some biological operators such as crossover, mutation and selection. The most common method for a genetic algorithm is to create a random group of individuals for a given population. These individuals are then evaluated using an evolution function usually given by the programmers. The individuals are given a certain score that highlights its fitness according to a given situation. From the population, the individuals that are more fit, are given a higher priority to mate, hence, in accordance with the biological Darwinian Theory of the “Survival of the Fittest”. Usually, the top two individuals are selected and reproduction is carried out using a crossover, generating one or more offsprings. Following, random mutations are carried out on the offsprings. The genetic algorithm continues until an acceptable solution has been derived from the procedures. The population presented in computation space is known as a genotype. In computation space, it is easy to manipulate and understand the solutions using a computing system. On the other hand, the phenotype is represented in the real-world situations. The phenotype and genotype are usually the same for simple problems. For most cases the spaces are different, involving the decoding and encoding processes. Encoding is a process of transforming a solution from the phenotype to the genotype space. While decoding is
26 the process of transforming from the genotype to the phenotype space. The terminologies for a GA are shown diagrammatically in Figure 2.12.
Figure 2.12 Basic Terminologies of Genetic Algorithm
A Genetic Algorithm has the capability to give a “good enough” solution “fast enough”. The GA has various advantages compared to the traditional artificial intelligence. The genetic algorithm is more robust and more inclined toward the breakdowns due to the variations in the inputs. The genetic algorithm also provides much better and significant results compared to other optimization methods like linear programming, first or breadth- first and other heuristics. It is widely used in many fields like computer-aided molecular design, automotive design, robotics and engineering design. The genetic algorithms also perform much better than the usual random local search algorithms since they exploit the historical information as well. GA optimizes the discrete, continuous and multi-objective functions. It always gets an answer to the problem and that also keeps getting better over the time. However, the GA’s are not well suited for problems that are extremely simple. The fitness value is calculated repeatedly, hence for some problems it might be computationally expensive. The implementation of GA holds great significance. If the GA is not implemented correctly, it might not converge towards an optimal solution. There are a large number of problems that are NP-hard. Hence, even some of the most powerful
27 computer systems consumes a long time to resolve such problems. In such cases, GA proves to be an extremely effective tool, providing a solution in a short amount of time.
2.9 Post-Processing
Post-processing is the final phase of an Optical Character Recognition process, it involves the tasks which aims towards the improvement or correctness of classification/recognition of the system. The chosen classifier might not produce accurate results for an image. Hence, post-processing might be required. It may include different processes such as grammar correction, spell-checking, text-to-speech conversion and improving the overall recognition rate and output.
2.10 Application Areas of OCR
Urdu script has a rich historical background. Pakistan, India, and Bangladesh are few of the countries where Urdu is spoken, understood and written widely. Due to the popularity of Urdu language at verbal and written level, and its massive hardcopy literature, abundant efforts have been directed towards Urdu OCR systems. A fully functional and efficient Urdu OCR system can have abundant applications in various fields. Some of the most renowned and emerging application areas of Urdu OCR are digital reformatting, automated text translation, text-to-speech conversion, Automated Number Plate Recognition (ANPR) and static-to-electronic media conversion (See Figure 2.13). These application areas have been discussed in the sub-sections ahead.
Application Digital Reformatting Areas of Urdu OCR Automated Text Translation
Text to Speech Conversion
Automated Number Plate Recognition
Static to Electronic Media Conversion
Figure 2.13 Application Areas of Urdu OCR
2.10.1 Digital Reformatting
In digital reformatting, original documents are converted into digital form. These digital documents act as surrogates, preserving and eliminating the need to use the original version. With an automated system, like OCR, it is possible to convert all the physical libraries into digital libraries. The Internet can then be used to transfer and spread the literature, making it available worldwide. Currently, the Internet is being used as a 28 repository for making textual material online, it has been successful but with a few trade- offs. Most of the literature on the internet now is in form of images containing text. These images consume a lot of storage space, also the time required to transfer the files from one place to another through the internet is slow. Hence, digital reformatting will allow the conversion of physical libraries to digital libraries with lesser time and space consumption.
2.10.2 Automated Text Translation
Automated text translation is a famous application area of OCR, sometimes also referred to as "Machine Translation (MT)". Generally, automated text translation software translates text from a source language (for e.g. Urdu) to a target language (for e.g. English). Nowadays, Text translation software is being designed for personal, business as well as enterprise usage. These software's are extremely useful and lets you understand and convert a language script into the target language script in real time.
2.10.3 Text-To-Speech Conversion
OCR technology also provides handicapped accessibility to low-vision users. This is generally known as, "text-to-speech conversion" and more technically known as "speech synthesis". It involves converting the recognized text using OCR software into computer- generated speech. This technology allows low-vision or blind people to read books, magazines or any other reading material after scanning it.
2.10.4 Automatic Number Plate Recognition (ANPR)
ANPR is a technology that reads vehicle registration plates using OCR technology. ANPR requires a fast video camera to capture the image. ANPR technology is being used worldwide by law enforcement agencies to keep track of vehicles such as vehicle license, vehicle registration and electronic toll collection on pay-per-use roads.
2.10.5 Static-to-Electronic Media Conversion
E-media is an emerging application area of OCR technology. Electronic media (E-media) encompasses the use of electronics by the end user to access any content. On the contrary, the static media (print media) doesn't involve any use of the electronics. Static media such as newspaper can be converted to E-media using the OCR technology, by recognizing the newspaper headlines. Any handheld device, having a camera can be used to take a snap of the headlines or recognize the headlines in real-time. Once the headlines are recognized, the same news and its detailed content can be accessed online in a video form on the same handheld digital device.
29 2.11 Urdu Script Preliminaries
Urdu is the national language of Pakistan [1]. It is spoken in more than 20 countries by more than 70 million people across the world. It is also widely spoken to some extent in countries like Afghanistan, Bangladesh, India, Malawi, Nepal, Saudi Arabia, UAE, South Africa, United Kingdom, Thailand, and Zambia. It is also an official language of five Indian states. Cities like Mecca and Medina in Saudi Arabia also use Urdu for informational signage; this project the significance of Urdu language in the Muslim world.
2.11.1 Urdu Script, Its History and Relation to Other Cursive Scripts
The Arabic alphabet has influenced several languages including Persian, Urdu and Pashto [50, 51]. Each of the mentioned languages has some dissimilarity in the characters but share the same underlying foundation. Urdu has similarities to Arabic alphabet, due to the history it shares with it [26]. Urdu is basically derived from a Turkish word “Ordu” meaning “army” or “camp”. The history of Urdu language is vibrant and vivid. It is believed that Urdu can into existence during the Mughal Empire. After the 11th century, the Persian and Turkish invasions of the subcontinent cause the development of Urdu as a source of communication. During the Mughal Empire, Persian was the official language, whereas, Arabic was the language of religion, Turkish was spoken mostly by the high profile or the Sultans. Therefore, Urdu was highly under influence of these three languages. During the early years, it was just used for communication and was known as “Hindvi”. As years progressed its vocabulary expanded and several names were associated with it during this period, like Dehalvi and Zaban-e-Urdu. After independence Urdu was declared the national language of Pakistan. Arabic and Urdu both are written in Perso-Arabic script; therefore, they share similarities at the written level. Arabic and Persian writing styles have great influence on Urdu script. Hence, Urdu uses a modified and extended set of Arabic alphabets and Persian alphabets. Urdu uses Nastalique calligraphic style of the Perso-Arabic script for writing. The history of Nastalique dates back to the Islamic conquest of Persia. The Persian art of calligraphy was adopted by the Iranians. Mir-Ali Heravi Tabrizi famous Iranian calligrapher developed the Nastalique calligraphic style during the 14th century. Nastalique was formed by the combination of two scripts “Nash” and “Taliq”. In the early years, it was called “Nashtaliq” but later on, it was more formally known as Nastalique. In South Asia, Persian was the official language of the Mughal Empire. Nastalique emerged during these days and left a great influence on South Asia including Bangladesh, India and Pakistan. In Bangladesh, Nastalique was greatly used before 1971. In India, Nastalique is still observed widely.
30 Nastalique is found to be the standard calligraphic style for writing in Pakistan. Nastalique is extremely beautiful and more artistic as compared to the Naskh writing style of Arabic. There are several calligraphic styles for writing Arabic script such as Naskh, Nastalique, Koufi, Thuluthi, Diwani and Rouq’i style (see Figure 2.14). Naskh is the most common writing style that is used for Arabic, Persian as well as Pashto script [18].
Figure 2.14 Different Writing Styles for Arabic Script [52]
Arabic, Persian, Urdu and Pashto, all four alphabet systems are more or less the same, the only difference is the total number of characters (see Table 2.2). Arabic has the smallest number of characters in its alphabet. Persian uses the Arabic characters along with a greater number of characters. Urdu and Pashto both extend further from the Persian alphabet.
Table 2.2 Comparison of Arabic, Urdu, Persian and Pashto Writing Styles
Characteristic Urdu Arabic Persian Pashto Total No of letters 38 28 32 45 Order of Writing Right to left Right to left Right to left Right to left Cursive Yes Yes Yes Yes Dots and Diacritics Yes Yes Yes Yes
The Arabic alphabet is also known as an abjad. It is written from right to left and has a total of 28 characters [18] (see Figure 2.15). Arabic alphabet does not possess any distinct upper- case and lower-case forms. There are several characters that may have a similar appearance but they are given their own distinction by the use of dots that are placed above or below jīm) have the same)ج ,(’hā) ح ,(’khā)خ their central part. For example, the Arabic letters base shape, however, they have one dot below, no dot and one dot above.
31
Figure 2.15 Arabic Alphabet
The Persian alphabet and script share many similarities to that of the Arabic script. It is also written right-to-left and is an abjad, meaning the vowels are under-represented in the writing system. The Persian alphabet consisting of 32 characters is shown in Figure 2.16. The Persian script is cursive in nature; hence, the characters change their shape depending on its position: isolated, initial, middle and final of a word.
Figure 2.16 Persian Alphabet
Pashto is the official language of Afghanistan and is also widely spoken in the Khyber Pakhtunkhwa province of Pakistan. It’s used by 50 million people as a source for oral and written communication [53]. There is a total of 45 characters in Pashto alphabet (see Figure 2.17). The characters in the alphabet may have 0 to 4 diacritic marks. There have been no significant efforts devoted to the recognition of the Pashto script.
32
Figure 2.17 Pashto Alphabet
There is a total of 38 characters in Urdu alphabet [40]. In Urdu, the text lines are read from top to bottom, whereas, the characters are read from right to left. The characters can be grouped into similar classes based on the similarities of their base forms; the characters in the same class differ only by their dots or retroflex mark. In Figure 2.18, the character shape for basic isolated Urdu characters has been shown.
Figure 2.18 Urdu Alphabet
2.11.2 Joiner and Non-Joiner Characters
There are two types of characters in Urdu; joiners and non-joiners [31]. Joiner characters are written cursively. The shape of character changes depending on its neighboring character to which its connected, as well as its position within the word. Hence, all
33 connectors in principle have four basic shape forms i.e. the “isolated”, “start”, “middle” and “end”. There are 27 joiner characters in the Urdu alphabet as shown in Figure 2.19.
Figure 2.19 Joiner Urdu Characters
Alternatively, non-joiner characters are those characters that have no special start or middle forms, because they don’t connect to other characters. Hence, the non-joiner characters only take two basic shape forms i.e. the “isolated” and “end”. There is a total of 10 characters in Urdu alphabet that are non-joiner as shown in Figure 2.20.
Figure 2.20 Non-Joiner Urdu Characters
If a word ends with a joiner character then space must be inserted after the word, else it will result in merging the current word to the following word, resulting in a visually incorrect and meaningless word. For example, two words “ ” without space will become “ ”, making it incorrect. Another example considers two words” ”, without space separation between the joiner words will become “ ”, completely making it visually meaningless. However, words ending with non-joiner usually don’t have space from its next word, and therefore are ligatures within the same word. For example, the text seems like two words but there is no space between them. However, these are actually ” م“