Naila Habib Khan
Total Page:16
File Type:pdf, Size:1020Kb
LIGATURE RECOGNITION SYSTEM FOR PRINTED URDU SCRIPT USING GENETIC ALGORITHM BASED HIERARCHICAL CLUSTERING A Thesis Submitted to the Faculty of the Institute of Management Sciences, Peshawar in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY COMPUTER SCIENCE By NAILA HABIB KHAN DEPARTMENT OF COMPUTER SCIENCE INSTITUTE OF MANAGEMENT SCIENCES PESHAWAR, PAKISTAN SESSION 2014-2017 This is to certify that the research work presented in this thesis entitled “Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering” was conducted by Naila Habib Khan under the supervision of Dr. Awais Adnan, Institute of Management Sciences, Peshawar, Pakistan. No part of this thesis has been submitted anywhere else for any other degree. This thesis is submitted to the Institute of Management Sciences, Peshawar in partial fulfilment of the requirements for the degree of Doctor of Philosophy in the field of Computer Science. Student Name: Naila Habib Khan Signature: ___________________________ Examination Committee: a) External Foreign Examiner 1: Dr. Yue Cao School of Computing and Communications, Lancaster University, UK Signature: ___________________________ b) External Foreign Examiner 2: Prof. Dr. Ibrahim A. Hameed Deputy Head of Research and Innovation Department of ICT and Natural Sciences, Norwegian University of Science and Technology, UK Signature: ___________________________ c) External Local Examiner: Dr. Saeeda Naz Head of Department/ Assistant Professor Govt. Girls Postgraduate College, Abbotabad, Pakistan Signature: ___________________________ ii d) Internal Local Examiner: Dr. Imran Ahmed Mughal Assistant Professor Institute of Management Sciences, Peshawar, Pakistan Signature: ___________________________ Supervisor: Dr. Awais Adnan Assistant Professor Institute of Management Sciences, Peshawar, Pakistan Signature: ___________________________ Director: Dr. Muhammad Mohsin Khan Institute of Management Sciences, Peshawar, Pakistan Signature: ___________________________ iii I, Naila Habib Khan, hereby declare that my Ph.D. thesis entitled, “Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering” submitted to Research and Development Department (R&DD) by me is my own original work. I am aware of the fact that in case my work is found to be plagiarized or not genuine, R&DD has the full authority to cancel my research work and I am liable to the penal action. Naila Habib Khan July 5, 2019 iv I, solemly, declare that the research work presented in the thesis entitited, “Ligature Recognition System for Printed Urdu Script Using Genetic Algorithm Based Hierarchical Clustering” is soley my research work with no significant contribution from any other person. Small contribution whereever taken has been duly acknowledged and that the complete thesis has been written by me. I understand the zero-toelerance policy of the HEC and the Institute of Management Sciences, Peshawar, towards plagrism. Therefore, I as an author of the above-mentioned titled thesis declare that no portion of my thesis has been plagrised and any material used as a reference has been properly cited. I understand that if I am found guilty of any form of plagrism in the above-mentioned titled thesis even after award of Ph.D. degree, the institute reserves the rights to withdraw/revoke my Ph.D. degree and that HEC and the Institute has the right to publish my name on the HEC/Institute website on which names of students are placed who submitted plagrised thesis. Author’s Signature: _______________ Naila Habib Khan v This research is dedicated to my beloved parents. They give me strength when I am weak, they never let me fall and hold me up, see the best that is there in me, they are always there for me and stand by me. They have been an inspiration and a blessing for me. I am everything I am because I am loved by them. vi Firstly, all glory is to Allah Almighty Who blessed me with a strong will and determination to complete this research. I express my deepest gratitude to my supervisor Dr. Awais Adnan for his kind guidance, constant help and constructive feedback on my research. I especially thank him for his support throughout the research phase with patience. I really appreciate his input on my research, as it wouldn’t have been possible without his advice and assistance. I am also grateful to all the faculty members and colleagues of the Department of Computer Science, Institute of Management Sciences for their support, inspiration and encouragement during my PhD research work. Thank you to my friend, Sadia Basar, for always been there, encouraging me with my work throughout this tough route to PhD. Last but not the least, enormous thanks to my beloved parents, my dearest sisters, Asma Habib Khan and Nazma Habib Khan, and my dearest brothers, Imran Khan and Asif Khan, whose warm wishes, all-embracing backing, patience and prayers made the completion of this PhD research possible. I owe my gratitude to my elder sister Nazma Habib Khan for her constant advice and motivation. I would also like to thank my family members Iftikhar Anjum and Jane Agna Khan for their immense support. vii In this dissertation, a method has been presented for ligature-based recognition of printed Urdu Nastalique script. The proposed recognition system uses a genetic algorithm based hierarchical clustering approach for recognition of Urdu ligatures. The overall proposed Urdu ligature recognition system has been divided into six phases, pre-processing, segmentation, feature extraction, hierarchical clustering, classification rules, genetic algorithm optimization and recognition. In the first phase, the Urdu text line images are read one by one from the dataset. Next, in pre-processing the images are thresholded and noise is removed. Subsequently, an efficient and effective holistic approach algorithm is developed for the segmentation of the Urdu text lines into constituent ligatures. The proposed ligature segmentation algorithm is novel since its one of the first algorithm that doesn’t use the baseline information for ligature segmentation of Urdu script. Following, a unique set of fifteen hand-engineered features are extracted from the segmented ligature images. Out of these fifteen hand-engineered features, two are geometric features, nine are first-order statistical features and four are second-order statistical features. For data points distribution reduction, the features are hierarchically clustered and a total of 3645 classification rules are generated using simple IF-THEN statements. Since the rules are at an initial stage, genetic algorithm optimization is used for further refinement of the hierarchical clustering. The proposed genetic algorithm phase consists of population initialization, chromosome encoding, parent selection, crossover, mutation, fitness function and termination stages. Experiments conducted on the benchmark UPTI dataset for the proposed Urdu Nastalique ligature recognition system yields promising results. The proposed ligature segmentation algorithm achieves an accuracy of 99.86%, whereas, the genetic algorithm based hierarchical clustering approach achieves a ligature recognition rate of 96.72%. viii Table of Contents Certificate of Approval .................................................................................................. ii Author’s Declaration .................................................................................................... iv Plagrism Undertaking ................................................................................................... v Dedication ..................................................................................................................... vi Acknowledgments ........................................................................................................ vii Abstract....................................................................................................................... viii List of Figures ............................................................................................................. xiv List of Tables ............................................................................................................. xvii List of Abbreviations ................................................................................................ xviii Chapter 1. Introduction .......................................................................................... 1 1.1 Overview................................................................................................ 1 1.2 Motivation ............................................................................................. 2 1.3 Problem Statement ................................................................................ 2 1.3.1 Problem Description ............................................................................. 2 1.4 Goal and Objectives .............................................................................. 3 1.5 Research Contributions ........................................................................ 3 1.6 Thesis Structure .................................................................................... 5 1.7 Summary ............................................................................................... 6 Chapter 2. Background ........................................................................................... 7 2.1 History of OCR ....................................................................................