Offline Urdu Nastaliq OCR for Printed Text Using Analytical Approach
Total Page:16
File Type:pdf, Size:1020Kb
Offline Urdu Nastaliq OCR for Printed Text using Analytical Approach By Danish Altaf Satti Department of Computer Science Quaid-i-Azam University Islamabad, Pakistan January, 2013 Offline Urdu Nastaliq OCR for Printed Text using Analytical Approach By Danish Altaf Satti Supervised by Dr. Khalid Saleem Department of Computer Science Quaid-i-Azam University Islamabad, Pakistan January, 2013 Offline Urdu Nastaliq OCR for Printed Text using Analytical Approach By Danish Altaf Satti A Dissertation Submitted in Partial Fulfillment for the Degree of MASTER OF PHILOSOPHY IN COMPUTER SCIENCE Department of Computer Science Quaid-i-Azam University Islamabad, Pakistan January, 2013 Offline Urdu Nastaliq OCR for Printed Text using Analytical Approach By Danish Altaf Satti CERTIFICATE A THESIS SUBMITTED IN THE PARTIAL FULFILMENT OF THE REQUIRMENTS FOR THE DEGREE OF MASTER OF PHILOSOPHY We accept this dissertation as conforming to the required standards Dr. Khalid Saleem (Supervisor) Prof. Dr. M. Afzal Bhatti (Chairman) Department of Computer Science Quaid-i-Azam University Islamabad, Pakistan January, 2013 Declaration I hereby declare that this dissertation is the presentation of my original research work. Wherever contributions of others are involved, every effort is made to indicate this clearly with due reference to the literature and acknowledgement of collaborative research and discussions. This work was done under the guidance of Prof. Dr. Khalid Saleem, Department of Computer Sciences, Quaid-i-Azam University. Islamabad. Date: January 31, 2013 Danish Altaf Satti i Abstract As technologies grow machines are gaining more and more human like intelligence. Despite the fact that they have no intelligence on their own, due to advancements in Artificial Intelligence (AI) techniques, machines are quickly catching up. Machines have already surpassed human brain capability in terms of computational speed and memory however they still lack the ability to process and interpret information, under- stand it, recognize and make decisions on their own. The field of Pattern Recognition (PR) deals with bringing improvements in ability of machines to recognize patterns and identify objects. Its sub-field Optical Character Recognition (OCR) deals exclu- sively with patterns which are of printed or written text in nature. A lot of progress has been made as far as recognition of Latin, Chinese and Japanese scripts are concerned. In recent past recognition of Arabic text has caught a lot of interest from research community and significant advancements were made in the recognition of both printed and handwritten text. As far as Urdu is concerned the work is almost non-existent as compared to languages cited above. One of the reason is extreme complex nature of Nastaliq style which is a standard for printed Urdu text. In this dissertation we have discussed those complexities in detail and proposed a novel analytical approach for recognizing printed Urdu Nastaliq text. The proposed technique is effective in recognizing cursive scripts like Urdu and it is font-size invari- ant. In the end we evaluated our proposed technique in dataset of 2347 primary Urdu ii Nastaliq ligatures and found that our system is able to recognize Nastaliq ligatures with an accuracy of 97.10%. iii Acknowledgment First of all I would like to thanks Allah Almighty, the most merciful, for giving me opportunity, courage and strength to complete this dissertation. I want to express my gratitude for my supervisor Dr. Khalid Saleem, who has been a constant source of motivation for me. His endeavors, struggle, guidance and supervision would always be written in my memory through thick and thin. It has been an honor to have him as my supervisor. I am grateful for all his contributions of time, ideas, and knowledge to make my research experience productive, exciting and a milestone in my life. The joy and enthusiasm he has for research was contagious and motivational for me even during tough times in my research. Without his guidance, motivation, and endless support it would have been impossible to remain sustainable in the obscure situations and hurdles. His willingness to encourage me and guide me contributed tremendously to my research. Above all I would like to appreciate the degree of patience and care of my supervisor at every moment of my research work. I am very thankful to the teachers Prof. Dr. M. Afzal Bhatti (chairman), Dr. Onaiza Maqbool, Dr. Shuaib Karim, Dr. Mubashir who provided me immeasurable productive knowledge that will remain with me throughout my life. I would like to show my gratitude and pleasure to my class fellows Muzzamil Ali Khan, Mudassar Ali Khan, Ali Masood, Johar Ali, Qadeem Khan, Zainab Malik, Unsa Masood Satti and Farzana Gul who gave me an inspiration to struggle towards my goal. I would iv never forget the feedback of my seniors Mohsin Ansari, Danish Saif Talpur, Lateef Talpur and Imran Malik and their helpful guidance throughout the way. Last, but not the least, I thank my parents and family, for their unflagging love and support and this dissertation would have not been possible without them. I express gratitude to my loving and caring siblings who have been a constant source of care, concern and strength. And I appreciate their generosity and understanding. Thank you Danish Altaf Satti v Dedicated to My Parents, Sisters & Dadi Jan and Late Dada Jan vi Contents Abstract ii Table of Contents xiii List of Tables xiv List of Figures xvi 1 Introduction 1 1.1 What is Optical Character Recognition and where does it stand ... 1 1.2 Applications Of OCR ........................... 3 1.3 Types of Optical Character Recognition Systems ........... 4 1.3.1 Off-line versus On-line ...................... 5 1.3.2 Printed versus Handwritten ................... 5 1.3.3 Isolated versus Cursive ...................... 6 1.3.4 Single font versus Omni-font ................... 6 1.4 The Urdu OCR .............................. 7 1.4.1 The need of Urdu OCR (Motivation) .............. 7 vii 1.4.2 Characteristics of Urdu Script .................. 8 1.4.3 Nastaliq Style of Writing ..................... 9 1.5 The General OCR Process ........................ 10 1.5.1 Image Acquisition ........................ 11 1.5.2 Pre-processing .......................... 12 1.5.2.1 Binarization (Thresholding) .............. 12 1.5.2.2 Noise Removal ..................... 13 1.5.2.3 Smoothing ....................... 13 1.5.2.4 De-skewing ....................... 13 1.5.2.5 Secondary Components Extraction .......... 13 1.5.2.6 Baseline Detection ................... 14 1.5.2.7 Thinning (Skeletonization) .............. 14 1.5.3 Segmentation ........................... 15 1.5.4 Feature Extraction ........................ 16 1.5.5 Classification & Recognition ................... 16 1.6 Holistic VS Analytical Approach .................... 17 1.7 Contribution of this Dissertation .................... 17 1.8 Dissertation Outline ........................... 18 2 Background & Related Work 19 2.1 Image Acquisition ............................. 21 2.2 Noise Removal, Smoothing and Skew Detection ............ 22 2.3 Binarization (Thresholding) ....................... 24 2.3.1 Global Thresholding ....................... 25 2.3.1.1 Otsu .......................... 26 2.3.1.2 ISODATA ....................... 26 2.3.2 Local Adaptive Thresholding .................. 26 viii 2.3.2.1 Bernsen ......................... 27 2.3.2.2 White and Rohrer ................... 28 2.3.2.3 Yanowitz and Bruckstein ............... 29 2.3.2.4 Niblack ......................... 30 2.3.2.5 Sauvola and Pietikäinen ................ 30 2.3.2.6 Nick ........................... 31 2.3.3 Evaluation of Thresholding Algorithms ............. 31 2.4 Thinning (Skeletonization) ........................ 32 2.4.1 Types of Thinning Algorithms .................. 33 2.4.2 Challenges in Thinning for OCR ................ 34 2.4.3 Hilditch’s Thinning Algorithm .................. 36 2.4.4 Zhang-Suen Thinning Algorithm ................ 37 2.4.5 Huang et al. Thinning Algorithm ................ 38 2.5 Segmentation ............................... 39 2.5.1 Page Decomposition ....................... 39 2.5.2 Line Segmentation ........................ 40 2.5.3 Ligature/Sub-word Segmentation ................ 40 2.5.4 Character Segmentation ..................... 42 2.6 Feature Extraction ............................ 44 2.6.1 Template Matching and Correlation .............. 45 2.6.2 Statistical Features ........................ 45 2.6.2.1 Moments ........................ 46 2.6.2.2 Zoning ......................... 46 2.6.2.3 Crossings ........................ 47 2.6.3 Structural Features ........................ 47 2.7 Classification & Recognition ....................... 47 2.7.1 Decision-theoretic based classification .............. 48 ix 2.7.1.1 Statistical Classifiers .................. 49 2.7.1.2 Artifical Neural Networks ............... 50 2.7.2 Structural Methods ........................ 52 2.8 Related Work in Off-line Urdu OCR .................. 53 2.8.1 Shamsher et al. Approach .................... 54 2.8.2 Hussain et al. Approach ..................... 54 2.8.3 Sattar Approach ......................... 55 2.8.4 Sardar and Wahab Approach .................. 55 2.8.5 Javed Approach ......................... 56 2.8.6 Akram et al. Approach ...................... 57 2.8.7 Iftikhar Approach ........................ 57 2.8.8 Comparison of Related Work in Urdu OCR .......... 58 2.9 Chapter Summary ............................ 61 3 Complexities and Implementation Challenges 62 3.1 Complexities