Recognition and Retrieval from Document Image Collections

Recognition and Retrieval from Document Image Collections Thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering by Million Meshesha 200299004 [email protected] International Institute of Information Technology Hyderabad 500 032, India http://www.iiit.ac.in August 2008 INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY Hyderabad, India CERTIFICATE This is to certify the thesis entitled "RECOGNITION AND RETRIEVAL FROM DOCUMENT IMAGE COLLECTIONS" submitted by Million Meshesha, to the In- ternational Institute of Information Technology Hyderabad, India, for the award of the degree of Doctor of Philosophy in Computer Science and Engineering is a record of bonafide research work carried out by him under my supervision, during the period (2002 - 2007). The results embodied in this thesis have not been submitted for any other degree or diploma. Date Dr. C. V. Jawahar i ii Acknowledgment I am deeply indebted to my advisor Dr. C. V. Jawahar for his dedication, encouragement, motivation and also for his inspiring guidance and supervision throughout my thesis work. I am also greatly indebted to Dr. P. J. Narayanan (PJN) for his con- cern, encouragement and advise. My sincere thanks also forwarded to Dr. Anoop M. Namboodiri for his critical comments on my journal articles and conference papers. I would also like to thank document understanding research groups at Centre for Visual Information Technology (CVIT), who had made great contribution by sharing ideas, comments and materials. My dearest thanks goes to MNSSK Pavan Kumar, KS Sesh Kumar, A. Balasubramanian, K. Pramod Sakar, Suryakant Patidar, N. V. Neeba and Anand Kumar for their invaluable suggestions and kindness to help me in any way possible. I give my warmest gratitude to Satyanarayana (CVIT) for his unreserved support related to administrative matters. I am thankful to my friends Kula Kekeba, Ermias Abebe and Sebsibe H/Mariam for sharing their ideas, Ethio news and resources. My thanks also goes to Addis Ababa University (AAU) for granting me study leave with the necessary benefits, without which I could not have been able to join my Ph. D. study here in IIITH, India. I thank Ministry of Education (MOI) for granting me scholarship, and Ethiopian Embassy, India for actively serving as a bridge in matters between me and the other end (AAU and MOI). I also extend my sincere thanks to faculty members and administrative staffs of Faculty of Informatics (AAU). My greatest gratitude is extended to my family. I forward special thanks to my brother Bahiru Eyassu and his family who cares, advises and plays a critical role in the other side of my life in Ethiopia. My friends in Ethiopia are the engine during my study, especially Hamere Getachew, Daniel Teklu, Enchalew Yifru, Getaw Shumiye, Yonas Bahiru, Yared Girma, Tesfaye Gebeyehu and many more. Last but not least, I would like to thank all CVIT and IIITH community (faculties, students and administrative staffs) who have a significant contribution in one way or another during my research work. iii iv Abstract The present growth of digitization of books and manuscripts demands an immediate solution to access them electronically. This will enable the archived valuable materials to be searchable and usable by users in order to achieve their objectives. This requires research in the area of document image understanding, specifically in the area of document image recognition as well as document image retrieval. In the last three decades significant advancement is made in the recognition of documents written in Latin-based scripts. There are many excellent attempts in building robust document analysis systems in industry, academia and research labs. Intelligent recognition systems are commercially available for use for certain scripts. However, there is only limited research effort for the recognition of indigenous scripts of African and Indian languages. In addition, diversity of archived printed documents poses many challenges to document analysis and understanding. Hence in this work, we explore novel approaches for understanding and accessing the content of document image collections that vary in quality and printing. In Africa around 2,500 languages are spoken. Some of these languages have their own indigenous scripts in which there is a bulk of printed documents available in the various institutions. Digitization of these documents enables us to harness already available language technologies to local information needs and developments. We present an OCR for converting digitized documents in Amharic language. Amharic is the official language of Ethiopia. Extensive literature survey reveals that this is the first attempt that reports the challenges toward the recognition of indigenous African scripts and a possible solution for Amharic script. Research in the recognition of Amharic script faces major challenges due to (i) the use of large number of characters in the writing and (ii) the existence of large set of visually similar characters. Here we extract a set of optimal discriminant features to train the classifier. Recognition results are presented on real-life degraded documents such as books, magazines and newspapers to demonstrate the performance of the recognizer. The present OCRs are typically designed to work on a single page at a time. We v vi argue that the recognition scheme for a collection (like a book) could be considerably different from that designed for isolated pages. The motivation here is therefore to exploit the entire available information (during the recognition process), which is not effectively used earlier for enhancing the performance of the recognizer. To this end, we propose self adaptable OCR framework for the recognition of document image collections. This approach enables the recognizer to learn incrementally and adapt to document image collections for performance improvement. We employ learning procedures to capture the relevant information available online, and feed it back to update the knowledge of the system. Experimental results show the effectiveness of our design for improving the performance of the recognizer on-the-fly, thereby adapting to a specific collection. For indigenous scripts of African and Indian languages there is no robust OCR available. Designing such a system is also a long-term process for accessing the archived document images. Hence we explore the application of word spotting approach for retrieval of printed document images without explicit recognition. To this end, we propose an effective word image matching scheme that achieves high performance in presence of script variability, printing variations, degradations and word-form variations. A novel partial matching algorithm is designed for morphological matching of word form variants in a language. We employ a feature extraction scheme that ex- tracts local features by scanning vertical strips of the word image. These features are then combined based on their discriminatory potential. We present detailed experimental results of the proposed approach on English, Amharic and Hindi documents. Searching and retrieval from document image collections is challenging because of the scalability issues and computational time. We design an efficient indexing scheme to speed up searching for relevant document images. We identify the word set by clustering them into different groups based on their similarities. Each of these clusters are equivalent to a variation in printing, morphology, and quality. This is achieved by mapping IR principles (that are widely used in text processing) for relevance ranking. Once we cluster list of index terms that define the content of the document, they are indexed using inverted data structure. This data structure also provides scope for incremental clustering in a dynamic environment. The indexing scheme enables effective search and retrieval in image-domain that is comparable with text search engines. We demonstrate the application of the indexing process with the help of experimental results. The proliferation of document images at large scale demands a solution that is effective to access document images at collection level. In this work, we investigate machine learning and information retrieval approaches that suits this demand. Ma- vii chine learning schemes are effectively used to redesign existing approach to OCR development. The recognizer is enabled to learn from its experience and improve its performance over time on document image collections. Information retrieval (IR) principles are mapped to construct an indexing scheme for efficient content-based search and retrieval from document image collections. Existing matching scheme is redesigned to undertake morphological matching in the image domain. Performance evaluation using datasets from different languages shows the effectiveness of our approaches. Extension works are recommended that need further consideration in the future to further the state-of-the-art in document image recognition and document image retrieval. viii Contents 1 Introduction 1 1.1 Background . 1 1.2 Motivation and Problem Setting . 3 1.3 Recognition and Retrieval from Document Images . 5 1.4 Objectives of the Study . 6 1.5 Significance of the Work . 7 1.6 Major Contributions . 8 1.7 Scope of the Work . 8 1.8 Organization of the Thesis . 9 2 State of the Art 11 2.1 Digital Library . 11 2.1.1 Current Status . 12 2.1.2 Digital Library of India . 13 2.1.3 Challenges . 13 2.2 Recognition of Document Images . 15 2.2.1 Overview of OCR Design . 16 2.2.2 Commercial and Free OCR Systems . 18 2.2.3 OCRs Development by Research Labs and Academia . 19 2.2.4 OCR Systems for Oriental Languages . 21 2.2.5 OCRs for Indian Scripts . 24 2.2.6 OCRs for African Indigenous Scripts . 26 2.2.7 Holistic Word Recognition . 26 2.3 Retrieval from Document Images . 27 2.3.1 Word Spotting for Handwritten Documents Retrieval . 28 2.3.2 Word Spotting for Printed Documents Retrieval . 30 2.3.3 Indexing Document Images . 32 2.4 Discussion . 37 ix x CONTENTS 2.5 Summary and Comments .

Load more