Document Image Retrieval with Improvements in Database Quality
Total Page:16
File Type:pdf, Size:1020Kb
DOCUMENT IMAGE HANNU RETRIEVAL WITH KAUNISKANGAS IMPROVEMENTS IN DATABASE Department of Electrical Engineering QUALITY OULU 1999 HANNU KAUNISKANGAS DOCUMENT IMAGE RETRIEVAL WITH IMPROVEMENTS IN DATABASE QUALITY Academic Dissertation to be presented with the assent of the Faculty of Technology, University of Oulu, for public discussion in Raahensali (Auditorium L 10), Linnanmaa, on August 17th, 1999, at 12 noon. OULUN YLIOPISTO, OULU 1999 Copyright © 1999 Oulu University Library, 1999 Manuscript received 21.6.1999 Accepted 23.6.1999 Communicated by Doctor Omid E. Kia Professor Pasi Koikkalainen ISBN 951-42-5313-2 (URL: http://herkules.oulu.fi/isbn9514253132/) ALSO AVAILABLE IN PRINTED FORMAT ISBN 951-42-5313-2 ISSN 0355-3213 (URL: http://herkules.oulu.fi/issn03553213/) OULU UNIVERSITY LIBRARY OULU 1999 Kauniskangas, Hannu:Document image retrieval with improvements in database quality Infotech Oulu and Department of Electrical Engineering, University of Oulu, P.O.Box 4500, FIN-90401 Oulu, Finland 1999 Oulu, Finland (Received 21 June, 1999) Abstract Modern technology has made it possible to produce, process, transmit and store digital images efficiently. Consequently, the amount of visual information is increasing at an accelerating rate in many diverse application areas. To fully exploit this new content-based image retrieval techniques are required. Document image retrieval systems can be utilized in many organizations which are using document image databases extensively. This thesis presents document image retrieval techniques and new approaches to improve database content. The goal of the thesis is to develop a functional retrieval system and to demonstrate that better retrieval results can be achieved with the proposed database generation methods. Retrieval system architecture, a document data model, and tools for querying document image databases are introduced. The retrieval framework presented allows users to interactively define, construct and combine queries using document or image properties: physical (structural), semantic, textual and visual image content. A technique for combining primitive features like color, shape and texture into composite features is presented. A novel search base reduction technique which uses structural and content properties of documents is proposed for speeding up the query process. A new model for database generation within the image retrieval system is presented. An approach for automated document image defect detection and management is presented to build high quality and retrievable database objects. In image database population, image feature profiles and their attributes are manipulated automatically to better match with query requirements determined by the available query methods, the application environment and the user. Experiments were performed with multiple image databases containing over one thousand images. They comprised a range of document and scene images from different categories, properties and condition. The results show that better recall and accuracy for retrieval is achieved with the proposed optimization techniques. The search base reduction technique results in a considerable speed-up in overall query processing. The constructed document image retrieval system performs well in different retrieval scenarios and provides a consistent basis for algorithm development. The proposed modular system structure and interfaces facilitate its usage in a wide variety of document image retrieval applications. Keywords: content-based retrieval, database population optimization, image database Acknowledgements This work was carried out in the MediaTeam Oulu and Machine Vision and Media Processing Unit at the Department of Electrical Engineering of the University of Oulu, Finland during the years 1995-1999. I would like to thank Professor Matti Pietikäinen, the head of the group, for allowing me to work in his laboratory and providing me with the excellent facilities to complete this thesis. I would also like to express my gratitude to Professor Jaakko Sauvola for his contribution and enthusiastic attitude. I am grateful to Dr. Doermann for fruitful collaboration and opportunities to visit the University of Maryland. I am grateful to all members of the MediaTeam and the Machine Vision and Media Processing Unit for creating a pleasant working environment. Professor Pasi Koikkalainen from the University of Jyväskylä and Dr. Omid Kia from the National Institute of Standards and Technology are acknowledged for reviewing and commenting on the thesis. Their constructive criticism improved the quality of the manuscript considerably. Many thanks also to Dr. Timo Ojala for reading and commenting on the thesis. The following institutions are gratefully acknowledged for their important financial support: the Graduate School in Electronics, Telecommunications and Automation, the Academy of Finland, and the Technology Development Center of Finland. I am deeply grateful to my mother Ritva and father Pentti for their love and care over the years. My brother Jukka and sister Eija deserve warm thanks for their unconditional support. Most of all, I want to thank my dear wife Jaana for her patience and understanding. Oulu, June 19, 1999 Hannu Kauniskangas Abbreviations Abbreviations CBIR content-based image retrieval DAS document analysis system DTM distributed test management FSBR functional search base reduction GUI graphical user interface IDIR intelligent document image retrieval IIR intelligent image retrieval IR information retrieval NNC neural network classifier OCR optical character recognition ORDBMS object-relational database management system SBR search base reduction SQL structured query language STORM automated document image cleaning system List of original papers I Doermann D, Sauvola J, Kauniskangas H, Shin C, Pietikäinen M & Rosenfeld A (1997) The development of a general framework for intelligent document image retrieval. A book chapter in Document Analysis Systems II, Series in Machine Perception and Artificial Intelligence, World Scientific, 433-460. II Kauniskangas H, Sauvola J, Pietikäinen M & Doermann D (1997) Content-based image retrieval using composite features. Proc. 10th Scandinavian Conference on Image Analysis, Lappeenranta, Finland, 1: 35-42. III Sauvola J, Doermann D, Kauniskangas H, Shin C, Koivusaari M & Pietikäinen M (1997) Graphical tools and techniques for querying document image databases. Proc. First Brazilian Symposium on Advances in Document Image Analysis, Curitiba, Brazil, 213-224. IV Kauniskangas H & Pietikäinen M (1996) Development support for content-based image retrieval systems. Proc. Multimedia Storage and Archiving Systems, Boston, Massachusetts, USA, SPIE Vol. 2916, 142-149. V Sauvola J & Kauniskangas H (1999) Active Multimedia Documents. To appear in Multimedia Tools and Applications. VI Kauniskangas H & Sauvola J (1998) An automated defect management for document images. Proc. 14th International Conference on Pattern Recognition, Brisbane, Australia, 1288-1294. VII Kauniskangas H, Sauvola J & Pietikäinen M (1999) Optimization techniques for document image retrieval. Proc. 11th Scandinavian Conference on Image Analysis, Kangerlussuaq, Greenland, 673-682. VIII Sauvola J, Doermann D, Kauniskangas H & Pietikäinen M (1997) Techniques for the automated testing of document analysis algorithms. Proc. First Brazilian Symposium on Advances in Document Image Analysis, Curitiba, Brazil, 201-212. Contents Abstract Acknowledgements Abbreviations List of original papers Contents 1. Introduction ......................................................................................................... 13 1.1. Background...............................................................................................13 1.2. The scope and contributions of the thesis.................................................15 1.3. Summary of the publications - the role of the author ...............................17 2. Information models for retrieval ......................................................................... 21 2.1. Modeling of scene images ........................................................................21 2.2. Modeling of document images .................................................................23 2.3. Our approach.............................................................................................30 2.4. Discussion.................................................................................................32 3. Image retrieval systems....................................................................................... 34 3.1. Content-based retrieval .............................................................................34 3.2. Scene image retrieval................................................................................35 3.2.1 Scene image database population..................................................37 3.2.2 Scene image query techniques.......................................................41 3.3. Document image retrieval.........................................................................44 3.3.1 Document image database population...........................................45 3.3.2 Document image query techniques ...............................................48 3.4. Our approach.............................................................................................50 3.4.1 Application development...............................................................50 3.4.2 Document image retrieval .............................................................50