TEXT LINE EXTRACTION USING SEAM CARVING a Thesis

TEXT LINE EXTRACTION USING SEAM CARVING A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirement for the Degree Master of Science Christopher Stoll May, 2015 TEXT LINE EXTRACTION USING SEAM CARVING Christopher Stoll Thesis Approved: Accepted: __________________________________ __________________________________ Advisor Dean of the College Dr. Zhong-Hui Duan Dr. Chand Midha __________________________________ __________________________________ Faculty Reader Interim Dean of the Graduate School Dr. Chien-Chung Chan Dr. Rex Ramsier __________________________________ __________________________________ Faculty Reader Date Dr. Yingcai Xiao __________________________________ Department Chair Dr. Timothy Norfolk !ii ABSTRACT Optical character recognition (OCR) is a well researched area of computer science; presently there are numerous commercial and open source applications which can perform OCR, albeit with varying levels of accuracy. Since the process of performing OCR requires extracting text from images, it should follow that text line extraction is also a well researched area. And indeed there are many methods to extract text from images of scanned documents, the process known in that field as document analysis and recognition. However, the existing text extraction techniques were largely devised to feed existing character recognition techniques. Since this work was originally conceived from the perspective of computer vision and pattern recognition and pattern analysis and machine intelligence, a new approach seemed necessary to meet the new objectives. Out of that need an apparently novel approach to text extraction was devised which relies upon the central idea behind seam carving. Text images are examined for seams, but rather than removing the lowest energy seams, they are evaluated to determine where text is located within the image. The approach can be run iteratively, alternating the direction of seam evaluation, to recognize !iii increasingly specific areas of the image. The ultimate goal is to create an algorithm which can provide data to a new type of recognition algorithm, one that can understand information at higher levels of the document structure and possibly use information gained from those higher levels (words or phrases) to increase the accuracy of identifying data in the lower levels (characters). This paper explores the existing methods used for text extraction, touching upon existing OCR techniques, and then describes a novel technique for information extraction based upon seam carving. Modifications needed to adapt the seam carving process to the new problem domain are explained. Then, two output methods, direct area detection and information masking, are described. Finally, potential modifications to the technique, which could make it suitable for use in other domains such as general computer vision or bioinformatics, are discussed. !iv ACKNOWLEDGEMENTS First, I would like to thank Dr. Zhong-Hui Duan for making this work possible. Her guidance on this project was invaluable and, perhaps more importantly, I would have never even seen the possibilities without the knowledge I gained from Dr. Duan’s instruction. I would also like to thank my committee members, Dr. Chien-Chung Chan and Dr. Yingcai Xiao, for the time they took reviewing my work and providing valuable feedback. Additionally, I would like to thank Dr. Kathy J. Liszka, who helped an unlikely candidate get a chance to succeed in a Computer Science program. I would also like to acknowledge the contributions of Dr. Michael L. Collard and Dr. Timothy W. O’Neil, who influenced the methodologies and technical implementations used in this project. Finally, I would like to thank my wife Heather, my boys, my family, and my friends. Without their support, occasional diversions, and encouragement I would have never been able to see this project through to completion. Namaste. !v TABLE OF CONTENTS Page LIST OF FIGURES............................................................................................................. x LIST OF EQUATIONS.................................................................................................... xii CHAPTER I I.INTRODUCTION INTRODUCTION............................................................................................................. 1 EXISTINGII. EXISTING APPROACHES APPROACHES.............................................................................................. 5 Text Extraction Methods.............................................................................................. 5 Run-length Smearing.............................................................................................. 6 X-Y Cut...................................................................................................................... 8 Docstrum.................................................................................................................. 9 Whitespace Analysis............................................................................................. 11 Voronoi.................................................................................................................... 12 Text Extraction Software............................................................................................ 13 Cuneiform............................................................................................................... 14 OCRopus................................................................................................................. 14 !vi Tesseract.................................................................................................................. 15 OCRFeeder............................................................................................................. 15 III.PROPOSED PROPOSED APPROACH APPROACH.............................................................................................. 17 Seam Carving.............................................................................................................. 18 Preprocessing Functions....................................................................................... 20 Energy Functions................................................................................................... 21 Simple Gradient................................................................................................ 21 Sobel Edge Detection........................................................................................ 22 Laplacian of Gaussians.................................................................................... 22 Difference of Gaussians................................................................................... 23 Difference of Gaussians with Sobel................................................................ 24 Seam Traversal....................................................................................................... 25 Text Extraction Steps.................................................................................................. 27 Direct Area Detection............................................................................................ 30 Information Masking............................................................................................ 31 Method Comparison............................................................................................. 32 Overcoming Limitations....................................................................................... 34 !vii Improved Skew Handling............................................................................... 35 Complicated Layouts....................................................................................... 36 Extracting Finer Details............................................................................................. 37 Asymptotic Analysis.................................................................................................. 38 Related Work............................................................................................................... 40 Future Work................................................................................................................ 41 CONCLUSIONSIV. CONCLUSIONS.............................................................................................................. 46 REFERENCES.................................................................................................................. 49 APPENDICES.................................................................................................................. 52 PROGRAMAPPENDIX A:MAKEFILE PROGRAM........................................................................................... MAKEFILE 53 PROGRAMAPPENDIX MAINB: PROGRAM — SC.C MAIN..................................................................................... — SC.C 54 SEAMAPPENDIX CARVING C: SEAM FUNCTIONS CARVING — FUNCTIONS LIBSEAMCARVE.C — LIBSEAMCARVE.C.................................... 57 PNGAPPENDIX IMPORT D: FUNCTIONSPNG IMPORT — FUNCTIONS LIBPNGHELPER.C — LIBPNGHELPER.C........................................... 74 IMAGEAPPENDIX RESIZE E: IMAGE FUNCTIONS RESIZE — FUNCTIONS LIBRESIZE.C —.................................................. ILBRESIZE.C 78 IMAGEAPPENDIX BINARIZATION F: IMAGE BINARIZATION — LIBBINARIZATION.C — LIBBINARIZATION.C........................................... 80 IMAGEAPPENDIX ENERGY G: IMAGE FUNCTIONS ENERGY —

TEXT LINE EXTRACTION USING SEAM CARVING a Thesis

OCR Pwds and Assistive Qatari Using OCR Issue No

Glosar De Termeni

Reconocimiento De Escritura Lecture 4/5 --- Layout Analysis

An Accuracy Examination of OCR Tools

Gradu04243.Pdf

Integral Estimation in Quantum Physics

JETIR Research Journal

Optical Character Recognition As a Cloud Service in Azure Architecture

Operational Platform with Additional Tools Integrated

Choosing Character Recognition Software To

An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow For

Analiza I Optičko Prepoznavanje Rukopisa S Herbarijskih Etiketa U Zbirci Herbarium Croaticum