Mathematical Expression Detection and Segmentation in Document Images
Total Page:16
File Type:pdf, Size:1020Kb
Mathematical Expression Detection and Segmentation in Document Images Jacob R. Bruce Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Master of Science In Computer Engineering A. Lynn Abbott, Committee Chair Jianhua J. Xuan, Co-Chair Michael S. Hsiao February 12, 2014 Blacksburg, VA Keywords: document layout analysis, optical character recognition, mathematical expression detection and segmentation, document image, type- specific layout analysis Mathematical Expression Detection and Segmentation in Document Images Jacob R. Bruce Abstract Various document layout analysis techniques are employed in order to enhance the accuracy of optical character recognition (OCR) in document images. Type-specific document layout analysis involves localizing and segmenting specific zones in an image so that they may be recognized by specialized OCR modules. Zones of interest include titles, headers/footers, paragraphs, images, mathematical expressions, chemical equations, musical notations, tables, circuit diagrams, among others. False positive/negative detections, oversegmentations, and undersegmentations made during the detection and segmentation stage will confuse a specialized OCR system and thus may result in garbled, incoherent output. In this work a mathematical expression detection and segmentation (MEDS) module is implemented and then thoroughly evaluated. The module is fully integrated with the open source OCR software, Tesseract, and is designed to function as a component of it. Evaluation is carried out on freely available public domain images so that future and existing techniques may be objectively compared. Acknowledgments I would like to thank Dr. Abbott for his helpful insights. I'd also like to thank my lab-mates Sherin Aly, Sherin Ghannam, Ahmed Ibrahim, Mahmoud Sobhy, and Amira Youssef for keeping me company, teaching me a thing or two about their rich culture, and letting me try some great foods. I'd also like to thank the Virginia Tech Assistive Technology director, William Holbach, and coordinator, Hal Bracket, for their feedback and encouragement in the early stages of this project. Last but not least, I'd like to express my gratitude for my family and their unconditional support. iii Table of contents 1 Introduction...............................................................................................1 1.1 Enhancing Information Accessibility.........................................................................1 1.2 Introduction to OCR and Document Analysis: A Brief History....................................5 1.3 Google Books Initiative.............................................................................................8 1.4 Contributions of this Thesis....................................................................................10 1.5 Organization of Thesis............................................................................................11 2 Literature Review....................................................................................13 2.1 The Beginnings of OCR...........................................................................................13 2.1.1 Fixed-font...................................................................................................13 2.1.2 Omnifont....................................................................................................15 2.2 Pattern Recognition Techniques in OCR..................................................................16 2.2.1 Text Line Finding.........................................................................................17 2.2.2 Character Feature Extraction........................................................................21 2.2.3 Character Classification...............................................................................28 2.2.4 Detection of Merged or Broken Characters.....................................................30 2.2.5 Word Recognition and Linguistic Analysis.......................................................30 2.3 Document Layout Analysis Techniques...................................................................31 2.3.1 Introduction to Document Layout Analysis.....................................................31 2.3.2 Preprocessing.............................................................................................35 2.3.3 Document Structure Analysis.......................................................................37 3 Method...................................................................................................... 75 3.1 Introduction............................................................................................................75 3.1.1 Purpose..................................................................................................... 75 3.1.2 Problem Statement and Project Scope...........................................................76 3.1.3 Definitions and Acronyms............................................................................77 3.1.4 Tesseract Document Layout Analysis Framework Overview..............................77 3.1.5 Overview....................................................................................................79 3.2 System Overview...................................................................................................79 3.3 System Architechture.............................................................................................79 3.4 Component Design.................................................................................................82 3.4.1 Groundtruth Dataset Generation..................................................................82 3.4.2 MEDS Module.............................................................................................88 3.4.3 Evaluation Module.....................................................................................123 4 Experimental Results...........................................................................132 4.1 Detector Parameter Selection and Training..........................................................132 4.2 Final Evaluation....................................................................................................134 5 Conclusion and Future Work...............................................................146 Bibliography...........................................................................................148 iv List of Figures Figure 1: This illustration, taken from a 1933 patent entitled “Statistical Machine” P. W. Handel, "Statistical Machine," United States Patent Office. 1,915,993, Jun. 27, 1933., depicts one of the earliest OCR devices ever invented. Used under fair use, 2014.......5 Figure 2: This excerpt from a 1901 New York Times article was optically recognized by ABBY Fine Reader 8. The “New York Times” heading at the very top, the “Furniture and Home Furnishings” label embedded in the illustration, and the layout of the three columns at the bottom right were all incorrectly recognized by the commercial system. Contributor: Bob Stein (uploaded to http://archive.org), "New York Times August September 1901 Collection," Internet:http://archive.org/download/NewYorkTimesAugSept1901Collection/New_York_T imes_August_September_1901_Part_7_text.pdf, Date Accessed: 2013.Used under fair use, 2014....................................................................................................................... 7 Figure 3: The above text includes excerpts from two different pages taken from a scan of an old textbook. On the left is a table followed by a paragraph of text, while on the right is a larger table. These images were extracted from a PDF which was digitized by Google and then made available under public domain at http://www.archive.org. R. Griffin, Statistics. London: Macmillon and Co., 1913, pp. 121-122. Used under fair use, 2014............................................................................................................................... 9 Figure 4: Above is the text from Figure 3 after having been labeled by Tesseract's table detection software. The text within the blue rectangles was identified as belonging to a table while the text within red rectangles was not. The green rectangle should encompass the entire table figure. As can be seen there are both false negatives and false positives................................................................................................................. 9 Figure 5: Example of OCR results on text excerpt. On the left is an example of text that was scanned at 300 dpi from a calculus text book. To the right is the output generated by the leading open source OCR engine, Tesseract......................................................10 Figure 6: An image of RCA's 1949 OCR system: M. Martin, "Reading Machine Speaks Out Loud," Popular Science, vol. 154, no. 2, Feb 1949, pp. 125-127. Used under fair use, 2014. The system was discontinued prior to completion due to its high costs ....14 Figure 7: The IBM 1401 System (Optical Character Reader not shown here). From left to right, the punch card reader/writer, mainframe, printer, and magnetic tape units. The IBM 1401 Demo Lab and Restoration Project Computer History Museum, "IBM 1401 Archive Pics," Internet: http://ibm-1401.info/IBM1401_ArchivePics.html, Date Accessed: 2014. Used under fair use, 2014.................................................................15