Information Extraction in an Optical Character Recognition Context

UNLV Theses, Dissertations, Professional Papers, and Capstones 5-2011 Information Extraction in an Optical Character Recognition Context Ramon Pereda University of Nevada, Las Vegas Follow this and additional works at: https://digitalscholarship.unlv.edu/thesesdissertations Part of the Theory and Algorithms Commons Repository Citation Pereda, Ramon, "Information Extraction in an Optical Character Recognition Context" (2011). UNLV Theses, Dissertations, Professional Papers, and Capstones. 1061. http://dx.doi.org/10.34917/2459072 This Dissertation is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV with permission from the rights-holder(s). You are free to use this Dissertation in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/or on the work itself. This Dissertation has been accepted for inclusion in UNLV Theses, Dissertations, Professional Papers, and Capstones by an authorized administrator of Digital Scholarship@UNLV. For more information, please contact [email protected]. INFORMATION EXTRACTION IN AN OPTICAL CHARACTER RECOGNITION CONTEXT by Ramón Pereda Bachelor of Science in Mathematics Bachelor of Science in Computer Science University of New Orleans 1991 Master of Science University of Arizona 1993 Master of Science National Technological University 1997 Master of Science University of Texas 1999 A dissertation submitted in partial fulfillment of the requirements for the Doctor of Philosophy in Computer Science School of Computer Science Howard R. Hughes College of Engineering Graduate College University of Nevada, Las Vegas May 2011 Copyright by Ramón Pereda 2011 All Rights Reserved THE GRADUATE COLLEGE We recommend the dissertation prepared under our supervision by Ramón Pereda entitled Information Extraction in an Optical Character Recognition Context be accepted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science College of Engineering Kazem Taghva, Committee Chair Thomas Nartker, Committee Member Laxmi Gewali, Committee Member Ajoy Datta, Committee Member Ashok Singh, Graduate Faculty Representative Ronald Smith, Ph. D., Vice President for Research and Graduate Studies and Dean of the Graduate College May 2011 ABSTRACT Information Extraction in an Optical Character Recognition Context by Ramón Pereda Dr. Kazem Taghva, Examination Committee Chair Professor of Computer Science University of Nevada, Las Vegas In this dissertation, we investigate the effectiveness of information extraction in the presence of Optical Character Recognition (OCR). It is well known that the OCR errors have no effects on general retrieval tasks. This is mainly due to the redundancy of information in textual documents. Our work shows that information extraction task is significantly influenced by OCR errors. Intuitively, this is due to the fact that extraction algorithms rely on a small window of text surrounding the objects to be extracted. We show that extraction methodologies based on the Hidden Markov Models are not robust enough to deal with extraction in this noisy environ- ment. We also show that both precise shallow parsing and fuzzy shallow parsing can be used to increase the recall at the price of a significant drop in the precision. Most of our experimental work deals with the extraction of dates of birth and extraction of postal addresses. Both of these specific extractions are part of general methods of identification of privacy information in textual documents. Privacy information is particularly important when large collections of documents are posted on the Internet. iii TABLE OF CONTENTS ABSTRACT............................................................................................iii LIST OF TABLES................................................................................... vi LIST OF FIGURES................................................................................ vii CHAPTER 1 INTRODUCTION...........................................................1 CHAPTER 2 INFORMATION EXTRACTION AND RETRIEVAL............2 Information Retrieval....................................................2 Information Extraction .................................................3 CHAPTER 3 SCANNING, OCR, AND ELECTRONIC CONVERSION ....6 CHAPTER 4 EFFECTS OF OCR ERRORS ON RETRIEVAL ..............10 Introduction...............................................................10 Effects of OCR Errors on Boolean IR...........................10 Effects of OCR Errors on Probabilistic IR ....................13 Effects of OCR Errors on Vector Space Model IR .........15 CHAPTER 5 METHODS FOR INFORMATION EXTRACTION USING HIDDEN MARKOV MODELS .............................21 Introduction...............................................................21 Applications for Address Extraction............................22 Hidden Markov Models...............................................23 Training Hidden Markov Models .................................26 CHAPTER 6 EFFECTS OF OCR ERRORS ON INFORMATION EXTRACTION USING HIDDEN MARKOV MODELS .......31 OCR Related Difficulties .............................................32 HMM Experiments .....................................................34 CHAPTER 7 METHODS FOR INFORMATION EXTRACTION USING PRECISE SHALLOW PARSING..........................37 Introduction...............................................................37 Extracting Date of Birth Information ..........................37 Finding Extraction Patterns........................................40 CHAPTER 8 EFFECTS OF OCR ERRORS ON INFORMATION EXTRACTION USING PRECISE SHALLOW PARSING....45 Blind Testing of DOB Extractor Program ....................45 Precision and Recall Results.......................................46 Making the Patterns Robust to OCR Errors ................47 iv CHAPTER 9 METHODS FOR INFORMATION EXTRACTION USING FUZZY SHALLOW PARSING.............................50 Introduction...............................................................50 Extraction of Date of Birth Information.......................51 Fuzzy Matching and High Recall.................................52 CHAPTER 10 EFFECTS OF OCR ERRORS ON INFORMATION EXTRACTION USING FUZZY SHALLOW PARSING .......57 Collecting Documents with DOB Information..............57 Blind Testing on Both Clean and OCRed Text.............59 Precision and Recall Results.......................................60 CHAPTER 11 SUMMARY AND CONCLUSION...................................63 HMM Information Extraction......................................63 IE using Precise Shallow Parsing ................................64 IE using Fuzzy Shallow Parsing ..................................65 OCR Effect on IE ........................................................68 Conclusion.................................................................69 BIBLIOGRAPHY ..................................................................................70 VITA ....................................................................................................73 v LIST OF TABLES Table 4.1 Precision at standard recall points .................................14 Table 4.2 Options for a weights ......................................................18 Table 4.3 Options for b weights ......................................................18 Table 4.4 Options for c weights ......................................................18 Table 4.5 Average Precision for Correct and OCR Collections .........19 Table 5.1 Types of Shrinkages Possible For Each State...................28 Table 6.1 Precision, Recall, and F1 Score Results. ..........................36 Table 8.1 Precision and Recall Results on 1075 Clean Text Documents..............................................................46 Table 8.2 Precision and Recall Results on 1075 OCRed Documents. ....................................................................46 Table 10.1 Precision and Recall Results on 50 Unseen Clean Text Documents..............................................................60 Table 10.2 Precision and Recall Results on 50 Unseen OCRed Text Documents..............................................................60 vi LIST OF FIGURES Figure 3.1 Deteriorated printed page...................................................7 Figure 3.2 Poorly scanned page...........................................................8 Figure 3.3 Typical scanned page .........................................................9 Figure 4.1 Boolean Model: % Recall by Query and by Document .......12 Figure 4.2 Precision-Recall curve at 10 standard recall points...........15 Figure 4.3 Average Precision for Correct (c) and OCR (o) collections...19 Figure 4.4 % Change in Average Precision for Correct-OCR ...............20 Figure 5.1 Hidden Markov Model for Addresses.................................24 Figure 5.2 HMM Hidden Symbols used for Address Extraction ..........25 Figure 6.1 Tagging of HQZ.19880629.6150 .......................................31 Figure 6.2 Address in HQZ.19890203.1143. .....................................33 Figure 6.3 Resulting OCR of HQZ.19890203.1143. ...........................33 Figure 6.4 Address in HQZ.19880629.6150 ......................................34 Figure 7.1 Hits versus Span between name and birth date................43

Load more