Ocrspell: an Interactive Spelling Correction System for OCR Errors in Text
Total Page:16
File Type:pdf, Size:1020Kb
UNLV Retrospective Theses & Dissertations 1-1-1996 OCRspell: An interactive spelling correction system for OCR errors in text Eric Stofsky University of Nevada, Las Vegas Follow this and additional works at: https://digitalscholarship.unlv.edu/rtds Repository Citation Stofsky, Eric, "OCRspell: An interactive spelling correction system for OCR errors in text" (1996). UNLV Retrospective Theses & Dissertations. 3222. http://dx.doi.org/10.25669/t1t6-9psi This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/ or on the work itself. This Thesis has been accepted for inclusion in UNLV Retrospective Theses & Dissertations by an authorized administrator of Digital Scholarship@UNLV. For more information, please contact [email protected]. INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter &ce, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely afreet reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. UMI A Bell & Howell Information Company 300 North Zeeb Road, Ann Arbor MI 48106-1346 USA 313/761-4700 800/521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. OCRSpell: An Interactive Spelling Correction System for OCR Errors in Text by Eric Stofsky A thesis submitted in partial fulfillment of the requirements for the degree of Master o f Science in Computer Science Department of Computer Science University of Nevada, Las Vegas August 1996 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 1381045 UMI Microform 1381045 Copyright 1996, by UMI Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Zeeb Road Ann Arbor, Ml 48103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The thesis of Eric Stofsky for the degree of Master of Science in Computer Science is approved. Chairperson, Kazem Taghva, Ph.D .IféÀ^yy Examining Committee Member, Thomas A. Nartker, Ph.D. Examining Committee Member, John T. Minor, Ph.D. GraduateA- Faculty Representative,_____________ Shahram Latifi, Ph.D. Graduate Dein, Ronald W. Smith, Ph.D. University of Nevada, Las Vegas August, 1996 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract In this thesis we describe a spelling correction system designed specifically for OCR (Optical Character Recognition) generated text that selects candidate words through the use of information gathered from multiple knowledge sources. This system for text correction is based on static and dynamic device mappings, approximate string matching, and n-gram analysis. Our statistically based, Bayesian system incorporates a learning feature that collects confusion information at the collection and document levels. An evaluation of the new system is presented as well. Ill Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table of Contents Abstract ...............................................................................................iii Acknowledgments ............................................................................... v 1 Introduction ..................................................................................... 1 2 Preliminaries ..................................................................................... 4 2.1 Background ..................................................................................... 4 2.2 Influences ....................................................................................... 6 2.3 Effects of OCR Generated Text on IR Systems .............................7 2.4 Implementation............................................................................... 9 3 Parsing OCR Generated Text ...........................................................11 4 Organization of the Lexicon.............................................................. 20 5 Design .................................................................................................22 5.1 System Design .............................................................................. 22 5.2 Algorithms and Heuristics U sed ....................................................23 5.3 Performance Issues ........................................................................ 32 6 Features ............................................................................................ 37 6.1 Simplicity ...................................................................................... 37 6.2 Extendibility .................................................................................. 38 6.3 Flexibility ...................................................................................... 38 7 OCRSpell Trainer ............................................................................ 39 8 Evaluation ........................................................................................ 42 8.1 OCRSpell Test 1 42 8.2 OCRSpell Test 2 ............................................................................ 45 8.3 Test Results Overview .................................................................. 46 9 Conclusion and Future Work .......................................................... 48 Bibliography ........................................................................................ 49 IV Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgments I would like to thank Dr. Kazem Taghva and the rest of the Text Retrieval group at ISRI. I would also like to thank Julie Borsack for her help in the development of the OCRSpell system. Her suggestions for potential OCRSpell options and subsequent testing of the system was invaluable. Also, Jeff Gilbreth aided in the implementation of the confusion generator, and Andrew Bagdanov proofread the original draft of this thesis. I am also very grateful to Dr. Thomas Nartker, Dr. John Minor, and Dr. Shahram Latifi for serving on my graduate committee. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction Research into algorithmic techniques for detecting and correcting spelling errors in text has a long, robust history in computer science. As an amalgamation of the traditional fields of artificial intelligence, pattern recognition, string matching, computational linguis tics, and others, this fundamental problem in information science has been studied from the early 1960’s to the present [12]. As other technologies matured, this major area of research has become more important than ever. Everything from text retrieval to speech recognition relies on efficient and reliable text correction and approximation. While research in the area of correcting words in text encompasses a wide array of fields, in this thesis we report on OCRSpell, a system which integrates many techniques for correcting errors induced by an OCR (optical character recognition) device. This system is fundamentally different from many of the common spelling correction applications which are prevalent today. Traditional text correction is performed by isolating a word boundary, checking the word against a collection of commonly misspelled words, and performing a simple four step procedure: insertion, deletion, substitution, and transposition of all the characters in the string [14]. While the “corrective engine” in this approach may seem overly simplistic, it works quite well for standard applications. In fact, Damerau [6] reported that 80% of all misspellings can be 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. corrected by the above approach. However, this