Optical Character Recognition Errors and Their Effects on Natural Language Processing

International Journal on Document Analysis and Recognition manuscript No. (will be inserted by the editor) Optical Character Recognition Errors and Their Effects on Natural Language Processing Daniel Lopresti Department of Computer Science and Engineering, Lehigh University, 19 Memorial Drive West, Bethlehem, PA 18015, USA Received December 19, 2008 / Revised August 23, 2009 Abstract. Errors are unavoidable in advanced computer the risk of over- and under-segmenting characters during vision applications such as optical character recognition, OCR, where should the line be drawn to maximize over- and the noise induced by these errors presents a seri- all performance? The answers to these questions should ous challenge to downstream processes that attempt to influence the way we design and build document analysis make use of such data. In this paper, we apply a new systems. paradigm we have proposed for measuring the impact Researchers have already begun studying problems of recognition errors on the stages of a standard text relating to processing text data from noisy sources. To analysis pipeline: sentence boundary detection, tokeniza- date, this work has focused predominately on errors that tion, and part-of-speech tagging. Our methodology for- arise during speech recognition. For example, Palmer and mulates error classification as an optimization problem Ostendorf describe an approach for improving named en- solvable using a hierarchical dynamic programming ap- tity extraction by explicitly modeling speech recognition proach. Errors and their cascading effects are isolated errors through the use of statistics annotated with confi- and analyzed as they travel through the pipeline. We dence scores [18]. The inaugural Workshop on Analytics present experimental results based on a large collection for Noisy Unstructured Text Data [23] and its followup of scanned pages to study the varying impact depending workshops [24,25] have featured papers examining the on the nature of the error and the character(s) involved. problem of noise from a variety of perspectives, with This dataset has also been made available online to en- most emphasizing issues that are inherent in written and courage future investigations. spoken language. There has been less work, however, in the case of noise induced by optical character recognition. Early papers by Taghva, Borsack, and Condit show that moder- Key words: Performance evaluation – Optical charac- ate error rates have little impact on the effectiveness of ter recognition – Sentence boundary detection – Tok- traditional information retrieval measures [21], but this enization – Part-of-speech tagging conclusion is tied to certain assumptions about the IR model (“bag of words”), the OCR error rate (not too high), and the length of the documents (not too short). 1 Introduction Miller, et al. study the performance of named entity extraction under a variety of scenarios involving both ASR Despite decades of research and the existence of estab- and OCR output [17], although speech is their primary lished commercial products, the output from optical char- interest. They found that by training their system on acter recognition (OCR) processes often contain errors. both clean and noisy input material, performance de- The more highly degraded the input, the greater the er- graded linearly as a function of word error rates. ror rate. Since such systems can form the first stage in Farooq and Al-Onaizan proposed an approach for a pipeline where later stages are designed to support so- improving the output of machine translation when pre- phisticated information extraction and exploitation ap- sented with OCR’ed input by modeling the error correc- plications, it is important to understand the effects of tion process itself as a translation problem [5]. recognition errors on downstream text analysis routines. A paper by Jing, Lopresti, and Shih studied the prob- Are all recognition errors equal in impact, or are some lem of summarizing textual documents that had under- worse than others? Can the performance of each stage gone optical character recognition and hence suffered be optimized in isolation, or must the end-to-end sys- from typical OCR errors [10]. From the standpoint of tem be considered? What are the most serious forms of performance evaluation, this work employed a variety degradation a page can suffer in the context of natural of indirect measures: for example, comparing the total language processing? In balancing the tradeoff between 2 Daniel Lopresti: Optical Character Recognition Errors and Their Effects on Natural Language Processing Fig. 1. Propagation of OCR errors through NLP stages (the “error cascade”). number of sentences returned by sentence boundary de- A brief synoposis of each stage and its potential prob- tection for clean and noisy versions of the same input lem areas is listed in Table 1. The interactions between text, or counting the number of incomplete parse trees errors that arise during OCR and later stages can be generated by a part-of-speech tagger. complex. Several common scenarios are depicted in Fig. 1. In two later papers [12,13], we turned to the question It is easy to imagine a single error propagating through of performance evaluation for text analysis pipelines, the pipeline and inducing a corresponding error at each proposing a paradigm based the hierarchical application of the later steps in the process (Case 1 in the figure). of approximate string matching techniques. This flexi- However, in the best case, an OCR error could have no ble yet mathematically rigorous approach both quanti- impact whatsoever on any of the later stages (Case 2); fies the performance of a given processing stage as well for example, tag and tap are both verbs, so the sen- as identifies explicitly the errors it has made. Also pre- tence boundary, tokenization, and part-of-speech tagging sented were the results of pilot studies where small sets of would remain unchanged if g were misrecognized as p. documents (tens of pages) were OCR’ed and then piped On the other hand, misrecognizing a comma (,)asa through standard routines for sentence boundary detec- period (.) creates a new sentence boundary, but might tion, tokenization, and part-of-speech tagging, demon- not affect the stages after this (Case 3). More intrigu- strating the utility of the approach. ing are latent errors which have no effect on one stage, In the present paper, we employ this same evaluation but reappear later in the processing pipline (Cases 4 and paradigm, but using a much larger and more realistic 5). OCR errors which change the tokenization or part- dataset totaling over 3,000 scanned pages which we are of-speech tagging while leaving sentence boundaries un- also making available to the community to foster work in changed fall in this category (e.g., faulty word segmen- this area [14]. We study the impact of several real-world tations that insert or delete whitespace characters). Fi- degradations on optical character recognition and the nally, a single OCR error can induce multiple errors in a NLP processes that follow it, and plot later-stage per- later stage, its impact mushrooming to neighboring to- formance as a function of the input OCR accuracy. We kens (Case 6). conclude by outlining possible topics for future research. In selecting implementations of the above stages to test, we choose to employ freely available open source software rather than proprietary, commercial packages. 2 Stages in Text Analysis From the standpoint of our work, we require behavior that is representative, not necessarily “best-in-class.” For In this section, we describe the prototypical stages that sufficiently noisy inputs, the same methodology and con- are common to many text analysis systems, discuss some clusions are likely to apply no matter what algorithm of the problems that can arise, and then list the specific is used. Comparing different techniques for realizing a packages we use in our work. The stages, in order, are: given stage to determine which is most robust in the (1) optical character recognition, (2) sentence boundary presence of OCR errors would make an interesting topic detection, (3) tokenization, and (4) part-of-speech tag- for future research. ging. These basic procedures are of interest because they form the basis for more sophisticated natural language applications, including named entity identification and 2.1 Optical character recognition extraction, topic detection and clustering, and summa- rization. In addition, the problem of identifying tabular The first stage of the pipeline is optical character recog- structures that should not be parsed as sentential text is nition, the conversion of the scanned input image from also discussed as a pre-processing step. bitmap format to encoded text. Optical character recog- Daniel Lopresti: Optical Character Recognition Errors and Their Effects on Natural Language Processing 3 Table 1. Text processing stages: functions and problems. Processing Stage Intended Function Potential Problems(s) Optical character recognition Transcribe input bitmap into encoded Current OCR is “brittle;” errors made text (hopefully accurately). early-on propagate to later stages. Sentence boundary detection Break input into sentence-sized units, Missing or spurious sentence boundaries one per text line. due to OCR errors on punctuation. Tokenization Break each sentence into word (or word- Missing or spurious tokens due to OCR like) tokens delimited by white space. errors on whitespace and punctuation. Part-of-speech tagging Takes tokenized text and attaches la- Bad PoS tags due to failed tokenization bel to each token indicating its part-of- or OCR errors that alter orthographies. speech. Fig. 3. OCR output for the image from Fig. 2. boundary detector we used in our test is the MXTERMI- Fig. 2. Example of a portion of a dark photocopy. NATOR package by Reynar and Ratnaparkhi [20]. An example of its output for a “clean” (error-free) text frag- ment consisting of two sentences is shown in Fig. 4(b).1 nition performs quite well on clean inputs in a known font.

Load more