Multi-Modular Domain-Tailored OCR Post-Correction
Total Page:16
File Type:pdf, Size:1020Kb
Multi-modular domain-tailored OCR post-correction Sarah Schulz and Jonas Kuhn Institute for Natural Language Processing (IMS) University of Stuttgart [email protected] Abstract block: large digital corpora comprising the textual material of interest are rare. Archives and individ- One of the main obstacles for many Dig- ual scholars are in the process of improving this ital Humanities projects is the low data situation by applying Optical Character Recog- availability. Texts have to be digitized nition (OCR) to the physical resources. In the in an expensive and time consuming pro- Google Books1 project books are being digitized cess whereas Optical Character Recogni- on a large scale. But even though collections of tion (OCR) post-correction is one of the literary texts like Project Gutenberg2 exist, these time-critical factors. At the example of collections often lack the texts of interest to a spe- OCR post-correction, we show the adap- cific question. As an example, we describe the tation of a generic system to solve a spe- compilation of a corpus of adaptations of Goethe’s cific problem with little data. The system Sorrows of the young Werther which allows for the accounts for a diversity of errors encoun- analysis of character networks throughout the pub- tered in OCRed texts coming from differ- lishing history of this work. ent time periods in the domain of litera- The success of OCR is highly dependent on the ture. We show that the combination of dif- quality of the printed source text. Recognition er- ferent approaches, such as e.g. Statisti- rors, in turn, impact results of computer-aided re- cal Machine Translation and spell check- search (Strange et al., 2014). Especially for older ing, with the help of a ranking mecha- books set in hard-to-read fonts and with stained nism tremendously improves over single- paper the output of OCR systems is not good handed approaches. Since we consider enough to serve as a basis for Digital Humanities the accessibility of the resulting tool as (DH) research. It needs to be post-corrected in a a crucial part of Digital Humanities col- time-consuming and cost-intensive process. laborations, we describe the workflow we We describe how we support and facilitate the suggest for efficient text recognition and manual post-correction process with the help of subsequent automatic and manual post- informed automatic post-correction. To account correction. for the problem of relative data sparsity, we illus- trate how a generic architecture agnostic to a spe- 1 Introduction cific domain can be adjusted to text specificities Humanities are no longer just the realm of schol- such as genre and font characteristics by including ars turning pages of thick books. As the worlds just small amounts of domain specific data. We of humanists and computer scientists begin to in- suggest a system architecture (cf. Figure 1) with tertwine, new methods to revisit known ground trainable modules which joins general and specific emerge and options to widen the scope of research problem solving as required in many applications. questions are available. Moreover, the nature of We show that the combination of modules via a language encountered in such research attracts the ranking algorithm yields results far above the per- attention of the NLP community (Kao and Juraf- formance of single approaches. sky (2015), Milli and Bamman (2016)). Yet, the basic requirement for the successful implemen- 1https://books.google.de/, 02.04.2017. tation of such projects often poses a stumbling 2http://www.gutenberg.org, 14.04.2017. 2716 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2716–2726 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics OCR text preprocessing preprocessed text specific vocab general statistical specific statistical original compound split spell check machine translation machine translation correction suggestions decision module corrected text Figure 1: Multi-modular OCR post-correction system. We discuss the point of departure for our research search relating to the data could be found. in Section 2 and introduce the data we base our OCR post-correction is applied in a diversity of system on in Section 4. In Section 5, we illustrate fields in order to compile high-quality datasets. the most common errors and motivate our multi- This is not merely reflected in the homogeneity of modular, partly customized architecture. Section 6 techniques but in the metric of evaluation as well. gives an overview of techniques included in our While accuracy has been widely used as evalu- system and the ranking algorithm. In Section 7, ation measure in OCR post-correction research, we discuss results, the limitations of automatic Reynaert (2008a) advocates the use of precision post-correction and the influence the amount of and recall in order to improve transparency in eval- training data takes on the performance of such a uations. Dependent on the paradigm of the applied system. Finally, Section 8 describes a way to effi- technique even evaluation measures like BLEU ciently integrate the results of our research into a score can be found (cf. Afli et al. (2016)). digitization work-flow as we see the easy accessi- Since shared tasks are a good opportunity to estab- bility of computer aid as a central point in Digital lish certain standards and facilitate the compara- Humanities collaborations. bility of techniques, the Competition on Post-OCR Text Correction3 organized in the context of IC- 2 Related work DAR 2017 could mark a milestone for more uni- There are two obvious ways to automatically im- fied OCR post-correction research efforts. prove quality of digitized text: optimization of Regarding techniques used for OCR post- OCR systems or automatic post-correction. Com- correction, there are two main trends to be monly, OCR utilizes just basic linguistic knowl- mentioned: statistical approaches utilizing error edge like character set of a language or reading distributions inferred from training data and lex- direction. The focus lies on the image recognition ical approaches oriented towards the comparison aspect which is often done with artificial neural of source words to a canonical form. Combina- networks (cf. Graves et al. (2009), Desai (2010)). tions of the two approaches are also available. Post-correction is focused on the correction of er- Techniques residing in this statistical domain rors in the linguistic context. It thus allows for the have the advantage that they can model specific purposeful inclusion of knowledge about the text distributions of the target domain if training data at hand, e.g. genre-specific vocabulary. Neverthe- is available. Tong and Evans (1996) approach less, post-correction has predominantly been tack- post-correction as a statistical language modeling led OCR system agnostic as outlined below. As problem, taking context into account. Perez-´ an advantage, post-correction can also be applied Cortes et al. (2000) employ stochastic finite-state when no scan or physical resource is available. automaton along with a modified version of There have been attempts towards shared datasets the Viterbi Algorithm to perform a stochastic for evaluation. Mihov et al. (2005) released a cor- error correcting parsing. Extending the simpler pus covering four different kinds of OCRed text stochastic context-sensitive models, Kolak and comprising German and Bulgarian. However, in 3https://sites.google.com/view/ 2017 the corpus was untraceable and no recent re- icdar2017-postcorrectionocr/home, 3.07.2017. 2717 Resnik (2002) apply the first noisy channel model, based on contents gathered from all over the web. using edit distance from noisy to corrected text The human component as final authority has on character level. In order to train such a model, been mentioned in some of these projects. Visual manually generated training data is required. Rey- support of the post-correction process has been naert (2008b) suggests a corpus-based correction emphasized by e.g. Vobl et al. (2014) who method, taking spelling variation (especially in describe a system of iterative post-correction historical text) into account. Abdulkader and of OCRed historical text which is evaluated in Casey (2009) introduce an error estimator neural an application-oriented way. They present the network that learns to assess error probabilities human corrector with an alignment of image from ground truth data which in turn is then and OCRed text and make batch correction of suggested for manual correction. This decreases the same error in the entire document possible. the time needed for manual post-correction since They can show that the time needed by human correct words do not have to be considered as correctors considerably decreases. candidates for correction by the human corrector. Llobet et al. (2010) combine information from the OCR system output, the error distribution and 3 Evaluation metrics the language as weighted finite-state transducers. We describe and evaluate our data by means of Reffle and Ringlstetter (2013) use global as word error rate (WER) and character error rate well as local error information to be able to (CER). The error rates are a commonly used met- fine-tune post-correction systems to historical ric in speech recognition and machine translation documents. Related to the approach introduced by evaluation and can also be referred to as length Perez-Cortes´ et al. (2000), Afli et al. (2016) use normalized edit distance. They quantify the num- statistical machine translation for error correction ber of operations, namely the number of inser- using the Moses toolkit on character level. Volk tions, deletions and substitutions, that are needed et al. (2010) merge the output of two OCR systems to transform the suggested string into the manually with the help of a language model to increase the corrected string and are computed as follows: quality of OCR text.