Image and Text Correction Using Language Models

Image and Text Correction Using Language Models Ido Kissos Nachum Dershowitz School of Computer Science, Tel Aviv University School of Computer Science, Tel Aviv University Ramat Aviv, Israel Ramat Aviv, Israel Abstract—We report on experiments with the use of learned Text post-correction applies a lexical spellchecker, and classifiers for improving OCR accuracy and generating word- potentially corrects single-error misspellings and a certain level correction candidates. The method involves the simulta- class of double-error misspellings, which are the major source neous application of several image and text correction models, followed by a performance evaluation that enables the selection of inaccurate recognition in most OCR use-cases. It takes of the most efficient image-processing model for each image into consideration several valuable word features, each giving document and the most likely corrections for each word. It additional information for a possible spelling correction. It relies on a ground-truth corpus, comprising image documents comprises two consecutive stages: (a) word expansion based and their transcription, plus an in-domain corpus used to build on a confusion matrix, and (b) word selection by a regression the language model. It is applicable to any language with simple segmentation rules, and performs well on morphologically- model based on word features. The confusion matrix and rich languages. Experiments with an Arabic newspaper corpus regression model are built from a transcribed set of images, showed a 50% reduction in word error rate, with per-document while the word features rely on a language model built from a image enhancement a major contributor. large textual dataset. The first stage generates correction candidates, ensuring high recall for a given word, while the second I. INTRODUCTION assures word-level precision by selecting the most probable word for a given position. Relying on features extracted Low quality printing, poor scanning, and physical deteriora- from pre-existing knowledge, such as unigram and bigram tion, reduce the usefullness of many modern publicly-available document frequencies extracted from electronic dictionaries, digital documentary resources (books, journals, newspaper as well as OCR metrics, such as recognition confidence and articles, etc.). Institutions are converting document images confusion matrix, we accomplished a significant improvement into machine readable text via Optical Character Recognition of text accuracy. (OCR), enabling a realistic way of exploring vast document The two correction methods—image enhancement and text corpora with automated tools, such as indexing for textual correction—implement equivalent methodologies: both begin search and machine translation. But when the images are by promoting recall, generating many correction candidates of poor quality, the OCR task becomes notoriously difficult. that some may improve the baseline result, and afterwards Consequently, it is impossible to directly employ the obtained use prior knowledge and context to gain precision, namely results for subsequent tasks, like text retrieval, without costly selecting the best candidate. Initially, our research focused on manual editing. For example, although contemporary OCR the text method; it is its success that pushed the idea on us engines claim 97% word accuracy for Arabic, the same of implementing a similar methodology for image processing, datasets with low-resolution images or infrequent character which in the end turned out to give the larger gain in accuracy. classes can drop below 80%. We report on experiments with applying the methodology Our proposed OCR-correction technique consists of an to Arabic, using test data from the “Arabic Press Archive” image pre-processing and text post-correction pipeline, based of the Moshe Dayan Center at Tel Aviv University. There on a composite machine-learning classification. The technique are a number of open-source and commercial OCR systems wraps the core OCR engine and is in practice agnostic to it. trained for Arabic [1]; we used NovoDynamics NovoVerus Image correction applies a small set of image enhancement commercial version 4, one of the leading OCR engines for algorithms on copies of the document images, which serve as Arabic scripts. For evaluation purposes we use the Word Error input for the OCR engine. The enhancements include image Rate (WER) measure, or its complement with 1, named OCR scaling, binarization methods and parameter thresholds, and accuracy, which is adapted for subsequent applications of were chosen for their experimental accuracy gain. The po- the OCR output, such as information retrieval. Our correction tential gain for every algorithm is evaluated as the sum of the method performs effectively, reducing faulty words by a rate of positive accuracy improvements, relying on a learned classifier 50% on this dataset, which is an 8% absolute improvement in for the selection of improved OCR text over a baseline OCR accuracy. The overall results showed negligible false-positive text, that is the output of the image with the OCR’s engine’s errors, namely the method rarely rejects correct OCR words default pre-processing. Such a classifier was trained with a in favor of an erroneous correction, which is a major concern ground-truth set, relying on recognition confidence statistics in spellcheckers. An analysis of classifier performance shows and language model features to output an accuracy prediction. that bigram features have the highest impact on its accuracy, X suggesting that the method is mainly context reliant. arg max accuracy(docjXi) X Section II presents the image enhancement methodology i doc2training-set and Section III, the text correction methodology. The main subject to section, Section IV, provides and discusses the experimental Xi ⊆ fScale, Binarize, Denoiseg × results. It is followed by a brief discussion. Algorithm Configurations; jXij ≤ 3 The limitation to a set of size 3 is conditioned for calculation purposes and empirical gain bounds. The approximation for II. IMAGE ENHANCEMENT Xi is obtained by trial and error and stopped when reaching negligible accuracy improvements. The ability to consistently select the best image processing Each family of dependent methods or thresholds is imple- for every image document leans on the capability to reliably mented in a separate module, resulting in a total of three predict its performance, namely, its OCR text accuracy. To modules. The following algorithm types and thresholds were facilitate this task, this prediction can be based on the extracted applied: (a) Bicubic and K-Nearest-Neighbors methods, scal- text of each image and not on the image itself, suggesting that ing from 1 to 3 with 0.25 stepsize. (b) Sauvola and threshold- an a posteriori selection of image processing algorithm could based binarization algorithms, with thresholds varying from outperform the common a priori one. 100 to 250 with a step size of 25. (c) Image denoising The enhancement method requires one to move from a methods included three different filters: Mild, Median and single-pass OCR engine, in which every document is pro- None, enabled in NovoVerus. Based on the above close- cessed once—and for which OCR engines are optimized, to to-optimal algorithm set, the candidate generation module multi-pass OCR. The latter enables an accuracy-performance applies this set to the input image and extracts its text for trade-off, promoting better OCR results at the compromise of the evaluation phase. CPU resources, which is often a reasonable call for digitiza- The evaluation stage evaluates the textual output from each tion projects. Having several output texts for a single image of the image candidates of every module. It is based on a document, we can rank them and choose the most accurate, learned linear regression that ranks the candidates according according to our prediction, for a specific image. to their expected accuracy. This score does not necessarily The multi-pass architecture is built as a pipeline where have to be a normalized accuracy, but as a language model each module applies a family of dependent algorithms, for score, assessing which textual output is more probable to example binarization methods and thresholds, and the sequen- occur. As for a typical machine learning algorithm, we extract tial modules are independent one of the other. After every features and train a regression upon them. The regression module an evaluation sequence extracts the document image relies on the language features of bigram occurences, as text and predicts its accuracy, then feeds its processed image to well as a confidence metric. The latter is calculated based the next module. This implementation avoids the application on the character level confidence given by the OCR engine, of an unfeasible number of image processing sets, which is aggregated to a document level statistic by averaging over the sum of all possible algorithm combinations. Assuming min confchar X char2word independence between the modules, their application order has words, namely σdoc = . jdocjw only a small significance. word2doc We train the accuracy prediction regression on the labeled Each module comprises two stages: (1) Enhancement can- set for which we know the real accuracy, and try different didate generation – Every algorithm in the set renders a feature representation and models to achieve good results. processed image that serves as an input to the OCR engine. Every image is scored independently

Image and Text Correction Using Language Models

Knowledge-Powered Deep Learning for Word Embedding

Recurrent Neural Network Based Language Model

Unified Language Model Pre-Training for Natural

Hello, It's GPT-2

N-Gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture Notes Courtesy of Prof

A General Language Model for Information Retrieval Fei Song W

A Unified Multilingual Handwriting Recognition System Using

Word Embeddings: History & Behind the Scene

Language Modelling for Speech Recognition

A Pretrained Language Model for Scientific Text

Improving Text Recognition Using Optical and Language Model Writer Adaptation

Exploring the Limits of Language Modeling