Improving Optical Character Recognition of Finnish Historical

Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing Mika Koistinen Kimmo Kettunen Tuula Paäkk¨ onen¨ National Library of Finland National Library of Finland National Library of Finland The Centre for Preservation The Centre for Preservation The Centre for Preservation and Digitisation and Digitisation and Digitisation [email protected] [email protected] tuula.paäkk¨ [email protected]¨ Abstract has 78% word accuracy. According to Jarvelin¨ et al. (2015) improved OCR quality would im- In this paper we describe a method for improve the usability and information retrieval re- proving the optical character recognition sults for the Digi collection. Evershed and Fitch (OCR) toolkit Tesseract for Finnish his- (2014) show that over 10% point OCR word ac- torical documents. First we create a model curacy improvement could improve the search re- for Finnish Fraktur fonts. Second we test call by over 9% point on OCRed historical English Tesseract with the created Fraktur model documents. According to Lopresti (2009) better and Antiqua model on single images and OCR quality would improve the possiblities to uti- combinations of images with different lize other text analysis methods for example sen- image preprocessing methods. Against tence boundary detection, tokenization, and part- commercial ABBYY FineReader toolkit of-speech tagging. Also other methods such as our method achieves 27.48% (FineReader named entity recognition, topic modeling and ma- 7 or 8) and 9.16% (FineReader 11) im- chine translation could benefit from better quality provement on word level. input data. There are many existing commercial and open Keywords: Optical Character Recog- source OCR tools available. Open source tools are nition, OCR Quality, Digital Image attractive choices, because they are free and open Processing, Binarization, Noise Removal, for improvement. For example Tesseract2 is an Tesseract, Finnish, Historical Documents open source OCR engine, that is combined with 1 Introduction Leptonica image library processing. It has models for over 100 languages. Some of the open source These days newspapers are published in digital OCR tools e.g. Tesseract, Cuneiform, and OCRo- format (born digital). However, historical docu- pus are discussed more in detail in (Smitha et al., ments were born before the digital age and need 2016; Smith, 2007). From commercial tools AB- to be scanned first and then extracted as text from BYY FineReader is one of the most known ones. the images by using optical character recognition In our work, we develop a Tesseract Finnish (OCR). Currently the National Library of Fin- Fraktur model using an existing German Frak- land (NLF) has over 10 million scanned histori- tur model3 as a starting point. We compare the cal newspaper and journal pages and there is need resulting model with the commercial ABBYY to improve the OCR quality, because it affects the FineReader toolkit. Previously, OCR quality of usability and information retrieval accuracy for in- Tesseract and ABBYY FineReader has been com- dividual users, researchers and companies at the pared by Helinski´ et al. (2012) on Polish historical Digi newspaper collection.1 documents. In their experiments, Tesseract out- NLF’s current document collection is discussed performed FineReader on good quality pages con- more in detail in Kettunen et al. (2016). Usually taining Fraktur fonts, while FineReader performed OCR quality of a historical document collection is better on Antiqua fonts and bad quality pages. many times on the level of 70-80% word accuracy. In addition to developing a new Finnish Fraktur Tanner et al. (2009) estimated that OCR quality model we study the effect of various preprocess- of the British 19th century newspaper collection 2https://code.google.com/p/tesseract-ocr/ 1digi.kansalliskirjasto.fi 3https://github.com/paalberti/tesseract-dan-fraktur 277 Proceedings of the 21st Nordic Conference of Computational Linguistics, pages 277–283, Gothenburg, Sweden, 23-24 May 2017. c 2017 Linkoping¨ University Electronic Press ing methods employed to improve the image qual- algorithm with much better recall and slightly bet- ity on final OCR accuracy. Previously these kind ter precision. The method is based on Niblack and of methods have shown to yield improvements in Sauvola algorithms. Niblack’s algorithm is using the image and/or OCR quality in several works rectangular window that is slided through the im- (Smitha et al., 2016; Ganchimeg, 2015; El Harraj age. It’s center pixel threshold (T) is calculated by and Raissouni, 2015; Wolf et al., 2002; Sauvola using the mean (m) and variance (s) of the values and Pietikainen,¨ 1999). inside the window. The rest of the paper is organized as follows. In Section 2, we discuss challenges and methods T = m + k s, (1) ∗ related to scanned newspaper image quality improvement and Tesseract model teaching. The de- where k is constant set to 0.2. This method unfor- veloped method and its evaluation are then dis- tunately usually creates noise to areas where there cussed in Sections 3 and 4, respectively. Finally, is no text. To avoid noise Sauvola’s method in- we present conclusion on the work in section 5. cluded hypothesis on gray values of text and background pixels, thus modifying the formula into 2 Challenges s T = m (1 k (1 )), (2) OCRing of Finnish historical documents is diffi- ∗ − ∗ − R cult mainly because of the varying quality newspa- where R is the dynamics of standard deviation that per images and lack of model for Finnish Fraktur. is set to 128. However this hypothesis does not Also the speed of the OCR algorithm is important, hold in every image documents such as variable when there is need for OCRing a collection con- contrast degraded documents. Therefore the for- taining millions of documents. Scanned histori- mula was changed to normalize the contrast and cal document images have noise such as scratches, mean gray level of the image. tears, ink spreading, low contrast, low brightness, s and skewing etc. Smitha et al. (2016) present that T = m k (1 ) (m M)), (3) document image quality can be improved by bi- − ∗ − R ∗ − narization, noise removal, deskewing, and fore- where R is the maximum standard deviation from ground detection. Image processing methods are all of the windows. M is the minimum graylevel briefly explained next as a background for improv- of the image. ing the scanned image quality. Smoothing or blurring is a method to attenu- ate the most frequent image pixels. It is typically 2.1 Improving the image quality by image used to reduce image noise. Gonzales and Woods processing methods (2002) (p. 119-125) presents smoothing windows Digital images can be processed either by sliding a (Figure 1), rectangular window through image to modify it’s pixel values inside the window (local) or the whole image can be processed at once (global). Binarization is image processing method that turns grayscale image pixels into binary image. The pixels in image are transferred to either black (0) or white (255) by a threshold value. Figure 1: Smoothing windows Binarization methods according to Segzin and Sankur (2004) are based on histogram shape- based methods, clustering-based methods such as where the former is the average of gray levels and Otsu (1979), entropy-based, object atribute-based, latter is the weighted average approach (window spatial methods and local methods such as Niblack is rough approximation of a Gaussian function). (1986); Sauvola and Pietikainen¨ (1999). Tesseract These windows can be slided through image and toolkit uses Otsu’s algorithm for binarization as a the output is smoothed image. Ganchimeg (2015) default, which is not always the best method for presents history of document preprocessing meth- degraded documents. ods noise removal and binarization. These meth- Wolf et al. (2002) developed image binarization ods can be based on thresholding, fuzzy methods, 278 histograms, morphology, genetic algorithms, and are utilized to to calculate new image gray levels. marginal noise removal. The filtering techniques However, global methods had problems with vary- Gaussian, Sharpen, and Mean are compared, and ing intensities inside the image. Thus an Adaptive it is noted that Gaussian blur is best for creat- Histogram Equalization (AHE) was developed. It ing clear and smooth images. Droettboom (2003) partitions the image usually to 8 x 8 windows. mentions problem with broken and touching char- For each of these image windows sub-histogram is acters in recognizing older documents, and pro- used to equalize each window separately, but still poses broken character connection algorithm us- AHE created problematic artifacts inside regions ing k-nearest neighbours, which is able to find that contained noise, but no text. and join 91% of the broken characters. Original, Pizer et al. (1990) developed Contrast Limited blurred and some binarization methods for four Adaptive Histogram Equalization (CLAHE) that different images can be seen below in Figure 2. was limiting the error in the AHE. CLAHE uses the slope of transform function. It clips the top part of histogram (by predefined clip limit value) before calculating the CDF for each of the images in sliding window (see figure 3). The clipping re- duces the over-amplification AHE had. Result image depends from window size and clip limit, and are selected by the user. Figure 2: a) Grayscale(left) , b) Gaussian blur c) Niblack, d) Sauvola e) WolfJolion, f) Howe 367. Image histogram presents image grayscale values(x) between 0-255 and frequencies(y) for all of Figure 3: CLAHE histogram clipping from Stan- the image pixels, where value 0 is black and 255 hope (2016) is white. It is a common way to visualize the image contents. According to Krutsch and Tenorio (2011) histogram equalization methods are His- Below in Figure 4 results of contrast improve- togram expansion, Local area histogram equal- ment methods are shown.

Load more