Glyph Extraction from Historic Document Images

Glyph Extraction from Historic Document Images Lothar Meyer-Lerbs Arne Schuldt Björn Gottfried [email protected] [email protected] [email protected] Center for Computing and Communication Technologies (TZI) University of Bremen, Am Fallturm 1, 28359 Bremen, Germany ABSTRACT That is why we adopt the approach of [11], and focus on This paper is about the reproduction of ancient texts with the extraction of glyphs from historic printed document im- vectorised fonts. While for OCR only recognition rates count, ages. These will be turned into vectorised fonts, using a a reproduction process does not necessarily require the recog- document-speci¯c encoding. The vectorisation will be based nition of characters. Our system aims at extracting all char- on grey-scale prototype glyphs, clustered to favor new pro- acters from printed historic documents without the employ- totypes over possible mixing of di®erent glyphs. The best ment of knowledge of language, font, or writing system. It glyphs will be averaged into prototypes which replace all searches for the best prototypes and creates a document- degraded versions. This results in improved compression [3] speci¯c font from these glyphs. To reach this goal, many and sets it apart from OCR. common OCR preprocessing steps are no longer adequate. The paper is organised as follows. Section 2 discusses We describe the necessary changes of our system that deals common preprocessing steps and describes our changes to particularly with documents typeset in Fraktur. On the one extract better glyphs. Section 3 presents ¯rst classi¯cation hand, algorithms are described that extract glyphs accu- results of the clustering of binarised glyphs to select suitable rately for the purpose of precise reproduction. On the other features. Conclusions are made in Section 4. hand, classi¯cation results of extracted Fraktur glyphs are presented for di®erent shape descriptors. 2. GLYPH EXTRACTION VS. OCR We wish to extract document-speci¯c fonts from historic Categories and Subject Descriptors document images. These fonts need size, style, and kerning information as well as subtle character details to be suc- I.4.3 [Image Processing and Computer Vision]: En- cessful. In contrast to the basic goal of OCR | identi¯ca- hancement; I.5.3 [Pattern Recognition]: Clustering tion of all glyphs | which would allow the construction of fonts usable with current computer typesetting systems, our General Terms prototype system `Venod' will assign Unicode codepoints to prototype glyphs from the Unicode `private use area' and Algorithms, Experimentation encode the generated fonts with unidenti¯ed glyphs. This allows the reflowing of text and high speed text searches Keywords from examples { in essence a fast form of word spotting. Image enhancement, glyph extraction, document-speci¯c font, Ledible and beautiful fonts require the extraction of many glyph shape, glyph classi¯cation signi¯cant details from document images. Therefore, sim- ple binarisation does not su±ce. Following the binarisation introduced by [1] we compare and select appropriate pre- 1. INTRODUCTION processing steps. Their binarisation proceeds from an input Old books display a huge variety of degradations, layouts, grey-scale source image via and typographic styles. They use nowadays unfamiliar lig- atures, abbreviations, and special printer marks typeset in 1. a Wiener ¯lter to a denoised grey-scale image I, fonts that spur the ¯eld of paleography to decipher the an- 2. adaptive thresholding by Sauvola's local binarisation cient writing | we can hardly hope for OCR of these docu- to a black and white initial segmentation image S, ments that, in addition, have all kinds of spelling variations using words that can not be found in current dictionaries. 3. interpolation of the non-text part to a grey-scale background surface B, 4. ¯nal thresholding by examining pixel contrast to a b/w Permission to make digital or hard copies of all or part of this work for image T , which might be upsampled during its creation personal or classroom use is granted without fee provided that copies are and is called U here, not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to 5. postprocessing by shrink and swell ¯ltering to the ¯nal republish, to post on servers or to redistribute to lists, requires prior specific binarised (upsampled) image, called F here. permission and/or a fee. DocEng2010, September 21–24, 2010, Manchester, United Kingdom. We follow these steps one by one and compare possibilities Copyright 2010 ACM 978-1-4503-0231-9/10/09 ...$10.00. and changes on the way. Original Otsu Wiener 3 £ 3 Sauvola k = 0:2 Dilated(Sauvola (k = 0:2, Wiener 5 £ 5 Wiener)) Di®usion 5 £ 5 Sauvola k = 0:35 Enlarged Sigma 5 £ 5, CCs(Sauvola (¢=20, K=2) (k=0:35,Sigma)) Figure 1: Noise ¯ltering results after processing the Figure 2: Binarised versions top four rows, bottom original image, each row presents the results of a row: enlarged, up to distance 2, connected compo- di®erent noise ¯lter and their window size. nents of S cut from I { our intended glyph vectorisation source. 2.1 Denoising In their approach, [1] always use a Wiener ¯lter [6, 8] as around them. Thresholding this map at distance 2 allows the ¯rst preprocessing step to denoise the input image. This us to enlarge every component by up to two pixels unless it will introduce Gaussian blur and diminish font details. In- overlaps other components, see the bottom row of Fig. 2. To stead, edge preserving noise ¯lters [5] like the Sigma ¯lter by test a di®erent background ¯lling method, we chose bilinear [7] or anisotropic di®usion ¯lters introduced by [10] promise interpolation to ¯ll the gaps. Our results indicate that the better results. Since di®usion ¯lters introduce Gaussian blur ¯lling procedure is less important than halo removal (bottom at some iteration, our experiments lead us to use just four rows of Fig. 4). iterations with decreasing parameters k = 32; 24; 16; 8, cor- In [4], glyph edges are modeled with an exponential de- responding to gradient di®erences influencing the ¯lter. cay function that usually decays within two pixels from the The denoising results shown in Fig. 1 recommend the edge. As our results show (Fig. 4 and bottom row of Fig. 2), Sigma ¯lter for our purpose. It keeps all glyph edges and all relevant boundary information is preserved. Using this eliminates the background noise. enlarged component mask for every component we have also 2.2 Preliminary Binarisation found grey-scale glyph components with su±cient many bor- A ¯rst estimation of glyph positions is given by a rough der pixels for grey-scale vectorisation. binarisation. Therefore several algorithms and their param- eterisation are compared; Fig. 2 shows some examples. The 2.4 Final Thresholding results indicate that global methods like Otsu's [9] have sig- Most grey-scale images will contain enough information ni¯cant problems with noisy, unevenly illuminated images for an upsampling in both directions by a factor of 2; see and need to be replaced by locally adaptive methods like the top row of Fig. 3. Depending on the intended use, this the one by [13], which can be implemented to run fast with step might improve later results or OCR. Since we intend any window size [15]. The window size should include more to cluster the grey-scale glyphs, upsampling should be done than one glyph and was set to 40 £ 40 in the experiments after the prototypes have been found. presented here. 2.3 Background Estimation 2.5 Postprocessing The approach by [1] interpolates the gaps left after remov- A shrink ¯lter to eliminate noise from the background is ing the foreground pixels by averaging a rectangular back- employed by [1]. With the suggested parameters, it would ground area, skipping foreground pixels along the way. This remove some isolated foreground pixels; but our experiments tends to leave a dark one-pixel halo around ¯lled areas. We show no e®ect on the given test material. Then, a ¯rst swell found out experimentally that by dilating the foreground ¯lter, supposed to `¯ll possible breaks, gaps or holes in the image with a 3 £ 3 black structure element, most of these foreground', and a second one ùsed to improve the quality artefacts can be removed (Fig. 4). of the character strokes' is applied by [1]. Since dilation will connect previously unconnected compo- From our results (see the bottom rows of Fig. 3), we con- nents and their later separation is prone to errors we devised clude that no swell ¯ltering should be applied when glyphs an alternative: ¯nd all 8-connected components in S, label need to be extracted. The entire postprocessing step is each one and create a Euclidean Distance Map (EDM) [12] therefore superfluous for our application. Sauvola k = 0:2 Sauvola k = 0:35 Sauvola k = 0:2 + dilate Figure 3: Binarised upsampled di®erence to interpolated background, and two swell ¯lter applications for the ¯nal result. Top row: upsampled binarisation without postprocessing, U; middle row: ¯rst swell ¯lter applied, bottom row: second swell ¯lter applied, F . First column based on the original process, second column with di®ering Sauvola binarisation and the third column with a dilation added after the initial binarisation. The top right result shows the fewest broken characters. 3.1 The Scope Histogram Rectangular In order to improve the performance of those established average features, further shape characteristics of the same runtime complexity are investigated that would complement the other Dilated and features. This is guaranteed by introducing a new system of rectangular shape properties as follows.

Load more