Font Clustering and Classification in Document Images *

FONT CLUSTERING AND CLASSIFICATION IN DOCUMENT IMAGES * Serdar Öztürk1, Bülent Sankur1, A. Toygar Abak2 1 %R÷D]LoL 8QLYHUVLW\ 'HSDUWPHQW RI (OHFWULFDO(OHFWURQLF (QJLQHHULQJ Bebek, Istanbul, TURKEY {oztuserd, sankur}@boun.edu.tr 2 Marmara Research Center, Information Technologies Research Institute Gebze, Kocaeli, TURKEY [email protected] ABSTRACT 2 PRELIMINARIES: FONTS AND TYPEFACES Clustering and identification of fonts in document images A font can be modeled by the following attributes[2, 9]: impacts on the performance of optical character recognition The font family like “Times”, “Helvetica” and (OCR). Therefore font features and their clustering “Courier”. tendency are investigated. Font clustering is implemented The size expressed in typographic points. both from shape similarity and from OCR performance The weight of the font, having one of the following points of view. A font recognition algorithm is developed to values: light, normal or bold. identify the font group with which a given text was created. The slope indicating the orientation of the letters main strokes. A font could be roman or italic. 1 INTRODUCTION The spacing mode specifying the pitch of the characters. The increasing diversity of font styles is a factor limiting The fonts we have considered belong to 28 different the performance of general-purpose OCR systems. In fact font families and together with weight and slope varieties new computer-generated fonts are being added to the their total number amounts to 65 as listed in the Appendix. existing fonts in the printing industry, their total running The fonts, are referred to in the paper via their associated into hundreds. The detection of font style of a text is an ID number. obvious way to improve the performance of optical We assume that the font attributes of spacing mode and recognition algorithms. size have been normalized as follows: The spacing mode We investigate the clustering behavior of fonts and uniformized via the bounding boxes of characters; the size propose a method for font/font class identification. One is normalized via bilinear interpolation to a standard size of application of font clustering is the design of a more robust 32 x 32. Thus the font variables that remain are font family, OCR system. The underlying scheme here is that the slope and weight. For each font we considered the 68 character recognition system will consist of a font cluster different characters, consisting of the lowercase letters, identifier unit followed by an OCR unit specific to each uppercase letters and numerals. A total of 68 x 65 = 4420 font group. A second application is the identification of character images of different fonts were prepared using a individual fonts when reproduction of a document in its computer graphics program. original appearance is desired. In addition font recognition There are several possible feature sets to be extracted aids in the recognition of logical structures of documents. from characters for recognition and clustering purposes [5]. Finally it is also conjectured that for many of these tasks it In this paper we tested the following features: bitmaps, may suffice to find out the font cluster to which a specific DCT coefficients, eigencharacters and Fourier descriptors. text belongs, so that the cluster centroid can be used as the representative font. The goals of the study can be itemized 3 FEATURES of CHARACTERS AND FONTS as follows: To find the minimal set of clusters that provides 3.1 Character Bitmaps adequate OCR performance across all fonts. To determine the representative font for each font The 1024-dimensional binary pattern of a character is cluster indicated by the lexicographic ordering of its 32 x 32 To develop an algorithm to detect the font type or bitmap. Any jitter, due to the finite resolution grid or cluster in a document. interpolation, can be mitigated by considering four other There have been few studies on font recognition. templates in one pixel: up, down, right, left shift positions. Notably R. Morris [10] has considered classification of The maximum of the match scores is considered for digital typefaces using spectral signatures. Khoubyari and classification. Hall in [1] have considered a scheme based on the identification function words e.g., “the”, “and”. Zramdini 3.2 Fourier Descriptors [5] and Ingold [2, 9] consider a font identification approach based on the projection profiles. Our contribution is in the The cosinusoidal and sinusoidal components of the x and y investigation of the clustering behavior of fonts and it's – projections of character chain codes have been used. For n impact on the OCR performance. * This study was supported by Bogazici University Research Fund under project "Low Level Image Processing for OCR" number of harmonics the total feature set contains 4 x n interfont distances. We have considered the mean distance, features for each character. the centroid distance, the nearest-neighbor distance and finally the mean Hamming distance. For example the mean 3.3 Eigencharacters distance was defined as: N N 1 rs D (cluster ,cluster ) = ∑∑D( font , font ) The subspace features often prove to be more efficient in mean r s p q Nr Ns p==11q the classification task and for dimension m = 2 the For features other than bitmaps, (Fourier descriptors, DCT visualization of the multivariate data scatter becomes coefficients and eigenfeatures), the above font and font useful. The subspace features for a character are obtained cluster distance definitions remain valid if Euclidean by projecting the original features on the subspace defined instead of Hamming distance is used, i.e.: by the m principal eigenvectors [7]. The large dimensional m = − 2 (e.g., 1024) eigenvalue problem is solved by a method H (c pk ,c qk ) ∑(c pk (i) cqk (i)) described in [4]. i=1 th if cpk denotes the k character feature vector in font p. 3.4 DCT Features 4.2 Cluster Centroids Features can be extracted by considering the low order DCT transform coefficients of the 32 x 32 character block. Using interfont distances the centroidal font can be defined Alternatively one can consider the juxtaposition of the as the font having the smallest sum of feature distances to selected DCT coefficients of the sub-blocks of the character all other fonts. We compared three centroid definitions, images. For example for sub-block size 8, if one selects the that is font feature centroids, juxtaposition of individual first 4 coefficients of each block DCT, one obtains a 4 x (32 character centroids, and averaging-thresholding of character / 8)2 = 64 element feature vector for each character. bitmaps. The first two methods proved to be equally good choices. 3.5 Font Features 4.3 Clustering Algorithms Font feature vectors are constructed by serial juxtaposition of the character feature vectors in some order. Thus if the We have experimented with three clustering algorithms, bold characters denote symbolically the respective feature namely, the linkage algorithm, the mean-square distance vectors, the font feature vector appears as φ = [a … z A ... based algorithm and finally with the Ward's algorithm. Z 0 … 9]T The dimensionality of the font feature vector In the linkage algorithm the font pairs having minimum adds up then to d = K x m, where K is the number of average distance in the font proximity matrix are merged characters and m is the number of features per character. and new clusters are formed iteratively. However as clusters become sizeable, the single linkage method suffers from 4 CLUSTERING OF FONTS false merges and the disparateness in the groups increases. In the second method the mean square Hamming We investigate in this section the distance functions of distance between clusters defined as: 1 character feature vectors and comparative evaluation of the K N N 1 r s 2 clustering methods. D(cluster ,cluster ) = r s ∑[D (c ,c )]2 r s + mean k k Nr N s K k=1 4.1 Distance Functions where Nr and Ns denote the respective font cluster populations. An illustrative output of this method is shown in Fig. 1. Thirdly a sample result of the Ward's algorithm is Let cpk(i,j) represent the bitmap at coordinates (i, j) of the kth character in the pth font style. The weighted Hamming shown in Fig. 2. Recall that Ward’s algorithm aims to distance between any two character images k created in minimize the within-cluster variation and this action is fonts p and q respectively is calculated as follows: equivalent to maximizing the between-cluster variation [3]. 32 The organization and clear separation of the 10 font clusters = ⊕ H(cpk ,cqk ) ∑[w(i, j)[cpk (i, j) cqk (i, j) ] using Ward's algorithm are shown in Fig. 3. i,j where the weights w(i,j) from 1 to 4 are assigned proportional to the number of neighbors (weight 1 also for isolated pixels). An F x F proximity matrix can be constituted for each character, averaged, F being the number of fonts considered. Averaging font-to-font proximity matrices over each of the K characters, we obtain the general F x F "font proximity" matrix, where the (p, q) cell has: K = D( font p , font q ) ∑ P(chark )H (chark / font p ,chark / fontq ) Figure 1 Clustering with average mean square Hamming distance k =1 using bitmap features; the cluster centroids are shown on the right. K = ∑ P(ck )H (c pk ,cqk ) k=1 P being the character occurrence probability. Finally the distance between clusters can be defined in terms of cross-performance table proves the importance of font recognition. Table 1: Correct recognition percentages with template based recognition on fonts different than the text font (only the first three font results are shown) Template Template Template Template font #1 Font #2 Font #3 font #4 Figure 2: Clustering with Ward’s algorithm using 20 principal components of the bitmap features Text font #1 89.6 55.6 71.8 78.6 Text font #2 64.4 98.6 20.6 58.2 Text font #3 69.0 39.2 100.0 77.2 Text font #4 74.4 76.2 83.4 100.0 Text font #5 93.2 52.0 94.8 81.4 Text font #6 64.6 88.0 74.8 91.6 Text font #7 84.6 64.2 64.4 81.2 Text font #8 64.4 71.6 37.2 76.2 Text font #9 59.4 34.6 58.8 73.8 Text font #10 40.6 70.4 41.0 64.4 Similarly we considered the intra-cluster variation of the OCR performance.

Load more