Classifying Fonts and Calligraphy Styles Using Complex Wavelet Transform
Total Page:16
File Type:pdf, Size:1020Kb
SIViP (2015) 9 (Suppl 1):S225–S234 DOI 10.1007/s11760-015-0795-z ORIGINAL PAPER Classifying fonts and calligraphy styles using complex wavelet transform Alican Bozkurt1 · Pinar Duygulu2 · A. Enis Cetin3 Received: 2 April 2015 / Revised: 15 June 2015 / Accepted: 20 June 2015 / Published online: 23 July 2015 © Springer-Verlag London 2015 Abstract Recognizing fonts has become an important task 1 Introduction in document analysis, due to the increasing number of avail- able digital documents in different fonts and emphases. A The term font generally refers to a document’s typeface, generic font recognition system independent of language, such as Arial. Each font can have variations such as bold script and content is desirable for processing various types of or italic to emphasize the text, which are called emphases. documents. At the same time, categorizing calligraphy styles Font recognition, which is the process of classifying dif- in handwritten manuscripts is important for paleographic ferent forms of letters, is an important issue in document analysis, but has not been studied sufficiently in the litera- analysis especially in multi-font documents [20,32]. In ture. We address the font recognition problem as analysis and addition to its advantages in capturing document layout, categorization of textures. We extract features using complex font recognition may also help to increase the perfor- wavelet transform and use support vector machines for clas- mance of optical character recognition (OCR) systems by sification. Extensive experimental evaluations on different reducing the variability of shape and size of the charac- datasets in four languages and comparisons with state-of-the- ters. art studies show that our proposed method achieves higher For printed fonts in languages that use the Latin alpha- recognition accuracy while being computationally simpler. bet, the main challenge is to recognize fonts in “noisy” Furthermore, on a new dataset generated from Ottoman man- documents, that is, those containing many artifacts. When uscripts, we show that the proposed method can also be used we consider languages with cursive scripts, such as Arabic, for categorizing Ottoman calligraphy with high accuracy. the change in character shape with location (isolated, initial, medial or final) and dots (diacritics) above or below the letters Keywords Font recognition · Ottoman calligraphy · Dual cause further difficulties in character segmentation. tree complex wavelet transform · SVM · Latin · Arabic · Handwritten documents add an extra component to the Chinese analysis because of writing style. Classifying handwriting styles continues to be a challenging yet important problem Electronic supplementary material The online version of this for paleographic analysis [3]. In recent studies of Hebrew, article (doi:10.1007/s11760-015-0795-z) contains supplementary Chinese and Arabic calligraphy, researchers used characters material, which is available to authorized users. as the basic elements to extract features [6,35,38]. However, B Alican Bozkurt these methods heavily rely on preprocessing steps and are [email protected] prone to error. Another example which we focus, the style of Ottoman 1 Department of Electrical and Computer Engineering, Northeastern University, 360 Huntington avenue, Boston, calligraphy, is the artistic handwriting style of Ottoman Turk- MA 02115, USA ish. Different styles were used in different documents, such 2 Department of Computer Science, Hacettepe University, as books, letters, etc. [29]. Scholars around the world want to Beytepe Campus, 06800 Ankara, Turkey access efficiently and effectively to Ottoman archives, which 3 Department of Electrical and Electronics Engineering, contain millions of documents. Classifying Ottoman callig- Bilkent University, 06800 Bilkent, Ankara, Turkey raphy styles would be an important step in categorizing large 123 S226 SIViP (2015) 9 (Suppl 1):S225–S234 numbers of documents in archives, as it would assist further scanned in high resolution. Global features refer to infor- processing for retrieval, browsing and transliteration. mation extracted from entire words, lines or pages, and are The Ottoman alphabet is similar to Farsi and Arabic. mostly texture based [2,4,10,20,32]. Hence, to recognize multi-font printed texts in Ottoman [25], Zhu et al. addressed the font recognition problem as a tex- existing methods of Arabic and Farsi font recognition can ture identification issue and used multichannel Gabor filters be utilized. However, due to the late adoption of printing to extract features. Prior to feature extraction, the authors technology in the Ottoman Empire, a high percentage of doc- normalized the documents to create uniform blocks of text. uments are handwritten. Documents in Ottoman calligraphy Experimental results were reported on computer-generated are very challenging compared to their printed counterparts, images with 24 Chinese (six typefaces and four emphases) with intra-class variances much higher than what is found and 32 English (eight typefaces and four emphases) fonts. in printed documents. Some documents are hundreds of Pepper and salt noise was added to generate artificial noise. years old and the non-optimal storage conditions of historic Gabor filters were also used in [30] for feature extraction, and Ottoman archives have resulted in highly degraded manu- support vector machines (SVM) for classification. Experi- scripts. So as not to damage the binding, books are scanned ments were carried out on six typefaces with four emphases with their pages only partially open, introducing non-uniform on English documents. Similarly, Ma and Doermann used lighting in images. Gabor filters not for font but for script identification at word We propose a simple but effective method for recognizing level. The authors used three different classifiers: SVM, k- printed fonts independent of language or alphabet, and extend nearest neighbor and the Gaussian mixture model (GMM), it to classifying handwritten calligraphy styles in Ottoman to identify the scripts in four different bilingual dictionaries manuscripts. We present a new method based on analyzing (Arabic-English, Chinese-English, Hindi-English, Korean- textural features extracted from text blocks, which there- English). The method was also used to classify Arial and fore does not require complicated preprocessing steps such Times New Roman fonts [23]. Aviles-Cruz et al. used high- as connected component analysis or segmentation. While order statistical moments to characterize the textures, and Gabor filters are commonly used in the literature for texture Bayes classifier for classification. Similar to [37], they exper- analysis, they are computationally costly [20]. Alternatively, imented on Spanish documents digitally generated with eight we propose using complex wavelet transform (CWT), which fonts and four emphases. They also tested the effects of is not only more efficient but also achieves better recognition Gaussian random noise in [5]. rates compared to other methods. Khosravi and Kabir approached the Farsi font recognition Unlike most existing studies focusing on a single lan- problem, and instead of using Gabor filter-based method, pro- guage, we experiment on many previously studied printed posed a gradient-based approach to reduce the computational fonts in four languages: English, Farsi, Arabic and Chi- complexity. They combined Sobel and Roberts operators to nese. Our method also yields high accuracy in categorizing extract the gradients and used AdaBoost for classification. Ottoman calligraphy styles on a newly generated dataset con- In [31], Sobel-Robert operator-based features were com- sisting of various samples of different handwritten Ottoman bined with wavelet-based features. Another method, based calligraphy styles. on matching interest points is proposed in [36]. For Arabic font recognition, method based on fractal geometry is proposed in [8,9], where the authors generated a 2 Related work dataset consisting of printed documents in ten typefaces. Sli- mane et al. proposed a method for recognizing fonts and sizes One of the first systems of optical font recognition was [39], in Arabic word images at an ultra-low resolution. They use in which global typographical features were extracted and GMM to model the likelihoods of large numbers of features classified by a multivariate Bayesian classifier. The authors extracted from gray level and binary images [32]. Bataineh extracted eight global features from connected components. et al. considered statistical analysis of edge pixel behavior in They experimented on ten typefaces in seven sizes and four binary images for feature extraction from Arabic calligraphic emphases in printed and scanned English documents. Ozturk scripts. They experimented on Kufi, Diwani, Persian, Roqaa, et al. proposed a cluster based approach for printed fonts and Thuluth and Naskh styles [7]. exploited recognition performance for quality analysis [26]. For classifying calligraphy styles, Yosef et al. presented Feature extraction methods for font recognition in the liter- a writer-identification method based on extracting geometric ature are divided into two basic approaches: local and global. parameters from three letters in Hebrew calligraphy docu- Local features, usually refer to the typographical informa- ments, followed by dimension reduction [35]. Azmi et al. tion gained from parts of individual letters, and are utilized proposed using triangle blocks for classifying Arabic callig- in [15,33].