Evaluation of the SVM Based Multi-Fonts Kanji Character Recognition Method for Early-Modern Japanese Printed Books
Total Page:16
File Type:pdf, Size:1020Kb
Evaluation of the SVM based Multi-Fonts Kanji Character Recognition Method for Early-Modern Japanese Printed Books Manami Fukuo1, Yurie Enomoto1†, Naoko Yoshii1, Masami Takata1, Tsukasa Kimesawa2, and Kazuki Joe 1, 1 Dept. of Advanced Information & Computer Sciences, Nara Women’s University, Nara, Japan 2 Digital Library Division, National Diet Library, Kyoto, Japan Abstract - The national diet library in Japan provides a web The information including titles and author names of the based digital archive for early-modern printed books by books in the digital library is given as text data while main image. To make better use of the digital archive, the book body is image data. There are no functions for generating text images should be converted to text data. In this paper, we data from image data. Thus full-text search is not supported evaluate the SVM based multi-fonts Kanji character yet. To make early-modern printed and valuable books data recognition method for early-modern Japanese printed books. more accessible, their main body should be given as text data, Using several sets of Kanji characters clipped from different too. As described above, the number of the target books is so publishers’ books, we obtain the recognition rate of more than large that auto conversion is required. 92% for 256 kinds of Kanji characters. It proves our recognition method, which uses the PDC (Peripheral If the conversion targets were general text images, they would Direction Contributivity) feature of given Kanji character have been converted into text data easily with some software images for learning and recognizing with an SVM, is effective of optical character recognition (OCR). However, most of for the recognition of multi-fonts Kanji character for early- early-modern printed books contain old Japanese characters modern Japanese printed books. that are not used now. Moreover, the fonts used in the early- modern printed books vary by publisher and year of Keywords: character recognition; SVM; digital archiving publication. Existing OCR software cannot recognize right characters under these conditions. 1 Introduction The national diet library (NDL) [1] in Japan keeps about To solve these problems, we proposed a multi-fonts Japanese 390,000 books dating from the Meiji and Taisyo era (1868- character recognition method for early-modern printed books 1926). The books cover a broad range including philosophies, [4]. However, we selected just ten kinds of (Japanese) Kanji literatures, histories, technologies, natural sciences, etc. Most characters, which are commonly and frequently used, for the of them are out of print and valuable materials in scholarly. recognition experiment. Thus, the effectivity of the proposed method was not clearly shown. In this paper, we extend the kinds of Kanji characters for several recognition experiments Generally, books in libraries read in hand have too many risks to validate our method. To perform several recognition of aging or wearing, and loss by man-made source to open to experiments, we need enormous character images with various the public. To solve the problem, the NDL started a project font sets from early-modern books by various publishers. called “The Digital Library from the Meiji Era” [2,3]. In the Therefore we need to perform automatic character clipping project, early-modern printed books are recorded on from images. Then, we extract the features of character microfiches page by page. The microfiches are converted into images using a feature extraction method for handwritten digital images and opened at the project Web site. Converting Kanji character recognition because of the wide variety of the books into digital images enabled the contents of the font-sets and hard noises. Finally, the Support Vector valuable books to be opened to the public while the original Machine (SVM), which is one of promising recognition books are preserved in good condition. The NDL provides methods, is used for the recognition of the feature vectors. about 148,000 volumes with 101,000 titles in the digital library from their collection. Users can see the digital images of books whenever or wherever with the Internet connection. The rest of this paper is organized as follows. In section II, we present the overview of our multi-fonts Kanji character recognition method. The evaluation method to validate the effectivity of our method is explained in section III. In section † Currently works for Fujitsu Limited. IV, we describe evaluation experiments and discuss the based method. However, there are 6,349 kinds of Kanji experimental results. characters, which Japanese Standards Association selects as the Japanese character code [6]. In addition, the structure of 2 The Recognition Method kanji characters is hierarchical: There are a lot of similar Kanji structures such as radical indices. In this section, we give the overview of our multi-fonts Kanji character recognition method presented in [4]. In this paper, we extend the kinds of Kanji character up to 256 For the character recognition of early-modern printed books, kinds for several recognition experiments and validate the we think we should not use the typical OCR processes effectivity of our method. In [4], all character images for the because each publisher might use different font sets in those recognition experiment were clipped from early-modern days. The flow of our recognition method is shown below. books by hand. Because of the large number of the kinds of Kanji characters for this experiment, we also implement a) An image containing a character is given as training data. automatic Kanji character clipping for early-modern books. b) Several pre-processes such as binarization, normalization 3 Automatic Kanji Character Clipping and noise reduction are applied to the given image. The flow of automatic Kanji character clipping to extract c) The PDC (Peripheral Direction Contributivity) feature each character image for the recognition experiments is shown [5] is extracted from the pre-processed image, and each below. feature vector for the given image is labeled with the category according to the kind of the character included 1) A page of the book images is given as an object image in the given image. and divided into right and left parts. d) A dictionary is generated by an SVM with the training 2) Several pre-processes such as binarization, noise data extracted in c). The dictionary is used for the reduction and affine transform are applied to each recognition of test data. divided half part of the object. Giving character images at a), several pre-processes are 3) Layout analysis is applied to each part. applied to each image. Any pixel in the images must be black or white when PDC features are extracted from the images. 4) Character strokes are clipped within the layout analysis Thus binarization is applied to the images. In normalization, results. margins are removed from the images so that each character size is equivalent to be scaled. Noise reduction should be also 5) Labeling is performed for each clipped character stroke applied since printing and archiving quality of early-modern so that each label is mapped to a character image. printed books is mostly poor. Without noise reduction, noises of images would be recognized as character strokes to be Several pre-processes are executed on each part image. extracted as unsuitable PDC features. Binalization is applied to the image since the distribution of the vertically projected pixel values is used for extracting PDC feature vectors are calculated with category labels at c), character domains. Noises of the image might be recognized and used for the learning phase of SVM at d). A training data as a false character domain. Then the noise reduction by a for SVM learning is a set of PDC feature vectors and its label median filter is performed. Furthermore, distortion correction while a test data for SVM classification is a set of PDC by an affine transformation is also applied since the distortion feature vectors without label. All character images are by capturing has a great influence to layout analysis. converted to training or test data by processes a) to c). Half of Therefore, layout analysis would be performed more correctly the whole data are randomly selected from each category as with noise reduction and distortion correction. The threshold training data, and the others are used for test data. An SVM is calculated with the distribution of the vertically projected learns separated hyper-planes in the PDC feature vector space pixel values, and layout analysis is executed. The auto- with the training data. The trained SVM can classify test data correlation function by Sondhi [7] is applied to the according to the separated hyper-planes. distribution of the vertically projected pixel values to calculate stroke spacing. Finally labeling is performed to each In our previous research [4], the SVM recognition character stroke, which is clipped based on the stroke spacing. experiments were performed for ten kinds of Kanji characters. Each labeled part of connected black dots is recognized as a Furthermore we compared the experimental result of SVM character domain to clip a character from the original book with a neural network (NN). The experimental result of SVM images. was the recognition rate of 97.8% while the NN was 77.6%. We confirmed that our SVM based method was more suitable In this paper, we use the character images clipped by the for learning and classification of PDC features than the NN automatic character clipping. Figure 1 shows some examples of clipped character images. Each character image is clipped Table 1 The list of nine early-modern works from nine early-modern books with various publication ages and publishers so that we get character images with various Book Title Author Publisher Publication font sets. It means we get nine kinds of PDC feature vector Number Year sets for each Kanji character.