Evaluation of the SVM based Multi-Fonts Kanji Character Recognition Method for Early-Modern Japanese Printed

Manami Fukuo1, Yurie Enomoto1†, Naoko Yoshii1, Masami Takata1, Tsukasa Kimesawa2, and Kazuki Joe 1, 1 Dept. of Advanced Information & Computer Sciences, Nara Women’s University, Nara, 2 Division, Library, Kyoto, Japan

Abstract - The national diet library in Japan provides a web The information including titles and author names of the based digital archive for early-modern printed books by books in the digital library is given as text data while main image. To make better use of the digital archive, the body is image data. There are no functions for generating text images should be converted to text data. In this paper, we data from image data. Thus full-text search is not supported evaluate the SVM based multi-fonts Kanji character yet. To make early-modern printed and valuable books data recognition method for early-modern Japanese printed books. more accessible, their main body should be given as text data, Using several sets of Kanji characters clipped from different too. As described above, the number of the target books is so publishers’ books, we obtain the recognition rate of more than large that auto conversion is required. 92% for 256 kinds of Kanji characters. It proves our recognition method, which uses the PDC (Peripheral If the conversion targets were general text images, they would Direction Contributivity) feature of given Kanji character have been converted into text data easily with some software images for learning and recognizing with an SVM, is effective of optical character recognition (OCR). However, most of for the recognition of multi-fonts Kanji character for early- early-modern printed books contain old Japanese characters modern Japanese printed books. that are not used now. Moreover, the fonts used in the early- modern printed books vary by publisher and year of Keywords: character recognition; SVM; digital archiving publication. Existing OCR software cannot recognize right characters under these conditions. 1 Introduction The national diet library (NDL) [1] in Japan keeps about To solve these problems, we proposed a multi-fonts Japanese 390,000 books dating from the Meiji and Taisyo era (1868- character recognition method for early-modern printed books 1926). The books cover a broad range including philosophies, [4]. However, we selected just ten kinds of (Japanese) Kanji literatures, histories, technologies, natural sciences, etc. Most characters, which are commonly and frequently used, for the of them are out of print and valuable materials in scholarly. recognition experiment. Thus, the effectivity of the proposed method was not clearly shown. In this paper, we extend the kinds of Kanji characters for several recognition experiments Generally, books in libraries read in hand have too many risks to validate our method. To perform several recognition of aging or wearing, and loss by man-made source to open to experiments, we need enormous character images with various the public. To solve the problem, the NDL started a project font sets from early-modern books by various publishers. called “The Digital Library from the Meiji Era” [2,3]. In the Therefore we need to perform automatic character clipping project, early-modern printed books are recorded on from images. Then, we extract the features of character microfiches page by page. The microfiches are converted into images using a feature extraction method for handwritten digital images and opened at the project Web site. Converting Kanji character recognition because of the wide variety of the books into digital images enabled the contents of the font-sets and hard noises. Finally, the Support Vector valuable books to be opened to the public while the original Machine (SVM), which is one of promising recognition books are preserved in good condition. The NDL provides methods, is used for the recognition of the feature vectors. about 148,000 volumes with 101,000 titles in the digital library from their collection. Users can see the digital images of books whenever or wherever with the Internet connection. The rest of this paper is organized as follows. In section II, we present the overview of our multi-fonts Kanji character recognition method. The evaluation method to validate the effectivity of our method is explained in section III. In section

† Currently works for Fujitsu Limited. IV, we describe evaluation experiments and discuss the based method. However, there are 6,349 kinds of Kanji experimental results. characters, which Japanese Standards Association selects as the Japanese character code [6]. In addition, the structure of 2 The Recognition Method kanji characters is hierarchical: There are a lot of similar Kanji structures such as radical indices. In this section, we give the overview of our multi-fonts Kanji character recognition method presented in [4]. In this paper, we extend the kinds of Kanji character up to 256 For the character recognition of early-modern printed books, kinds for several recognition experiments and validate the we think we should not use the typical OCR processes effectivity of our method. In [4], all character images for the because each publisher might use different font sets in those recognition experiment were clipped from early-modern days. The flow of our recognition method is shown below. books by hand. Because of the large number of the kinds of Kanji characters for this experiment, we also implement a) An image containing a character is given as training data. automatic Kanji character clipping for early-modern books. b) Several pre-processes such as binarization, normalization 3 Automatic Kanji Character Clipping and noise reduction are applied to the given image. The flow of automatic Kanji character clipping to extract c) The PDC (Peripheral Direction Contributivity) feature each character image for the recognition experiments is shown [5] is extracted from the pre-processed image, and each below. feature vector for the given image is labeled with the category according to the kind of the character included 1) A page of the book images is given as an object image in the given image. and divided into right and left parts. d) A dictionary is generated by an SVM with the training 2) Several pre-processes such as binarization, noise data extracted in c). The dictionary is used for the reduction and affine transform are applied to each recognition of test data. divided half part of the object.

Giving character images at a), several pre-processes are 3) Layout analysis is applied to each part. applied to each image. Any pixel in the images must be black or white when PDC features are extracted from the images. 4) Character strokes are clipped within the layout analysis Thus binarization is applied to the images. In normalization, results. margins are removed from the images so that each character size is equivalent to be scaled. Noise reduction should be also 5) Labeling is performed for each clipped character stroke applied since printing and archiving quality of early-modern so that each label is mapped to a character image. printed books is mostly poor. Without noise reduction, noises of images would be recognized as character strokes to be Several pre-processes are executed on each part image. extracted as unsuitable PDC features. Binalization is applied to the image since the distribution of the vertically projected pixel values is used for extracting PDC feature vectors are calculated with category labels at c), character domains. Noises of the image might be recognized and used for the learning phase of SVM at d). A training data as a false character domain. Then the noise reduction by a for SVM learning is a set of PDC feature vectors and its label median filter is performed. Furthermore, distortion correction while a test data for SVM classification is a set of PDC by an affine transformation is also applied since the distortion feature vectors without label. All character images are by capturing has a great influence to layout analysis. converted to training or test data by processes a) to c). Half of Therefore, layout analysis would be performed more correctly the whole data are randomly selected from each category as with noise reduction and distortion correction. The threshold training data, and the others are used for test data. An SVM is calculated with the distribution of the vertically projected learns separated hyper-planes in the PDC feature vector space pixel values, and layout analysis is executed. The auto- with the training data. The trained SVM can classify test data correlation function by Sondhi [7] is applied to the according to the separated hyper-planes. distribution of the vertically projected pixel values to calculate stroke spacing. Finally labeling is performed to each In our previous research [4], the SVM recognition character stroke, which is clipped based on the stroke spacing. experiments were performed for ten kinds of Kanji characters. Each labeled part of connected black dots is recognized as a Furthermore we compared the experimental result of SVM character domain to clip a character from the original book with a neural network (NN). The experimental result of SVM images. was the recognition rate of 97.8% while the NN was 77.6%. We confirmed that our SVM based method was more suitable In this paper, we use the character images clipped by the for learning and classification of PDC features than the NN automatic character clipping. Figure 1 shows some examples of clipped character images. Each character image is clipped Table 1 The list of nine early-modern works from nine early-modern books with various publication ages and publishers so that we get character images with various Book Title Author Publisher Publication font sets. It means we get nine kinds of PDC feature vector Number Year sets for each Kanji character. In this paper, a PDC feature 1 L'Incident de Sakai Mori Ougai Suzuki Miekichi 1914 vectors is calculated with nine character images in Fig. 1, for 2 Like that Mori Ougai Momiyama bookstores 1914 example. Furthermore, we calculate the averages of each 3 The Boat on the Mori Ougai Shunyodo Publishing 1918 dimension to obtain the standard deviation of the PDC feature Takase River Co., Ltd. vector. As the result, the standard deviation is 11.53, which 4 I Am a Cat Natsume OKURA Publishing Co., 1905-1907 seems to be an enough small value. That is, the fluctuation of Soseki Ltd. characters are enough small. Therefore, we use nine character 5 London Tower Natsume Senshokan 1915 images clipped from different books for each character. Four Soseki character images for each character are randomly selected as 6 Tobacco and the Devil Akutagawa SHINCHOSHA 1922 Ryunosuke Publishing Co., Ltd. training samples, and the others are used for test samples. 7 Strange Reunion Akutagawa KINSEIDO Publishing 1922 Ryunosuke Co., Ltd.

8 Returning a Favor Akutagawa Jiritsusha 1923 Ryunosuke

book1 book2 book3 9 One Day in the Life of Akutagawa Bungei shunjyu Ltd. 1926 Oishi Kuranosuke Ryunosuke

book4 book5 book6 Table 1 shows the list of nine early-modern works used for this experiment. Firstly, we examine all Kanji characters used in the nine early-modern works with the text data from the book7 book8 book9 Aozora Bunko to find the Kanji characters that are used in the nine works in common. We find there are 262 kinds of Figure 1 Clipped kanji character images from different commonly used Kanji characters. We randomly select several books sets from 262 kinds of Kanji characters, and clip the Kanji character images from the nine book images of the digital 4 Experiments library. In this experiment, we use the Kanji character sets of 16, 32, 64, 128, and 256 kinds to examine the recognition rate 4.1 Experiment Method for each set. We explain the data used in this experiment. To confirm The original book images are monochrome with 256 gray the effectivity of our method, character images with various scales. Several pre-processes are applied to each Kanji font sets should be collected from early-modern books of character image. The image data are binarized in the first pre- various ages. In the early-modern ages, different publishers process for PDC feature extraction. The second pre-process is used different font sets in general. Even the same publisher noise reduction by a median filter. The median filter with a 3 might use different font sets in different ages. Therefore, we by 3 mask is applied to the images. The normalization of size need various character images for each kind of characters and position is performed by trimming the margins to clip the clipped from as many different font sets as possible. The Kanji character image area. To fit the Kanji character images question is how to get the sufficient number of samples for precisely into the square of 128 by 128 pixels, centering and each target character. We use the Aozora Bunko [8] to check affine transformation are applied. By these pre-processes, all which books contain a target character. Kanji character image data are converted to binary images with 128 by 128 pixels. The Aozora Bunko is a Web based online collection of the literary works in Japanese. Users can download PDC feature vectors are generated from the pre-processed the works as text data. The number of works offered by the images and used for the learning phase of SVM. Five samples Aozora Bunko is about 7,200. All of them are also included in are randomly selected for each Kanji character and used as NDL’s “the digital library from the Meiji era”. The works in training samples. The other five samples are used for the the Aozora Bunko can be used as a subset of the works in the recognition phase as test samples. We adopt an SVM as a digital library from the Meiji era. By using text data of the recognition method. works in the Aozora Bunko, the character occurrence frequency can be calculated easily. The data set for the experiment is collected from text data in the Aozora Bunko, and image data corresponding to the data set are obtained from the digital library from the Meiji era as shown below. 4.2 Results 4.3 Discussion SVMs require careful parameter choices for their We analyzed the recognition results in detail to find learning processes in general. In this experiment, we three cases for miss-recognized Kanji characters. The three determined the parameters by grid-search and 5-fold cross- cases are listed below. validation for the SVM. We use LIBSVM [9] for the experiments in this paper. Radial Basis Function (RBF) is i) The miss-recognized Kanji characters have the same used as the kernel function of the SVM. The numbers of Kanji radical indices. characters sets are 16, 32, 64, 128, and 256, and each range of two RBF parameters, γ and C, is determined by grid-search. C ii) The miss-recognized Kanji characters do not have the is a penalty parameter for the SVM. Each element of PDC same radical indices but similar structures. feature vectors is rescaled to the range of [-1, 1]. Each parameter for five experiments is shown as follows. In the iii) The miss-recognized Kanji characters do not have any cases of 16, 32, 64, 128, and 256, the parameters are set to (γ, similar structures. C) = (2-13, 25), (2-9, 23), (2-9, 23), (2-11, 25), and (2-13, 27), respectively. Note that these parameters are chosen just by a simple grid-search. Since the experiments in this paper are Table 4 Miss-recognized characters with similar structures preliminary, we do not optimize the parameters. Pre-processed Correct Recognized image data character character Table 2 shows the experimental results. As is easily expected, Case11 置 the recognition rate becomes low as the number of Kanji Case12 五 左 characters increases. However, the recognition rate of each Case13 右 左 case always keeps more than 92%. Several examples of 眞 兵 unrecognized samples are shown in Tab. 3, Tab. 4 and Tab. 5. Case14 Case15 場 現 床 成 Table 2 The number of errors and the recognition rates Case16 Case17 申 出 The number of Correct/Test samples Errors Recognition Kanji characters rate[%] Case18 時 持 16 64/64 0 100 32 124/128 4 96.875 Case19 幾 64 248/256 8 96.875 国 自 128 477/512 35 93.164 Case20 256 949/1024 75 92.676 Case21 快 色 Case22 同 問 Case23 者 着 Case24 白 自

Table 3 Miss-recognized characters with the same radical indices Table 5 Other miss-recognized characters

Pre-processed Correct Recognized Pre-processed Correct Recognized image data charactercharacter image data character character Case1 違 遠 Case25 深

Case2 渡 沈 Case26 深

Case3 感 思 Case27 空 草

Case4 聞 間 Case28 張 紙

Case5 過 通 Case29 得

Case6 連 遠 Case30 寝 無

Case7 側 何 Case31 微

Case8 聞 Case32 深 抵

Case9 問 間 Case33 無 成

Case10 後 微 Case34 結 着 Table 3 shows some examples of miss-recognized Kanji and archiving quality. Furthermore, Tab. 7 shows the relation characters with the same radical indices. Table 4 shows some between the number of Kanji characters and the number of examples of miss-recognized Kanji characters with similar miss-recognitions for cases i), ii) and iii). The more the structures. Recognition errors of cases 1-4 shown in Tab. 3 number of Kanji characters for experiments is increased, the and cases 11-17 shown in Tab. 4 are caused by lack of more the ratio of cases i) and ii) to three cases. Therefore, it is character strokes removed at the pre-processes phase. Similar observed that miss-recognitions caused by similar structures features of Kanji characters may be emphasized by lack of increase as the number of target Kanji characters increases in character strokes. In cases 1-4, similar features are radical the experiments. indices. In cases 11-17, similar features are horizontal and vertical strokes and slant of character strokes. These false Therefore, the reasons of miss-recognition are considered as features may lead the SVM to miss-recognition. follows. Firstly, noises sometimes give bad effects to PDC feature extraction. The second reason is the similarity of Kanji In cases 5-8 and cases 18-20, the pre-processes seem to characters: radical indices, horizontal and vertical character eliminate complex structures of the character strokes. strokes, and their slants. Especially, cases 6, 8, 19 and 20 are much smeared images. The recognition error of case 9 is caused by noises in the Table 7 Miss-recognitions by case image that are not removed at the pre-processes. In case 10 and cases 21-24, the possibility that the recognition errors are The number of Case (i) Case (ii) Case (iii) caused by noises is low since the pre-processed images have Kanji characters 32 0 2 2 relatively few noises. In each occasion, correct cases and 64 0 3 5 unrecognized cases have common features: Horizontal and 128 5 15 12 vertical character strokes and their slants. These common 256 13 27 28 features may lead the SVM to false recognition.

Table 5 shows some examples of miss-recognition among the Kanji characters that do not have any similar structures. The reason of miss-recognition in cases 25-28 is the same to cases 11-17. The reason of miss-recognition in cases 29-34 is the same to cases 5-8 and cases 18-20. Thus, noises prevent the original images from correctly clipping margins. Or they are recognized as some parts of character strokes. Therefore, the calculation of PDC feature vectors is greatly affected by noises in the original Kanji character images.

Table 6 Miss-recognitions by title book2 book5 book7

Book Title 32 kinds 64 kinds 128 kinds 256 kinds Number 1 L'Incident de Sakai 0 0 0 1 book8 book9 2 Like that 1 1 7 12

3 The Boat on the 0 0 1 2 Takase River Figure 2 Exsamples of miss-recognized clipped Kanji 4 I Am a Cat 0 0 0 0 characters 5 London Tower 2 1 6 10

6 Tobacco and the Devil 0 1 4 12 7 Strange Reunion 0 0 5 10 5 Conclusions 8 Returning a Favor 0 2 7 14

9 One Day in the Life of 1 3 5 11 In this paper, we evaluated the SVM based multi-fonts Oishi Kuranosuke Kanji character recognition method for early-modern Japanese printed books. To evaluate our recognition method, we used 262 kinds of Kanji characters, which are commonly Table 6 reflects our discussion. It shows the number of miss- used in nine early-modern titles from different publishers recognitions by title. Titles 2, 5, 7, 8 and 9 have a lot of miss- found in “The Digital Library from the Meiji Era”. We recognitions. Figure 2 shows examples of Kanji character applied automatic character clipping to the nine titles to clip images clipped from titles 2, 5, 7, 8 and 9. The Kanji Kanji character images. To extract features of Kanji character images in Fig. 2 are broken because of poor printing characters, the PDC feature was calculated from the pre- processes character images. For effective experiments, we [6] Japanese Standards Association: generated five sets (16, 32, 64, 128, and 256) of Kanji www.jsa.or.jp/default_english. asp characters. We selected each set from the 262 kinds of Kanji characters at random, and clip Kanji character images from [7] Man Mohan Sondhi: “New Methods of Pitch nine different images of NDL’s digital library. To recognize Extraction”, IEEE Transactions on Audio and Kanji character images based on the extracted PDC features, Electroacoustics, Vol.AU-16, No.2, June 1968, pp.262- their feature vectors were given to an SVM for learning. 266. When the targets were 16, 32, 64, 128, and 256 kinds of Kanji characters, the recognition rates were 100%, 96.875%, [8] Aozora Bunko www.aozora.gr.jp/ (in Japanese) 96.825%, 93.164%, and 92.676%, respectively. Each recognition rate for training samples was 100%. [9] V. Vapnik. “The Nature of Statistical Learning Theory”. Springer-Verlag, 1995. We showed that our SVM based Kanji character recognition method could recognize printed Kanji characters clipped from early-modern Japanese printed books. However, two reasons of miss-recognition are considered. Firstly, miss-recognitions are caused by noises. Kanji character images clipped from early-modern printed books are usually broken or smeared because most of the early-modern printed books have been ill- preserved. The second reason is the similarity of Kanji characters: radical indices, horizontal and vertical character strokes, and their slants.

We think it will require some kind of hierarchical structures for learning data. In that case, the question is how 6,400 Japanese Kanji characters are divided in to a hierarchical structure, which is still an open problem. Another point of improvement is the pre-processes. We should improve the noise reduction to apply to heavy noises with poor printing quality on debased papers.

Acknowledgment This work is partially supported by Grant-in-Aid for scientific research from the Ministry of Education, Culture, Sports, Science and Technology of Japan (MEXT).

References [1] National Diet Library: www.ndl.go.jp/en/ index.html

[2] Digital Library from the Meiji Era, kindai.ndl.go.jp/ (in Japanese)

[3] Pamphlet of Digital Library from the Meiji Era, kindai.ndl.go.jp/information/kindai(eng).

[4] C. Ishikawa, N. Ashida, Y. Enomoto, M. Takata, T. Kimesawa, and K. Joe, “Recognition of Multi-Fonts Character in Early-Modern Printed Books,” Int’l Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA’09), 2009, pp.728-734.

[5] N. Hagita, S. Naito and I. Masuda. “Handprinted Chinese Characters Recognition by Peripheral Direction Contributivity Feature”, IEICE, Vol.J66-D, 10, pp.1185- 1192, 1983. (in Japanese)