A Multi-Fonts Kanji Character Recognition Method for Early-Modern Japanese Printed Books with Ruby Characters
Total Page:16
File Type:pdf, Size:1020Kb
A Multi-fonts Kanji Character Recognition Method for Early-modern Japanese Printed Books with Ruby Characters Taeka Awazu1, Manami Fukuo1, Masami Takata2 and Kazuki Joe2 1Graduate School of Humanities and Sciences, Nara Women's University, Nara, Japan 2Academic Group of Information and Computer Sciences, Nara Women's University, Nara, Japan Keywords: Character Recognition, Character Clipping, Genetic Programing, Early-modern Japanese Printed Books. Abstract: The web site of National Diet Library in Japan provides a lot of early-modern (AD1868-1945) Japanese printed books to the public, but full-text search is essentially impossible. In order to perform advanced search for historical literatures, the automatic textualization of the images is required. However, the ruby system, which is peculiar to Japanese books, gives a serious obstacle against the textualization. When we apply existing OCRs to early-modern Japanese printed books, the recognition rate is extremely low. To solve this problem, we have already proposed a multi-font Kanji character recognition method using the PDC feature and an SVM. In this paper, we propose a ruby character removal method for early-modern Japanese printed books using genetic programming, and evaluate our multi-fonts Kanji character recognition method with 1,000 types of early-modern Japanese printed Kanji characters. 1 INTRODUCTION project of automatic text extraction for eDigital Li- brary from the Meiji Eraf. In extracting text data from The National Diet Library (NDL) in Japan keeps image data of early-modern Japanese printed books, about 320,000 books dating from the Meiji era to the when existing OCRs are applied to the image data, first half of Showa era (AD1868-1945). These books the recognition rates are too low to be practical. We cover a broad range of books including philosophies, have reported that the recognition of the printing type literatures, histories, technologies, natural sciences, segmented from the early-modern Japanese printed etc. Many books are out of print although they are books is possible using the method of handwritten scientifically precious data. Therefore, in the NDL, li- Kanji character recognition (C. Ishikawa and Joe, brary materials have been and will be handed down to 2009). In early-modern Japanese printed books, it is the future generationsas cultural assets by digitalizing naturally inferred that used for each publisher adopts the archives of the materials for wider and easier use. a different typography. However, even if within the The electronic library service is provided as eDigi- same publisher, it has also been reported that ty- tal Library from the Meiji Eraf. At the NDL Web pography differs by age (M. Fukuo and Joe, 2012). site, user can search the materials by setting up de- For these reasons, we use the method of handwrit- tailed items, such as title, author, publisher, and pub- ten Kanji character recognition for text extraction of lication year. However, since the text of early-modern early-modern Japanese printed books. Japanese printed books is exhibited as image, full-text In order to achieve automatic text extraction from search is essentially impossible. In order to perform image data of such old books, we have to automat- the full-text search, it is required extracting texts from ically clip each character for its recognition. How- images. Although there are considerably many early- ever, the erubyf system, which is peculiar to Japanese modern Japanese printed books that are scientifically books, gives a serious obstacle against the clipping. precious, the text extraction of hundreds of thousands The ruby system has been developed for low edu- of books is impossible in budget if it is performed cated people to teach the way to read difficult Kanji by hand. There is no existing research on character characters. While there are six thousands of@Kanji recognition for early-modern Japanese printed books. characters used in Japan, all Japanese do not com- Based on the above backgrounds, we enlist co- pletely learn them. Low educated people can read operation from the NDL to have started the research only hundreds of Kanji characters. At the same Awazu T., Fukuo M., Takata M. and Joe K.. 637 A Multi-fonts Kanji Character Recognition Method for Early-modern Japanese Printed Books with Ruby Characters. DOI: 10.5220/0004825306370645 In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM-2014), pages 637-645 ISBN: 978-989-758-018-5 Copyright c 2014 SCITEPRESS (Science and Technology Publications, Lda.) ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods time, Japanese characters include phonetic charac- rent usual documents. We use 1,000 kinds of Kanji ters called Katakana and Hiragana distinct from Kanji characters from the JIS level-1 Kanji character set for characters, and all Japanese can read the phonetic our experiments. characters. The ruby system gives the way of pro- The rest of the paper is organized as follows. nouncing difficult Kanji characters by locating small Section 2 gives the description about fonts and the phonetic characters on the right side of each Kanji ruby system found in early-modern Japanese printed character. books. In section 3 we introduce existing methods It is well known that failure of character clip- for character clipping and their application to the ruby ping due to existence of ruby characters reduces the character removal. Our method of the ruby character character recognition rate for conventional OCRs. removal using genetic programming and experiment Since the early-modern Japanese printed books are results are presented in section 4. Section 5 gives the not given standard typography such as current books, description about character clipping, off-line hand- the character recognition rate of conventional OCRs written Kanji character recognition, and recognition for the old books is extremely low in general. As experiments. far as we know, any ruby removal methods for early- modern Japanese printed books have not been studied. As for existing methods to remove ruby characters 2 RUBYANDFONTOF from current books with standard typography, there are two main methods (N. Stamatopoulos and Gatos, EARLY-MODERN JAPANESE 2009) (Fletcher and Kasturi, 1988): (1) Separating PRINTED BOOKS ruby characters linearly using density histogram and (2) separating ruby characters using circumscription Japanese letter consists of three kinds of characters: rectangles. Both methods assume the standard ty- Hiragana, Katakana, and Kanji. The first two charac- pography for the target books. Otherwise, a lot of ter sets are syllabaries and include about 70 phonetic parts of a main Kanji characterwould be stronglycon- characters while the last character set is an ideogram nected with the corresponding ruby characters, and and include more than six thousands characters. Sen- valleys of the histogram do not appear clearly. There- tences in Japanese books are usually printed vertically fore, good recognition results cannot be obtained in using the above three character sets. Ruby is a sys- the case of (1). In the case of (2), strongly connected tem for low educated Japanese to support pronounc- ruby characters to a main Kanji character would make ing difficult Kanji characters written in the books by it very difficult to construct a rectangle just for the locating small Hiragana or Katakana characters at the main Kanji character, and ruby character removing right side of difficult Kanji characters. We call the becomes very difficult. small phonetic characters and the (big) Kanji char- In this paper, we propose a ruby character removal acters Ruby characters and parent characters, respec- method for early-modern Japanese printed books. We tively. Most Japanese books adopt the ruby system, classify these books into several categories to gener- and currently most books are generated in a desktop ate ruby character removal filters for each category publishing system, where the standard of fonts and using the genetic programming (Koza, 1992). We use ruby is defined by JIS. Early-modernJapanese printed the following two methods for the classification. One books are published from the Meiji era (1868-1912) is the classification using the information of publish- to the middle of the Showa era (1926-1980) with ers and publication ages added to the books. The typographical printing. Since there are no standard other is the classification using the feature acquired typographies by the middle of the Showa era, vari- from booksf images. In order to investigate the differ- ous fonts and ruby systems are used in early-modern ence of these methods, we perform comparative ex- Japanese printed books. Figure 1 shows one of the periments for ruby character removal. current fonts and two fonts in early-modern express- ing the same Kanji. When a ruby system is used in a Furthermore, in order to absorb various fonts of early-modern Japanese printed books, we perform recognition experiments with 1,000 kinds of Kanji characters from the old books using an off-line hand- written Kanji character recognition method with the Figure 1: A current font and two old fonts expressing the PDC feature (N. Hagita and Masuda, 1983). By same Kanji. Japanese Industrial Standards (JIS), 2,965 Kanji char- acters that are frequently used are defined as the JIS book, ruby characters are located at the right side of level-1 Kanji character set that includes most of cur- the corresponding parent Kanji character so that the 638 AMulti-fontsKanjiCharacterRecognitionMethodforEarly-modernJapanesePrintedBookswithRubyCharacters outer frames of the ruby characters contact with the ting at turning points of the cross direction histogram. outer frame of the parent Kanji character. Figure 2 Figure 4 shows the black pixel projection histogram shows an example of ruby character location. of a Kanji character. Figure 2: An example of the positions of ruby. The ruby characters found in early-modern Figure 4: Black pixel projection histogram. Japanese printed books tend to be located to their par- Most Kanji characters are constructed with several ent Kanji characters with extremely close.