Perspectives for the Historical Information Retrieval with Digitized Japanese Classical Manuscripts

The 20th Annual Conference of the Japanese Society for Arti¯cial Intelligence, 2006 2P6 Perspectives for the Historical Information Retrieval with Digitized Japanese Classical Manuscripts Hari Raghavacharya¤1, Rafal Rzepka¤2, Kenji Araki¤2 and Toshimasa Miyazawa¤1 ¤1Graduate School of Letters, Hokkaido University ¤2Graduate School of Information Science and Technology, Hokkaido University This paper outlines a system to retrieve and convert text matter to digital form from old manuscripts written in Classical Kana Characters.The system consists of ¯ve stages 1.Preparation of sample kana handwriting 2.Primary classi¯cation into sub categories 3.The Segmentation of Target Classical Manuscript 4.Feature Extraction 5.Recog- nition.The detailed aspect of preliminary stages like sample collection and primary classi¯cation will entail greater accuracy in recognition of a wide range of Kana manuscripts. 1. Introduction 2. Kana writing system Character recognition has come long way since it was The modern Japanese writing system consists of three el- conceived as an idea. Greater accuracy has become possi- ements, Hiragana and Katakana syllabary and the Chinese ble in extracting printed matter to a text form. Its poten- ideograms called Kanji. It is typical for a sentence to have tial is recognized widely for digitalizing the contents of text all the three elements, although the usage of Katakana is printed in an earlier era. However, handwriting recognition restricted to proper nouns of foreign origin. remains in its infancy despite its huge promise. There are Kana writing system is derived from Chinese ideograms several systems in existence for handwriting recognition of that were used in early Nara period as phonetic symbols Latin based languages[1][2]. Some intensive research is seen for Japanese syllabic sounds. These were called as Manyo- in CJK version of handwriting recognition as well[3][4][5]. gana and had no meaning apart from the sound association. There are no existing systems for recognition of classical Handwritten Manyogana were abbreviated and simpli¯ed Japanese Kana. There is a large corpus of manuscripts in actual usage, which gave rise to a new system of writ- written in Kana that includes some very famous works. ing called Sogana. This Sogana or Kana system of writing Most of the famous works have been rendered into modern remained popular throughout Japanese history till it was Japanese by going over entire manuscripts letter by letter. replaced by Hiragana syllabary during Meiji period. In This process has covered a minuscule amount of texts and Kana system, Kana characters derived from multiple Chi- is not likely to take a vast majority of others unless some nese ideograms coexisted until Hiragana syllabary ¯xed one tool is made available to make the work simpler. Kana for one syllable. However, for a manuscript of a pre- In this paper, we propose a recognition system for classi- ceding era, there are no ¯xed number of Kana syllabary but cal Kana characters. The proposed process can be divided the most commonly used Kana characters` number could be into 3 stages. The ¯rst stage covers collection of sample estimated around 200. handwriting spread over a few centuries. The second stage covers conversion of these samples into binaries and clas- 3. Ideas for Digital Transcription of sifying them according to their sizes as a part of primary Classical Kana Manuscripts classi¯cation. In the ¯nal stage, an algorithm is used to recognize an unknown character as one of the primary clas- 3.1 A typical Manuscript si¯cation groups. Classical manuscripts have problems such as deteriorat- The proposed system is being worked on as an interdisci- ing paper, discoloration and worm holes. We use pho- plinary joint project. tographs of the original manuscript scanned at 300dpi, This paper is organized in following sections. Section 2 distortions and noise are removed using thresholding deals with the background of Kana system of writing. Sec- techniques[6]. Some manuscripts are written on papers tion 3 introduces the proposed system while section 4 deals with exquisite patterns which when photographed show with the possibilities of the proposed system. We present as prominently as the text written with brush,making the our conclusions in section 5 and discuss future work. reading of a particular portion inordinately di±cult. The Contact: Language Media Laboratory, Research Group of background of such a manuscript image needs to be con- Information Media Science and Technology, Division verted to white to make the text legible. Most of the com- of Media and Network Technologies, Graduate School monly available manuscripts, however, are on plain form of of Information Science and Technology, Hokkaido Uni- paper. There are some existing techniques for handling the versity, Kita-ku Kita 14 Nishi 9, 060-0814 Sapporo, discoloration and worm hole problems[7]. Japan. TEL: (+81)(11)706-6535, FAX: (+81)(11)706- 6277, fhari,kabura,[email protected] 1 The 20th Annual Conference of the Japanese Society for Arti¯cial Intelligence, 2006 3.2 Problems in Digital Transcription of Clas- pre-classi¯ed groups can be determined based on the four sical Kana Manuscripts parameters. Most of the classical texts have annotations and revi- sions and reviews scribbled alongside the main text mat- (1) The Kana character touches only the Central Refer- ter.Corrections by the author exist side by side with notes ence line of a reviewer which makes it extremely di±cult to identify (1) The Kana character touches or protrudes all three ref- the original text. erence lines A classical manuscript composed in Kana does not mean that it is entirely written in just one system of syllabary. (1) The Kana character protrudes RRL but does not touch There are frequent occurrences of Chinese ideograms or the LRL Kanji which require a separate handling apart from Kana when it comes to digital transcription of the documents. (1) The Kana character protrudes LRL but does not touch RRL 3.3 The proposed method The proposed method consists of ¯ve di®erent stages. 3.34 The Segmentation of Target Classical ManuscriptThe The ¯rst one is Pre-processing that includes preparation system of Kana writing seen in classical works is entirely of sample handwriting of di®erent people spread over a few vertical. Since all of them are composed with brush, there centuries. A primary classi¯cation of samples into select is a tendency to write without lifting the brush till the ink sub categories based on average occurrence of a letter with runs dry. This results in a set of Kana characters that are reference to provided reference lines. connected without any demarcation among individual let- The next stage deals with the segmentation of the Target ters. This causes some problems in segmenting the text text followed by feature extraction. The ¯nal stage is recog- along the traditional methods. nition. To overcome this problem, it is proposed that all joined 3.31 Preparation of sample Kana handwritingIn order to Kana characters are treated as one unit apart from the sep- make this experimental system work on a wide range of arate Kana characters which constitute a single unit. Each Kana texts, a corpus of Kana handwriting samples will be such unit is referred to an N-gram database to check the ex- built. Beginning with sample handwriting from the early istence of a particular combination of characters to suit the Kana literary texts, a wide range of handwriting samples context or alternate segmenting positions are considered till will be collected spread over a few centuries. All forms and the appropriate combination is rendered[10]. variations of one Kana will be categorized into one Kana 3.35 Feature ExtractionOnce the Target Text is seg- class mented, feature vector is extracted for each of the seg- 3.32 Digital Transcription of Classical Kana Manuscripts mented area. The pixel density is calculated in each zone Each sample will be scanned at a speci¯ed resolution and where each zone is identical to the zone size in the sample segmented.In the experimental stage the resolution of 300 images already prepared. dpi is considered suitable. Segmentation size of each sam- 3.36 RecognitionRecognition of Kana characters is made ple is proposed at 32x32 pixels. The samples thus rendered by an algorithm to identify similarity.The positon of Kana will be stored as 8 bit gray-scale images. Each image is character with reference to the Reference lines determine then further broken down to 4x4 pixel zones for which the which subcategory of Kana the unknown character belongs pixel density is calculated. An initial feature vector extrac- to. Inside the sub category, to which Kana class the un- tion captures the pixel density of each zone for each of the known character belongs to is determined on the basis of images. most number of matches. The three best matches are then 3.33 Primary classi¯cation into sub categoriesThree verti- processed using N-gram database to identify the most ap- cal reference lines parallel to each other are drawn for every propriate character. column of Kana writing. The position of each Kana is then calculated with reference to these lines. For each column 4. Possibilities of Kana, the position of central reference line is estimated upon the measurement of the width of the ¯rst Kana char- The proposed system can contribute to accurate digi- acter of that column, so that it occurs precisely at the cen- talization of classical Japanese manuscripts.This in turn ter. The Right and Left reference lines are then estimated opens up new vistas for Japanologists and Japanese after the calculation of the average character width of that Philologists. A classical manuscript in Kana is useful particular column.

Perspectives for the Historical Information Retrieval with Digitized Japanese Classical Manuscripts

The Study of Old Documents of Hokkaido and Kuril Ainu

On Translation a Short Introduction to the Japanese Language Will

Title Classical Japanese in Linguistic and Cross-Cultural Perspective Sub

Source-Based Translation and Foreignization: a Japanese Case

A Comparative Analysis of the Simplification of Chinese Characters in Japan and China

KANA Response Live Organization Administration Tool Guide

Arxiv:1812.01718V1 [Cs.CV] 3 Dec 2018

Japanese Major and Minor Revised: 06/2021

DBT and Japanese Braille

A Discovery in the History of Research on Japanese Kana Orthography: Ishizuka Tatsumaro's Kanazukai Oku No Yamamichi

The Formation of “New” Kana Written Horizontally

Discriminative Method for Japanese Kana-Kanji Input Method