Perspectives for the Historical Information Retrieval with Digitized Japanese Classical Manuscripts

Total Page:16

File Type:pdf, Size:1020Kb

Perspectives for the Historical Information Retrieval with Digitized Japanese Classical Manuscripts The 20th Annual Conference of the Japanese Society for Arti¯cial Intelligence, 2006 2P6 Perspectives for the Historical Information Retrieval with Digitized Japanese Classical Manuscripts Hari Raghavacharya¤1, Rafal Rzepka¤2, Kenji Araki¤2 and Toshimasa Miyazawa¤1 ¤1Graduate School of Letters, Hokkaido University ¤2Graduate School of Information Science and Technology, Hokkaido University This paper outlines a system to retrieve and convert text matter to digital form from old manuscripts written in Classical Kana Characters.The system consists of ¯ve stages 1.Preparation of sample kana handwriting 2.Primary classi¯cation into sub categories 3.The Segmentation of Target Classical Manuscript 4.Feature Extraction 5.Recog- nition.The detailed aspect of preliminary stages like sample collection and primary classi¯cation will entail greater accuracy in recognition of a wide range of Kana manuscripts. 1. Introduction 2. Kana writing system Character recognition has come long way since it was The modern Japanese writing system consists of three el- conceived as an idea. Greater accuracy has become possi- ements, Hiragana and Katakana syllabary and the Chinese ble in extracting printed matter to a text form. Its poten- ideograms called Kanji. It is typical for a sentence to have tial is recognized widely for digitalizing the contents of text all the three elements, although the usage of Katakana is printed in an earlier era. However, handwriting recognition restricted to proper nouns of foreign origin. remains in its infancy despite its huge promise. There are Kana writing system is derived from Chinese ideograms several systems in existence for handwriting recognition of that were used in early Nara period as phonetic symbols Latin based languages[1][2]. Some intensive research is seen for Japanese syllabic sounds. These were called as Manyo- in CJK version of handwriting recognition as well[3][4][5]. gana and had no meaning apart from the sound association. There are no existing systems for recognition of classical Handwritten Manyogana were abbreviated and simpli¯ed Japanese Kana. There is a large corpus of manuscripts in actual usage, which gave rise to a new system of writ- written in Kana that includes some very famous works. ing called Sogana. This Sogana or Kana system of writing Most of the famous works have been rendered into modern remained popular throughout Japanese history till it was Japanese by going over entire manuscripts letter by letter. replaced by Hiragana syllabary during Meiji period. In This process has covered a minuscule amount of texts and Kana system, Kana characters derived from multiple Chi- is not likely to take a vast majority of others unless some nese ideograms coexisted until Hiragana syllabary ¯xed one tool is made available to make the work simpler. Kana for one syllable. However, for a manuscript of a pre- In this paper, we propose a recognition system for classi- ceding era, there are no ¯xed number of Kana syllabary but cal Kana characters. The proposed process can be divided the most commonly used Kana characters` number could be into 3 stages. The ¯rst stage covers collection of sample estimated around 200. handwriting spread over a few centuries. The second stage covers conversion of these samples into binaries and clas- 3. Ideas for Digital Transcription of sifying them according to their sizes as a part of primary Classical Kana Manuscripts classi¯cation. In the ¯nal stage, an algorithm is used to recognize an unknown character as one of the primary clas- 3.1 A typical Manuscript si¯cation groups. Classical manuscripts have problems such as deteriorat- The proposed system is being worked on as an interdisci- ing paper, discoloration and worm holes. We use pho- plinary joint project. tographs of the original manuscript scanned at 300dpi, This paper is organized in following sections. Section 2 distortions and noise are removed using thresholding deals with the background of Kana system of writing. Sec- techniques[6]. Some manuscripts are written on papers tion 3 introduces the proposed system while section 4 deals with exquisite patterns which when photographed show with the possibilities of the proposed system. We present as prominently as the text written with brush,making the our conclusions in section 5 and discuss future work. reading of a particular portion inordinately di±cult. The Contact: Language Media Laboratory, Research Group of background of such a manuscript image needs to be con- Information Media Science and Technology, Division verted to white to make the text legible. Most of the com- of Media and Network Technologies, Graduate School monly available manuscripts, however, are on plain form of of Information Science and Technology, Hokkaido Uni- paper. There are some existing techniques for handling the versity, Kita-ku Kita 14 Nishi 9, 060-0814 Sapporo, discoloration and worm hole problems[7]. Japan. TEL: (+81)(11)706-6535, FAX: (+81)(11)706- 6277, fhari,kabura,[email protected] 1 The 20th Annual Conference of the Japanese Society for Arti¯cial Intelligence, 2006 3.2 Problems in Digital Transcription of Clas- pre-classi¯ed groups can be determined based on the four sical Kana Manuscripts parameters. Most of the classical texts have annotations and revi- sions and reviews scribbled alongside the main text mat- (1) The Kana character touches only the Central Refer- ter.Corrections by the author exist side by side with notes ence line of a reviewer which makes it extremely di±cult to identify (1) The Kana character touches or protrudes all three ref- the original text. erence lines A classical manuscript composed in Kana does not mean that it is entirely written in just one system of syllabary. (1) The Kana character protrudes RRL but does not touch There are frequent occurrences of Chinese ideograms or the LRL Kanji which require a separate handling apart from Kana when it comes to digital transcription of the documents. (1) The Kana character protrudes LRL but does not touch RRL 3.3 The proposed method The proposed method consists of ¯ve di®erent stages. 3.34 The Segmentation of Target Classical ManuscriptThe The ¯rst one is Pre-processing that includes preparation system of Kana writing seen in classical works is entirely of sample handwriting of di®erent people spread over a few vertical. Since all of them are composed with brush, there centuries. A primary classi¯cation of samples into select is a tendency to write without lifting the brush till the ink sub categories based on average occurrence of a letter with runs dry. This results in a set of Kana characters that are reference to provided reference lines. connected without any demarcation among individual let- The next stage deals with the segmentation of the Target ters. This causes some problems in segmenting the text text followed by feature extraction. The ¯nal stage is recog- along the traditional methods. nition. To overcome this problem, it is proposed that all joined 3.31 Preparation of sample Kana handwritingIn order to Kana characters are treated as one unit apart from the sep- make this experimental system work on a wide range of arate Kana characters which constitute a single unit. Each Kana texts, a corpus of Kana handwriting samples will be such unit is referred to an N-gram database to check the ex- built. Beginning with sample handwriting from the early istence of a particular combination of characters to suit the Kana literary texts, a wide range of handwriting samples context or alternate segmenting positions are considered till will be collected spread over a few centuries. All forms and the appropriate combination is rendered[10]. variations of one Kana will be categorized into one Kana 3.35 Feature ExtractionOnce the Target Text is seg- class mented, feature vector is extracted for each of the seg- 3.32 Digital Transcription of Classical Kana Manuscripts mented area. The pixel density is calculated in each zone Each sample will be scanned at a speci¯ed resolution and where each zone is identical to the zone size in the sample segmented.In the experimental stage the resolution of 300 images already prepared. dpi is considered suitable. Segmentation size of each sam- 3.36 RecognitionRecognition of Kana characters is made ple is proposed at 32x32 pixels. The samples thus rendered by an algorithm to identify similarity.The positon of Kana will be stored as 8 bit gray-scale images. Each image is character with reference to the Reference lines determine then further broken down to 4x4 pixel zones for which the which subcategory of Kana the unknown character belongs pixel density is calculated. An initial feature vector extrac- to. Inside the sub category, to which Kana class the un- tion captures the pixel density of each zone for each of the known character belongs to is determined on the basis of images. most number of matches. The three best matches are then 3.33 Primary classi¯cation into sub categoriesThree verti- processed using N-gram database to identify the most ap- cal reference lines parallel to each other are drawn for every propriate character. column of Kana writing. The position of each Kana is then calculated with reference to these lines. For each column 4. Possibilities of Kana, the position of central reference line is estimated upon the measurement of the width of the ¯rst Kana char- The proposed system can contribute to accurate digi- acter of that column, so that it occurs precisely at the cen- talization of classical Japanese manuscripts.This in turn ter. The Right and Left reference lines are then estimated opens up new vistas for Japanologists and Japanese after the calculation of the average character width of that Philologists. A classical manuscript in Kana is useful particular column.
Recommended publications
  • The Study of Old Documents of Hokkaido and Kuril Ainu
    NINJAL International Symposium 2018 Approaches to Endangered Languages in Japan and Northeast Asia, August 6-8 The study of old documents of Hokkaido and Kuril Ainu: Promise and Challenges Tomomi Sato (Hokkaido U) & Anna Bugaeva (TUS/NINJAL) [email protected] [email protected]) Introduction: Ainu • AINU (isolate, North Japan, moribund) • Is the only non-Japonic lang. of Japan. • Major dialect groups : Hokkaido (moribund), Sakhalin (extinct since 1993), Kuril (extinct since the end of XIX). • Was also spoken in Tōhoku till mid XVIII. • Hokkaido Ainu dialects: Southwestern (well documented) Northeastern (less documented) • Is not used in daily conversation since the 1950s. • Ethnical Ainu: 100,000. 2 Fig. 2 Major language families in Northeast Asia (excluding Sinitic) Amuric Mongolic Tungusic Ainuic Koreanic Japonic • Ainu shares only few features with Northeast Asian languages. • Ainu is typologically “more like a morphologically reduced version of a North American language.” (Johanna Nichols p.c.). • This is due to the strongly head-marking character of Ainu (Bugaeva, to appear). Why is it important to study Ainu? • Ainu culture is widely regarded as a direct descendant of the Jōmon culture which was spread in the Japanese archipelago in the Prehistoric time from about 14,000 BC. • Ainu is the only surviving Jōmon language; there had been other Jōmon lgs too: about 300 lgs (Janhunen 2002), cf. 10 lgs (Whitman, p.c.) . • Ainu is likely to be much more typical of what languages were like in Northeast Asia several millennia ago than the picture we would get from Chinese, Japanese or Korean. • Focusing on Ainu can help us understand a period of northeast Asian history when political, cultural and linguistic units were very different to what they have been since the rise of the great historically-attested states of East Asia.
    [Show full text]
  • On Translation a Short Introduction to the Japanese Language Will
    On Translation A short introduction to the Japanese language will illustrate the kind of difficulties one encounters in translating Japanese into the European languages, and vice versa. Linguistically, Japanese is an isolated language. It has no relation to Chinese. It must have had some relation to Korean, another isolated language, but the two went into different directions thousands of years ago. Some linguists claim that the Japanese language, along with the Korean, belongs to the Ural-Altaic family, yet the claim remains hypothetical. The Japanese language features some characteristics that would seem most strange to those who are only familiar with the European languages. For example, a grammatical subject is unnecessary in Japanese to construct a grammatically complete sentence. “淋しい” (Sabishii) means (someone is) lonely. It is a complete sentence, but there is no subject. The sentence may mean ‘I’m lonely,’ ‘you are lonely,’ ‘he/she is lonely,’ ‘the rock is lonely,’ ‘all human beings are lonely,’ etc, depending on the context. It may furthermore refer to a vague sense of loneliness which needn’t be specified. It is true that in some European languages, such as Italian, a grammatically complete sentence is possible without a named subject. But the subject can always be determined by the inflection of the verb (and often also by the changes in the articles, adjectives and nouns): “Sono sola,” “Sei solo.” A Japanese sentence may be very long and still be without a subject. Tale of Genji , for instance, might contain a sequence of three long sentences without subjects, yet in each a different subject would be implied.
    [Show full text]
  • Title Classical Japanese in Linguistic and Cross-Cultural Perspective Sub
    Title Classical Japanese in linguistic and cross-cultural perspective Sub Title Author De Wolf, Charles Publisher 慶應義塾大学日吉紀要刊行委員会 Publication 2020 year Jtitle 慶應義塾大学日吉紀要. 英語英米文学 (The Keio University Hiyoshi review of English studies). Vol.73, No.2020 (9. ) ,p.69- 87 Abstract Notes Genre Departmental Bulletin Paper URL https://koara.lib.keio.ac.jp/xoonips/modules/xoonips/detail.php?ko ara_id=AN10030060-20200930-0069 慶應義塾大学学術情報リポジトリ(KOARA)に掲載されているコンテンツの著作権は、それぞれの著作者、学会または 出版社/発行者に帰属し、その権利は著作権法によって保護されています。引用にあたっては、著作権法を遵守して ご利用ください。 The copyrights of content available on the KeiO Associated Repository of Academic resources (KOARA) belong to the respective authors, academic societies, or publishers/issuers, and these rights are protected by the Japanese Copyright Act. When quoting the content, please follow the Japanese copyright act. Powered by TCPDF (www.tcpdf.org) Classical Japanese in Linguistic and Cross-Cultural Perspective1) Charles De Wolf In the preface to his famous A Dictionary of the English Language (1755), Samuel Johnson notes: “When we see men grow old and die at a certain time one after another, from century to century, we laugh at the elixir that promises to prolong life to a thousand years; and with equal justice may the lexicographer be derided, who being able to produce no example of a nation that has preserved their words and phrases from mutability, shall imagine that his dictionary can embalm his language, and secure it from corruption and decay, that it is in his power to change sublunary nature, or clear the world at once from folly, vanity, and affectation.” I cite this not only to show that, though no modern linguist, Johnson was quite aware that “mutability” applies to human language as well as all else that is “sublunary,” but also to note that, as learned as he was, Johnson knew far less about the history of the English language than anyone with curiosity and access to Wikipedia can learn, in a matter of minutes or at most hours.
    [Show full text]
  • Source-Based Translation and Foreignization: a Japanese Case
    Source-Based Translation and Foreignization Source-Based Translation and Foreignization: A Japanese Case Yukari Fukuchi Meldrum Modern Languages and Cultural Studies, University of Alberta Introduction Foreignization, as currently understood in Translation Studies, is a concept that is charged with “more emphasis on the ideological pressure against the target-language culture than on the faithfulness to the original text” (Tamaki, 2005: 239). In other words, it is a conscious operation of bringing a foreign flavor into translations in order to counteract the effects of domestication, claimed by Venuti (1995) to be the cause of invisibility of translation and translators. Tamaki, in her 2005 paper, also cautions that the concept of foreignization should not be confused with a literal method of translation. Literal translation does not involve ideological intentions and is a mere translation method. In this paper, I will attempt to provide a supporting view that source-based translation, often seen in Japanese translation, needs to be understood outside of foreignization in the above sense. Specifically, I will illustrate that Japanese readers, in premodern times, had to gain specific knowledge and adapt to what was required in order to read and interpret texts in a satisfactory manner. This could have been a factor for the source-orientedness of Japanese translations still observed in a certain form today. By examining this background of Japanese text culture, the more source-based translation is shown to be merely a translation carried out by a literal method without any political or ideological intentions. Therefore, the concept of foreignization does not have a place in Japanese translation.
    [Show full text]
  • A Comparative Analysis of the Simplification of Chinese Characters in Japan and China
    CONTRASTING APPROACHES TO CHINESE CHARACTER REFORM: A COMPARATIVE ANALYSIS OF THE SIMPLIFICATION OF CHINESE CHARACTERS IN JAPAN AND CHINA A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAI‘I AT MĀNOA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN ASIAN STUDIES AUGUST 2012 By Kei Imafuku Thesis Committee: Alexander Vovin, Chairperson Robert Huey Dina Rudolph Yoshimi ACKNOWLEDGEMENTS I would like to express deep gratitude to Alexander Vovin, Robert Huey, and Dina R. Yoshimi for their Japanese and Chinese expertise and kind encouragement throughout the writing of this thesis. Their guidance, as well as the support of the Center for Japanese Studies, School of Pacific and Asian Studies, and the East-West Center, has been invaluable. i ABSTRACT Due to the complexity and number of Chinese characters used in Chinese and Japanese, some characters were the target of simplification reforms. However, Japanese and Chinese simplifications frequently differed, resulting in the existence of multiple forms of the same character being used in different places. This study investigates the differences between the Japanese and Chinese simplifications and the effects of the simplification techniques implemented by each side. The more conservative Japanese simplifications were achieved by instating simpler historical character variants while the more radical Chinese simplifications were achieved primarily through the use of whole cursive script forms and phonetic simplification techniques. These techniques, however, have been criticized for their detrimental effects on character recognition, semantic and phonetic clarity, and consistency – issues less present with the Japanese approach. By comparing the Japanese and Chinese simplification techniques, this study seeks to determine the characteristics of more effective, less controversial Chinese character simplifications.
    [Show full text]
  • KANA Response Live Organization Administration Tool Guide
    This is the most recent version of this document provided by KANA Software, Inc. to Genesys, for the version of the KANA software products licensed for use with the Genesys eServices (Multimedia) products. Click here to access this document. KANA Response Live Organization Administration KANA Response Live Version 10 R2 February 2008 KANA Response Live Organization Administration All contents of this documentation are the property of KANA Software, Inc. (“KANA”) (and if relevant its third party licensors) and protected by United States and international copyright laws. All Rights Reserved. © 2008 KANA Software, Inc. Terms of Use: This software and documentation are provided solely pursuant to the terms of a license agreement between the user and KANA (the “Agreement”) and any use in violation of, or not pursuant to any such Agreement shall be deemed copyright infringement and a violation of KANA's rights in the software and documentation and the user consents to KANA's obtaining of injunctive relief precluding any further such use. KANA assumes no responsibility for any damage that may occur either directly or indirectly, or any consequential damages that may result from the use of this documentation or any KANA software product except as expressly provided in the Agreement, any use hereunder is on an as-is basis, without warranty of any kind, including without limitation the warranties of merchantability, fitness for a particular purpose, and non-infringement. Use, duplication, or disclosure by licensee of any materials provided by KANA is subject to restrictions as set forth in the Agreement. Information contained in this document is subject to change without notice and does not represent a commitment on the part of KANA.
    [Show full text]
  • Arxiv:1812.01718V1 [Cs.CV] 3 Dec 2018
    Deep Learning for Classical Japanese Literature Tarin Clanuwat∗ Mikel Bober-Irizar Center for Open Data in the Humanities Royal Grammar School, Guildford Asanobu Kitamoto Alex Lamb Center for Open Data in the Humanities MILA, Université de Montréal Kazuaki Yamamoto David Ha National Institute of Japanese Literature Google Brain Abstract Much of machine learning research focuses on producing models which perform well on benchmark tasks, in turn improving our understanding of the challenges associated with those tasks. From the perspective of ML researchers, the content of the task itself is largely irrelevant, and thus there have increasingly been calls for benchmark tasks to more heavily focus on problems which are of social or cultural relevance. In this work, we introduce Kuzushiji-MNIST, a dataset which focuses on Kuzushiji (cursive Japanese), as well as two larger, more challenging datasets, Kuzushiji-49 and Kuzushiji-Kanji. Through these datasets, we wish to engage the machine learning community into the world of classical Japanese literature. 1 Introduction Recorded historical documents give us a peek into the past. We are able to glimpse the world before our time; and see its culture, norms, and values to reflect on our own. Japan has very unique historical pathway. Historically, Japan and its culture was relatively isolated from the West, until the Meiji restoration in 1868 where Japanese leaders reformed its education system to modernize its culture. This caused drastic changes in the Japanese language, writing and printing systems. Due to the modernization of Japanese language in this era, cursive Kuzushiji (くずしc) script is no longer taught in the official school curriculum.
    [Show full text]
  • Japanese Major and Minor Revised: 06/2021
    Japanese Major and Minor www.DEALL.Pitt.edu/Japanese/Japanese-Major Revised: 06/2021 The Department of East Asian Languages and Literatures offers courses in the language, literature, film, linguistics, and culture of China and Japan; the department offers both a major and a minor in Japanese. A full sequence of courses in the modern standard language of Japan and a variety of offerings, ranging from introductory to specialized, illuminate various facets of Japanese civilization. Students whose interests range broadly across the civilization of East Asia may want to take advantage of the Asian Studies certificate program. The departmental curriculum for the Japanese major is composed of three main categories as follows. Courses designed to develop competence in the four skills of the Japanese language, plus courses for advanced study. Courses taught in English and focused on the mainstream of Japanese culture as reflected in literature, drama, and film. Courses for the advanced study of literature and linguistic analysis. Required courses for the Japanese major Capstone course The Japanese major requires the completion of 54 credits JPNSE 1999 Capstone Project (three credits) must be taken distributed as follows. during the student’s last term as an undergraduate. Language courses Non-departmental course requirements JPNSE 0001 First Year Japanese 1 + Students must complete three courses related to Japan offered JPNSE 0002 First Year Japanese 2 + by other Dietrich School of Arts and Sciences departments. Two JPNSE 0003 Second Year Japanese 1 + of these courses must have at least 50% Japan-related content. JPNSE 0004 Second Year Japanese 2 + The Asian Studies Center at the University Center for JPNSE 1020 Third Year Japanese 1 International Studies (UCIS) maintains a list of Asia-related JPNSE 1021 Third Year Japanese 2 classes, which students may use to identify Japan-related + Note: JPNSE 1061 may replace both JPNSE 0001 and JPNSE 0002; JPNSE 1062 courses.
    [Show full text]
  • DBT and Japanese Braille
    AUSTRALIAN BRAILLE AUTHORITY A subcommittee of the Round Table on Information Access for People with Print Disabilities Inc. www.brailleaustralia.org email: [email protected] Japanese braille and UEB Introduction Japanese is widely studied in Australian schools and this document has been written to assist those who are asked to transcribe Japanese into braille in that context. The Japanese braille code has many rules and these have been simplified for the education sector. Japanese braille has a separate code to that of languages based on the Roman alphabet and code switching may be required to distinguish between Japanese and UEB or text in a Roman script. Transcription for higher education or for a native speaker requires a greater knowledge of the rules surrounding the Japanese braille code than that given in this document. Ideally access to both a fluent Japanese reader and someone who understands all the rules for Japanese braille is required. The website for the Braille Authority of Japan is: http://www.braille.jp/en/. I would like to acknowledge the invaluable advice and input of Yuko Kamada from Braille & Large Print Services, NSW Department of Education whilst preparing this document. Kathy Riessen, Editor May 2019 Japanese print Japanese print uses three types of writing. 1. Kana. There are two sets of Kana. Hiragana and Katana, each character representing a syllable or vowel. Generally Hiragana are used for Japanese words and Katakana for words borrowed from other languages. 2. Kanji. Chinese characters—non-phonetic 3. Rōmaji. The Roman alphabet 1 ABA Japanese and UEB May 2019 Japanese braille does not distinguish between Hiragana, Katakana or Kanji.
    [Show full text]
  • A Discovery in the History of Research on Japanese Kana Orthography: Ishizuka Tatsumaro's Kanazukai Oku No Yamamichi
    国立国語研究所学術情報リポジトリ A discovery in the history of research on Japanese kana orthography: Ishizuka Tatsumaro's Kanazukai oku no yamamichi 著者(英) Shinkichi HASHIMOTO 翻訳者(英) Timothy J. Vance 校正者(英) Wayne Lawrence journal or Pioneering Linguistic Works in Japan publication title page range 1-24 year 2019-09 URL http://doi.org/10.15084/00002233 HASHIMOTO Pioneering Linguistic Works in Japan A Discovery in the History of Research on Japanese Kana Orthography: Ishizuka Tatsumaro’s Kanazukai oku no yamamichi HASHIMOTO Shinkichi 1 Two Aspects of Kana Orthography Research Kana orthography refers to the way of using kana [i.e., Chinese characters used to write Japanese syllables phonographically, including both the unabbreviated characters (man’yōgana), used mostly in the Nara period (710–794) and early in the Heian Period (794–1185), and the abbreviated forms (hiragana and katakana) that first appeared around 900]. When it comes to using あ to represent the sound “a” or か to represent the sound “ka,” things are clear and simple, and no doubts arise. It is only when two or more different letters correspond to the same sound, as in the case of い [i] and ゐ [wi] [both pronounced i today] or お [o] and を [wo] [both pronounced o today], that doubts arise as to which letter to use. Thus, we can say that problems of kana orthography are actually just problems of choosing which letter to use. Kana orthography problems have two aspects. On the one hand, there is the question of whether or not letters that represent the same sound (い [i] and ゐ [wi] [for i], お [o] and を [wo] [for o], etc.) should be distinguished, and if so, which letter should be used when.
    [Show full text]
  • The Formation of “New” Kana Written Horizontally
    1 2 Good morning, everyone. On behalf of the Shinsekai Type Study Group. today we would like to talk about the Japanese writing system… specifically about the kana script. Our discussion will focus on the realm of possibility. 3 Written Japanese is one of the world’s most complex languages. It’s written using a combination of three diferent systems… kanji characters, and two scripts known as hiragana and katakana. Originally, however, hiragana and katakana both derived from kanji. 4 Kanji characters originated in China, where they each have their own meaning. 5 Early on, they were imported into Japan, where the existing language was completely diferent. To express Japanese, Chinese characters had to be adapted. 6 Two new scripts were invented, based on kanji, known as “hiragana” and “katakana.” Today we will limit our discussion to hiragana… which we will refer to simply as “kana.” 7 I will start by describing how kana came about. 8 At first, in around the 7th century, kanji were used phonetically. In other words, kanji were used simply to express the sounds of Japanese – one character per sound. 9 Around the 9th or 10th century, the forms of kanji began undergoing gradual change. A new style developed – a more fluid or “cursive” style unique to Japan. This is called “sōgana.” 10 During the 10th to 11th centuries, sōgana’s forms became even more fluid and refined. The result is the writing system that came to be known as “kana.” 11 In this way, where as at first kanji were used in Japan in their original form gradually their shapes became increasingly cursive until eventually they became kana unique to Japan.
    [Show full text]
  • Discriminative Method for Japanese Kana-Kanji Input Method
    Discriminative Method for Japanese Kana-Kanji Input Method Hiroyuki Tokunaga Daisuke Okanohara Shinsuke Mori Preferred Infrastructure Inc. Kyoto University 2-40-1 Hongo, Bunkyo-ku, Tokyo Yoshidahonmachi, Sakyo-ku, Kyoto 113-0033, Japan 606-8501, Japan tkng,hillbig @preferred.jp [email protected] { } Abstract Early kana-kanji conversion systems employed heuristic rules, such as maximum-length-word The most popular type of input method matching. in Japan is kana-kanji conversion, conver- With the growth of statistical natural language sion from a string of kana to a mixed kanji- processing, data-driven methods were applied for kana string. kana-kanji conversion. Most existing studies on However there is no study using discrim- kana-kanji conversion have used probability mod- inative methods like structured SVMs for els, especially generative models. Although gen- kana-kanji conversion. One of the reasons erative models have some advantages, a number is that learning a discriminative model of studies on natural language tasks have shown from a large data set is often intractable. that discriminative models tend to overcome gen- However, due to progress of recent re- erative models with respect to accuracy. searches, large scale learning of discrim- However, there have been no studies using only inative models become feasible in these a discriminative model for kana-kanji conversion. days. One reason for this is that learning a discriminative model from a large data set is often intractable. In the present paper, we investigate However, due to progress in recent research on whether discriminative methods such as machine learning, large-scale learning of discrim- structured SVMs can improve the accu- inative models has become feasible.
    [Show full text]