Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature

Total Page:16

File Type:pdf, Size:1020Kb

Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature Teruaki Oka† Mamoru Komachi† [email protected] [email protected] Toshinobu Ogiso‡ Yuji Matsumoto† [email protected] [email protected] Nara Institute of Science and Technology National† Institute for Japanese Language and Linguistics ‡ Abstract literary text,2 which achieves high performance on analysis for existing electronic text (e.g. Aozora- Since the present-day Japanese use of bunko, an online digital library of freely available voiced consonant mark had established books and work mainly from out-of-copyright ma- in the Meiji Era, modern Japanese lit- terials). erary text written in the Meiji Era of- However, the performance of morphological an- ten lacks compulsory voiced consonant alyzers using the dictionary deteriorates if the text marks. This deteriorates the performance is not normalized, because these dictionaries often of morphological analyzers using ordi- lack orthographic variations such as Okuri-gana,3 nary dictionary. In this paper, we pro- accompanying characters following Kanji stems pose an approach for automatic labeling of in Japanese written words. This is problematic voiced consonant marks for modern liter- because not all historical texts are manually cor- ary Japanese. We formulate the task into a rected with orthography, and it is time-consuming binary classification problem. Our point- to annotate by hand. It is one of the major issues wise prediction method uses as its feature in applying NLP tools to Japanese Linguistics be- set only surface information about the sur- cause ancient materials often contain a wide vari- rounding character strings. As a conse- ety of orthographic variations. quence, training corpus is easy to obtain For example, there is an issue of voiced con- and maintain because we can exploit a par- sonant marks. Any “Hiragana” character and tially annotated corpus for learning. We “Katakana” character (called Kana character alto- compared our proposed method as a pre- gether) represent either consonant (k, s, t, n, h, m, processing step for morphological analy- y, r, w) onset with vowel (a, i, u, e, o) nucleus or sis with a dictionary-based approach, and only the vowel (except for nasal codas N). Further- confirmed that pointwise prediction out- more, the characters alone can not represent syl- performs dictionary-based approach by a lables beginning with a voiced consonant (g, z, d, large margin. b) in current orthography. They are spelled with ゛ 1 Introduction Kana and a voiced consonant mark ( ) to the up- per right (see Figure 1). However, confusingly, it Recently, corpus-based approaches have been suc- was not ungrammatical to put down the charac- cessfully adopted in the field of Japanese Linguis- ter without the mark to represent voiced syllable tics. However, the central part of the fields has 2In historical linguistics, the phrase “modern Japanese” been occupied by historical research that uses an- refers to the language from 1600 on to the present in a broad cient material, on which fundamental annotations sense. However, most Japanese people regard the phrase to are often not yet available. the Meiji and Taisho Era; we also use the phrase to intend the narrower sense. Despite the limited annotated corpora, re- 3In Japanese Literature, both Kana (phonogramic char- searchers have developed several morphological acters) and Kanji (ideographic characters) are used together. analysis dictionaries for past-day Japanese. Na- Generally, conjugated form is ambiguous, given the preced- ing Kanji characters. However, the character’s pronunciation tional Institute for Japanese Language and Lin- can also be written using Kana characters. Thus, the pro- guistics creates Kindai-bungo UniDic,1 a morpho- nunciation’s tailing few syllables are hanged out (Okuri), us- logical analysis dictionary for modern Japanese ing Kana (gana) characters for disambiguating the form. Al- though the number of Okuri-gana is fixed for each Kanji char- 1http://www2.ninjal.ac.jp/lrc/index.php?UniDic acter now, it was not fixed in the Meiji Era. 292 Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 292–300, Chiang Mai, Thailand, November 8 – 13, 2011. c 2011 AFNLP ga が:か (ka) + ゛ da だ:た (ta) + ゛ gi → ぎ:き (ki) + ゛ di → ぢ:ち (ti) + ゛ → → 今や廣島は其名大に内外國に顯はれ苟も時 gu ぐ:く (ku) + ゛ du づ:つ (tu) + ゛ :: :: ge → げ:け (ke) + ゛ de → で:て (te) + ゛ 事を談 す るものは同地の形勢如何を知ら go → ご:こ (ko) + ゛ do → ど:と (to) + ゛ :: :: za → ざ:さ (sa) + ゛ ba → ば:は (ha) + ゛ んと欲せ さ るはあら す ::: :: :: :: :: zi → じ:し (si) + ゛ bi → び:ひ (hi) + ゛ zu → ず:す (su) + ゛ bu → ぶ:ふ (hu) + ゛ Today, the fame of “Hiroshima” has been ze → ぜ:せ (se) + ゛ be → べ:へ (he) + ゛ broadly known in and outside Japan, and if you zo → ぞ:そ (so) + ゛ bo → ぼ:ほ (ho) + ゛ talk about current affairs, you want to know how → → the place has been established. Figure 1: Spelling syllables beginning with the voiced consonants g, z, d and b by Hiragana char- acters with a voiced consonant mark. Figure 2: Example of sentences that includes un- marked characters. This text is an excerpt from Voiced Voiceless “The tide of Hiroshima”: Katsuichi Noguchi, Marked 3,124 0 Taiyo, No.2, p.64 (1925). Wavy-underlined char- Ambiguous 4,032 29,332 acters are ambiguous character, and gray-boxed Table 1: The contingency table of observed fre- characters are unmarked character. quencies of characters and voiceness. fully annotated sentence that does not include un- marked characters, and thus if the target sentence until the Meiji Era, because Japanese orthography includes unmarked character(s), the performance dates back to the Meiji Era. Consequently, mod- can degrade considerably. ern Japanese literary text written in the Meiji Era often lacks compulsory voiced consonant marks. There are two major approaches to handle The mark was used only when the author deems this problem: a dictionary-based approach and a it necessary to disambiguate; and it was not often classification-based approach. used if one can infer from the context that the pro- First, the dictionary-based approach creates a nunciation is voiced. dictionary that has both original spellings and Figure 2 shows characters which lack the voiced modified variants without the mark. For exam- consonant mark even though we expect it to be ple, Kindai-bungo UniDic includes both entries marked in the text. Hereafter, we call such char- “ず (zu)” and “す (zu)” for frequent words such as acters as “unmarked characters.” Also, we call the “ず (zu)” in auxiliary verb. This allows morpho- characters to which the voiced consonant mark can logical analysis algorithms to learn the weights of be attached as “ambiguous characters.” In Table 1, both entries all together from a corpus annotated we present the statistics of the voiced consonants with part-of-speech tags in order to select appro- in “Kokumin-no-tomo” corpus which we will use priate entries during decoding. for our evaluation. As you can see, 12% of the Second, the classification-based approach em- ambiguous characters are actually voiced but not ploys a corpus annotated with unmarked charac- marked. In addition, 44% of the voiced charac- ters to learn a classifier that labels the voiced ters have the voiced consonant mark, showing the consonant mark for unmarked characters. Unlike variation of using the voiced consonant mark in the dictionary-based approach, the classification- the corpus. based approach does not require part-of-speech In the modern Japanese literary text, ortho- tagged nor tokenized corpora. Since it is easier for graphic variations are not only the unmarked. human annotators to annotate unmarked characters However, unmarked characters appear a lot in the than word boundaries and part-of-speech tags, we text and can be annotated easily by hand. Thus, we can obtain a large scale annotated corpus at low can get natural texts for evaluation of our method cost. at low cost (in fact, it cost only a few weeks to an- Therefore, in this paper, we propose a notate our above-mentioned test corpus). There- classification-based approach to automatic la- fore, we decided to begin with attaching voiced beling of voiced consonant marks as a pre- consonant mark for unmarked characters as a start- processing step for morphological analysis for ing point for normalizing orthographic variations. modern Japanese literary language. Basically, Kindai-bungo UniDic is created for a We formulate the task of labeling voiced con- 293 sonant marks into a binary classification problem. and part-of-speech tags. Thus, a word segmenta- Our method uses as its feature set only surface in- tion model learned from such a limited amount of formation about the surrounding character strings data is unreliable. Unlike Nagata’s method, our with pointwise prediction, whose training data are classification-based method does not rely on word available at low cost. We use an online learning segmentation and can exploit low-level annotation method for learning large spelling variation from such as voiced consonant mark, which is available massive datasets rapidly and accurately. Thus, we quite easily. can improve its performance easily by increasing In addition, Nagata performed clustering of amount of training data. In addition, we perform characters for smoothing confusion probability clustering of Kanji, which is abundant in the train- among characters to narrow down correction can- ing data, and employ class n-grams for addressing didates. We also perform clustering on Kanji for the data sparseness problem. We compared our addressing the data sparseness problem. Though classification-based approach with the dictionary- Nagata uses character’s shape for clustering, we based approach and showed that the classification- instead use neighboring characters of the Kanji based method outperforms the dictionary-based character. The intuition behind this is that whether method, especially in an out-of-domain setting.
Recommended publications
  • 5892 Cisco Category: Standards Track August 2010 ISSN: 2070-1721
    Internet Engineering Task Force (IETF) P. Faltstrom, Ed. Request for Comments: 5892 Cisco Category: Standards Track August 2010 ISSN: 2070-1721 The Unicode Code Points and Internationalized Domain Names for Applications (IDNA) Abstract This document specifies rules for deciding whether a code point, considered in isolation or in context, is a candidate for inclusion in an Internationalized Domain Name (IDN). It is part of the specification of Internationalizing Domain Names in Applications 2008 (IDNA2008). Status of This Memo This is an Internet Standards Track document. This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc5892. Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
    [Show full text]
  • Tailoring Collation to Users and Languages Markus Scherer (Google)
    Tailoring Collation to Users and Languages Markus Scherer (Google) Internationalization & Unicode Conference 40 October 2016 Santa Clara, CA This interactive session shows how to use Unicode and CLDR collation algorithms and data for multilingual sorting and searching. Parametric collation settings - "ignore punctuation", "uppercase first" and others - are explained and their effects demonstrated. Then we discuss language-specific sort orders and search comparison mappings, why we need them, how to determine what to change, and how to write CLDR tailoring rules for them. We will examine charts and data files, and experiment with online demos. On request, we can discuss implementation techniques at a high level, but no source code shall be harmed during this session. Ask the audience: ● How familiar with Unicode/UCA/CLDR collation? ● More examples from CLDR, or more working on requests/issues from audience members? About myself: ● 17 years ICU team member ● Co-designed data structures for the ICU 1.8 collation implementation (live in 2001) ● Re-wrote ICU collation 2012..2014, live in ICU 53 ● Became maintainer of UTS #10 (UCA) and LDML collation spec (CLDR) ○ Fixed bugs, clarified spec, added features to LDML Collation is... Comparing strings so that it makes sense to users Sorting Searching (in a list) Selecting a range “Find in page” Indexing Internationalization & Unicode Conference 40 October 2016 Santa Clara, CA “Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof.” (http://en.wikipedia.org/wiki/Collation) “Collation is the general term for the process and function of determining the sorting order of strings of characters.
    [Show full text]
  • Arxiv:1812.01718V1 [Cs.CV] 3 Dec 2018
    Deep Learning for Classical Japanese Literature Tarin Clanuwat∗ Mikel Bober-Irizar Center for Open Data in the Humanities Royal Grammar School, Guildford Asanobu Kitamoto Alex Lamb Center for Open Data in the Humanities MILA, Université de Montréal Kazuaki Yamamoto David Ha National Institute of Japanese Literature Google Brain Abstract Much of machine learning research focuses on producing models which perform well on benchmark tasks, in turn improving our understanding of the challenges associated with those tasks. From the perspective of ML researchers, the content of the task itself is largely irrelevant, and thus there have increasingly been calls for benchmark tasks to more heavily focus on problems which are of social or cultural relevance. In this work, we introduce Kuzushiji-MNIST, a dataset which focuses on Kuzushiji (cursive Japanese), as well as two larger, more challenging datasets, Kuzushiji-49 and Kuzushiji-Kanji. Through these datasets, we wish to engage the machine learning community into the world of classical Japanese literature. 1 Introduction Recorded historical documents give us a peek into the past. We are able to glimpse the world before our time; and see its culture, norms, and values to reflect on our own. Japan has very unique historical pathway. Historically, Japan and its culture was relatively isolated from the West, until the Meiji restoration in 1868 where Japanese leaders reformed its education system to modernize its culture. This caused drastic changes in the Japanese language, writing and printing systems. Due to the modernization of Japanese language in this era, cursive Kuzushiji (くずしc) script is no longer taught in the official school curriculum.
    [Show full text]
  • CJK Symbols and Punctuation Range: 3000–303F
    CJK Symbols and Punctuation Range: 3000–303F This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]
  • TLD: NAGOYA Language Tag: Ja Language Description: Japanese Version: 1.0 Effective Date: 25 November 2013
    TLD: NAGOYA Language Tag: ja Language Description: Japanese Version: 1.0 Effective Date: 25 November 2013 Website: http://www.gmo-registry.com/en/ # Codepoints allowed from the Japanese language. Reference 1 RFC 20 (USASCII) Reference 2 JIS X 0208:1997 Reference 3 Unicode-3.2.0 Version 1.
    [Show full text]
  • Chapter 6, Writing Systems and Punctuation
    The Unicode® Standard Version 13.0 – Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. © 2020 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium; edited by the Unicode Consortium. — Version 13.0. Includes index. ISBN 978-1-936213-26-9 (http://www.unicode.org/versions/Unicode13.0.0/) 1.
    [Show full text]
  • MSR-4: Annotated Repertoire Tables, Non-CJK
    Maximal Starting Repertoire - MSR-4 Annotated Repertoire Tables, Non-CJK Integration Panel Date: 2019-01-25 How to read this file: This file shows all non-CJK characters that are included in the MSR-4 with a yellow background. The set of these code points matches the repertoire specified in the XML format of the MSR. Where present, annotations on individual code points indicate some or all of the languages a code point is used for. This file lists only those Unicode blocks containing non-CJK code points included in the MSR. Code points listed in this document, which are PVALID in IDNA2008 but excluded from the MSR for various reasons are shown with pinkish annotations indicating the primary rationale for excluding the code points, together with other information about usage background, where present. Code points shown with a white background are not PVALID in IDNA2008. Repertoire corresponding to the CJK Unified Ideographs: Main (4E00-9FFF), Extension-A (3400-4DBF), Extension B (20000- 2A6DF), and Hangul Syllables (AC00-D7A3) are included in separate files. For links to these files see "Maximal Starting Repertoire - MSR-4: Overview and Rationale". How the repertoire was chosen: This file only provides a brief categorization of code points that are PVALID in IDNA2008 but excluded from the MSR. For a complete discussion of the principles and guidelines followed by the Integration Panel in creating the MSR, as well as links to the other files, please see “Maximal Starting Repertoire - MSR-4: Overview and Rationale”. Brief description of exclusion
    [Show full text]
  • Overview and Rationale
    Integration Panel: Maximal Starting Repertoire — MSR-4 Overview and Rationale REVISION – November 09, 2018 Table of Contents 1 Overview 3 2 Maximal Starting Repertoire (MSR-4) 3 2.1 Files 3 2.1.1 Overview 3 2.1.2 Normative Definition 3 2.1.3 Code Charts 4 2.2 Determining the Contents of the MSR 5 2.3 Process of Deciding the MSR 6 3 Scripts 7 3.1 Comprehensiveness and Staging 7 3.2 What Defines a Related Script? 8 3.3 Separable Scripts 8 3.4 Deferred Scripts 9 3.5 Historical and Obsolete Scripts 9 3.6 Selecting Scripts and Code Points for the MSR 9 3.7 Scripts Appropriate for Use in Identifiers 9 3.8 Modern Use Scripts 10 3.8.1 Common and Inherited 11 3.8.2 Scripts included in MSR-1 11 3.8.3 Scripts added in MSR-2 11 3.8.4 Scripts added in MSR-3 or MSR-4 12 3.8.5 Modern Scripts Ineligible for the Root Zone 12 3.9 Scripts for Possible Future MSRs 12 3.10 Scripts Identified in UAX#31 as Not Suitable for identifiers 13 4 Exclusions of Individual Code Points or Ranges 14 4.1 Historic and Phonetic Extensions to Modern Scripts 14 4.2 Code Points That Pose Special Risks 15 4.3 Code Points with Strong Justification to Exclude 15 4.4 Code Points That May or May Not be Excludable from the Root Zone LGR 15 4.5 Non-spacing Combining Marks 16 5 Discussion of Particular Code Points 18 Integration Panel: Maximal Starting Repertoire — MSR-3 Overview and Rationale 5.1 Digits and Hyphen 19 5.2 CONTEXT O Code Points 19 5.3 CONTEXT J Code Points 19 5.4 Code Points Restricted for Identifiers 19 5.5 Compatibility with IDNA2003 20 5.6 Code Points for Which the
    [Show full text]
  • Katakana Range: 30A0–30FF
    Katakana Range: 30A0–30FF This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]
  • PDF/Unicode-14.0/ for Charts Showing Only the Characters Added in Unicode 14.0
    Hiragana Range: 3040–309F This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]
  • Section 18.1, Han
    The Unicode® Standard Version 12.0 – Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. © 2019 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium; edited by the Unicode Consortium. — Version 12.0. Includes index. ISBN 978-1-936213-22-1 (http://www.unicode.org/versions/Unicode12.0.0/) 1.
    [Show full text]
  • History Prehistory a Common Ancestor of Japanese and Ryukyuan
    History Prehistory A common ancestor of Japanese and Ryukyuan languages or dialects is thought[by whom?] to have come to Japan with settlers from continental Asia or nearby Pacific islands sometime in the early- to mid-2nd century (the Yayoi period), replacing the languages of the original Jōmon inhabitants,[3] including the ancestor of modern Ainu. Very little is known about the Japanese of this period—there is no direct evidence, as writing had yet to be introduced from China so what can be discerned must be based on the reconstructions of Old Japanese.[citation needed] Old Japanese Page from Nihon Shoki A page from Nihon Shoki (The Chronicles of Japan), the second oldest book of classical Japanese history Main article: Old Japanese Old Japanese is the oldest attested stage of the Japanese language. Through the spread of Buddhism, the Chinese writing system was imported to Japan. The earliest texts found in Japan are written in Classical Chinese, but they may have been meant to be read as Japanese by the kanbun method. Some of these texts show the influence of Japanese grammar, such as word order. In these "hybrid" texts, Chinese characters are occasionally used to phonetically represent Japanese particles. The earliest text, the Kojiki, dates to the early 8th century, and was written entirely in Chinese characters. The end of Old Japanese coincides with the end of the Nara period in 794. Old Japanese uses the Man'yōgana system of writing, which uses kanji for both phonetic and semantic values. From the Man'yōgana system, Old Japanese has been reconstructed as having had 88 syllables.
    [Show full text]