Substring Frequency Features for Segmentation of Japanese Katakana Words with Unlabeled Corpora

Total Page:16

File Type:pdf, Size:1020Kb

Substring Frequency Features for Segmentation of Japanese Katakana Words with Unlabeled Corpora Substring Frequency Features for Segmentation of Japanese Katakana Words with Unlabeled Corpora Yoshinari Fujinuma∗ Alvin C. Grissom II Computer Science Mathematics and Computer Science University of Colorado Ursinus College Boulder, CO Collegeville, PA [email protected] [email protected] Abstract 20% of the katakana words in newspaper articles are OOV words (Breen, 2009). For example, some Word segmentation is crucial in natu- katakana compound loanwords are not transliter- ral language processing tasks for unseg- ated but rather “Japanized” (e.g., ¬½êン¹¿ mented languages. In Japanese, many out- ンド gasorinsutando “gasoline stand”, meaning of-vocabulary words appear in the pho- “gas station” in English) or abbreviated (e.g., ¹ netic syllabary katakana, making segmen- マーÈフ©ン±ー¹ sumatofonk¯ esu¯ (“smart- tation more difficult due to the lack of phone case”), which is abbreviated as スマホケー clues found in mixed script settings. In this ¹ sumahokesu¯ ). Abbreviations may also undergo paper, we propose a straightforward ap- phonetic and corresponding orthographic changes, proach based on a variant of tf-idf and ap- as in the case of ¹ マ ーÈ フ © ン sumatofon¯ ply it to the problem of word segmentation (“smartphone”), which, in the abbreviated term, in Japanese. Even though our method uses shortens the long vowel a¯ to a, and replaces フ© only an unlabeled corpus, experimental re- fo with ホ ho. This change is then propagated to sults show that it achieves performance compound words, such as ¹マホ±ー¹ suma- comparable to existing methods that use hokesu¯ (“smartphone case”). Word segmentation manually labeled corpora. Furthermore, of compound words is important for improving it improves performance of simple word results in downstream tasks, such as information segmentation models trained on a manu- retrieval (Braschler and Ripplinger, 2004; Alfon- ally labeled corpus. seca et al., 2008), machine translation (Koehn and Knight, 2003), and information extraction from 1 Introduction microblogs (Bansal et al., 2015). In languages without explicit segmentation, word Hagiwara and Sekine(2013) incorporated an segmentation is a crucial step of natural lan- English corpus by projecting Japanese transliter- guage processing tasks. In Japanese, this prob- ations to words from an English corpus; how- lem is less severe than in Chinese because of ever, loanwords that are not transliterated (such as the existence of three different scripts: hiragana, sumaho for “smartphone”) cannot be segmented katakana, and kanji, which are Chinese charac- by the use of an English corpus alone. We inves- ters.1 However, katakana words are known to tigate a more efficient use of a Japanese corpus degrade word segmentation performance because by incorporating a variant of the well-known tf- of out-of-vocabulary (OOV) words which do not idf weighting scheme (Salton and Buckley, 1988), appear manually segmented corpora (Nakazawa which we refer to as term frequency-inverse sub- et al., 2005; Kaji and Kitsuregawa, 2011). Cre- string frequency or tf-issf. The core idea of our ap- 2 ation of new words is common in Japanese; around proach is to assign scores based on the likelihood that a given katakana substring is a word token, ∗Part of the work was done while the first author was at Amazon Japan. using only statistics from an unlabeled corpus. 1Hiragana and katakana are the two distinct character sets representing the same Japanese sounds. These two character sets are used for different purposes, with katakana typically used for transliterations of loanwords. Kanji are typically 2Our code is available at https://www.github.com/ used for nouns. akkikiki/katakana segmentation. 74 Proceedings of the The 8th International Joint Conference on Natural Language Processing, pages 74–79, Taipei, Taiwan, November 27 – December 1, 2017 c 2017 AFNLP Our contributions are as follows: ID Notation Feature 1 yi unigram 2 yi 1, yi bigram 1. We show that a word segmentation model − 3 length(yi) num. of characters in yi using tf-issf alone outperforms a previously proposed frequency-based method and that Table 1: Features used for the structured percep- it produces comparable results to various tron. Japanese tokenizers. katakana can, in principle, be a meaningful char- 2. We show that tf-issf improves the F1-score of acter n-gram, this rarely occurs, and tf-issf suc- word segmentation models trained on manu- cessfully penalizes the score of such sequences of ally labeled data by more than 5%. characters. 2 Proposed Method Figure1 shows an example of a word lattice for the loan compound word “smartphone case” In this section, we describe the katakana word seg- with the desired segmentation path in bold. When mentation problem and our approach to it. building a lattice, we only create a node for a sub- 2.1 Term Frequency-Inverse Substring string that appears as a term in the unlabeled cor- 4 Frequency pus and does not start with a small katakana letter or a prolonged sound mark “ー”, as such charac- Let S be a sequence of katakana characters, Y be ters are rarely the first character in a word. Includ- the set of all possible segmentations, y Y be ∈ ing terms or consecutive katakana characters from a possible segmentation, and y be a substring of i an unlabeled corpus reduces the number of OOV y. Then, for instance, y = y y ...y is a possible 1 2 n words. word segmentation of S with n segments. We now propose a method to segment katakana 2.2 Structured Perceptron with tf-issf OOV words. Our approach, term frequency- inverse substring frequency (tf-issf), is a variant To incorporate manually labeled data and to com- of the tf-idf weighting method, which computes a pare with other supervised Japanese tokenizers, score for each candidate segmentation. We calcu- we use the structured perceptron (Collins, 2002). This model represents the score as late the score of a katakana string yi with tf(yi) Score(y) = w φ(y), (4) tf-issf(yi) = , (1) · sf(yi) where φ(y) is a feature function and w is a weight where tf(yi) is the number of occurrences of yi vector. Features used in the structured perceptron as a katakana term in a corpus and sf(yi) is the are shown in Table1. We use the surface-level number of unique katakana terms that have yi as a substring. We regard consecutive katakana char- features used by Kaji and Kitsuregawa(2011) and acters as a single katakana term when computing decode with the Viterbi algorithm. We incorpo- tf-issf. rate tf-issf into the structured perceptron as the ini- We compute the product of tf-issf scores over a tial feature weight for unigrams instead of initial- 5 string to score the segmentation izing the weight vector to 0. Specifically, we use log(tf-issf(yi) + 1) for the initial weights to avoid overemphasizing the tf-issf value (Kaji and Kit- Score(y) = tf-issf(yi), (2) yi y suregawa, 2011). In this way, we can directly ad- Y∈ just tf-issf values using a manually labeled corpus. and choose the optimal segmentation y∗ with Unlike the approach of Xiao et al.(2002), which y∗ = arg max Score(y S). (3) uses tf-idf to resolve segmentation ambiguities in y Y | ∈ Chinese, we regard each katakana term as one doc- Intuitively, if a string appears frequently as ument to compute its inverse document (substring) a word substring, we treat it as a meaning- frequency. less sequence.3 While substrings of consecutive 4Such letters are ¡ a, £ i, ¥ u, § e, © o, õ ka, ヶ ke, 3A typical example is a single character substring. How- à tsu, ã ya, ュ yu, ョ yo, and ヮ wa. ever, it is possible for single-character substrings could be 5We also attempt to incorporate tf-issf as an individual word tokens. feature, but this does not improve the segmentation results. 75 スマホケース sumahokēsu ス マホ ケース BOS su maho kēsu EOS ケー スマホ kē ス sumaho su Figure 1: An example lattice for a katakana word segmentation. We use the Viterbi algorithm to find the optimal segmentation from the beginning of the string (BOS) to the end (EOS), shown by the bold edges and nodes. Only those katakana substrings which exist in the training corpus as words are considered. This example produces the correct segmentation, スマホ / ケース sumaho / kesu¯ (“smartphone case”). 3 Experiments data. We filter out duplicate hashtags because the Twitter Streaming API collects a set of sample We now describe our experiments. We run our tweets that are biased compared with the overall proposed method under two different settings: 1) tweet stream (Morstatter et al., 2013). We follow using only an unlabeled corpus (UNLABEL), and the BCCWJ annotation guidelines (Ogura et al., 2) using both an unlabeled corpus and a labeled 2011) to conduct the annotation.7 corpus (BOTH). For the first experiment, we estab- lish a baseline result using an approach proposed 3.2 Baselines by Nakazawa et al.(2005) and compare this with We follow previous work and use a frequency- using tf-issf alone. We conduct an experiment in based method as the baseline (Nakazawa et al., the second setting to compare with other super- 2005; Kaji and Kitsuregawa, 2011): vised approaches, including Japanese tokenizers. 1 n n ( i=1 tf(yi)) 3.1 Dataset y∗ = arg max C (5) y Y l + α We compute the tf-issf value for each katakana ∈ Q N substring using all of 2015 Japanese Wikipedia where l is the average length of all segmented sub- as an unlabeled training corpus. This consists of strings. Following Nakazawa et al.(2005) and 1, 937, 006 unique katakana terms. Kaji and Kitsuregawa(2011), we set the hyper- Following Hagiwara and Sekine(2013), we parameters to C = 2500, N = 4, and α = 0.7.
Recommended publications
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • Galio Uma, Maƙerin Azurfa Maƙeri (M): Ni Ina Da Shekara Yanzu…
    LANGUAGE PROGRAM, 232 BAY STATE ROAD, BOSTON, MA 02215 [email protected] www.bu.edu/Africa/alp 617.353.3673 Interview 2: Galio Uma, Maƙerin Azurfa Maƙeri (M): Ni ina da shekara yanzu…, ina da vers …alal misali vers mille neuf cent quatre vingt et quatre (1984) gare mu parce que wajenmu babu hopital lokacin da anka haihe ni ban san shekaruna ba, gaskiya. Ihehehe! Makaranta, ban yi makaranta ba sabo da an sa ni makaranta sai daga baya wajenmu babana bai da hali, mammana bai da hali sai mun biya. Akwai nisa tsakaninmu da inda muna karatu. Kilometir kama sha biyar kullum ina tahiya da ƙahwa. To, ha na gaji wani lokaci mu hau jaki, wani lokaci mu hau raƙumi, to wani lokaci ma babu hali, babu karatu. Amman na yi shekara shidda ina karatu ko biyat. Amman ban ji daɗi ba da ban yi karatu ba. Sabo da shi ne daɗin duniya. Mm. To mu Buzaye ne muna yin buzanci. Ana ce muna Tuareg. Mais Tuareg ɗin ma ba guda ba ne. Mu artisans ne zan ce miki. Can ainahinmu ma ba mu da wani aiki sai ƙira. Duka mutanena suna can Abala. Nan ga ni dai na nan ga. Ka gani ina da mata, ina da yaro. Akwai babana, akwai mammana. Ina da ƙanne huɗu, macce biyu da namiji biyu. Suna duka cikin gida. Duka kuma kama ni ne chef de famille. Duka ni ne ina aider ɗinsu. Sabo da babana yana da shekara saba’in da biyar. Vers. Mammana yana da shekara shi ma hamsin da biyar.
    [Show full text]
  • Writing As Aesthetic in Modern and Contemporary Japanese-Language Literature
    At the Intersection of Script and Literature: Writing as Aesthetic in Modern and Contemporary Japanese-language Literature Christopher J Lowy A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2021 Reading Committee: Edward Mack, Chair Davinder Bhowmik Zev Handel Jeffrey Todd Knight Program Authorized to Offer Degree: Asian Languages and Literature ©Copyright 2021 Christopher J Lowy University of Washington Abstract At the Intersection of Script and Literature: Writing as Aesthetic in Modern and Contemporary Japanese-language Literature Christopher J Lowy Chair of the Supervisory Committee: Edward Mack Department of Asian Languages and Literature This dissertation examines the dynamic relationship between written language and literary fiction in modern and contemporary Japanese-language literature. I analyze how script and narration come together to function as a site of expression, and how they connect to questions of visuality, textuality, and materiality. Informed by work from the field of textual humanities, my project brings together new philological approaches to visual aspects of text in literature written in the Japanese script. Because research in English on the visual textuality of Japanese-language literature is scant, my work serves as a fundamental first-step in creating a new area of critical interest by establishing key terms and a general theoretical framework from which to approach the topic. Chapter One establishes the scope of my project and the vocabulary necessary for an analysis of script relative to narrative content; Chapter Two looks at one author’s relationship with written language; and Chapters Three and Four apply the concepts explored in Chapter One to a variety of modern and contemporary literary texts where script plays a central role.
    [Show full text]
  • Japanese Language and Culture
    NIHONGO History of Japanese Language Many linguistic experts have found that there is no specific evidence linking Japanese to a single family of language. The most prominent theory says that it stems from the Altaic family(Korean, Mongolian, Tungusic, Turkish) The transition from old Japanese to Modern Japanese took place from about the 12th century to the 16th century. Sentence Structure Japanese: Tanaka-san ga piza o tabemasu. (Subject) (Object) (Verb) 田中さんが ピザを 食べます。 English: Mr. Tanaka eats a pizza. (Subject) (Verb) (Object) Where is the subject? I go to Tokyo. Japanese translation: (私が)東京に行きます。 [Watashi ga] Toukyou ni ikimasu. (Lit. Going to Tokyo.) “I” or “We” are often omitted. Hiragana, Katakana & Kanji Three types of characters are used in Japanese: Hiragana, Katakana & Kanji(Chinese characters). Mr. Tanaka goes to Canada: 田中さんはカナダに行きます [kanji][hiragana][kataka na][hiragana][kanji] [hiragana]b Two Speech Styles Distal-Style: Semi-Polite style, can be used to anyone other than family members/close friends. Direct-Style: Casual & blunt, can be used among family members and friends. In-Group/Out-Group Semi-Polite Style for Out-Group/Strangers I/We Direct-Style for Me/Us Polite Expressions Distal-Style: 1. Regular Speech 2. Ikimasu(he/I go) Honorific Speech 3. Irasshaimasu(he goes) Humble Speech Mairimasu(I/We go) Siblings: Age Matters Older Brother & Older Sister Ani & Ane 兄 と 姉 Younger Brother & Younger Sister Otooto & Imooto 弟 と 妹 My Family/Your Family My father: chichi父 Your father: otoosan My mother: haha母 お父さん My older brother: ani Your mother: okaasan お母さん Your older brother: oniisanお兄 兄 さん My older sister: ane姉 Your older sister: oneesan My younger brother: お姉さ otooto弟 ん Your younger brother: My younger sister: otootosan弟さん imooto妹 Your younger sister: imootosan 妹さん Boy Speech & Girl Speech blunt polite I/Me = watashi, boku, ore, I/Me = watashi, washi watakushi I am going = Boku iku.僕行 I am going = Watashi iku く。 wa.
    [Show full text]
  • Introduction to Japanese Computational Linguistics Francis Bond and Timothy Baldwin
    1 Introduction to Japanese Computational Linguistics Francis Bond and Timothy Baldwin The purpose of this chapter is to provide a brief introduction to the Japanese language, and natural language processing (NLP) research on Japanese. For a more complete but accessible description of the Japanese language, we refer the reader to Shibatani (1990), Backhouse (1993), Tsujimura (2006), Yamaguchi (2007), and Iwasaki (2013). 1 A Basic Introduction to the Japanese Language Japanese is the official language of Japan, and belongs to the Japanese language family (Gordon, Jr., 2005).1 The first-language speaker pop- ulation of Japanese is around 120 million, based almost exclusively in Japan. The official version of Japanese, e.g. used in official settings andby the media, is called hyōjuNgo “standard language”, but Japanese also has a large number of distinctive regional dialects. Other than lexical distinctions, common features distinguishing Japanese dialects are case markers, discourse connectives and verb endings (Kokuritsu Kokugo Kenkyujyo, 1989–2006). 1There are a number of other languages in the Japanese language family of Ryukyuan type, spoken in the islands of Okinawa. Other languages native to Japan are Ainu (an isolated language spoken in northern Japan, and now almost extinct: Shibatani (1990)) and Japanese Sign Language. Readings in Japanese Natural Language Processing. Francis Bond, Timothy Baldwin, Kentaro Inui, Shun Ishizaki, Hiroshi Nakagawa and Akira Shimazu (eds.). Copyright © 2016, CSLI Publications. 1 Preview 2 / Francis Bond and Timothy Baldwin 2 The Sound System Japanese has a relatively simple sound system, made up of 5 vowel phonemes (/a/,2 /i/, /u/, /e/ and /o/), 9 unvoiced consonant phonemes (/k/, /s/,3 /t/,4 /n/, /h/,5 /m/, /j/, /ó/ and /w/), 4 voiced conso- nants (/g/, /z/,6 /d/ 7 and /b/), and one semi-voiced consonant (/p/).
    [Show full text]
  • Title: Exploring Japanese Language: from East Asia to Eurasia
    Eurasia Foundation International Lectures, Fall 2020 Semester “The Construction and Transformation of East Asiaology” Lecture Series (12) Title: Exploring Japanese Language: From East Asia to Eurasia For the 12th Eurasia Foundation International Lectures, we invite Professor Shun-Yi Chen, from the Center for Japanese Studies at Chinese Culture University, to share his research results on exploring Japanese language. Professor Chen’s speech consist of two parts: the first part explores the pronunciation of Chinese characters in Japanese (Japanese Kanji) and its relations with the composition of words and the second part discusses the similarity between Japanese and Hebrew. Professor Chen indicates that the pronunciation of Japanese Kanji is different depending on when they were introduced to Japan. There are different types including Go-on (呉音, “sounds from the Wu region”), Kan-on (漢音, “Han sound”), and Tō-on (唐音, “Tang sound”). Kan-on refers to sounds of Kanji from Chang’an in Tang dynasty which were introduced with Japanese missions to Tang and Sui China. Go-on refers to sounds of Kanji from Southern China which were introduced via the Korean Peninsula in the 5th and 6th century. Tō-on refers to sounds of Kanji introduced after Song dynasty, while So-on were introduced in Yuan, Ming Dynasty. Most Tō-on and So-on were Buddhist terms. Here are some examples of different sounds: 行政 Gyōsei (ぎょうせい) is Go-on; 銀行 Ginkō (ぎんこう) is Kan-on. Professor Chen also points out that since the pronunciation of Kanji were introduced from China, they must follow the rule of the pronunciation in Chinese.
    [Show full text]
  • Abin Yi Idan Kana Da COVID-19
    Abin yi idan kana da COVID-19 KOYI YADDA ZAKA KULA DA KANKA DA KUMA WADANSU A GIDA. Menene Alamomin COVID-19? • Akwai alamomi iri-iri masu yawa, fara daga masu sauki zuwa masu tsanani. Wadansu mutane ba su da wadansu alamomin cutar. • Mafi yawan alamomi sun haɗa da zazzaɓi ko sanyi, tari, gajeruwar numfashi ko wahalar numfashi, gajiya, ciwon jijiyoyi ko jiki, ciwon kai, rashin dandano ko jin wari, ciwon makogwaro ko yoyon hanci, tashin zuciya ko amai, da gudawa. Wanene ke da Hadari don Me ya kamata in yi idan ina da alamomin COVID-19? Tsananin ciwo daga COVID-19? • Tsaya a gida! Kada ka bar gida sai dai don • Tsakanin manya, haɗarin yin gwaji don COVID-19 da sauran muhimman mummunan rashin lafiya yana abubuwan kula da lafiya ko bukatun ƙaruwa tare da shekaru, tsofaffi yau-da-kullum, kamar su kayan masarufi, na cikin haɗari mafi girma. idan wani ba zai iya samar maka su ba. Kada ka je wurin aiki, ko da kuwa kana • Mutane daga wadansu launin ma’aikaci mai muhimmanci. fatar da kabilu (hade da Bakake, ’yan Latin Amurka da ’Yan asali) • Yi magana da mai baka kulawar lafiya! saboda tsarin lafiya da rashin Yi amfani da tarho ko telemedicine idan zai yiwu. daidaiton zamantakewa. • Yi gwaji! Idan mai baka kulawa baya bayar • Mutanen na kowane shekaru yin gwaji, ziyarci nyc.gov/covidtest ko kira waɗanda suke da yanayin 311 don neman wurin gwaji kusa da kai. boyayyen rashin lafiya, kamar: Wurare dayawa Na bayar da gwajin kyauta. Ciwon daji • Kira 911 a yanayin gaggawa! Idan kana Daɗɗaɗen Ciwon koda da matsalar numfashi, zafi ko matsin lamba Daɗɗaɗen ciwon huhu a kirjinka, ka rikice ko baka iya zama farke, Ciwon mantuwa da sauran kana da leɓuna masu ruwan bula ko fuska, cututtukan jijiyoyin jiki ko wadansu yanayin gaggawa, jeka asibiti Ciwon suga ko kira 911 nan take.
    [Show full text]
  • Unicode Character Properties
    Unicode character properties Document #: P1628R0 Date: 2019-06-17 Project: Programming Language C++ Audience: SG-16, LEWG Reply-to: Corentin Jabot <[email protected]> 1 Abstract We propose an API to query the properties of Unicode characters as specified by the Unicode Standard and several Unicode Technical Reports. 2 Motivation This API can be used as a foundation for various Unicode algorithms and Unicode facilities such as Unicode-aware regular expressions. Being able to query the properties of Unicode characters is important for any application hoping to correctly handle any textual content, including compilers and parsers, text editors, graphical applications, databases, messaging applications, etc. static_assert(uni::cp_script('C') == uni::script::latin); static_assert(uni::cp_block(U'[ ') == uni::block::misc_pictographs); static_assert(!uni::cp_is<uni::property::xid_start>('1')); static_assert(uni::cp_is<uni::property::xid_continue>('1')); static_assert(uni::cp_age(U'[ ') == uni::version::v10_0); static_assert(uni::cp_is<uni::property::alphabetic>(U'ß')); static_assert(uni::cp_category(U'∩') == uni::category::sm); static_assert(uni::cp_is<uni::category::lowercase_letter>('a')); static_assert(uni::cp_is<uni::category::letter>('a')); 3 Design Consideration 3.1 constexpr An important design decision of this proposal is that it is fully constexpr. Notably, the presented design allows an implementation to only link the Unicode tables that are actually used by a program. This can reduce considerably the size requirements of an Unicode-aware executable as most applications often depend on a small subset of the Unicode properties. While the complete 1 Unicode database has a substantial memory footprint, developers should not pay for the table they don’t use. It also ensures that developers can enforce a specific version of the Unicode Database at compile time and get a consistent and predictable run-time behavior.
    [Show full text]
  • Hiragana Ka Takana
    A GUIDE TO LEARNING HIRAGANA AND KA TAKANA Kenneth G. Henshall with Tetsuo Takagaki CHARLES E. TUTTLE COMPANY Rutland, Vermont & Tokyo, Japan A GUIDE TO LEARNING HIRAGANA AND KA TAKANA Kenneth G. Henshall with Tetsuo Takagaki CHARLES E. TUTTLE COMPANY Rutland, Vermont & Tokyo, Japan PART rn: FINAL REVIEW About Japan Food Items Quiz HOW TO USE THIS BOOK Flora and Fauna Quiz Personal Names Quiz The main aim of this book is to help students achieve competence in reading and writing Kana Word Search kana, the phonetic symbols that are fundamental to written Japanese. The book starts with Quiz Answers a section entitled An Explanation of Kana, which contains everything the student will need Do-It-Yourself Kana Charts to know about the two kana systems of hiraganu and kotakuna. Part I of the workbook sec- The Iroha Verse tion then systematically introduces each hiragana symbol, voiced form, and combination, and provides ample practice and review. Pan I1 does the same for katakana, while Part III provides an overall review. The Explanation of Kana outlines the function and origin of kana, the difference between the two kana systems, the various sounds, the combinations, and the conventions of usage. It attempts to be detailed and thorough so that it can be used for reference at any stage. Though all the information about kana is grouped together in this one section for ease of reference, it is not expected that the student will read it all before starting on the practice pages. In fact, to do so might give the impression that kana are perhaps rather formidable, which is not really the case at all.
    [Show full text]
  • A Translation-Study of the Kana Shôri
    A Translation-Study of the Kana shōri John Allen Tucker East Carolina University The Kana shōri 假名性理 (Neo-Confucian Terms for Japanese) is one of the most philosophically packed and yet historically puzzling texts of early seventeenth-century Tokugawa Japan (1600-1868). Philosophically, it is noteworthy for at least four reasons, all having to do with its advocacy of Neo-Confucian and Shinto views rather than those of Buddhism. First, the Kana shōri’s emphasis on the inversion of excess in history differed with understandings of time based on the Buddhist teachings of karma and reincarnation.1 Second, the Kana shōri’s paternalistic political views, stressing the crucial nature of a ruler’s parental concern for the people, contrasted with the predominantly medieval stress on propagating the Buddhist dharma for the sake of securing stable, religiously sanctioned rule defended by heavenly guardian kings.2 Third, the Kana shōri’s syncretic accounts of Neo- Confucianism and Shinto, emphasizing the common relevance of notions such as the way of heaven (tendō 天道); reverence (kei 敬); the learning of the mind (shingaku 心學); the transmission of the way (dōtō 道統); the five relationships (gorin 五倫); and the “sixteen- word teaching” (jūrokuji 十六字), contrasted with earlier syncretic formulations favored by medieval Zen theorists.3 Finally, to the extent that the Kana shōri was syncretic, it was so in 1 One of the classic statements of Buddhist historiography is Jien’s 慈圓 Gukanshō 愚菅抄 (1219), which periodizes Japanese history according to the supposed rise and fall of the Buddha’s teachings. For a brief discussion of it, see John S.
    [Show full text]
  • Discriminative Method for Japanese Kana-Kanji Input Method
    Discriminative Method for Japanese Kana-Kanji Input Method Hiroyuki Tokunaga Daisuke Okanohara Shinsuke Mori Preferred Infrastructure Inc. Kyoto University 2-40-1 Hongo, Bunkyo-ku, Tokyo Yoshidahonmachi, Sakyo-ku, Kyoto 113-0033, Japan 606-8501, Japan tkng,hillbig @preferred.jp [email protected] { } Abstract Early kana-kanji conversion systems employed heuristic rules, such as maximum-length-word The most popular type of input method matching. in Japan is kana-kanji conversion, conver- With the growth of statistical natural language sion from a string of kana to a mixed kanji- processing, data-driven methods were applied for kana string. kana-kanji conversion. Most existing studies on However there is no study using discrim- kana-kanji conversion have used probability mod- inative methods like structured SVMs for els, especially generative models. Although gen- kana-kanji conversion. One of the reasons erative models have some advantages, a number is that learning a discriminative model of studies on natural language tasks have shown from a large data set is often intractable. that discriminative models tend to overcome gen- However, due to progress of recent re- erative models with respect to accuracy. searches, large scale learning of discrim- However, there have been no studies using only inative models become feasible in these a discriminative model for kana-kanji conversion. days. One reason for this is that learning a discriminative model from a large data set is often intractable. In the present paper, we investigate However, due to progress in recent research on whether discriminative methods such as machine learning, large-scale learning of discrim- structured SVMs can improve the accu- inative models has become feasible.
    [Show full text]
  • The Unicode Standard, Version 3.0, Issued by the Unicode Consor- Tium and Published by Addison-Wesley
    The Unicode Standard Version 3.0 The Unicode Consortium ADDISON–WESLEY An Imprint of Addison Wesley Longman, Inc. Reading, Massachusetts · Harlow, England · Menlo Park, California Berkeley, California · Don Mills, Ontario · Sydney Bonn · Amsterdam · Tokyo · Mexico City Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If these files have been purchased on computer-readable media, the sole remedy for any claim will be exchange of defective media within ninety days of receipt. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc.
    [Show full text]