Enhancing Word Representations for Korean with Hanja

Total Page:16

File Type:pdf, Size:1020Kb

Enhancing Word Representations for Korean with Hanja Don’t Just Scratch the Surface: Enhancing Word Representations for Korean with Hanja Kang Min Yoo∗, Taeuk Kim∗ and Sang-goo Lee Department of Computer Science and Engineering Seoul National University, Seoul, Korea fkangminyoo,taeuk,[email protected] Abstract We propose a simple yet effective approach for improving Korean word representations using additional linguistic annotation (i.e. Hanja). We employ cross-lingual transfer learning in training word representations by leveraging the fact that Hanja is closely related to Chi- nese. We evaluate the intrinsic quality of rep- resentations learned through our approach us- ing the word analogy and similarity tests. In Figure 1: An example of a Korean word showing its addition, we demonstrate their effectiveness form and multi-level meanings. The Sino-Korean word KR on several downstream tasks, including a novel consists of Hangul phonograms ( ) and Hanja lo- HJ Korean news headline generation task. gograms ( ). Although annotation of Hanja is op- tional, it offers deeper insight into the word meaning due to its association with the Chinese characters (CN). 1 Introduction There is a strong connection between the Korean Information Skip-Gram), for capturing the seman- and Chinese languages due to cultural and histori- tics of Hanja and subword structures of Korean cal reasons (Lee and Ramsey, 2011). Specifically, and introducing them into the vector space. Note a set of logograms with very similar forms to the that it is also quite intuitive for native Koreans 1 Chinese characters, called Hanja , served in the to resolve the ambiguity of (Sino-)Korean words past as the only medium for written Korean until with the aid of Hanja. We conjecture this heuris- Hangul, the Korean alphabet, and Jamo came into tic is attributed to the fact that Hanja characters existence in 1443 (Sohn, 2001). Considering this are logograms, each of which contains more lex- etymological background, a substantial portion of ical meaning when compared against its counter- Korean words are classified as Sino-Korean, a set part, the Hangul character, which is a phonogram. of Korean words that had originated from Chinese Accordingly, in this work, we focus on exploiting and can be expressed in both Hanja and Hangul the rich extra information from Hanja as a means (Taylor, 1997), with the latter becoming common- of constructing better Korean word embeddings. place in modern Korean. Furthermore, our approach enables us to empir- Based on these facts, we assume the introduc- ically investigate the potential of character-level tion of Hanja-level information will aid in ground- cross-lingual knowledge transfer with a case study ing better representation for the meaning of Ko- of Chinese and Hanja. rean words (e.g. see Fig.1). To validate our To sum up, our contributions are threefold: hypothesis, we propose a simple yet effective ap- proach to training Korean word representations, • We introduce a novel way of training Korean named Hanja-level SISG (Hanja-level Subword word embeddings with an expanded Korean alphabet vocabulary using Hanja in addition ∗Equal contribution. 1Hanja and traditional Chinese characters are very similar to the existing Korean characters, Hangul. but not completely identical. Some differences in strokes and language-exclusive characters exist. • We also explore the possibility of character- 3528 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3528–3533, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics level knowledge transfer between two lan- However, due to the deconstructible structure guages by initializing Hanja embeddings of Korean characters and the agglutinative na- with Chinese character embeddings before ture of the language, the model proposed by Bo- training skip-gram. janowski et al.(2017) comes short in capturing sub-character and inter-character information spe- • We prove the effectiveness of our method in cific to Korean understanding. As a remedy, the two ways, one of which is intrinsic evaluation model proposed by Park et al.(2018) introduces (the word analogy and similarity tests), and jamo-level n-grams g(j), whose vectors can be as- the other being two downstream tasks includ- sociated with the target context as well. Given that ing a newly proposed Korean news headline a Korean word w can be split into a set of jamo generation task. (j) (j) n (j) (j)o n-grams Gw ⊂ G = g1 ; : : : ; gJ , where 2 Model Description G(j) is the set of all possible jamo n-grams in the corpus, the updated scoring function for jamo- 2.1 Skip-Gram (SG) level skip-gram s(j) is thus Given a corpus as a sequence of words (w1; w2; : : : ; wT ), the goal of a skip-gram model (j) X (Mikolov et al., 2013) is to maximize the log- s (w; c) = s (w; c) + zjvc: (4) (j) probabilities of context words given a target word j2Gw wi: 2.3 Hanja-level SISG2 X X log p (wcjwi); (1) Another semantic level in Korean lies in Hanja i2f1;:::;T g c2C(i) logograms. Hanja characters can be mapped to where C returns a set of context word indices given Hangul phonograms, hence Hanja annotations are a word index. The usual choice for training the sufficient but not necessary in conveying mean- parameterized probability function p (wcjw) is to ings (Fig.1). As each Hanja character is associ- reframe the problem as a set of binary classifica- ated with semantics, their presence could provide tion tasks of the target context word c and other meaningful aid in capturing word-level semantics negative samples chosen from the entire dictionary in the vector space. (negative sampling). The objective is thus We propose incorporating Hanja n-grams in the learning of word representations by allowing the X scoring function to associate each Hanja n-gram L (s (wi; wc)) + L (−s (wi; wj)) ; (2) with the target context word. Concretely, given j2N (i;c) that a Korean word w contains a set of Hanja se- quences H = (h ; : : : ; h ) (e.g. for the where L (x) = log (1 + e−x), the binary logis- w w;1 w;Hw example in Fig.1 h = “>會” and h = “型”) tic loss. N returns a set of negative sample in- w;1 w2 and that I is the set of Hanja n-grams for a Hanja dices given target word and context indices. In the h sequence h (e.g. Ihw;2 = f “<boh>型”, “型”, vanilla SG model, the scoring function s is the dot (h) “型<eoh>”g), all Hanja n-grams Gw for word product between the two word vectors: vw and vc. w are the union of Hanja n-grams of Hanja se- 2.2 Subword Information SG (SISG) and S quences present in w: h2Hw Ih. The length of Jamo-level SISG Hanja n-grams in Ih is a hyperparameter. Then In Bojanowski et al.(2017), the scoring func- the new score function for Hanja-level SISG is tion incorporates subword information. Specifi- cally, given all possible n-gram character sets G = (h) (j) X s (w; c) = s (w; c) + zhvc (5) fg1; : : : ; gGg in the corpus and that the n-gram set h2G(h) of a particular word w is Gw ⊂ G, the scoring w function is thus the sum of dot products between where z is the n-gram vector for a Hanja se- all n-gram vectors z and the context vector v : h c quence h. X 2 s (w; c) = zgvc: (3) The code and models are publicly available online at https://github.com/kaniblu/hanja-sisg g2Gw 3529 Analogy Similarity 2.4 Character-level Knowledge Transfer Method Sem. Syn. All Pr. Sp. Since Hanja characters essentially share deep SISG(cj)y 0.47 0.38 0.43 - 0.67 roots with Chinese characters, many of them can SG 0.42 0.49 0.45 0.60 0.62 SISG(c) 0.45 0.59 0.52 0.62 0.61 be mapped by one-to-one correspondence. By SISG(cj) 0.39 0.48 0.44 0.66 0.67 converting Korean characters into simplified Chi- SISG(cj)z 0.41 0.48 0.45 0.65 0.67 nese characters or vice versa, it is possible to uti- SISG(cjh3) 0.34 0.45 0.39 0.63 0.63 lize pre-trained Chinese embeddings in the pro- SISG(cjh4) 0.34 0.45 0.40 0.62 0.61 SISG(cjhr) 0.35 0.46 0.40 0.65 0.64 cess of training Hanja-level SISG. In our case, we y reported by Park et al.(2018) propose leveraging advances in Chinese word em- z pre-trained embeddings provided by the authors of beddings from the Chinese community by initial- Park et al.(2018) run with our evaluation script izing Hanja n-grams vectors zh with the state-of- the-art Chinese embeddings (Li et al., 2018) and Table 1: Word analogy and similarity evaluation re- training with the score function in Equation5. sults. For word analogy test, the table shows cosine distances averaged over semantic and syntactic word pairs (lower is better). For word similarity test, we 3 Experiments report on Pearson and Spearman correlations between 3.1 Word Representation Learning predictions and ground truth scores. We observe that our approach (SISG(cjh)) shows significant improve- 3.1.1 Korean Corpus ments in the analogy test but some deterioration in the For learning word representations, we utilize the similarity test. Note that the evaluation results reported by the original authors (y) are largely different from corpus prepared by Park et al.(2018) with small our results.
Recommended publications
  • Stochastic Model of Stroke Order Variation
    2009 10th International Conference on Document Analysis and Recognition Stochastic Model of Stroke Order Variation Yoshinori Katayama, Seiichi Uchida, and Hiroaki Sakoe Faculty of Information Science and Electrical Engineering, Kyushu University, 819-0395, Japan fyosinori,[email protected] Abstract “ ¡ ” under an unnatural stroke correspondence which max- imizes their similarity. Note that the correct stroke order of A stochastic model of stroke order variation is proposed “ ” is (“—” ! “j' ! “=” ! “n” ! “–”) and that of “ ¡ ” and applied to the stroke-order free on-line Kanji character is (“—” ! “–” ! “j' ! “=” ! “n” ). Thus if we allow any recognition. The proposed model is a hidden Markov model stroke order variation, those two characters become almost (HMM) with a special topology to represent all stroke order identical. variations. A sequence of state transitions from the initial One possible remedy to suppress the misrecognitions is state to the final state of the model represents one stroke to penalize unnatural i.e., rare stroke order on optimizing order and provides a probability of the stroke order. The the stroke correspondence. In fact, there are popular stroke distribution of the stroke order probability can be trained orders (including the standard stroke order) and there are automatically by using an EM algorithm from a training rare stroke orders. If we penalize the situation that “ ¡ ” is set of on-line character patterns. Experimental results on matched to an input pattern with its very rare stroke order large-scale test patterns showed that the proposed model of (“—” ! “j' ! “=” ! “n” ! “–”), we can avoid the could represent actual stroke order variations appropriately misrecognition of “ ” as “ ¡ .” and improve recognition accuracy by penalizing incorrect For this purpose, a stochastic model of stroke order vari- stroke orders.
    [Show full text]
  • Neural Substrates of Hanja (Logogram) and Hangul (Phonogram) Character Readings by Functional Magnetic Resonance Imaging
    ORIGINAL ARTICLE Neuroscience http://dx.doi.org/10.3346/jkms.2014.29.10.1416 • J Korean Med Sci 2014; 29: 1416-1424 Neural Substrates of Hanja (Logogram) and Hangul (Phonogram) Character Readings by Functional Magnetic Resonance Imaging Zang-Hee Cho,1 Nambeom Kim,1 The two basic scripts of the Korean writing system, Hanja (the logography of the traditional Sungbong Bae,2 Je-Geun Chi,1 Korean character) and Hangul (the more newer Korean alphabet), have been used together Chan-Woong Park,1 Seiji Ogawa,1,3 since the 14th century. While Hanja character has its own morphemic base, Hangul being and Young-Bo Kim1 purely phonemic without morphemic base. These two, therefore, have substantially different outcomes as a language as well as different neural responses. Based on these 1Neuroscience Research Institute, Gachon University, Incheon, Korea; 2Department of linguistic differences between Hanja and Hangul, we have launched two studies; first was Psychology, Yeungnam University, Kyongsan, Korea; to find differences in cortical activation when it is stimulated by Hanja and Hangul reading 3Kansei Fukushi Research Institute, Tohoku Fukushi to support the much discussed dual-route hypothesis of logographic and phonological University, Sendai, Japan routes in the brain by fMRI (Experiment 1). The second objective was to evaluate how Received: 14 February 2014 Hanja and Hangul affect comprehension, therefore, recognition memory, specifically the Accepted: 5 July 2014 effects of semantic transparency and morphemic clarity on memory consolidation and then related cortical activations, using functional magnetic resonance imaging (fMRI) Address for Correspondence: (Experiment 2). The first fMRI experiment indicated relatively large areas of the brain are Young-Bo Kim, MD Department of Neuroscience and Neurosurgery, Gachon activated by Hanja reading compared to Hangul reading.
    [Show full text]
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • Recognition of Online Handwritten Gurmukhi Strokes Using Support Vector Machine a Thesis
    Recognition of Online Handwritten Gurmukhi Strokes using Support Vector Machine A Thesis Submitted in partial fulfillment of the requirements for the award of the degree of Master of Technology Submitted by Rahul Agrawal (Roll No. 601003022) Under the supervision of Dr. R. K. Sharma Professor School of Mathematics and Computer Applications Thapar University Patiala School of Mathematics and Computer Applications Thapar University Patiala – 147004 (Punjab), INDIA June 2012 (i) ABSTRACT Pen-based interfaces are becoming more and more popular and play an important role in human-computer interaction. This popularity of such interfaces has created interest of lot of researchers in online handwriting recognition. Online handwriting recognition contains both temporal stroke information and spatial shape information. Online handwriting recognition systems are expected to exhibit better performance than offline handwriting recognition systems. Our research work presented in this thesis is to recognize strokes written in Gurmukhi script using Support Vector Machine (SVM). The system developed here is a writer independent system. First chapter of this thesis report consist of a brief introduction to handwriting recognition system and some basic differences between offline and online handwriting systems. It also includes various issues that one can face during development during online handwriting recognition systems. A brief introduction about Gurmukhi script has also been given in this chapter In the last section detailed literature survey starting from the 1979 has also been given. Second chapter gives detailed information about stroke capturing, preprocessing of stroke and feature extraction. These phases are considered to be backbone of any online handwriting recognition system. Recognition techniques that have been used in this study are discussed in chapter three.
    [Show full text]
  • Nushu and the Writing of Religious
    NOSHU AND RELIGIOUS CULTURE IN CHINA "MY MOTHER WATCHED OVER AN EMPTY HOUSE AND',WAS SEPARATED FROM THE HEA VENL Y FEMALE": NUSHU AND THE WRITING OF RELIGIOUS CULTURE IN CHINA By STEPHANIE BALK WILL, B.A. (H.Hons.) A Thesis Submitted to the School of Graduate Studies in Partial FulfiHment ofthe Requirements for the Degree Master of Arts McMaster University © Copyright by Stephanie Balkwill, August 2006 MASTER OF ARTS (2006) McMaster University (Religious Studies) Hamilton, Ontario TITLE: "My Mother Watched Over an Empty House and was Separated From the Heavenly Female": Nushu and the Writing IOf Religious Culture in China. AUTHOR: Stephanie Balkwill, B.A. (H.Hons.) (University of Regina) SUPERVISOR: Dr. James Benn NUMBER OF PAGES: v, 120 ii ]rll[.' ' ABSTRACT Niishu, or "Women's Script" is a system of writing indigenous to a small group of village women in iJiangyong County, Hunan Province, China. Used exclusively by and for these women, the script was developed in order to write down their oral traditions that may have included songs, prayers, stories and biographies. However, since being discovered by Chinese and Western researchers, nushu has been rapidly brought out of this Chinese village locale. At present, the script has become an object of fascination for diverse audiences all over the world. It has been both the topic of popular media presentations and publications as well as the topic of major academic research projects published in Engli$h, German, Chinese and Japanese. Resultantly, niishu has played host to a number of mo~ern explanations and interpretations - all of which attempt to explain I the "how" and the "why" of an exclusively female script developed by supposedly illiterate women.
    [Show full text]
  • Sinitic Language and Script in East Asia: Past and Present
    SINO-PLATONIC PAPERS Number 264 December, 2016 Sinitic Language and Script in East Asia: Past and Present edited by Victor H. Mair Victor H. Mair, Editor Sino-Platonic Papers Department of East Asian Languages and Civilizations University of Pennsylvania Philadelphia, PA 19104-6305 USA [email protected] www.sino-platonic.org SINO-PLATONIC PAPERS FOUNDED 1986 Editor-in-Chief VICTOR H. MAIR Associate Editors PAULA ROBERTS MARK SWOFFORD ISSN 2157-9679 (print) 2157-9687 (online) SINO-PLATONIC PAPERS is an occasional series dedicated to making available to specialists and the interested public the results of research that, because of its unconventional or controversial nature, might otherwise go unpublished. The editor-in-chief actively encourages younger, not yet well established, scholars and independent authors to submit manuscripts for consideration. Contributions in any of the major scholarly languages of the world, including romanized modern standard Mandarin (MSM) and Japanese, are acceptable. In special circumstances, papers written in one of the Sinitic topolects (fangyan) may be considered for publication. Although the chief focus of Sino-Platonic Papers is on the intercultural relations of China with other peoples, challenging and creative studies on a wide variety of philological subjects will be entertained. This series is not the place for safe, sober, and stodgy presentations. Sino- Platonic Papers prefers lively work that, while taking reasonable risks to advance the field, capitalizes on brilliant new insights into the development of civilization. Submissions are regularly sent out to be refereed, and extensive editorial suggestions for revision may be offered. Sino-Platonic Papers emphasizes substance over form.
    [Show full text]
  • + Natali A, Professor of Cartqraphy, the Hebreu Uhiversity of -Msalem, Israel DICTIONARY of Toponymfc TERLMINO~OGY Wtaibynafiail~
    United Nations Group of E%perts OR Working Paper 4eographicalNames No. 61 Eighteenth Session Geneva, u-23 August1996 Item7 of the E%ovisfonal Agenda REPORTSOF THE WORKINGGROUPS + Natali a, Professor of Cartqraphy, The Hebreu UhiVersity of -msalem, Israel DICTIONARY OF TOPONYMfC TERLMINO~OGY WtaIbyNafiaIl~- . PART I:RaLsx vbim 3.0 upi8elfuiyl9!J6 . 001 . 002 003 004 oo!l 006 007 . ooa 009 010 . ol3 014 015 sequala~esfocJphabedcsaipt. 016 putting into dphabetic order. see dso Kqucna ruIt!% Qphabctk 017 Rtlpreat8Ii00, e.g. ia 8 computer, wflich employs ooc only numm ds but also fetters. Ia a wider sense. aIso anploying punauatiocl tnarksmd-SymboIs. 018 Persod name. Esamples: Alfredi ‘Ali. 019 022 023 biliaw 024 02s seecIass.f- 026 GrqbicsymboIusedurunitiawrIdu~morespedficaty,r ppbic symbol in 1 non-dphabedc writiog ryste.n& Exmlptes: Chinese ct, , thong; Ambaric u , ha: Japaoese Hiragana Q) , no. 027 -.modiGed Wnprehauive term for cheater. simplified aad character, varIaoL 031 CbmJnyol 032 CISS, featm? 033 cQdedrepfwltatiul 034 035 036 037' 038 039 040 041 042 047 caavasion alphabet 048 ConMQo table* 049 0nevahte0frpointinlhisgr8ti~ . -.- w%idofplaaecoordiaarurnm;aingoftwosetsofsnpight~ -* rtcight8ngfIertoeachotkrodwithap8ltKliuofl8qthonbo&. rupenmposedonr(chieflytopogtaphtc)map.see8lsouTM gz 051 see axxdimtes. rectangufar. 052 A stahle form of speech, deriyed from a pbfgin, which has became the sole a ptincipal language of 8 qxech comtnunity. Example: Haitian awle (derived from Fresh). ‘053 adllRaIfeatlue see feature, allhlral. 054 055 * 056 057 Ac&uioaofsoftwamrcqkdfocusingrdgRaIdatabmem rstoauMe~osctlto~thisdatabase. 058 ckalog of defItitioas of lbe contmuofadigitaldatabase.~ud- hlg data element cefw labels. f0mw.s. internal refm codMndtextemty,~well~their-p,. 059 see&tadichlq. 060 DeMptioa of 8 basic unit of -Lkatifiile md defiile informatioa tooccqyrspecEcdataf!eldinrcomputernxaxtLExampk Pateofmtifii~ofluwtby~namaturhority’.
    [Show full text]
  • Chinese Script Generation Panel Document
    Chinese Script Generation Panel Document Proposal for the Generation Panel for the Chinese Script Label Generation Ruleset for the Root Zone 1. General Information Chinese script is the logograms used in the writing of Chinese and some other Asian languages. They are called Hanzi in Chinese, Kanji in Japanese and Hanja in Korean. Since the Hanzi unification in the Qin dynasty (221-207 B.C.), the most important change in the Chinese Hanzi occurred in the middle of the 20th century when more than two thousand Simplified characters were introduced as official forms in Mainland China. As a result, the Chinese language has two writing systems: Simplified Chinese (SC) and Traditional Chinese (TC). Both systems are expressed using different subsets under the Unicode definition of the same Han script. The two writing systems use SC and TC respectively while sharing a large common “unchanged” Hanzi subset that occupies around 60% in contemporary use. The common “unchanged” Hanzi subset enables a simplified Chinese user to understand texts written in traditional Chinese with little difficulty and vice versa. The Hanzi in SC and TC have the same meaning and the same pronunciation and are typical variants. The Japanese kanji were adopted for recording the Japanese language from the 5th century AD. Chinese words borrowed into Japanese could be written with Chinese characters, while Japanese words could be written using the character for a Chinese word of similar meaning. Finally, in Japanese, all three scripts (kanji, and the hiragana and katakana syllabaries) are used as main scripts. The Chinese script spread to Korea together with Buddhism from the 2nd century BC to the 5th century AD.
    [Show full text]
  • A Comparative Analysis of the Simplification of Chinese Characters in Japan and China
    CONTRASTING APPROACHES TO CHINESE CHARACTER REFORM: A COMPARATIVE ANALYSIS OF THE SIMPLIFICATION OF CHINESE CHARACTERS IN JAPAN AND CHINA A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAI‘I AT MĀNOA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN ASIAN STUDIES AUGUST 2012 By Kei Imafuku Thesis Committee: Alexander Vovin, Chairperson Robert Huey Dina Rudolph Yoshimi ACKNOWLEDGEMENTS I would like to express deep gratitude to Alexander Vovin, Robert Huey, and Dina R. Yoshimi for their Japanese and Chinese expertise and kind encouragement throughout the writing of this thesis. Their guidance, as well as the support of the Center for Japanese Studies, School of Pacific and Asian Studies, and the East-West Center, has been invaluable. i ABSTRACT Due to the complexity and number of Chinese characters used in Chinese and Japanese, some characters were the target of simplification reforms. However, Japanese and Chinese simplifications frequently differed, resulting in the existence of multiple forms of the same character being used in different places. This study investigates the differences between the Japanese and Chinese simplifications and the effects of the simplification techniques implemented by each side. The more conservative Japanese simplifications were achieved by instating simpler historical character variants while the more radical Chinese simplifications were achieved primarily through the use of whole cursive script forms and phonetic simplification techniques. These techniques, however, have been criticized for their detrimental effects on character recognition, semantic and phonetic clarity, and consistency – issues less present with the Japanese approach. By comparing the Japanese and Chinese simplification techniques, this study seeks to determine the characteristics of more effective, less controversial Chinese character simplifications.
    [Show full text]
  • Proposal for a Korean Script Root Zone LGR 1 General Information
    (internal doc. #: klgp220_101f_proposal_korean_lgr-25jan18-en_v103.doc) Proposal for a Korean Script Root Zone LGR LGR Version 1.0 Date: 2018-01-25 Document version: 1.03 Authors: Korean Script Generation Panel 1 General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Korean Script LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document below: • proposal-korean-lgr-25jan18-en.xml Labels for testing can be found in the accompanying text document below: • korean-test-labels-25jan18-en.txt In Section 3, we will see the background on Korean script (Hangul + Hanja) and principal language using it, i.e., Korean language. The overall development process and methodology will be reviewed in Section 4. The repertoire and variant groups in K-LGR will be discussed in Sections 5 and 6, respectively. In Section 7, Whole Label Evaluation Rules (WLE) will be described and then contributors for K-LGR are shown in Section 8. Several appendices are included with separate files. proposal-korean-lgr-25jan18-en 1 / 73 1/17 2 Script for which the LGR is proposed ISO 15924 Code: Kore ISO 15924 Key Number: 287 (= 286 + 500) ISO 15924 English Name: Korean (alias for Hangul + Han) Native name of the script: 한글 + 한자 Maximal Starting Repertoire (MSR) version: MSR-2 [241] Note.
    [Show full text]
  • The Japanese Writing Systems, Script Reforms and the Eradication of the Kanji Writing System: Native Speakers’ Views Lovisa Österman
    The Japanese writing systems, script reforms and the eradication of the Kanji writing system: native speakers’ views Lovisa Österman Lund University, Centre for Languages and Literature Bachelor’s Thesis Japanese B.A. Course (JAPK11 Spring term 2018) Supervisor: Shinichiro Ishihara Abstract This study aims to deduce what Japanese native speakers think of the Japanese writing systems, and in particular what native speakers’ opinions are concerning Kanji, the logographic writing system which consists of Chinese characters. The Japanese written language has something that most languages do not; namely a total of ​ ​ three writing systems. First, there is the Kana writing system, which consists of the two syllabaries: Hiragana and Katakana. The two syllabaries essentially figure the same way, but are used for different purposes. Secondly, there is the Rōmaji writing system, which is Japanese written using latin letters. And finally, there is the Kanji writing system. Learning this is often at first an exhausting task, because not only must one learn the two phonematic writing systems (Hiragana and Katakana), but to be able to properly read and write in Japanese, one should also learn how to read and write a great amount of logographic signs; namely the Kanji. For example, to be able to read and understand books or newspaper without using any aiding tools such as dictionaries, one would need to have learned the 2136 Jōyō Kanji (regular-use Chinese characters). With the twentieth century’s progress in technology, comparing with twenty years ago, in this day and age one could probably theoretically get by alright without knowing how to write Kanji by hand, seeing as we are writing less and less by hand and more by technological devices.
    [Show full text]
  • Areal Script Form Patterns with Chinese Characteristics James Myers
    Areal script form patterns with Chinese characteristics James Myers National Chung Cheng University http://personal.ccu.edu.tw/~lngmyers/ To appear in Written Language & Literacy This study was made possible through a grant from Taiwan’s Ministry of Science and Technology (103-2410-H-194-119-MY3). Iwano Mariko helped with katakana and Minju Kim with hangul, while Tsung-Ying Chen, Daniel Harbour, Sven Osterkamp, two anonymous reviewers and the special issue editors provided all sorts of useful suggestions as well. Abstract It has often been claimed that writing systems have formal grammars structurally analogous to those of spoken and signed phonology. This paper demonstrates one consequence of this analogy for Chinese script and the writing systems that it has influenced: as with phonology, areal script patterns include the borrowing of formal regularities, not just of formal elements or interpretive functions. Whether particular formal Chinese script regularities were borrowed, modified, or ignored also turns out not to depend on functional typology (in morphemic/syllabic Tangut script, moraic Japanese katakana, and featural/phonemic/syllabic Korean hangul) but on the benefits of making the borrowing system visually distinct from Chinese, the relative productivity of the regularities within Chinese character grammar, and the level at which the borrowing takes place. Keywords: Chinese characters, Tangut script, Japanese katakana, Korean hangul, writing system grammar, script outer form, areal patterns 1. Areal phonological patterns and areal script patterns Sinoform writing systems look Chinese, even when they are functionally quite different. The visual traits of non-logographic Japanese katakana and Korean hangul are nontrivially like those of logographic Chinese script, as are those of the logographic but structurally unique script of Tangut, an extinct Tibeto-Burman language of what is now north-central China.
    [Show full text]