2 Hangul Jamo Auxiliary Canonical Decomposition Mappings
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only. -
Proposal for a Korean Script Root Zone LGR 1 General Information
(internal doc. #: klgp220_101f_proposal_korean_lgr-25jan18-en_v103.doc) Proposal for a Korean Script Root Zone LGR LGR Version 1.0 Date: 2018-01-25 Document version: 1.03 Authors: Korean Script Generation Panel 1 General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Korean Script LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document below: • proposal-korean-lgr-25jan18-en.xml Labels for testing can be found in the accompanying text document below: • korean-test-labels-25jan18-en.txt In Section 3, we will see the background on Korean script (Hangul + Hanja) and principal language using it, i.e., Korean language. The overall development process and methodology will be reviewed in Section 4. The repertoire and variant groups in K-LGR will be discussed in Sections 5 and 6, respectively. In Section 7, Whole Label Evaluation Rules (WLE) will be described and then contributors for K-LGR are shown in Section 8. Several appendices are included with separate files. proposal-korean-lgr-25jan18-en 1 / 73 1/17 2 Script for which the LGR is proposed ISO 15924 Code: Kore ISO 15924 Key Number: 287 (= 286 + 500) ISO 15924 English Name: Korean (alias for Hangul + Han) Native name of the script: 한글 + 한자 Maximal Starting Repertoire (MSR) version: MSR-2 [241] Note. -
UC Santa Barbara Previously Published Works
UC Santa Barbara UC Santa Barbara Previously Published Works Title Scriptworlds Permalink https://escholarship.org/uc/item/41d2f9kq Author Park, Sowon Publication Date 2021-06-28 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Scriptworlds Sowon Park Script and world Thinking about literature in relation to ‘world’ tends to invite the broad sweep. But the widened scope needn’t take in only the majestic, as conjured up in lofty concepts such as planetarity or world-historical totality. Sometimes the scale of world can produce a multifocal optic that helps us pay renewed attention to the literary commonplace. Taking script, the most basic component of literature, as one of the lenses through which to view the world literary landscape, this chapter will examine if, and how, ideas about writing can influence both our close and distant reading. Script is something that usually escapes notice under traditional classifications of literature, which organize subjects along linguistic borders, whether as English, French, or Spanish, or as Anglophone, Francophone, or Hispanophone studies. The construction of literatures by discrete language categories supports a wide-spread view that differences between languages are not just a matter of different phonemes, lexemes and grammars but of distinct ways of conceiving, apprehending and relating to the world. This was a view that was developed into a political stance in Europe in the nineteenth century.1 It posited a language as an embodiment of cultural or national distinctiveness, and a literature written in a national language as the sovereign expression of a particular worldview. One of the facets this stance obscured was the script in which the texts were written. -
Suggestions for the ISO/IEC 14651 CTT Part for Hangul
SC22/WG20 N891R ISO/IEC JTC 1/SC2/WG2 N2405R L2/01-469 (formerly L2/01-405) Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation internationale de normalisation Title: Ordering rules for Hangul Source: Kent Karlsson Date: 2001-11-29 Status: Expert Contribution Document Type: Working Group Document Action: For consideration by the UTC, JTC 1/SC 2/WG 2’s ad hoc on Korean, and JTC 1/SC 22/WG 20 1 Introduction The Hangul script as such is very elegantly designed. However, its incarnation in 10646/Unicode is far from elegant. This paper is about restoring the elegance of Hangul, as much as it can be restored, for the process of string ordering. 1.1 Hangul syllables A lot of Hangul syllables have a character of their own in the range AC00-D7A3. They each have a canonical decomposition into two (choseong, jungseong) or three (choseong, jungseong, jongseong) Hangul Jamo characters in the ranges 1100-1112, 1161-1175, and 11A8-11C2. The choseong are leading consonants, one of which is mute. The jungseong are vowels. And the jongseong are trailing consonants. A Hangul Jamo character is either a letter or letter cluster. The Hangul syllable characters alone can represent most modern Hangul words. They cannot represent historic Hangul words (Middle Korean), nor modern/future Hangul words using syllables not preallocated. However, all Hangul words can elegantly be represented by sequences of single-letter Hangul Jamo characters plus optional tone mark. 1 1.2 Single-letter and cluster Hangul Jamo characters Cluster Hangul Jamo characters represent either clusters of two or three consonants, or clusters of two or three vowels. -
Jamo Pair Encoding: Subcharacter Representation-Based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3490–3497 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization Sangwhan Moonyz, Naoaki Okazakiy Tokyo Institute of Technologyy, Odd Concepts Inc.z, [email protected], [email protected] Abstract In the context of multilingual language model pre-training, vocabulary size for languages with a broad set of potential characters is an unsolved problem. We propose two algorithms applicable in any unsupervised multilingual pre-training task, increasing the elasticity of budget required for building the vocabulary in Byte-Pair Encoding inspired tokenizers, significantly reducing the cost of supporting Korean in a multilingual model. Keywords: tokenization, vocabulary compaction, sub-character representations, out-of-vocabulary mitigation 1. Background BPE. Roughly, the minimum size of the subword vocab- ulary can be approximated as jV j ≈ 2jV j, where V is the With the introduction of large-scale language model pre- c minimal subword vocabulary, and V is the character level training in the domain of natural language processing, the c vocabulary. domain has seen significant advances in the performance Since languages such as Japanese require at least 2000 char- of downstream tasks using transfer learning on pre-trained acters to express everyday text, in a multilingual training models (Howard and Ruder, 2018; Devlin et al., 2018) when setup, one must make a tradeoff. One can reduce the av- compared to conventional per-task models. As a part of this erage surface of each subword for these character vocabu- trend, it has also become common to perform this form of lary intensive languages, or increase the vocabulary size. -
A STUDY of WRITING Oi.Uchicago.Edu Oi.Uchicago.Edu /MAAM^MA
oi.uchicago.edu A STUDY OF WRITING oi.uchicago.edu oi.uchicago.edu /MAAM^MA. A STUDY OF "*?• ,fii WRITING REVISED EDITION I. J. GELB Phoenix Books THE UNIVERSITY OF CHICAGO PRESS oi.uchicago.edu This book is also available in a clothbound edition from THE UNIVERSITY OF CHICAGO PRESS TO THE MOKSTADS THE UNIVERSITY OF CHICAGO PRESS, CHICAGO & LONDON The University of Toronto Press, Toronto 5, Canada Copyright 1952 in the International Copyright Union. All rights reserved. Published 1952. Second Edition 1963. First Phoenix Impression 1963. Printed in the United States of America oi.uchicago.edu PREFACE HE book contains twelve chapters, but it can be broken up structurally into five parts. First, the place of writing among the various systems of human inter communication is discussed. This is followed by four Tchapters devoted to the descriptive and comparative treatment of the various types of writing in the world. The sixth chapter deals with the evolution of writing from the earliest stages of picture writing to a full alphabet. The next four chapters deal with general problems, such as the future of writing and the relationship of writing to speech, art, and religion. Of the two final chapters, one contains the first attempt to establish a full terminology of writing, the other an extensive bibliography. The aim of this study is to lay a foundation for a new science of writing which might be called grammatology. While the general histories of writing treat individual writings mainly from a descriptive-historical point of view, the new science attempts to establish general principles governing the use and evolution of writing on a comparative-typological basis. -
ONIX for Books Codelists Issue 40
ONIX for Books Codelists Issue 40 23 January 2018 DOI: 10.4400/akjh All ONIX standards and documentation – including this document – are copyright materials, made available free of charge for general use. A full license agreement (DOI: 10.4400/nwgj) that governs their use is available on the EDItEUR website. All ONIX users should note that this is the fourth issue of the ONIX codelists that does not include support for codelists used only with ONIX version 2.1. Of course, ONIX 2.1 remains fully usable, using Issue 36 of the codelists or earlier. Issue 36 continues to be available via the archive section of the EDItEUR website (http://www.editeur.org/15/Archived-Previous-Releases). These codelists are also available within a multilingual online browser at https://ns.editeur.org/onix. Codelists are revised quarterly. Go to latest Issue Layout of codelists This document contains ONIX for Books codelists Issue 40, intended primarily for use with ONIX 3.0. The codelists are arranged in a single table for reference and printing. They may also be used as controlled vocabularies, independent of ONIX. This document does not differentiate explicitly between codelists for ONIX 3.0 and those that are used with earlier releases, but lists used only with earlier releases have been removed. For details of which code list to use with which data element in each version of ONIX, please consult the main Specification for the appropriate release. Occasionally, a handful of codes within a particular list are defined as either deprecated, or not valid for use in a particular version of ONIX or with a particular data element. -
Chapter One Chapter
M01_HULI7523_05_SE_C01.QXD 12/4/09 11:37 AM Page 1 CHAPTER ONE A Connection of Brains C HAPTER O BJECTIVES This first chapter is designed to pique your facilitate understanding of the following interest in speech and language as pro- topics: cesses within the broader process we call • Definitions of communication, language, communication. We define and describe and speech these processes, and we consider how • Speech and language as separate but speech and language interact to produce a related processes form of communication unique to humans. We also consider how a speaker’s thoughts • Design features of the human communi- are conveyed to a listener’s brain through a cation system series of communication transformations • The speech chain that connects the known collectively as the speech chain. speaker’s thoughts to the listener’s Specifically, this chapter is designed to understanding of those thoughts 1 M01_HULI7523_05_SE_C01.QXD 12/4/09 11:37 AM Page 2 2 Chapter 1 he idea for Born to Talk was cultivated long before the first word of the original Tmanuscript was written, and it was probably a good thing that there was a period of latency between the concept and the product. During that latency, I* observed lan- guage development firsthand in my two daughters, Yvonne and Carmen. I learned more about the power and wonder of language in observing them than I have in all the books and all the journal articles I have read over the course of my career because I witnessed their processes of discovery. I watched and listened as they made con- nections between the world in which they were growing up and the words and lan- guage forms that spilled out of them. -
Cover Page the Handle
Cover Page The handle http://hdl.handle.net/1887/42753 holds various files of this Leiden University dissertation Author: Moezel, K.V.J. van der Title: Of marks and meaning : a palaeographic, semiotic-cognitive, and comparative analysis of the identity marks from Deir el-Medina Issue Date: 2016-09-07 INTRODUCTION vii viii DEFINING VISUAL COMMUNICATION WE TEND TO think of communication in terms of speech and writing: speech as the oral expression of language and writing as its graphic expression.1 This thought can be traced back to ancient Greece, particularly to a statement written by Aristotle around 350 BC: ‘Spoken words are the symbols of mental experience and written words are the symbols of spoken words’2 This statement was understood in the sense that spoken signs are the key to language systems, while written signs are merely their representations, serving only to their needs. Even though Aristotle had not meant to say this,3 writing thus came to be considered written speech: spoken language recorded by marks the basic function of which is phonoptic.4 Western scholars throughout the Middle Ages and the Early Modern Period held the view of writing as a visible surrogate or substitute for speech. This view, called ‘the surrogational model’ by Harris,5 was prevailing especially in the 18th and 19th centuries. The French missionary De Brébeuf, for instance, in translating the poem Pharsalia by Lucan, spoke of ‘cet art ingénieux – de peindre la parole et de parler aux yeux’.6 In similar fashion Voltaire’s famous quote reads ‘l’écriture est la peinture de la voix : plus elle est ressemblante, meilleure elle est.’7 The Irish poet Trench in 1855 spoke of ‘representing sounds by written signs, of reproducing for the eye that which existed at first only for the ear’; and ‘The intention of the written word’ is ‘to represent to the eye with as much accuracy as possible the spoken word.’8 The surrogational view was still popular in much of the 20th century. -
AKARA the Quest for Perfect Form
AKARA The Quest for Perfect Form 20th Nov. 1988 - 15th Feb. 1989 5, Dr. R. P. Road, IGNCA 'For every form, He has been the ideal, His form, visible everywhere.' Rig Veda VI.47.18 From the mind, emanates AKARA A considerable volume of work has been done in the areas of content of the written word: its semantic structure; its linguistic interconnections; its etymology and grammer. But the spiritual and philosophical aspects and the physical manifestation of the letter (akshara) need to be examined. The letter needs to be viewed as an organic whole. The visual power and the inner strength of otherwise innocuous-looking letters (aksharas) should be felt, experienced, realised. Our core concern is to probe into the inner processes. We explore the emergence of sound (nada); of speech of writing which gives form (akara) to language. This leads to ideal form -- the ultimate objective of calligraphy. We examine the universality of using the written symbol for meditation, ideation, cognition and communication at various levels, from the mundane to the sacred. Such an attempt would help to create an awareness of the metaphysical (adhyatmic), aesthetic (saundarya), structural (rachana), spatial (akasha) and technical (upayojana) considerations of the aksharas. The psyche dimensions of Akara deal with aspects both inherently metaphysical as well as intellectual conceptual, simultaneously encompassing the world view and inquiring into the mind's eye-view of what emerges as an aesthetic experience: the written symbol which begins as the vibration in every breath drawn from the primal energy-source, the nada, that imbues the calligrapher's mind and guides his hand to express beauty in dot and stroke, shape and form. -
Proposal for a Korean Script Root Zone LGR
Proposal for a Korean Script Root Zone LGR LGR Version K_LGR_v2.3 Date: 2021-05-01 Document version: K_LGR_v23_20210501 Authors: Korean script Generation Panel 1 General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Korean Script LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document below: • proposal-korean-lgr-01may21-en.xml Labels for testing can be found in the accompanying text document below: • korean-test-labels-01may21-en.txt In Section 3, we will see the background on Korean script (Hangul + Hanja) and principal language using it, i.e., Korean language. The overall development process and methodology will be reviewed in Section 4. The repertoire and variant sets in K-LGR will be discussed in Sections 5 and 6, respectively. In Section 7, Whole Label Evaluation Rules (WLE) will be described and then contributors for K-LGR are shown in Section 8. Several appendices are included with separate files. 2 Script for which the LGR is proposed ISO 15924 Code: Kore proposal_korean_lgr_v23_20210201 1/20 ISO 15924 Key Number: 287 (= 286 + 500) ISO 15924 English Name: Korean (alias for Hangul + Han) Native name of the script: 한글 + 한자 Maximal Starting Repertoire (MSR) version: MSR-4 [241] Note. 'Korean script' usually means 'Hangeul' or 'Hangul'. However, in the context of the Korean LGR, Korean script is a union of Hangul and Hanja. -
Poorman's Hangul Jamo Input Method
Poorman’s Hangul Jamo Input Method pmhanguljamo.sty Kangsoo Kim 20 Sep 2021 version 0.3.6 Contents 1 Introduction 1 2 Usage 2 2.1 Loading the package .......................... 2 2.2 Commands and Environment Provided ................. 2 2.3 Setting up in your Preamble ....................... 3 3 Transliteration Rule of This Package 4 3.1 Tone Marks and Syllable Serapator ................... 4 3.2 Consonants ............................... 4 3.3 Vowels .................................. 5 3.4 Compatibility Jamos .......................... 6 4 Proper Fonts 6 5 Examples 7 5.1 Modern Hangul ............................. 7 5.2 pre-1933 Hangul ............................ 8 6 The RRK Input Method: An Alternative Way 8 6.1 Transliteration Rule of RRK ...................... 9 6.2 Example of RRK method ........................ 10 7 Further Information 11 8 Acknowledgement 11 1 Introduction 1 This LATEX package provides Hangul transliteration input method, which allows to typeset Korean Letters (Hangul) with the help of proper fonts. The name comes from “Poorman’s Hangul Jamo Input Method.” It is mainly for the people who have a system without Korean keyboard IM, but want to typeset Hangul in their document. Not only 1Hangul is the Korean alphabet to write the Korean language. In both South and North Korea, the standard writing system uses Hangul. 1 modern Hangul, but so-colled “Old Hangul” characters that uses the lost letters such as ‘Arae-A’(ㆍ), ‘Yet Ieung’(ㆁ) or ‘Pan-Sios’(ㅿ) etc. can also be typeset. X LE ATEX or LuaLATEX is required. The legacy pdfTEX is not supported. The Korean Language supporting packages such as xetexko and luatexko (in the ko.TEX bundle) or polyglossia package with Korean support are recommended, but without them typeset- ting Hangul is of no problem with this package pmhanguljamo.