Suggestions for the ISO/IEC 14651 CTT Part for Hangul
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Vietnamese Accent
Vietnamese English Erik Singer Vietnamese is spoken by about 86 million people, which makes it the 17th largest language community in the world. It is part of the Austro-Asiatic language family, and is by far the most widely spoken of these languages. It has borrowed a large portion of its vocabulary from Chinese, thanks to an early period of Chinese domination, but it is otherwise linguistically unrelated. It is generally described as having three dialects: Hanoi in the North, Ho Chi Minh in the South, and Hue in the center. The three dialects are mostly mutually intelligible, though Hue is said to be difficult for speakers of the other two dialects to understand. The northern speech…is marked by sharpness, or choppiness, with greater attention to the precise distinction of tones. The southern speech, in addition to certain uniform differences from northern speech in the pronunciation of consonants, does not distinguish between the hoi and nga tones; and, it is felt by some to sound more laconic and musical. The speech of the Center, on the other hand, is often described as being heavy because of its emphasis on low tones.1 Vietnamese uses the Latin alphabet, with additional diacritics to indicate tones. Vietnam itself is the world’s 13th most populous country, and the 8th most populous in Asia. It became independent from Imperial China in 938 BCE. Since 2000, it has been one of the fastest-growing economies in the world. Oral Posture Oral or vocal tract posture is the characteristic pattern of muscular engagement and relaxation inherent to a given language or accent. -
Consonant Cluster Acquisition by L2 Thai Speakers
English Language Teaching; Vol. 10, No. 7; 2017 ISSN 1916-4742 E-ISSN 1916-4750 Published by Canadian Center of Science and Education Consonant Cluster Acquisition by L2 Thai Speakers Apichai Rungruang1 1 Faculty of Humanities, Naresuan University, Phitsanulok, Thailand Correspondence: Apichai Rungruang, Faculty of Humanities, Naresuan University, Phitsanulok, Thailand. E-mail: [email protected] Received: April 29, 2017 Accepted: June 10, 2017 Online Published: June 13, 2017 doi: 10.5539/elt.v10n7p216 URL: http://doi.org/10.5539/elt.v10n7p216 Abstract Attempts to account for consonant cluster acquisition are always made into two aspects. One is transfer of the first language (L1), and another is markedness effects on the developmental processes in second language acquisition. This study has continued these attempts by finding out how well Thai university students were able to perceive English onset and coda clusters when they were second year and fourth year students. This paper also aims to investigate Thai speakers’ opinions about their listening and speaking skills, and whether their course subjects enhanced their performance. To fulfil the first objective, a pretest and posttest were launched to measure how the 34 Thai participants were able to identify 40 onset and 120 coda clusters at different periods of time. The statistical findings show that even though their overall scores in the fourth year were higher than those in the second year, there was no statistically significant difference in both major types of clusters [t = -1.29; p value >0.05 in onsets; t = -0.28; p value >0.05 in codas]. The Thai participants performed slightly better in onset (84% / 86%) than in coda (70% / 71%). -
Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only. -
Proposal for a Korean Script Root Zone LGR 1 General Information
(internal doc. #: klgp220_101f_proposal_korean_lgr-25jan18-en_v103.doc) Proposal for a Korean Script Root Zone LGR LGR Version 1.0 Date: 2018-01-25 Document version: 1.03 Authors: Korean Script Generation Panel 1 General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Korean Script LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document below: • proposal-korean-lgr-25jan18-en.xml Labels for testing can be found in the accompanying text document below: • korean-test-labels-25jan18-en.txt In Section 3, we will see the background on Korean script (Hangul + Hanja) and principal language using it, i.e., Korean language. The overall development process and methodology will be reviewed in Section 4. The repertoire and variant groups in K-LGR will be discussed in Sections 5 and 6, respectively. In Section 7, Whole Label Evaluation Rules (WLE) will be described and then contributors for K-LGR are shown in Section 8. Several appendices are included with separate files. proposal-korean-lgr-25jan18-en 1 / 73 1/17 2 Script for which the LGR is proposed ISO 15924 Code: Kore ISO 15924 Key Number: 287 (= 286 + 500) ISO 15924 English Name: Korean (alias for Hangul + Han) Native name of the script: 한글 + 한자 Maximal Starting Repertoire (MSR) version: MSR-2 [241] Note. -
On the Theoretical Implications of Cypriot Greek Initial Geminates
<LINK "mul-n*">"mul-r16">"mul-r8">"mul-r19">"mul-r14">"mul-r27">"mul-r7">"mul-r6">"mul-r17">"mul-r2">"mul-r9">"mul-r24"> <TARGET "mul" DOCINFO AUTHOR "Jennifer S. Muller"TITLE "On the theoretical implications of Cypriot Greek initial geminates"SUBJECT "JGL, Volume 3"KEYWORDS "geminates, representation, phonology, Cypriot Greek"SIZE HEIGHT "220"WIDTH "150"VOFFSET "4"> On the theoretical implications of Cypriot Greek initial geminates* Jennifer S. Muller The Ohio State University Cypriot Greek contrasts singleton and geminate consonants in word-initial position. These segments are of particular interest to phonologists since two divergent representational frameworks, moraic theory (Hayes 1989) and timing-based frameworks, including CV or X-slot theory (Clements and Keyser 1983, Levin 1985), account for the behavior of initial geminates in substantially different ways. The investigation of geminates in Cypriot Greek allows these differences to be explored. As will be demonstrated in a formal analysis of the facts, the patterning of geminates in Cypriot is best accounted for by assuming that the segments are dominated by abstract timing units such as X- or C-slots, rather than by a unit of prosodic weight such as the mora. Keywords: geminates, representation, phonology, Cypriot Greek 1. Introduction Cypriot Greek is of particular interest, not only because it is one of the few varieties of Modern Greek maintaining a consonant length contrast, but more importantly because it exhibits this contrast in word-initial position: péfti ‘Thursday’ vs. ppéfti ‘he falls’.Although word-initial geminates are less common than their word-medial counterparts, they are attested in dozens of the world’s languages in addition to Cypriot Greek. -
Mechanisms of Vowel Epenthesis in Consonant Clusters: an EMA Study Seiya Funatsu, Masako Fujimoto
Mechanisms of vowel epenthesis in consonant clusters: an EMA study Seiya Funatsu, Masako Fujimoto To cite this version: Seiya Funatsu, Masako Fujimoto. Mechanisms of vowel epenthesis in consonant clusters: an EMA study. Acoustics 2012, Apr 2012, Nantes, France. hal-00810613 HAL Id: hal-00810613 https://hal.archives-ouvertes.fr/hal-00810613 Submitted on 23 Apr 2012 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Proceedings of the Acoustics 2012 Nantes Conference 23-27 April 2012, Nantes, France Mechanisms of vowel epenthesis in consonant clusters: an EMA study S. Funatsua and M. Fujimotob aPrefectural University of Hiroshima, 1-1-71 Ujinahigashi Minami-ku, 734-8558 Hiroshima, Japan bNational Institute for Japanese Language and Linguistics, 10-2 Midori-machi, 190-8561 Tachikawa, Japan [email protected] 341 23-27 April 2012, Nantes, France Proceedings of the Acoustics 2012 Nantes Conference The mechanisms of vowel epenthesis in consonant clusters were investigated using an electromagnetic articulograph (EMA). The target languages were Japanese and German. Japanese does not allow consonant clusters, while German does. Two Japanese speakers and two German speakers participated in this experiment. For Japanese speakers, normalized tongue tip displacements from the first consonant to the second consonant in clusters (/bn/, /pn/) were significantly larger than those of German speakers (p<0.001). -
Jamo Pair Encoding: Subcharacter Representation-Based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3490–3497 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization Sangwhan Moonyz, Naoaki Okazakiy Tokyo Institute of Technologyy, Odd Concepts Inc.z, [email protected], [email protected] Abstract In the context of multilingual language model pre-training, vocabulary size for languages with a broad set of potential characters is an unsolved problem. We propose two algorithms applicable in any unsupervised multilingual pre-training task, increasing the elasticity of budget required for building the vocabulary in Byte-Pair Encoding inspired tokenizers, significantly reducing the cost of supporting Korean in a multilingual model. Keywords: tokenization, vocabulary compaction, sub-character representations, out-of-vocabulary mitigation 1. Background BPE. Roughly, the minimum size of the subword vocab- ulary can be approximated as jV j ≈ 2jV j, where V is the With the introduction of large-scale language model pre- c minimal subword vocabulary, and V is the character level training in the domain of natural language processing, the c vocabulary. domain has seen significant advances in the performance Since languages such as Japanese require at least 2000 char- of downstream tasks using transfer learning on pre-trained acters to express everyday text, in a multilingual training models (Howard and Ruder, 2018; Devlin et al., 2018) when setup, one must make a tradeoff. One can reduce the av- compared to conventional per-task models. As a part of this erage surface of each subword for these character vocabu- trend, it has also become common to perform this form of lary intensive languages, or increase the vocabulary size. -
2 Hangul Jamo Auxiliary Canonical Decomposition Mappings
DRAFT Unicode technical note NN Auxiliary character decompositions for supporting Hangul Kent Karlsson 2006-09-24 1 Introduction The Hangul script is very elegantly designed. There are just a small number of letters (28, plus a small number of variant letters introduced later, but the latter have fallen out of use) and even a featural design philosophy for the shapes of the letters. However, the incarnation of Hangul as characters in ISO/IEC 10646 and Unicode is not so elegant. In particular, there are many Hangul characters that are not needed, for precomposed letter clusters as well as precomposed syllable characters. The precomposed syllables have arithmetically specified canonical decompositions into Hangul jamos (conjoining Hangul letters). But unfortunately the letter cluster Hangul jamos do not have canonical decompositions to their constituent letters, which they should have had. This leads to multiple representations for exactly the same sequence of letters. There is not even any compatibility-like distinction; i.e. no (intended) font difference, no (intended) width difference, no (intended) ligaturing difference of any kind. They have even lost the compatibility decompositions that they had in Unicode 2.0. There are also some problems with the Hangul compatibility letters, and their proper compatibility decompositions to Hangul jamo characters. Just following their compatibility decompositions in UnicodeData.txt does not give any useful results in any setting. In this paper and its two associated datafiles these problems are addressed. Note that no changes to the standard Unicode normal forms (NFD, NFC, NFKD, and NFKC) are proposed, since these normal forms are stable for already allocated characters. -
Proposal for a Korean Script Root Zone LGR
Proposal for a Korean Script Root Zone LGR LGR Version K_LGR_v2.3 Date: 2021-05-01 Document version: K_LGR_v23_20210501 Authors: Korean script Generation Panel 1 General Information/ Overview/ Abstract The purpose of this document is to give an overview of the proposed Korean Script LGR in the XML format and the rationale behind the design decisions taken. It includes a discussion of relevant features of the script, the communities or languages using it, the process and methodology used and information on the contributors. The formal specification of the LGR can be found in the accompanying XML document below: • proposal-korean-lgr-01may21-en.xml Labels for testing can be found in the accompanying text document below: • korean-test-labels-01may21-en.txt In Section 3, we will see the background on Korean script (Hangul + Hanja) and principal language using it, i.e., Korean language. The overall development process and methodology will be reviewed in Section 4. The repertoire and variant sets in K-LGR will be discussed in Sections 5 and 6, respectively. In Section 7, Whole Label Evaluation Rules (WLE) will be described and then contributors for K-LGR are shown in Section 8. Several appendices are included with separate files. 2 Script for which the LGR is proposed ISO 15924 Code: Kore proposal_korean_lgr_v23_20210201 1/20 ISO 15924 Key Number: 287 (= 286 + 500) ISO 15924 English Name: Korean (alias for Hangul + Han) Native name of the script: 한글 + 한자 Maximal Starting Repertoire (MSR) version: MSR-4 [241] Note. 'Korean script' usually means 'Hangeul' or 'Hangul'. However, in the context of the Korean LGR, Korean script is a union of Hangul and Hanja. -
Poorman's Hangul Jamo Input Method
Poorman’s Hangul Jamo Input Method pmhanguljamo.sty Kangsoo Kim 20 Sep 2021 version 0.3.6 Contents 1 Introduction 1 2 Usage 2 2.1 Loading the package .......................... 2 2.2 Commands and Environment Provided ................. 2 2.3 Setting up in your Preamble ....................... 3 3 Transliteration Rule of This Package 4 3.1 Tone Marks and Syllable Serapator ................... 4 3.2 Consonants ............................... 4 3.3 Vowels .................................. 5 3.4 Compatibility Jamos .......................... 6 4 Proper Fonts 6 5 Examples 7 5.1 Modern Hangul ............................. 7 5.2 pre-1933 Hangul ............................ 8 6 The RRK Input Method: An Alternative Way 8 6.1 Transliteration Rule of RRK ...................... 9 6.2 Example of RRK method ........................ 10 7 Further Information 11 8 Acknowledgement 11 1 Introduction 1 This LATEX package provides Hangul transliteration input method, which allows to typeset Korean Letters (Hangul) with the help of proper fonts. The name comes from “Poorman’s Hangul Jamo Input Method.” It is mainly for the people who have a system without Korean keyboard IM, but want to typeset Hangul in their document. Not only 1Hangul is the Korean alphabet to write the Korean language. In both South and North Korea, the standard writing system uses Hangul. 1 modern Hangul, but so-colled “Old Hangul” characters that uses the lost letters such as ‘Arae-A’(ㆍ), ‘Yet Ieung’(ㆁ) or ‘Pan-Sios’(ㅿ) etc. can also be typeset. X LE ATEX or LuaLATEX is required. The legacy pdfTEX is not supported. The Korean Language supporting packages such as xetexko and luatexko (in the ko.TEX bundle) or polyglossia package with Korean support are recommended, but without them typeset- ting Hangul is of no problem with this package pmhanguljamo. -
3. Pronunciation & Phonetic Letters Guide
หน้า ๒๙ : Nâa 29 PRONUNCIATION GUIDE It is important to read this section thoroughly if your aim is to learn to speak Thai using phonetic letters or transliteration. Skip this only if you can read Thai script. What are Phonetic Letters? There are many different practices for learning to speak a language without learning the scripts of the language. Romanisation is the conversion of writing from a different writing system to the Roman (Latin) script, or a system for doing so. Methods of romanisation include transliteration, for representing written text, and transcription, for representing the spoken word, and combinations of both. Phonetic transcription (also known as Phonetic Script or Phonetic Notation) is the visual representation of speech sounds (or phones). The most common type of phonetic transcription use a phonetic alphabet such as the International Phonetic Alphabet (IPA). Transliteration is the practice of converting a text from one script into another by writing or printing (a letter or word) using the closest corresponding letters of a different alphabet or language. Royal Thai Phonetic Alphabet (RPA) is a system used as a standard for transcribing Thai words, names of a person, place, thing, into English. Romanisation In Thai is called ทับศัพท์ : Túb~Sùb or ทับเสียง : Túb~Sĕarng and colloquially called ภาษาคาราโอเกะ : Paa-săa Kaa-raa-o-ge’ = Karaoke language. (ทับ : Túb = place on top of, ศัพท์ : Sùb = vocabulary, เสียง : Sĕang = sound, คาราโอเกะ : Kaa- raa-o-ge’ = karaoke) At Thai Style, we have developed our own system using the English alphabet as phonetic letters. This is because we would like to compare the sounds in English and Thai by using the English alphabet, like if you are a native English speaker learning French which uses the same alphabet but some letters are pronounced with completely different sounds. -
UC Berkeley Dissertations, Department of Linguistics
UC Berkeley Dissertations, Department of Linguistics Title The Consonant System of Middle-Old Tibetan and the Tonogenesis of Tibetan Permalink https://escholarship.org/uc/item/18g449sj Author Zhang, Lian Publication Date 1987 eScholarship.org Powered by the California Digital Library University of California The Consonant System of Middle-Old Tibetan and the Tonogenesis of Tibetan By Lian Sheng Zhang Graduate of Central Institute for Minority Nationalities, People's Republic of China, 1966 C.Phil. (University of California) 1986 DISSERTATION Submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Linguistics in the GRADUATE DIVISION OF THE UNIVERSITY OF CALIFORNIA, BERKELEY Approved: Chairman Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. THE CONSONANT SYSTEM OF MIDDLE-OLD TIBETAN AND THE TONOGENESIS OF TIBETAN Copyright © 1987 LIAN SHENG ZHANG Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ABSTRACT Lian Sheng Zhang This study not only tries to reconstruct the consonant system of Middle-Old Tibetan, but also provides proof for my proposition that the period of Middle-Old Tibetan (from the middle of the 7th century to the second half of the 9th century, A.D.) was the time when the original Tibetan voiced consonants were devoiced and tonogenesis occurred. I have used three kinds of source materials: 1. extant old Tibetan documents and books, as well as wooden slips and bronze or stone tablets dating from the 7th century to the 9th century; 2. 7th to 9th century transcribed (translated or transliterated) materials between Tibetan and other languages, especially Chinese transcriptions of Tibetan documents and Tibetan transcriptions of Chinese documents; 3- linguistic data from various dialects of Modern Tibetan.