Chapter 5. Characters: Typology and Page Encoding 1
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names
ST/ESA/STAT/SER.M/87 Department of Economic and Social Affairs Statistics Division Technical reference manual for the standardization of geographical names United Nations Group of Experts on Geographical Names United Nations New York, 2007 The Department of Economic and Social Affairs of the United Nations Secretariat is a vital interface between global policies in the economic, social and environmental spheres and national action. The Department works in three main interlinked areas: (i) it compiles, generates and analyses a wide range of economic, social and environmental data and information on which Member States of the United Nations draw to review common problems and to take stock of policy options; (ii) it facilitates the negotiations of Member States in many intergovernmental bodies on joint courses of action to address ongoing or emerging global challenges; and (iii) it advises interested Governments on the ways and means of translating policy frameworks developed in United Nations conferences and summits into programmes at the country level and, through technical assistance, helps build national capacities. NOTE The designations employed and the presentation of material in the present publication do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. The term “country” as used in the text of this publication also refers, as appropriate, to territories or areas. Symbols of United Nations documents are composed of capital letters combined with figures. ST/ESA/STAT/SER.M/87 UNITED NATIONS PUBLICATION Sales No. -
Ligature Modeling for Recognition of Characters Written in 3D Space Dae Hwan Kim, Jin Hyung Kim
Ligature Modeling for Recognition of Characters Written in 3D Space Dae Hwan Kim, Jin Hyung Kim To cite this version: Dae Hwan Kim, Jin Hyung Kim. Ligature Modeling for Recognition of Characters Written in 3D Space. Tenth International Workshop on Frontiers in Handwriting Recognition, Université de Rennes 1, Oct 2006, La Baule (France). inria-00105116 HAL Id: inria-00105116 https://hal.inria.fr/inria-00105116 Submitted on 10 Oct 2006 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Ligature Modeling for Recognition of Characters Written in 3D Space Dae Hwan Kim Jin Hyung Kim Artificial Intelligence and Artificial Intelligence and Pattern Recognition Lab. Pattern Recognition Lab. KAIST, Daejeon, KAIST, Daejeon, South Korea South Korea [email protected] [email protected] Abstract defined shape of character while it showed high recognition performance. Moreover when a user writes In this work, we propose a 3D space handwriting multiple stroke character such as ‘4’, the user has to recognition system by combining 2D space handwriting write a new shape which is predefined in a uni-stroke models and 3D space ligature models based on that the and which he/she has never seen. -
ISO Basic Latin Alphabet
ISO basic Latin alphabet The ISO basic Latin alphabet is a Latin-script alphabet and consists of two sets of 26 letters, codified in[1] various national and international standards and used widely in international communication. The two sets contain the following 26 letters each:[1][2] ISO basic Latin alphabet Uppercase Latin A B C D E F G H I J K L M N O P Q R S T U V W X Y Z alphabet Lowercase Latin a b c d e f g h i j k l m n o p q r s t u v w x y z alphabet Contents History Terminology Name for Unicode block that contains all letters Names for the two subsets Names for the letters Timeline for encoding standards Timeline for widely used computer codes supporting the alphabet Representation Usage Alphabets containing the same set of letters Column numbering See also References History By the 1960s it became apparent to thecomputer and telecommunications industries in the First World that a non-proprietary method of encoding characters was needed. The International Organization for Standardization (ISO) encapsulated the Latin script in their (ISO/IEC 646) 7-bit character-encoding standard. To achieve widespread acceptance, this encapsulation was based on popular usage. The standard was based on the already published American Standard Code for Information Interchange, better known as ASCII, which included in the character set the 26 × 2 letters of the English alphabet. Later standards issued by the ISO, for example ISO/IEC 8859 (8-bit character encoding) and ISO/IEC 10646 (Unicode Latin), have continued to define the 26 × 2 letters of the English alphabet as the basic Latin script with extensions to handle other letters in other languages.[1] Terminology Name for Unicode block that contains all letters The Unicode block that contains the alphabet is called "C0 Controls and Basic Latin". -
The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles
Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2017 The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles Moran, Steven ; Cysouw, Michael DOI: https://doi.org/10.5281/zenodo.290662 Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-135400 Monograph The following work is licensed under a Creative Commons: Attribution 4.0 International (CC BY 4.0) License. Originally published at: Moran, Steven; Cysouw, Michael (2017). The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles. CERN Data Centre: Zenodo. DOI: https://doi.org/10.5281/zenodo.290662 The Unicode Cookbook for Linguists Managing writing systems using orthography profiles Steven Moran & Michael Cysouw Change dedication in localmetadata.tex Preface This text is meant as a practical guide for linguists, and programmers, whowork with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together. The intersection of the Unicode Standard and the International Phonetic Al- phabet is often not met without frustration by users. Nevertheless, thetwo standards have provided language researchers with a consistent computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA. Our research uses quantitative methods to compare languages and uncover and clarify their phylogenetic relations. However, the majority of lexical data available from the world’s languages is in author- or document-specific orthogra- phies. -
Unicode and Code Page Support
Natural for Mainframes Unicode and Code Page Support Version 4.2.6 for Mainframes October 2009 This document applies to Natural Version 4.2.6 for Mainframes and to all subsequent releases. Specifications contained herein are subject to change and these changes will be reported in subsequent release notes or new editions. Copyright © Software AG 1979-2009. All rights reserved. The name Software AG, webMethods and all Software AG product names are either trademarks or registered trademarks of Software AG and/or Software AG USA, Inc. Other company and product names mentioned herein may be trademarks of their respective owners. Table of Contents 1 Unicode and Code Page Support .................................................................................... 1 2 Introduction ..................................................................................................................... 3 About Code Pages and Unicode ................................................................................ 4 About Unicode and Code Page Support in Natural .................................................. 5 ICU on Mainframe Platforms ..................................................................................... 6 3 Unicode and Code Page Support in the Natural Programming Language .................... 7 Natural Data Format U for Unicode-Based Data ....................................................... 8 Statements .................................................................................................................. 9 Logical -
Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only. -
Combining Diacritical Marks Range: 0300–036F the Unicode Standard
Combining Diacritical Marks Range: 0300–036F The Unicode Standard, Version 4.0 This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 4.0. Characters in this chart that are new for The Unicode Standard, Version 4.0 are shown in conjunction with any existing characters. For ease of reference, the new characters have been highlighted in the chart grid and in the names list. This file will not be updated with errata, or when additional characters are assigned to the Unicode Standard. See http://www.unicode.org/charts for access to a complete list of the latest character charts. Disclaimer These charts are provided as the on-line reference to the character contents of the Unicode Standard, Version 4.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this excerpt file, please consult the appropriate sections of The Unicode Standard, Version 4.0 (ISBN 0-321-18578-1), as well as Unicode Standard Annexes #9, #11, #14, #15, #24 and #29, the other Unicode Technical Reports and the Unicode Character Database, which are available on-line. See http://www.unicode.org/Public/UNIDATA/UCD.html and http://www.unicode.org/unicode/reports A thorough understanding of the information contained in these additional sources is required for a successful implementation. Fonts The shapes of the reference glyphs used in these code charts are not prescriptive. Considerable variation is to be expected in actual fonts. -
Threshold Digraphs
Volume 119 (2014) http://dx.doi.org/10.6028/jres.119.007 Journal of Research of the National Institute of Standards and Technology Threshold Digraphs Brian Cloteaux1, M. Drew LaMar2, Elizabeth Moseman1, and James Shook1 1National Institute of Standards and Technology, Gaithersburg, MD 20899 2The College of William and Mary, Williamsburg, VA 23185 [email protected] [email protected] [email protected] [email protected] A digraph whose degree sequence has a unique vertex labeled realization is called threshold. In this paper we present several characterizations of threshold digraphs and their degree sequences, and show these characterizations to be equivalent. Using this result, we obtain a new, short proof of the Fulkerson-Chen theorem on degree sequences of general digraphs. Key words: digraph realizations; Fulkerson-Chen theorem; threshold digraphs. Accepted: May 6, 2014 Published: May 20, 2014 http://dx.doi.org/10.6028/jres.119.007 1. Introduction Creating models of real-world networks, such as social and biological interactions, is a central task for understanding and measuring the behavior of these networks. A usual first step in this type of model creation is to construct a digraph with a given degree sequence. We examine the extreme case of digraph construction where for a given degree sequence there is exactly one digraph that can be created. What follows is a brief introduction to the notation used in the paper. For notation not otherwise defined, see Diestel [1]. We let G=( VE , ) be a digraph where E is a set of ordered pairs called arcs. If (,vw )∈ E, then we say w is an out-neighbor of v and v is an in-neighbor of w. -
Gerard Manley Hopkins' Diacritics: a Corpus Based Study
Gerard Manley Hopkins’ Diacritics: A Corpus Based Study by Claire Moore-Cantwell This is my difficulty, what marks to use and when to use them: they are so much needed, and yet so objectionable.1 ~Hopkins 1. Introduction In a letter to his friend Robert Bridges, Hopkins once wrote: “... my apparent licences are counterbalanced, and more, by my strictness. In fact all English verse, except Milton’s, almost, offends me as ‘licentious’. Remember this.”2 The typical view held by modern critics can be seen in James Wimsatt’s 2006 volume, as he begins his discussion of sprung rhythm by saying, “For Hopkins the chief advantage of sprung rhythm lies in its bringing verse rhythms closer to natural speech rhythms than traditional verse systems usually allow.”3 In a later chapter, he also states that “[Hopkins’] stress indicators mark ‘actual stress’ which is both metrical and sense stress, part of linguistic meaning broadly understood to include feeling.” In his 1989 article, Sprung Rhythm, Kiparsky asks the question “Wherein lies [sprung rhythm’s] unique strictness?” In answer to this question, he proposes a system of syllable quantity coupled with a set of metrical rules by which, he claims, all of Hopkins’ verse is metrical, but other conceivable lines are not. This paper is an outgrowth of a larger project (Hayes & Moore-Cantwell in progress) in which Kiparsky’s claims are being analyzed in greater detail. In particular, we believe that Kiparsky’s system overgenerates, allowing too many different possible scansions for each line for it to be entirely falsifiable. The goal of the project is to tighten Kiparsky’s system by taking into account the gradience that can be found in metrical well-formedness, so that while many different scansion of a line may be 1 Letter to Bridges dated 1 April 1885. -
BBJ1RBR '(]~ F'o £.A2WB IJ.Boibiollb, •
801 80 mua1. , .. DDl&IOI o• OOLU•••• . I I:~ I BBJ1RBR '(]~ f'O £.A2WB IJ.BOIBIOllB, • •• • I r. • • Not Current- 1803 , 0 TIUI t ohn Tock r, q., of o ton, be i an k ep hi office at th eat of the o nbt practi , ither a an attorn y or a 0 &&all ontinue to b clerk of the ame. hat (until furth r orde ) it hall be om or cou ello to practi e in th· uch for thr e ea pa t in th uprem p cti ly b long, and that their pri ate r to be fair. &. t coun llors hall not pract• e a ........ ello , in h · court. 6. , That they hall re pecti el take the ------, do olemnly ear, that I ill or coun ellor of the court) uprightly, and upport the o titution of the nited &. t (unle and until it hall othe · e of · court hall be in the name of the That he cou ello and attorne , hall ake either an oath, or, in proper cribe b the rule of this court on that r•••, 1790, iz.: '' , , do olemnly be) that will demean m elf, as an attor .....· oo rt uprightl , an according to la , and that tion of he nited ta •'' hi f ju ice, in a er to the motion of the info him and the bar, that thi court of king' bench, and of chancery, in Eng ice of th• court; and that they will, tio therein circumstances may ren- x_...vu•• Not Current- 1803 0 '191, ebraary 4. -
QCMUQ@QALB-2015 Shared Task: Combining Character Level MT and Error-Tolerant Finite-State Recognition for Arabic Spelling Correction
QCMUQ@QALB-2015 Shared Task: Combining Character level MT and Error-tolerant Finite-State Recognition for Arabic Spelling Correction Houda Bouamor1, Hassan Sajjad2, Nadir Durrani2 and Kemal Oflazer1 1Carnegie Mellon University in Qatar [email protected], [email protected] 2Qatar Computing Research Institute fhsajjad,[email protected] Abstract sion, in this task participants are asked to imple- We describe the CMU-Q and QCRI’s joint ment a system that takes as input MSA (Modern efforts in building a spelling correction Standard Arabic) text with various spelling errors system for Arabic in the QALB 2015 and automatically correct it. In this year’s edition, Shared Task. Our system is based on a participants are asked to test their systems on two hybrid pipeline that combines rule-based text genres: (i) news corpus (mainly newswire ex- linguistic techniques with statistical meth- tracted from Aljazeera); (ii) a corpus of sentences ods using language modeling and machine written by learners of Arabic as a Second Lan- translation, as well as an error-tolerant guage (ASL). Texts produced by learners of ASL finite-state automata method. Our sys- generally contain a number of spelling errors. The tem outperforms the baseline system and main problem faced by them is using Arabic with achieves better correction quality with an vocabulary and grammar rules that are different F1-measure of 68.42% on the 2014 testset from their native language. and 44.02 % on the L2 Dev. In this paper, we describe our Arabic spelling correction system. Our system is based on a 1 Introduction hybrid pipeline which combines rule-based tech- With the increased usage of computers in the pro- niques with statistical methods using language cessing of various languages comes the need for modeling and machine translation, as well as an correcting errors introduced at different stages. -
A Collection of Mildly Interesting Facts About the Little Symbols We Communicate With
Ty p o g raph i c Factettes A collection of mildly interesting facts about the little symbols we communicate with. Helvetica The horizontal bars of a letter are almost always thinner than the vertical bars. Minion The font size is approximately the measurement from the lowest appearance of any letter to the highest. Most of the time. Seventy-two points equals one inch. Fridge256 point Cochin most of 50the point Zaphino time Letters with rounded bottoms don’t sit on the baseline, but slightly below it. Visually, they would appear too high if they rested on the same base as the squared letters. liceAdobe Caslon Bold UNITED KINGDOM UNITED STATES LOLITA LOLITA In Ancient Rome, scribes would abbreviate et (the latin word for and) into one letter. We still use that abbreviation, called the ampersand. The et is still very visible in some italic ampersands. The word ampersand comes from and-per-se-and. Strange. Adobe Garamond Regular Adobe Garamond Italic Trump Mediaval Italic Helvetica Light hat two letters ss w it cam gue e f can rom u . I Yo t h d. as n b ha e rt en ho a s ro n u e n t d it r fo w r s h a u n w ) d r e e m d a s n o r f e y t e t a e r b s , a b s u d t e d e e n m t i a ( n l d o b s o m a y r S e - d t w A i e t h h t t , h d e n a a s d r v e e p n t m a o f e e h m t e a k i i l .