Your Data Will Go On: Practice for Character Data Migration Edwin (You) Xie, SAS Institute Inc
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
LG Programmer’S Reference Manual
LG Programmer’s Reference Manual Line Matrix Series Printers Trademark Acknowledgements ANSI is a registered trademark of American National Standards Institute, Inc. Code V is a trademark of Quality Micro Systems. Chatillon is a trademark of John Chatillon & Sons, Inc. Ethernet is a trademark of Xerox Corporation. IBM is a registered trademark of International Business Machines Corporation. IGP is a registered trademark of Printronix, LLC. Intelligent Printer Data Stream and IPDS are trademarks of International Business Machines Corporation. LinePrinter Plus is a registered trademark of Printronix, LLC. MS-DOS is a registered trademark of Microsoft Corporation. PC-DOS is a trademark of International Business Machines Corporation. PGL is a registered trademark of Printronix, LLC. PrintNet is a registered trademark of Printronix, LLC. Printronix is a registered trademark of Printronix, LLC. PSA is a trademark of Printronix, LLC. QMS is a registered trademark of Quality Micro Systems. RibbonMinder is a trademark of Printronix, LLC. Torx is a registered trademark of Camcar/Textron Inc. Utica is a registered trademark of Cooper Power Tools. Printronix, LLC. makes no representations or warranties of any kind regarding this material, including, but not limited to, implied warranties of merchantability and fitness for a particular purpose. Printronix, LLC. shall not be held responsible for errors contained herein or any omissions from this material or for any damages, whether direct, indirect, incidental or consequential, in connection with the furnishing, distribution, performance or use of this material. The information in this manual is subject to change without notice. This document contains proprietary information protected by copyright. No part of this document may be reproduced, copied, translated or incorporated in any other material in any form or by any means, whether manual, graphic, electronic, mechanical or otherwise, without the prior written consent of Printronix, LLC. -
Control Characters in ASCII and Unicode
Control characters in ASCII and Unicode Tens of odd control characters appear in ASCII charts. The same characters have found their way to Unicode as well. CR, LF, ESC, CAN... what are all these codes for? Should I care about them? This is an in-depth look into control characters in ASCII and its descendants, including Unicode, ANSI and ISO standards. When ASCII first appeared in the 1960s, control characters were an essential part of the new character set. Since then, many new character sets and standards have been published. Computing is not the same either. What happened to the control characters? Are they still used and if yes, for what? This article looks back at the history of character sets while keeping an eye on modern use. The information is based on a number of standards released by ANSI, ISO, ECMA and The Unicode Consortium, as well as industry practice. In many cases, the standards define one use for a character, but common practice is different. Some characters are used contrary to the standards. In addition, certain characters were originally defined in an ambiguous or loose way, which has resulted in confusion in their use. Contents Groups of control characters Control characters in standards o ASCII control characters o C1 control characters o ISO 8859 special characters NBSP and SHY o Control characters in Unicode Control characters in modern applications Character list o ASCII o C1 o ISO 8859 Categories Translations Character index Sources This article starts by looking at the history of control characters in standards. We then move to modern times. -
Information, Characters, Unicode
Information, Characters, Unicode Unicode © 24 August 2021 1 / 107 Hidden Moral Small mistakes can be catastrophic! Style Care about every character of your program. Tip: printf Care about every character in the program’s output. (Be reasonably tolerant and defensive about the input. “Fail early” and clearly.) Unicode © 24 August 2021 2 / 107 Imperative Thou shalt care about every Ěaracter in your program. Unicode © 24 August 2021 3 / 107 Imperative Thou shalt know every Ěaracter in the input. Thou shalt care about every Ěaracter in your output. Unicode © 24 August 2021 4 / 107 Information – Characters In modern computing, natural-language text is very important information. (“number-crunching” is less important.) Characters of text are represented in several different ways and a known character encoding is necessary to exchange text information. For many years an important encoding standard for characters has been US ASCII–a 7-bit encoding. Since 7 does not divide 32, the ubiquitous word size of computers, 8-bit encodings are more common. Very common is ISO 8859-1 aka “Latin-1,” and other 8-bit encodings of characters sets for languages other than English. Currently, a very large multi-lingual character repertoire known as Unicode is gaining importance. Unicode © 24 August 2021 5 / 107 US ASCII (7-bit), or bottom half of Latin1 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SS SI DLE DC1 DC2 DC3 DC4 NAK SYN ETP CAN EM SUB ESC FS GS RS US !"#$%&’()*+,-./ 0123456789:;<=>? @ABCDEFGHIJKLMNO PQRSTUVWXYZ[\]^_ `abcdefghijklmno pqrstuvwxyz{|}~ DEL Unicode Character Sets © 24 August 2021 6 / 107 Control Characters Notice that the first twos rows are filled with so-called control characters. -
Chapter 6, Writing Systems and Punctuation
The Unicode® Standard Version 13.0 – Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. © 2020 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium; edited by the Unicode Consortium. — Version 13.0. Includes index. ISBN 978-1-936213-26-9 (http://www.unicode.org/versions/Unicode13.0.0/) 1. -
The Unicode Standard, Version 3.0, Issued by the Unicode Consor- Tium and Published by Addison-Wesley
The Unicode Standard Version 3.0 The Unicode Consortium ADDISON–WESLEY An Imprint of Addison Wesley Longman, Inc. Reading, Massachusetts · Harlow, England · Menlo Park, California Berkeley, California · Don Mills, Ontario · Sydney Bonn · Amsterdam · Tokyo · Mexico City Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If these files have been purchased on computer-readable media, the sole remedy for any claim will be exchange of defective media within ninety days of receipt. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. -
Technical Study Universal Multiple-Octet Coded Character Set Coexistence & Migration
Technical Study Universal Multiple-Octet Coded Character Set Coexistence & Migration NIC CH A E L T S T U D Y [This page intentionally left blank] X/Open Technical Study Universal Multiple-Octet Coded Character Set Coexistence and Migration X/Open Company Ltd. February 1994, X/Open Company Limited All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owners. X/Open Technical Study Universal Multiple-Octet Coded Character Set Coexistence and Migration ISBN: 1-85912-031-8 X/Open Document Number: E401 Published by X/Open Company Ltd., U.K. Any comments relating to the material contained in this document may be submitted to X/Open at: X/Open Company Limited Apex Plaza Forbury Road Reading Berkshire, RG1 1AX United Kingdom or by Electronic Mail to: [email protected] ii X/Open Technical Study (1994) Contents Chapter 1 Introduction............................................................................................... 1 1.1 Background.................................................................................................. 2 1.2 Terminology................................................................................................. 2 Chapter 2 Overview..................................................................................................... 3 2.1 Codesets....................................................................................................... -
ASCII, Unicode, and in Between Douglas A. Kerr Issue 3 June 15, 2010 ABSTRACT the Standard Coded Character Set ASCII Was Formall
ASCII, Unicode, and In Between Douglas A. Kerr Issue 3 June 15, 2010 ABSTRACT The standard coded character set ASCII was formally standardized in 1963 and, in its “complete” form, in 1967. It is a 7-bit character set, including 95 graphic characters. As personal computers emerged, they typically had an 8-bit architecture. To exploit that, they typically came to use character sets that were “8-bit extensions of ASCII”, providing numerous additional graphic characters. One family of these became common in computers using the MS-DOS operating system. Another similar but distinct family later became common in computers operating with the Windows operating system. In the late 1980s, a true second-generation coded character set, called Unicode, was developed, and was standardized in 1991. Its structure provides an ultimate capacity of about a million graphic characters, catering thoroughly to the needs of hundreds of languages and many specialized and parochial uses. Various encoding techniques are used to represent characters from this set in 8- and 16-bit computer environments. In this article, we will describe and discuss these coded character sets, their development, their status, the relationships between them, and practical implications of their use. Information is included on the entry, in Windows computer systems, of characters not directly accessible from the keyboard. BEFORE ASCII As of 1960, the interchange of data within and between information communication and processing systems was made difficult by the lack of a uniform coded character set. Information communication in the traditional “teletypewriter” sense generally used a 5-bit coded character set, best described as the “Murray code” (although as a result of historical misunderstandings it is often called the “Baudot code”). -
Font Collection
z/OS Version 2 Release 3 Font Collection IBM GA32-1048-30 Note Before using this information and the product it supports, read the information in “Notices” on page 137. This edition applies to Version 2 Release 3 of z/OS (5650-ZOS) and to all subsequent releases and modifications until otherwise indicated in new editions. Last updated: 2019-02-16 © Copyright International Business Machines Corporation 2002, 2017. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents List of Figures........................................................................................................ v List of Tables........................................................................................................vii About this publication...........................................................................................ix Who needs to read this publication............................................................................................................ ix How this publication is organized............................................................................................................... ix Related information......................................................................................................................................x How to send your comments to IBM.......................................................................xi If you have a technical problem..................................................................................................................xi -
Federal Information Processing Standards Publication: Code For
ANSI X3.4-1977 X3.4-1977 This standard has been adopted for Federal Government use. Details concerning its use within the Federal Government are contained in Federal Information Processing Standards Publication 1-2, Code for Infor¬ mation Interchange, Its Representations, Subsets, and Extensions. For a complete list of the publications available in the Federal Information Pro¬ cessing Standards Series, write to the Standards Processing Coordinator (ADP), Institute for Computer Sciences and Technology, National Bureau of Standards, Gaithersburg, MD 20899. ANSI® X3.4-1977 Revision of X3.4-1968 American National Standard Code for Information Interchange Secretariat Computer and Business Equipment Manufacturers Association Approved June 9, 1977 American National Standards Institute, Inc American An American National Standard implies a consensus of those substantially concerned with its scope and provisions. An American National Standard is intended as a guide to aid the manu¬ National facturer, the consumer, and the general public. The existence of an American National Stan¬ dard does not in any respect preclude anyone, whether he has approved the standard or not, Standard from manufacturing, marketing, purchasing, or using products, processes, or procedures not conforming to the standard. American National Standards are subject to periodic review and users are cautioned to obtain the latest editions. The American National Standards Institute does not develop standards and will in no circum¬ stances give an interpretation of any American National Standard. Moreover, no person shall have the right or authority to issue an interpretation of an American National Standard in the name of the American National Standards Institute. CAUTION NOTICE: This American National Standard may be revised or withdrawn at any time. -
Font Summary Overview 1
InfoPrint Font Collection Font Summary Overview 1 Font concepts 2 Version 3.7 AFP Fonts 3 AFP Outline Fonts 4 AFP Classic OpenType Fonts 5 AFP Asian Classic OpenType Fonts 6 WorldType Fonts 7 AFP Raster Fonts 8 Code pages and extended code pages 9 For information not in this manual, refer to the Help System in your product. Read this manual carefully and keep it handy for future reference. TABLE OF CONTENTS Introduction Important............................................................................................................................................ 3 Cautions regarding this guide............................................................................................................. 3 Publications for this product ................................................................................................................ 3 How to read the documentation ......................................................................................................... 3 Before using InfoPrint Font Collection.................................................................................................. 3 Related publications ........................................................................................................................... 4 Symbols.............................................................................................................................................. 4 Abbreviations .................................................................................................................................... -
ISO/IEC International Standard ISO/IEC 10646
ISO/IEC International Standard ISO/IEC 10646 Final Committee Draft Information technology – Universal Coded Character Set (UCS) Technologie de l’information – Jeu universel de caractères codés (JUC) Second edition, 2010 ISO/IEC 10646:2010 (E) Final Committee Draft (FCD) PDF disclaimer This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat accepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF- creation parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below. © ISO/IEC 2010 All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the ad- dress below or ISO's member body in the country of the requester. ISO copyright office Case postale 56 • CH-1211 Geneva 20 Tel. + 41 22 749 01 11 Fax + 41 22 749 09 47 E-mail [email protected] Web www.iso.ch Printed in Switzerland 2 © ISO/IEC 2010 – All rights reserved ISO/IEC 10646:2010 (E) Final Committee Draft (FCD) CONTENTS Foreword................................................................................................................................................. -
Japanese Language: an Introduction for Language Technology Professionals and Researchers
[Translating and the Computer 24: proceedings of the…International Conference…21-22 November 2002, London (Aslib, 2002)] Japanese Language: An Introduction for Language Technology Professionals and Researchers Carl Kay Independent Consultant 15-22, Onoharanishi 3-chome Minoh-shi, Osaka Japan 562-0032 [email protected] David Fine Internationalization Consultant Philadelphia, PA United States [email protected] Background and Synopsis It is the authors' experience, based on a combined four decades in the translation and localization business, that most people involved in the language business lack knowledge about the basic features of the Japanese language and how it is the same or different from major Western languages. This is quite natural given the many ways that Japan and Japanese are remote presences to most people outside Japan. Many barriers make Japanese inaccessible to Westerners: a strange and complex written language (and related challenges in computer processing of text); cultural differences; limited opportunities to study the language (and the time such study requires); geographic and time zone gaps, and Japan's shallow interaction with the West compared to exchanges among Western nations. Still, even in a recession Japan remains one of the major economies of the world. Its information technology market still dwarfs that of the rest of Asia. In software localization Japanese is often a "Tier One" language meaning a target of the first round of localization aimed at the biggest markets not addressable with English product. Japan's government and companies remain major funders of research in machine translation and other language technology. Yet the above-mentioned barriers make it hard for Westerners in the language technology field to handle contact with Japanese as comfortably and effectively as they do with many other languages.