Ascii Control Codes

Total Page:16

File Type:pdf, Size:1020Kb

Ascii Control Codes 270 TUGboat, Volume 29 (2008), No. 2 Character encoding Since a computer organizes its bits in 8-bit bytes, and ASCII only codified the codes under 128, this left Victor Eijkhout the codes with the high bit set (`extended ASCII') Have you ever wondered what goes on between the undefined, and different manufacturers of computer `A' you hit on your keyboard, the `A' stored in your equipment came up with their own way of filling file, and the `A' that comes out of your printer? them in. These standards were called `code pages', Why does that letter still come out of the printer and IBM gave a standard numbering to them. For in- if the file is printed by your friend in Egypt who stance, code page 437 is the MS-DOS code page with doesn't use the letter `A'? Maybe you know that `A' accented characters for most European languages, is character 65 (decimal) in ASCII; if you put it on 862 is DOS in Israel, and 737 is DOS for Greek. a web page, and it's visited by someone in Japan, Here is cp437: why don't they get character number 65 in the Kanji alphabet? Do you remember the DOS days when your Mac owning colleague would send you a file and what were supposed to be accented characters would turn into smiley faces? Have you ever pasted text from MS-Word into Emacs, and Emacs wanted to save the document as UTF-8? Just what is that about? All this, and more, will be explained in this article. 1 History in one byte Somewhere in the depths of prehistory, people in the Western world agreed on a standard for character codes under 127, ASCII, the American Standard Code for Information Interchange. This standard declares that the letter `A' is character number 65 decimal (41 in hexadecimal), so if your file contains the bit pattern for 65 (which is 01000001), it will produce an `A' when sent to the printer. MacRoman: ASCII has some nice properties, some of which were lacking in another encoding scheme, EBCDIC (which was used almost exclusively by IBM): • All letters are consecutive, making a test `is this a letter' easy to perform. • Uppercase and lowercase letters are at a distance of 32; this means that the Shift key on your keyboard simply toggles the sixth bit in the pattern of whatever key you are holding down. and Microsoft cp-1252: • The first 32 codes, everything below the space character, as well as position 127, are `unprint- able', and can be used for such purposes as terminal cursor control. The ISO 646 standard codified 7-bit ASCII, but it left certain character positions (or `code points') open for national variation. For instance, British usage put a pound sign ($) in the position of the dollar. The ASCII character set was originally accepted as ANSI X3.4 in 1968. ANSI is displayed in table1. More code pages are displayed in [5]. TUGboat, Volume 29 (2008), No. 2 271 dec CHAR ASCII CONTROL CODES hex oct b7 0 0 0 0 1 1 1 1 b6 0 0 1 1 0 0 1 1 b5 0 1 0 1 0 1 0 1 BITS SYMBOLS CONTROL UPPERCASE LOWERCASE b4 b3 b2 b1 NUMBERS 0 16 32 48 64 80 96 112 0 0 0 0 NUL DLE SP 0 @ P ` p 0 0 10 20 20 40 30 60 40 100 50 120 60 140 70 160 1 17 33 49 65 81 97 113 0 0 0 1 SOH DC1 ! 1 A Q a q 1 1 11 21 21 41 31 61 41 101 51 121 61 141 71 161 2 18 34 50 66 82 98 114 0 0 1 0 STX DC2 " 2 B R b r 2 2 12 22 22 42 32 62 42 102 52 122 62 142 72 162 3 19 35 51 67 83 99 115 0 0 1 1 ETX DC3 # 3 C S c s 3 3 13 23 23 43 33 63 43 103 53 123 63 143 73 163 4 20 36 52 68 84 100 116 0 1 0 0 EOT DC4 $ 4 D T d t 4 4 14 24 24 44 34 64 44 104 54 124 64 144 74 164 5 21 37 53 69 85 101 117 0 1 0 1 ENQ NAK % 5 E U e u 5 5 15 25 25 45 35 65 45 105 55 125 65 145 75 165 6 22 38 54 70 86 102 118 0 1 1 0 ACK SYN & 6 F V f v 6 6 16 26 26 46 36 66 46 106 56 126 66 146 76 166 7 23 39 55 71 87 103 119 0 1 1 1 BEL ETB ' 7 G W g w 7 7 17 27 27 47 37 67 47 107 57 127 67 147 77 167 8 24 40 56 72 88 104 120 1 0 0 0 BS CAN ( 8 H X h x 8 10 18 30 28 50 38 70 48 110 58 130 68 150 78 170 9 25 41 57 73 89 105 121 1 0 0 1 HT EM ) 9 I Y i y 9 11 19 31 29 51 39 71 49 111 59 131 69 151 79 171 10 26 42 58 74 90 106 122 1 0 1 0 LF SUB * : J Z j z A 12 1A 32 2A 52 3A 72 4A 112 5A 132 6A 152 7A 172 11 27 43 59 75 91 107 123 1 0 1 1 VT ESC + ; K [ k f B 13 1B 33 2B 53 3B 73 4B 113 5B 133 6B 153 7B 173 12 28 44 60 76 92 108 124 1 1 0 0 FF FS , < L n l j C 14 1C 34 2C 54 3C 74 4C 114 5C 134 6C 154 7C 174 13 29 45 61 77 93 109 125 1 1 0 1 CR GS − = M ] m g D 15 1D 35 2D 55 3D 75 4D 115 5D 135 6D 155 7D 175 14 30 46 62 78 94 110 126 1 1 1 0 SO RS . > N ^ n ~ E 16 1E 36 2E 56 3E 76 4E 116 5E 136 6E 156 7E 176 15 31 47 63 79 95 111 127 1 1 1 1 SI US / ? O _ o DEL F 17 1F 37 2F 57 3F 77 4F 117 5F 137 6F 157 7F 177 Table 1: The ASCII table The international variants were standardized as ISO 646-DE (German), 646-DK (Danish), et cetera. Originally, the dollar sign could still be replaced by the currency symbol, but after a 1991 revision the dollar is now the only possibility. The different code pages were ultimately stan- dardized as ISO 8859, with such popular code pages as 8859-1 (`Latin 1') for western European: 272 TUGboat, Volume 29 (2008), No. 2 8859-2 for eastern European, and 8859-5 for Cyrillic: There used to be a drive towards unambiguous abstract character names across repertoires and encodings, but Unicode ended this, as it provides (or aims to provide) more or less a complete list of every character on earth. CEF Character Encoding Form: a mapping from a set of non-negative integers that are elements of a CCS to a set of sequences of particular code These ISO standards explicitly left the first 32 units. A `code unit' is an integer of a specific extended positions undefined. binary width, for instance 8 or 16 bits. A CEF Reading material: The history of ASCII out of then maps the code points of a coded character telegraph codes [1]; a history, paying attention to mul- set into sequences of code points, and these tilingual use [4]; Bob Bemer, the `father of ASCII'[2]; sequences can be of different lengths inside one a detailed discussion of ISO 8859, Latin-1 [11]. code page. For instance ASCII uses a single 7-bit unit; UTF-8 uses one to four 8-bit units. We 2 Character sets and encodings will discuss the UTF encodings below. As you can tell from the introduction, there is quite CES Character Encoding Scheme: a reversible trans- a bit of confusion possible between characters and formation from a set of sequences of code units representations or encodings. Let us clear up the (from one or more CEFs to a serialized sequence concepts a little. of bytes. In single-byte cases such as ASCII and Informally, the term `character set' (also `char- UTF-8 this mapping is trivial. With the two- acter code' or `code') used to mean something like byte scheme UCS-2 there is a single `byte order `a table of bytes, each with a character shape'. With mark', after which the code units are trivially only the English alphabet to deal with that is a good mapped to bytes. On the other hand, ISO 2022, enough definition. These days, much more general which uses escape sequences to switch between cases are handled, mapping one octet into several different encodings, is a complicated CES. characters, or several octets into one character. The definition has changed accordingly: Additionally, there are the concepts of A charset is a method of converting a se- CM Character Map: a mapping from sequences of quence of octets into a sequence of characters. members of an abstract character repertoire to This conversion may also optionally produce serialized sequences of bytes bridging all four additional control information such as direc- levels in a single operation.
Recommended publications
  • Cumberland Tech Ref.Book
    Forms Printer 258x/259x Technical Reference DRAFT document - Monday, August 11, 2008 1:59 pm Please note that this is a DRAFT document. More information will be added and a final version will be released at a later date. August 2008 www.lexmark.com Lexmark and Lexmark with diamond design are trademarks of Lexmark International, Inc., registered in the United States and/or other countries. © 2008 Lexmark International, Inc. All rights reserved. 740 West New Circle Road Lexington, Kentucky 40550 Draft document Edition: August 2008 The following paragraph does not apply to any country where such provisions are inconsistent with local law: LEXMARK INTERNATIONAL, INC., PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions; therefore, this statement may not apply to you. This publication could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in later editions. Improvements or changes in the products or the programs described may be made at any time. Comments about this publication may be addressed to Lexmark International, Inc., Department F95/032-2, 740 West New Circle Road, Lexington, Kentucky 40550, U.S.A. In the United Kingdom and Eire, send to Lexmark International Ltd., Marketing and Services Department, Westhorpe House, Westhorpe, Marlow Bucks SL7 3RQ. Lexmark may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.
    [Show full text]
  • ST.36 Page: 3.36.1
    HANDBOOK ON INDUSTRIAL PROPERTY INFORMATION AND DOCUMENTATION Ref.: Standards – ST.36 page: 3.36.1 STANDARD ST.36 Version 1.2 RECOMMENDATION FOR THE PROCESSING OF PATENT INFORMATION USING XML (EXTENSIBLE MARKUP LANGUAGE) Revision adopted by ST.36 Task Force of the Standards and Documentation Working Group (SDWG) on November 23, 2007 TABLE OF CONTENTS INTRODUCTION ............................................................................................................................................................ 2 DEFINITIONS ................................................................................................................................................................. 3 SCOPE OF THE STANDARD ........................................................................................................................................ 3 REQUIREMENTS OF THE STANDARD........................................................................................................................ 4 General ......................................................................................................................................................................... 4 Characters .................................................................................................................................................................... 5 Naming international common elements....................................................................................................................... 6 Naming office-specific elements
    [Show full text]
  • Hieroglyphs for the Information Age: Images As a Replacement for Characters for Languages Not Written in the Latin-1 Alphabet Akira Hasegawa
    Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 5-1-1999 Hieroglyphs for the information age: Images as a replacement for characters for languages not written in the Latin-1 alphabet Akira Hasegawa Follow this and additional works at: http://scholarworks.rit.edu/theses Recommended Citation Hasegawa, Akira, "Hieroglyphs for the information age: Images as a replacement for characters for languages not written in the Latin-1 alphabet" (1999). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. Hieroglyphs for the Information Age: Images as a Replacement for Characters for Languages not Written in the Latin- 1 Alphabet by Akira Hasegawa A thesis project submitted in partial fulfillment of the requirements for the degree of Master of Science in the School of Printing Management and Sciences in the College of Imaging Arts and Sciences of the Rochester Institute ofTechnology May, 1999 Thesis Advisor: Professor Frank Romano School of Printing Management and Sciences Rochester Institute ofTechnology Rochester, New York Certificate ofApproval Master's Thesis This is to certify that the Master's Thesis of Akira Hasegawa With a major in Graphic Arts Publishing has been approved by the Thesis Committee as satisfactory for the thesis requirement for the Master ofScience degree at the convocation of May 1999 Thesis Committee: Frank Romano Thesis Advisor Marie Freckleton Gr:lduate Program Coordinator C.
    [Show full text]
  • Base64 Character Encoding and Decoding Modeling
    Base64 Character Encoding and Decoding Modeling Isnar Sumartono1, Andysah Putera Utama Siahaan2, Arpan3 Faculty of Computer Science,Universitas Pembangunan Panca Budi Jl. Jend. Gatot Subroto Km. 4,5 Sei Sikambing, 20122, Medan, Sumatera Utara, Indonesia Abstract: Security is crucial to maintaining the confidentiality of the information. Secure information is the information should not be known to the unreliable person, especially information concerning the state and the government. This information is often transmitted using a public network. If the data is not secured in advance, would be easily intercepted and the contents of the information known by the people who stole it. The method used to secure data is to use a cryptographic system by changing plaintext into ciphertext. Base64 algorithm is one of the encryption processes that is ideal for use in data transmission. Ciphertext obtained is the arrangement of the characters that have been tabulated. These tables have been designed to facilitate the delivery of data during transmission. By applying this algorithm, errors would be avoided, and security would also be ensured. Keywords: Base64, Security, Cryptography, Encoding I. INTRODUCTION Security and confidentiality is one important aspect of an information system [9][10]. The information sent is expected to be well received only by those who have the right. Information will be useless if at the time of transmission intercepted or hijacked by an unauthorized person [7]. The public network is one that is prone to be intercepted or hijacked [1][2]. From time to time the data transmission technology has developed so rapidly. Security is necessary for an organization or company as to maintain the integrity of the data and information on the company.
    [Show full text]
  • IBM Db2 High Performance Unload for Z/OS User's Guide
    5.1 IBM Db2 High Performance Unload for z/OS User's Guide IBM SC19-3777-03 Note: Before using this information and the product it supports, read the "Notices" topic at the end of this information. Subsequent editions of this PDF will not be delivered in IBM Publications Center. Always download the latest edition from the Db2 Tools Product Documentation page. This edition applies to Version 5 Release 1 of Db2 High Performance Unload for z/OS (product number 5655-AA1) and to all subsequent releases and modifications until otherwise indicated in new editions. © Copyright IBM® Corporation 1999, 2021; Copyright Infotel 1999, 2021. All Rights Reserved. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. © Copyright International Business Machines Corporation . US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents About this information......................................................................................... vii Chapter 1. Db2 High Performance Unload overview................................................ 1 What does Db2 HPU do?..............................................................................................................................1 Db2 HPU benefits.........................................................................................................................................1 Db2 HPU process and components.............................................................................................................2
    [Show full text]
  • PCL PC-8, Code Page 437 Page 1 of 5 PCL PC-8, Code Page 437
    PCL PC-8, Code Page 437 Page 1 of 5 PCL PC-8, Code Page 437 PCL Symbol Set: 10U Unicode glyph correspondence tables. Contact:[email protected] http://pcl.to -- -- -- -- $90 U00C9 Ê Uppercase e acute $21 U0021 Ë Exclamation $91 U00E6 Ì Lowercase ae diphthong $22 U0022 Í Neutral double quote $92 U00C6 Î Uppercase ae diphthong $23 U0023 Ï Number $93 U00F4 & Lowercase o circumflex $24 U0024 ' Dollar $94 U00F6 ( Lowercase o dieresis $25 U0025 ) Per cent $95 U00F2 * Lowercase o grave $26 U0026 + Ampersand $96 U00FB , Lowercase u circumflex $27 U0027 - Neutral single quote $97 U00F9 . Lowercase u grave $28 U0028 / Left parenthesis $98 U00FF 0 Lowercase y dieresis $29 U0029 1 Right parenthesis $99 U00D6 2 Uppercase o dieresis $2A U002A 3 Asterisk $9A U00DC 4 Uppercase u dieresis $2B U002B 5 Plus $9B U00A2 6 Cent sign $2C U002C 7 Comma, decimal separator $9C U00A3 8 Pound sterling $2D U002D 9 Hyphen $9D U00A5 : Yen sign $2E U002E ; Period, full stop $9E U20A7 < Pesetas $2F U002F = Solidus, slash $9F U0192 > Florin sign $30 U0030 ? Numeral zero $A0 U00E1 ê Lowercase a acute $31 U0031 A Numeral one $A1 U00ED B Lowercase i acute $32 U0032 C Numeral two $A2 U00F3 D Lowercase o acute $33 U0033 E Numeral three $A3 U00FA F Lowercase u acute $34 U0034 G Numeral four $A4 U00F1 H Lowercase n tilde $35 U0035 I Numeral five $A5 U00D1 J Uppercase n tilde $36 U0036 K Numeral six $A6 U00AA L Female ordinal (a) http://www.pclviewer.com (c) RedTitan Technology 2005 PCL PC-8, Code Page 437 Page 2 of 5 $37 U0037 M Numeral seven $A7 U00BA N Male ordinal (o) $38 U0038
    [Show full text]
  • Oral History of Captain Grace Hopper
    Oral History of Captain Grace Hopper Interviewed by: Angeline Pantages Recorded: December, 1980 Naval Data Automation Command, Maryland CHM Reference number: X5142.2009 © 1980 Computer History Museum Table of Contents BACKGROUND HISTORY ...........................................................................................................3 1943-1949: MARK I, II, AND III COMPUTERS AT HARVARD....................................................6 1949-1964: ECKERT AND MAUCHLY, UNIVAC, AND THE ONE-PASS COMPILER ................7 The Need for User-Friendly Languages ..................................................................................10 DEMANDS FOR THE FUTURE..................................................................................................12 Application Processors, Database Machines, Distributed Processing ....................................12 Demand for Programmers and System Analysts ....................................................................14 The Value and Cost of Information..........................................................................................14 The Navy’s Dilemma: Micros and Software Creation..............................................................15 The Murray Siblings: Brilliant Communicators.........................................................................18 Common Sense and Distributed Computing ...........................................................................19 BACK TO 1943-1949: HOWARD AIKEN....................................................................................21
    [Show full text]
  • Unicode and Code Page Support
    Natural for Mainframes Unicode and Code Page Support Version 4.2.6 for Mainframes October 2009 This document applies to Natural Version 4.2.6 for Mainframes and to all subsequent releases. Specifications contained herein are subject to change and these changes will be reported in subsequent release notes or new editions. Copyright © Software AG 1979-2009. All rights reserved. The name Software AG, webMethods and all Software AG product names are either trademarks or registered trademarks of Software AG and/or Software AG USA, Inc. Other company and product names mentioned herein may be trademarks of their respective owners. Table of Contents 1 Unicode and Code Page Support .................................................................................... 1 2 Introduction ..................................................................................................................... 3 About Code Pages and Unicode ................................................................................ 4 About Unicode and Code Page Support in Natural .................................................. 5 ICU on Mainframe Platforms ..................................................................................... 6 3 Unicode and Code Page Support in the Natural Programming Language .................... 7 Natural Data Format U for Unicode-Based Data ....................................................... 8 Statements .................................................................................................................. 9 Logical
    [Show full text]
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • Verity Locale Configuration Guide V5.0 for Peoplesoft
    Verity® Locale Configuration Guide V 5.0 for PeopleSoft® November 15, 2003 Original Part Number DM0619 Verity, Incorporated 894 Ross Drive Sunnyvale, California 94089 (408) 541-1500 Verity Benelux BV Coltbaan 31 3439 NG Nieuwegein The Netherlands Copyright 2003 Verity, Inc. All rights reserved. No part of this publication may be reproduced, transmitted, stored in a retrieval system, nor translated into any human or computer language, in any form or by any means, electronic, mechanical, magnetic, optical, chemical, manual or otherwise, without the prior written permission of the copyright owner, Verity, Inc., 894 Ross Drive, Sunnyvale, California 94089. The copyrighted software that accompanies this manual is licensed to the End User for use only in strict accordance with the End User License Agreement, which the Licensee should read carefully before commencing use of the software. Verity®, Ultraseek®, TOPIC®, KeyView®, and Knowledge Organizer® are registered trademarks of Verity, Inc. in the United States and other countries. The Verity logo, Verity Portal One™, and Verity® Profiler™ are trademarks of Verity, Inc. Sun, Sun Microsystems, the Sun logo, Sun Workstation, Sun Operating Environment, and Java are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. Xerces XML Parser Copyright 1999-2000 The Apache Software Foundation. All rights reserved. Microsoft is a registered trademark, and MS-DOS, Windows, Windows 95, Windows NT, and other Microsoft products referenced herein are trademarks of Microsoft Corporation. IBM is a registered trademark of International Business Machines Corporation. The American Heritage® Concise Dictionary, Third Edition Copyright 1994 by Houghton Mifflin Company. Electronic version licensed from Lernout & Hauspie Speech Products N.V.
    [Show full text]
  • Regular Expressions in Jlex to Define a Token in Jlex, the User to Associates a Regular Expression with Commands Coded in Java
    Regular Expressions in JLex To define a token in JLex, the user to associates a regular expression with commands coded in Java. When input characters that match a regular expression are read, the corresponding Java code is executed. As a user of JLex you don’t need to tell it how to match tokens; you need only say what you want done when a particular token is matched. Tokens like white space are deleted simply by having their associated command not return anything. Scanning continues until a command with a return in it is executed. The simplest form of regular expression is a single string that matches exactly itself. © CS 536 Spring 2005 109 For example, if {return new Token(sym.If);} If you wish, you can quote the string representing the reserved word ("if"), but since the string contains no delimiters or operators, quoting it is unnecessary. For a regular expression operator, like +, quoting is necessary: "+" {return new Token(sym.Plus);} © CS 536 Spring 2005 110 Character Classes Our specification of the reserved word if, as shown earlier, is incomplete. We don’t (yet) handle upper or mixed- case. To extend our definition, we’ll use a very useful feature of Lex and JLex— character classes. Characters often naturally fall into classes, with all characters in a class treated identically in a token definition. In our definition of identifiers all letters form a class since any of them can be used to form an identifier. Similarly, in a number, any of the ten digit characters can be used. © CS 536 Spring 2005 111 Character classes are delimited by [ and ]; individual characters are listed without any quotation or separators.
    [Show full text]
  • SAS 9.3 UTF-8 Encoding Support and Related Issue Troubleshooting
    SAS 9.3 UTF-8 Encoding Support and Related Issue Troubleshooting Jason (Jianduan) Liang SAS certified: Platform Administrator, Advanced Programmer for SAS 9 Agenda Introduction UTF-8 and other encodings SAS options for encoding and configuration Other Considerations for UTF-8 data Encoding issues troubleshooting techniques (tips) Introduction What is UTF-8? . A character encoding capable of encoding all possible characters Why UTF-8? . Dominant encoding of the www (86.5%) SAS system options for encoding . Encoding – instructs SAS how to read, process and store data . Locale - instructs SAS how to present or display currency, date and time, set timezone values UTF-8 and other Encodings ASSCII (American Standard Code for Information Interchange) . 7-bit . 128 - character set . Examples (code point-char-hex): 32-Space-20; 63-?-3F; 64-@-40; 65-A-41 UTF-8 and other Encodings ISO 8859-1 (Latin-1) for Western European languages Windows-1252 (Latin-1) for Western European languages . 8-bit (1 byte, 256 character set) . Identical to asscii for the first 128 chars . Extended ascii chars examples: . 155-£-A3; 161- ©-A9 . SAS option encoding value: wlatin1 (latin1) UTF-8 and other Encodings UTF-8 and other Encodings Problems . Only covers English and Western Europe languages, ISO-8859-2, …15 . Multiple encoding is required to support national languages . Same character encoded differently, same code point represents different chars Unicode . Unicode – assign a unique code/number to every possible character of all languages . Examples of unicode points: o U+0020 – Space U+0041 – A o U+00A9 - © U+C3BF - ÿ UTF-8 and other Encodings UTF-8 .
    [Show full text]