 
                        270 TUGboat, Volume 29 (2008), No. 2 Character encoding Since a computer organizes its bits in 8-bit bytes, and ASCII only codified the codes under 128, this left Victor Eijkhout the codes with the high bit set (`extended ASCII') Have you ever wondered what goes on between the undefined, and different manufacturers of computer `A' you hit on your keyboard, the `A' stored in your equipment came up with their own way of filling file, and the `A' that comes out of your printer? them in. These standards were called `code pages', Why does that letter still come out of the printer and IBM gave a standard numbering to them. For in- if the file is printed by your friend in Egypt who stance, code page 437 is the MS-DOS code page with doesn't use the letter `A'? Maybe you know that `A' accented characters for most European languages, is character 65 (decimal) in ASCII; if you put it on 862 is DOS in Israel, and 737 is DOS for Greek. a web page, and it's visited by someone in Japan, Here is cp437: why don't they get character number 65 in the Kanji alphabet? Do you remember the DOS days when your Mac owning colleague would send you a file and what were supposed to be accented characters would turn into smiley faces? Have you ever pasted text from MS-Word into Emacs, and Emacs wanted to save the document as UTF-8? Just what is that about? All this, and more, will be explained in this article. 1 History in one byte Somewhere in the depths of prehistory, people in the Western world agreed on a standard for character codes under 127, ASCII, the American Standard Code for Information Interchange. This standard declares that the letter `A' is character number 65 decimal (41 in hexadecimal), so if your file contains the bit pattern for 65 (which is 01000001), it will produce an `A' when sent to the printer. MacRoman: ASCII has some nice properties, some of which were lacking in another encoding scheme, EBCDIC (which was used almost exclusively by IBM): • All letters are consecutive, making a test `is this a letter' easy to perform. • Uppercase and lowercase letters are at a distance of 32; this means that the Shift key on your keyboard simply toggles the sixth bit in the pattern of whatever key you are holding down. and Microsoft cp-1252: • The first 32 codes, everything below the space character, as well as position 127, are `unprint- able', and can be used for such purposes as terminal cursor control. The ISO 646 standard codified 7-bit ASCII, but it left certain character positions (or `code points') open for national variation. For instance, British usage put a pound sign ($) in the position of the dollar. The ASCII character set was originally accepted as ANSI X3.4 in 1968. ANSI is displayed in table1. More code pages are displayed in [5]. TUGboat, Volume 29 (2008), No. 2 271 dec CHAR ASCII CONTROL CODES hex oct b7 0 0 0 0 1 1 1 1 b6 0 0 1 1 0 0 1 1 b5 0 1 0 1 0 1 0 1 BITS SYMBOLS CONTROL UPPERCASE LOWERCASE b4 b3 b2 b1 NUMBERS 0 16 32 48 64 80 96 112 0 0 0 0 NUL DLE SP 0 @ P ` p 0 0 10 20 20 40 30 60 40 100 50 120 60 140 70 160 1 17 33 49 65 81 97 113 0 0 0 1 SOH DC1 ! 1 A Q a q 1 1 11 21 21 41 31 61 41 101 51 121 61 141 71 161 2 18 34 50 66 82 98 114 0 0 1 0 STX DC2 " 2 B R b r 2 2 12 22 22 42 32 62 42 102 52 122 62 142 72 162 3 19 35 51 67 83 99 115 0 0 1 1 ETX DC3 # 3 C S c s 3 3 13 23 23 43 33 63 43 103 53 123 63 143 73 163 4 20 36 52 68 84 100 116 0 1 0 0 EOT DC4 $ 4 D T d t 4 4 14 24 24 44 34 64 44 104 54 124 64 144 74 164 5 21 37 53 69 85 101 117 0 1 0 1 ENQ NAK % 5 E U e u 5 5 15 25 25 45 35 65 45 105 55 125 65 145 75 165 6 22 38 54 70 86 102 118 0 1 1 0 ACK SYN & 6 F V f v 6 6 16 26 26 46 36 66 46 106 56 126 66 146 76 166 7 23 39 55 71 87 103 119 0 1 1 1 BEL ETB ' 7 G W g w 7 7 17 27 27 47 37 67 47 107 57 127 67 147 77 167 8 24 40 56 72 88 104 120 1 0 0 0 BS CAN ( 8 H X h x 8 10 18 30 28 50 38 70 48 110 58 130 68 150 78 170 9 25 41 57 73 89 105 121 1 0 0 1 HT EM ) 9 I Y i y 9 11 19 31 29 51 39 71 49 111 59 131 69 151 79 171 10 26 42 58 74 90 106 122 1 0 1 0 LF SUB * : J Z j z A 12 1A 32 2A 52 3A 72 4A 112 5A 132 6A 152 7A 172 11 27 43 59 75 91 107 123 1 0 1 1 VT ESC + ; K [ k f B 13 1B 33 2B 53 3B 73 4B 113 5B 133 6B 153 7B 173 12 28 44 60 76 92 108 124 1 1 0 0 FF FS , < L n l j C 14 1C 34 2C 54 3C 74 4C 114 5C 134 6C 154 7C 174 13 29 45 61 77 93 109 125 1 1 0 1 CR GS − = M ] m g D 15 1D 35 2D 55 3D 75 4D 115 5D 135 6D 155 7D 175 14 30 46 62 78 94 110 126 1 1 1 0 SO RS . > N ^ n ~ E 16 1E 36 2E 56 3E 76 4E 116 5E 136 6E 156 7E 176 15 31 47 63 79 95 111 127 1 1 1 1 SI US / ? O _ o DEL F 17 1F 37 2F 57 3F 77 4F 117 5F 137 6F 157 7F 177 Table 1: The ASCII table The international variants were standardized as ISO 646-DE (German), 646-DK (Danish), et cetera. Originally, the dollar sign could still be replaced by the currency symbol, but after a 1991 revision the dollar is now the only possibility. The different code pages were ultimately stan- dardized as ISO 8859, with such popular code pages as 8859-1 (`Latin 1') for western European: 272 TUGboat, Volume 29 (2008), No. 2 8859-2 for eastern European, and 8859-5 for Cyrillic: There used to be a drive towards unambiguous abstract character names across repertoires and encodings, but Unicode ended this, as it provides (or aims to provide) more or less a complete list of every character on earth. CEF Character Encoding Form: a mapping from a set of non-negative integers that are elements of a CCS to a set of sequences of particular code These ISO standards explicitly left the first 32 units. A `code unit' is an integer of a specific extended positions undefined. binary width, for instance 8 or 16 bits. A CEF Reading material: The history of ASCII out of then maps the code points of a coded character telegraph codes [1]; a history, paying attention to mul- set into sequences of code points, and these tilingual use [4]; Bob Bemer, the `father of ASCII'[2]; sequences can be of different lengths inside one a detailed discussion of ISO 8859, Latin-1 [11]. code page. For instance ASCII uses a single 7-bit unit; UTF-8 uses one to four 8-bit units. We 2 Character sets and encodings will discuss the UTF encodings below. As you can tell from the introduction, there is quite CES Character Encoding Scheme: a reversible trans- a bit of confusion possible between characters and formation from a set of sequences of code units representations or encodings. Let us clear up the (from one or more CEFs to a serialized sequence concepts a little. of bytes. In single-byte cases such as ASCII and Informally, the term `character set' (also `char- UTF-8 this mapping is trivial. With the two- acter code' or `code') used to mean something like byte scheme UCS-2 there is a single `byte order `a table of bytes, each with a character shape'. With mark', after which the code units are trivially only the English alphabet to deal with that is a good mapped to bytes. On the other hand, ISO 2022, enough definition. These days, much more general which uses escape sequences to switch between cases are handled, mapping one octet into several different encodings, is a complicated CES. characters, or several octets into one character. The definition has changed accordingly: Additionally, there are the concepts of A charset is a method of converting a se- CM Character Map: a mapping from sequences of quence of octets into a sequence of characters. members of an abstract character repertoire to This conversion may also optionally produce serialized sequences of bytes bridging all four additional control information such as direc- levels in a single operation.
Details
- 
                                File Typepdf
- 
                                Upload Time-
- 
                                Content LanguagesEnglish
- 
                                Upload UserAnonymous/Not logged-in
- 
                                File Pages8 Page
- 
                                File Size-
