Other Encodings

11/13/2019 Other Encodings ASCII, Unicode, BCD and EBCDIC ASCII • American Standard Code for Information Interchange • Representation of printable (and related) characters as bit patterns • Basic ASCII is a 7-bit code - 8th bit used for parity (primitive error checking) • Extended ASCII: ISO8859-1, CP437, etc - extensions of 7-bit ASCII to 8 bits - include graphics symbols, European characters - not consistent - All include 7-bit ASCII as the 1st 128 characters 1 11/13/2019 ASCII tables – see montcs.bloomu.edu/Information/Encoding/ASCII-EBCDIC.html Hexadecimal Range Usage Examples range First 32 0x00 - 0x1f Control Ctrl-D, ‘\n’, values characters Escape Second 32 Punctuation, 0x20 - 0x3f (, ), 0..9, = values digits Third 32 Uppercase 0x40 - 0x5f A..Z, [, ],@ values letters Fourth 32 Lowercase 0x60 - 0x7f a..z, {, }, ~ values letters Extended ASCII Various non- Last 128 English 0x80 - 0xff ¶, ü, ┝, ┤ values characters, “ASCII graphics” ISO8859-1, a.k.a. Latin-1 2 11/13/2019 CodePage 437 – the IBM Character Set Unicode family • Multi-byte successor to ASCII - Represents alphabets by "code points" - UTF-8, other encodings of code points » 1 byte for ASCII codes » expands to 2, 3, or 4 bytes for other character sets • Support for many languages - Greek - Cyrillic - Arabic - Mandarin - Sanskrit - Kanji 3 11/13/2019 partial Unicode code table – see montcs.bloomu.edu/Information/Encodings/unicode.html Unicode and Emojis • Different representations for code points • More than 1700 emojis currently defined 4 11/13/2019 UTF-8 Encodings • Unicode currently defines code points U+0000 through 0x10ffff - somewhat over 1 million characters in 17 planes • UTF-8 uses up to four bytes to represent these code points - 5-, 6-byte encodings unneeded Unicode and UTF-8 – A Few Example Alphabets Character First Second Third Fourth Code Points Set Byte Byte? Byte? Byte? Basic Latin 0x00 – U+0000 – U+007f (ASCII) 0x7f 0xc0 – 0x80 – Latin-1 U+0080 – U+00ff 0xc3 0xbf Latin 0xc4 – 0x80 – U+0100 – U+017f Extended-A 0xc5 0xbf Latin 0xc6 – 0x80 – U+0180 – U+024f Extended-B 0xc9 0x8f 0xc9 – 0x90 – … U+0250 – U+036f 0xcd 0xaf Greek and 0xcd – 0xb0 – U+0370 – U+03ff Coptic 0xcf 0xbf 0xd0 – 0x80 – … U+0380 – U+07ff 0xdf 0xbf 0x80 – Samaritan U+0800 – U+083f 0xe0 0xa0 0xbf … U+0840 – U+10ffff 0xe0 0xa1 – … 0x80 – … ??? 5 11/13/2019 Bit Pattern BCD Unsigned Binary Coded Decimal Binary 0000 0 0 0001 1 1 • BCD – scheme for encoding 10 decimal digits 0010 2 2 3 - Bit patterns 1010 – 1111 unused 0011 3 0100 4 4 • Two BCD digits per byte 0101 5 5 - examples: 6 13 = 0000 0011 0110 6 64 = 0110 0100 0111 7 7 - 00-99 range is less than 1000 8 8 unsigned binary range of 0-255 1001 9 9 • Hardware support more 1010 - 10 complicated 1011 - 11 - Addition, subtraction are full of 1100 - 12 “special cases” requiring 1101 - 13 additional circuitry 1110 - 14 1111 - 15 Using BCD numbers: a 1979 HP-9845 interfaces with a 1967 HP voltmeter 6 11/13/2019 EBCDIC • Extended Binary Coded Decimal Interchange Code - BCD is embedded within it • Alternative to ASCII - 8 bits • Created for IBM mainframes - suited to 80-column punched cards - support for business applications • Country-specific versions were not mutually consistent • Little used today EBCDIC 7 11/13/2019 EBCDIC table – see montcs.bloomu.edu/Information/Encoding/ASCII-EBCDIC.html a punched card showing EBCDIC 8 .

Load more