11/13/2019
Other Encodings
ASCII, Unicode, BCD and EBCDIC
ASCII
• American Standard Code for Information Interchange
• Representation of printable (and related) characters as bit patterns
• Basic ASCII is a 7-bit code - 8th bit used for parity (primitive error checking)
• Extended ASCII: ISO8859-1, CP437, etc - extensions of 7-bit ASCII to 8 bits - include graphics symbols, European characters - not consistent - All include 7-bit ASCII as the 1st 128 characters
1 11/13/2019
ASCII tables – see montcs.bloomu.edu/Information/Encoding/ASCII-EBCDIC.html
Hexadecimal Range Usage Examples range
First 32 0x00 - 0x1f Control Ctrl-D, ‘\n’, values characters Escape Second 32 Punctuation, 0x20 - 0x3f (, ), 0..9, = values digits Third 32 Uppercase 0x40 - 0x5f A..Z, [, ],@ values letters Fourth 32 Lowercase 0x60 - 0x7f a..z, {, }, ~ values letters Extended ASCII Various non- Last 128 English 0x80 - 0xff ¶, ü, ┝, ┤ values characters, “ASCII graphics”
2 11/13/2019
CodePage 437 – the IBM Character Set
Unicode family
• Multi-byte successor to ASCII - Represents alphabets by "code points" - UTF-8, other encodings of code points » 1 byte for ASCII codes » expands to 2, 3, or 4 bytes for other character sets
• Support for many languages - Greek - Cyrillic - Arabic - Mandarin - Sanskrit - Kanji
3 11/13/2019
partial Unicode code table – see montcs.bloomu.edu/Information/Encodings/unicode.html
Unicode and Emojis
• Different representations for code points
• More than 1700 emojis currently defined
4 11/13/2019
UTF-8 Encodings
• Unicode currently defines code points U+0000 through 0x10ffff - somewhat over 1 million characters in 17 planes
• UTF-8 uses up to four bytes to represent these code points - 5-, 6-byte encodings unneeded
Unicode and UTF-8 – A Few Example Alphabets
Character First Second Third Fourth Code Points Set Byte Byte? Byte? Byte? Basic Latin 0x00 – U+0000 – U+007f (ASCII) 0x7f 0xc0 – 0x80 – Latin-1 U+0080 – U+00ff 0xc3 0xbf Latin 0xc4 – 0x80 – U+0100 – U+017f Extended-A 0xc5 0xbf Latin 0xc6 – 0x80 – U+0180 – U+024f Extended-B 0xc9 0x8f 0xc9 – 0x90 – … U+0250 – U+036f 0xcd 0xaf Greek and 0xcd – 0xb0 – U+0370 – U+03ff Coptic 0xcf 0xbf 0xd0 – 0x80 – … U+0380 – U+07ff 0xdf 0xbf 0x80 – Samaritan U+0800 – U+083f 0xe0 0xa0 0xbf … U+0840 – U+10ffff 0xe0 0xa1 – … 0x80 – … ???
5 11/13/2019
Bit Pattern BCD Unsigned Binary Coded Decimal Binary 0000 0 0 0001 1 1 • BCD – scheme for encoding 10 decimal digits 0010 2 2 3 - Bit patterns 1010 – 1111 unused 0011 3 0100 4 4 • Two BCD digits per byte 0101 5 5 - examples: 13 = 0000 0011 0110 6 6 64 = 0110 0100 0111 7 7 - 00-99 range is less than 1000 8 8 unsigned binary range of 0-255 1001 9 9 • Hardware support more 1010 - 10 complicated 1011 - 11 - Addition, subtraction are full of 1100 - 12 “special cases” requiring 1101 - 13 additional circuitry 1110 - 14
1111 - 15
Using BCD numbers: a 1979 HP-9845 interfaces with a 1967 HP voltmeter
6 11/13/2019
EBCDIC
• Extended Binary Coded Decimal Interchange Code - BCD is embedded within it
• Alternative to ASCII - 8 bits
• Created for IBM mainframes - suited to 80-column punched cards - support for business applications
• Country-specific versions were not mutually consistent
• Little used today
EBCDIC
7 11/13/2019
EBCDIC table – see montcs.bloomu.edu/Information/Encoding/ASCII-EBCDIC.html
a punched card showing EBCDIC
8