06-11337 Introduction to Science The University of Birmingham Autumn Semester 2002 School of Computer Science

October 21, 2002 ­ Achim Jung and Uday Reddy Handout 4 representation

1. History. The representation of characters in a binary has a much longer history than . It came up in connection with telegraphs which were invented around 1800. A well-known example of such a representation is the Morse Alphabet from 1867. Computers, on the other hand, were primarily developed for numerical calculations and that is why the early machines offered only and floating point arithmetic (some offered only one of the two). The observation that a computer can also process non-numerical data appeared only in 1950’s. This is the reason why many concepts in the study of character representation refer to data transmission rather than data processing. Also, many developments took place in companies such as AT&T. 2. ASCII. Most computers in the English speaking world today operate with a 7- representation, known as ASCII, or American Standard Code for Interchange. It has been standardised by the American National Standards Institute. See below for a table. If you study the character description then besides the usual everyday characters you find things like End Transmission and Negative Acknowledgement which point to the origins of the code in data transmission. In data processing some of these special symbols have assumed a different or second interpretation. I do not expect you to memorise this table. What I want you to remember is that the letters A–Z, a–z, and 0–9 form contiguous segments, separated from each other by other symbols. It is also useful to know that UNIX separates lines by the single character ‘ˆJ’, whereas PCs use the character combination ‘ˆMˆJ’.

3. ASCII and the keyboard. Only for some characters is it obvious how to enter them into a text using the keyboard. The ¼¿¾ characters from octal ¼½ to are called control characters. They are generated by holding down the control key and pressing the corresponding character key. For many programs they will have a special meaning, hence the name “”. Other in the table are not accessible through the keyboard directly. Emacs allows you to enter any ASCII character in a file, by first typing ‘ˆQ’, then the octal code (leading zero not necessary), and finishing with ‘return’. Give it a try! 4. Latin-1. The basic unit of computer memory is the , which consists of 8 . For representing an ASCII character we only need 7 bits, and hence the highest bit is always 0 in an ASCII character representation. A number of possibilities for the remaining 128 bit-patterns have been standardised by ISO (“International Standardization Organisation”). The one you will find on our UNIX-machines is called Latin-1. It contains numerous special characters from various European languages, such as ‘U’,¨ ‘ˆa’, etc. Go to

http://czyborra.com/charsets/iso8859.html

for a table. Our local UNIX systems understand Latin-1 codes. (For Emacs to display Latin-1 characters you may need to include the line (standard-display-european 1) in your “.emacs” file.) PCs also use ASCII but the extension to eight bits is different from Latin-1. 5. . The 256 different patterns that we can store in a byte are clearly not enough to cover the world’s differ- ent . Various efforts to define a two-byte code have been combined since 1991 in the Unicode consortium (http://www.unicode.org). Work is ongoing but a large number of character systems have been codified already. See http://charts.unicode.org for a list of tables. Observe that the patters from 0x0000 to 0x00FF are iden- tical with ASCII and Latin-1. 6. ASCII, Unicode and Java. Java is one of the first languages which fully embraces Unicode. Indeed, in the first stage of compilation, every input of your program is translated into the corresponding Unicode. This allows programmers from other cultures to use identifier names built up from characters of their own languages. However, the keywords of the language, such as class, int,etc,arefixed. A declaration of the form

char c;

1 reserves two of memory. In a char-cell we can store any of the 65,536 different Unicode characters (actually, not all these possibilities are yet defined). Unicode is a nice idea but we are still surrounded by ASCII and Latin-1 in this part of the world. For example, keyboards can’t possibly try to cover all of this code. So while entering an ASCII character into a char-cell is easily effected by a statement like

c = ’A’;

we have to have a “Unicode to ASCII” translation for other values. The basic idea for such a translation is to use escape , where a special symbol (which in Java is always ‘\’) indicates that a certain number of following characters are describing some other character rather than denoting themselves. The following table lists the traditional escape sequences

which are understood by C, C··, UNIX, and, of course, Java.

esc seq description esc seq description \n newline \r return \t tab \v vertical tab \’ quote \" double quote \\ backslash \b backspace \f formfeed \a bell \? \0 character \ooo octal byte \xhh hexadecimal byte

(Note how we recover the ‘\’ itself!) Java adds to these the Unicode escapes of the format “\uhhhh” where each h stands for a hexadecimal digit. Unicode escapes can appear anywhere in a program, not just within the definition of a character or a string; try it! 7. Computing with characters. A little bit of arithmetic is permitted on char-values. This is useful for text processing. Forexample,ifthechar-cell c holds the character ’a’,thenc-32will represent ’A’. The same is true for every other lowercase letter of the ASCII alphabet and its Latin-1 extension. Thus we can we can use arithmetic to switch from lowercase to uppercase letters. 8. Text compression. Universal standardised codes are extremely important for the smooth cooperation of computers across networks and for the portability of programs. In the remainder of these notes, however, I want to look at a slightly different scenario.

Suppose we are asked to develop a code for for a certain source of information. We assume that we know the A of

symbols that the source is using. Now, in such a situation it may well be that the set A has fewer than 256 elements and so the use of 8 bits for the representation of each symbol could be wasteful. As an example, consider English text. It will consist of 52 upper- and lowercase characters plus some 10 symbols. In this situation a code of 6 bits per character suffices. 9. The source code theorem. Suppose now that we also know the relative frequency with which we will encounter each character from the information source. We can then do even better by adapting the length of the code to the frequency of the character occurring. An example of such a strategy is the Alphabet. One might wonder whether a particular code is optimal for the given application. Surprisingly, there is a precise answer to this question. It was

developed by Claude Shannon in 1948.

a ¾ A Ô Ô a Suppose the relative frequency with which the symbol appears is a . (So the sum of all will be 1.) Shannon calls

the value

½

Á ´aµ=ÐÓg

Ô a

the information content of the symbol a. He then defines

X

Ô ¢ Á ´aµ

a

a¾A

which he calls the entropy of the data source. This is his source code theorem:

There is a coding of the symbols in A in a binary alphabet such that the average number of bits per symbol is

arbitrarily close to the entropy of the source. No coding can have a bit rate smaller than the entropy.

¾ A Ð ´aµ The average number of bits is calculated almost like the entropy. Suppose the coding of symbol a takes many

bits. Then the average length is

X

Ô ¢ Ð ´aµ

a

a¾A

2 10. Huffman code. Four years after Shannon published his theorem, David Albert Huffman came up with a very simple practical code which is optimal among all codes which code every character separately. His method is best explained by example

symbol frequency code

0 a .5625 0

0 b .1875 1.0 10

0 .4375 c .1875 1 110 .25 1 d .0625 111 1

In a first stage one constructs a binary tree by repeatedly grouping together the two entries with smallest frequency (as- signing the sum to the new root), until only one node remains (which must carry the frequency 1). Each path in this tree is labelled by assigning a left branch with 1 and a right branch with 0. The code for each symbol is obtained by collecting all

labels along the path from the root to the symbol.

:6¿

In this particular example, one gets an average bit rate of ½ bits per symbol which is a lot better than the naive code with

:6¾ 2 bits per character. The entropy of the system is ½ so it would be hard (and probably not worthwhile) to improve on this coding. It worthwhile to note that every of 0s and 1s can be seen as a of Huffman coded characters (except that at the end we may have some digits left over). 11. Improving on the source code theorem. The second part in Shannon’s theorem assumes that there is no correlation between successive characters in the stream of information produced by the source. This is not true in English texts, for example, where certain letter combinations are much more likely than others. There are refined statements and codes for such situations.

3 Oct Dec Hex Char Description Oct Dec Hex Char Description 000 0 0x00 NUL, Null character 0100 64 0x40 @ commercial at 001 1 0x01 ˆA SOH, Start Of Header 0101 65 0x41 A 002 2 0x02 ˆB STX, Start Of Text 0102 66 0x42 B 003 3 0x03 ˆC ETX, End Of Text 0103 67 0x43 C 004 4 0x04 ˆD EOT, End Of Transmission 0104 68 0x44 D 005 5 0x05 ˆE ENQ, ENQire 0105 69 0x45 E 006 6 0x06 ˆF ACK, ACKnowledge 0106 70 0x46 F 007 7 0x07 ˆG BEL, Bell 0107 71 0x47 G 010 8 0x08 ˆH BS, BackSpace 0110 72 0x48 H 011 9 0x09 ˆI HT, Horizontal Tab 0111 73 0x49 I 012 10 0x0A ˆJ LF, Line Feed, newline 0112 74 0x4A J 013 11 0x0B ˆK VT, Vertical Tab 0113 75 0x4B K 014 12 0x0C ˆL FF, form feed 0114 76 0x4C L 015 13 0x0D ˆM CR, 0115 77 0x4D M 016 14 0x0E ˆN SO, Shift Out 0116 78 0x4E N 017 15 0x0F ˆO SI, Shift In 0117 79 0x4F O 020 16 0x10 ˆP DLE, Data Link Escape 0120 80 0x50 P 021 17 0x11 ˆQ DC1, XON, Device Control 1 0121 81 0x51 Q 022 18 0x12 ˆR DC2, Device Control 2 0122 82 0x52 R 023 19 0x13 ˆS DC3, XOFF, Device Control 3 0123 83 0x53 S 024 20 0x14 ˆT DC4, Device Control 4 0124 84 0x54 T 025 21 0x15 ˆU NAK, Negative AcKnowledgement 0125 85 0x55 U 026 22 0x16 ˆV SYN, SYNchronous idle 0126 86 0x56 V 027 23 0x17 ˆW ETB, End Transmission Block 0127 87 0x57 W 030 24 0x18 ˆX CAN, CANcel 0130 88 0x58 X 031 25 0x19 ˆY EM, End of Medium 0131 89 0x59 Y 032 26 0x1A ˆZ SUB, SUBstitute 0132 90 0x5A Z 033 27 0x1B ESC, ESCape 0133 91 0x5B [ open square baracket 034 28 0x1C FS, File Separator 0134 92 0x5C \ backslash 035 29 0x1D GS, Group Separator 0135 93 0x5D ] close square bracket 036 30 0x1E RS, Record Separator 0136 94 0x5E ˆ caret 037 31 0x1F US, Unit Separator 0137 95 0x5F underscore 040 32 0x20 0140 96 0x60 ‘ back quote 041 33 0x21 ! 0141 97 0x61 a 042 34 0x22 ” double quote 0142 98 0x62 b 043 35 0x23 # hash 0143 99 0x63 c 044 36 0x24 $ dollar 0144 100 0x64 d 045 37 0x25 % percent 0145 101 0x65 e 046 38 0x26 & 0146 102 0x66 f 047 39 0x27 ’ quote 0147 103 0x67 g 050 40 0x28 ( open parenthesis 0150 104 0x68 h 051 41 0x29 ) close parenthesis 0151 105 0x69 i 052 42 0x2A * asterix 0152 106 0x6A j 053 43 0x2B + plus 0153 107 0x6B k 054 44 0x2C , 0154 108 0x6C l 055 45 0x2D - minus 0155 109 0x6D m 056 46 0x2E . full stop 0156 110 0x6E n 057 47 0x2F / oblique stroke 0157 111 0x6F o 060 48 0x30 0 0160 112 0x70 p 061 49 0x31 1 0161 113 0x71 q 062 50 0x32 2 0162 114 0x72 r 063 51 0x33 3 0163 115 0x73 s 064 52 0x34 4 0164 116 0x74 t 065 53 0x35 5 0165 117 0x75 u 066 54 0x36 6 0166 118 0x76 v 067 55 0x37 7 0167 119 0x77 w 070 56 0x38 8 0170 120 0x78 x 071 57 0x39 9 0171 121 0x79 y 072 58 0x3A : colon 0172 122 0x7A z

073 59 0x3B ; semicolon 0173 123 0x7B f open curly bracket j 074 60 0x3C < less than 0174 124 0x7C vertical bar

075 61 0x3D = equals 0175 125 0x7D g close curly bracket ~ 076 62 0x3E > greater than 0176 126 0x7E tilde 077 63 0x3F ? question mark 0177 127 0x7F DEL, DELete