Handout 4 Character Representation

06-11337 Introduction to Computer Science The University of Birmingham Autumn Semester 2002 School of Computer Science October 21, 2002 c Achim Jung and Uday Reddy Handout 4 Character representation 1. History. The representation of characters in a binary alphabet has a much longer history than computers. It came up in connection with telegraphs which were invented around 1800. A well-known example of such a representation is the Morse Code Alphabet from 1867. Computers, on the other hand, were primarily developed for numerical calculations and that is why the early machines offered only integer and floating point arithmetic (some offered only one of the two). The observation that a computer can also process non-numerical data appeared only in 1950’s. This is the reason why many concepts in the study of character representation refer to data transmission rather than data processing. Also, many developments took place in telecommunication companies such as AT&T. 2. ASCII. Most computers in the English speaking world today operate with a 7-bit representation, known as ASCII, or American Standard Code for Information Interchange. It has been standardised by the American National Standards Institute. See below for a table. If you study the character description then besides the usual everyday characters you find things like End Transmission Block and Negative Acknowledgement which point to the origins of the code in data transmission. In data processing some of these special symbols have assumed a different or second interpretation. I do not expect you to memorise this table. What I want you to remember is that the letters A–Z, a–z, and 0–9 form contiguous segments, separated from each other by other symbols. It is also useful to know that UNIX separates lines by the single character ‘ˆJ’, whereas PCs use the character combination ‘ˆMˆJ’. 3. ASCII and the keyboard. Only for some characters is it obvious how to enter them into a text using the keyboard. The ¼¿¾ characters from octal ¼½ to are called control characters. They are generated by holding down the control key and pressing the corresponding character key. For many programs they will have a special meaning, hence the name “control character”. Other codes in the table are not accessible through the keyboard directly. Emacs allows you to enter any ASCII character in a file, by first typing ‘ˆQ’, then the octal code (leading zero not necessary), and finishing with ‘return’. Give it a try! 4. Latin-1. The basic unit of computer memory is the byte, which consists of 8 bits. For representing an ASCII character we only need 7 bits, and hence the highest bit is always 0 in an ASCII character representation. A number of possibilities for the remaining 128 bit-patterns have been standardised by ISO (“International Standardization Organisation”). The one you will find on our UNIX-machines is called Latin-1. It contains numerous special characters from various European languages, such as ‘U’,¨ ‘â’, etc. Go to http://czyborra.com/charsets/iso8859.html for a table. Our local UNIX systems understand Latin-1 codes. (For Emacs to display Latin-1 characters you may need to include the line (standard-display-european 1) in your “.emacs” file.) PCs also use ASCII but the extension to eight bits is different from Latin-1. 5. Unicode. The 256 different patterns that we can store in a byte are clearly not enough to cover the world’s different alphabets. Various efforts to define a two-byte code have been combined since 1991 in the Unicode consortium (http://www.unicode.org). Work is ongoing but a large number of character systems have been codified already. See http://charts.unicode.org for a list of tables. Observe that the patters from 0x0000 to 0x00FF are iden- tical with ASCII and Latin-1. 6. ASCII, Unicode and Java. Java is one of the first languages which fully embraces Unicode. Indeed, in the first stage of compilation, every input letter of your program is translated into the corresponding Unicode. This allows programmers from other cultures to use identifier names built up from characters of their own languages. However, the keywords of the language, such as class, int,etc,arefixed. A declaration of the form char c; 1 reserves two bytes of memory. In a char-cell we can store any of the 65,536 different Unicode characters (actually, not all these possibilities are yet defined). Unicode is a nice idea but we are still surrounded by ASCII and Latin-1 in this part of the world. For example, keyboards can’t possibly try to cover all of this code. So while entering an ASCII character into a char-cell is easily effected by a statement like c = ’A’; we have to have a “Unicode to ASCII” translation for other values. The basic idea for such a translation is to use escape sequences, where a special symbol (which in Java is always ‘\’) indicates that a certain number of following characters are describing some other character rather than denoting themselves. The following table lists the traditional escape sequences which are understood by C, C··, UNIX, and, of course, Java. esc seq description esc seq description \n newline \r return \t tab \v vertical tab \’ quote \" double quote \\ backslash \b backspace \f formfeed \a bell \? question mark \0 null character \ooo octal byte \xhh hexadecimal byte (Note how we recover the escape character ‘\’ itself!) Java adds to these the Unicode escapes of the format “\uhhhh” where each h stands for a hexadecimal digit. Unicode escapes can appear anywhere in a program, not just within the definition of a character or a string; try it! 7. Computing with characters. A little bit of arithmetic is permitted on char-values. This is useful for text processing. Forexample,ifthechar-cell c holds the character ’a’,thenc-32will represent ’A’. The same is true for every other lowercase letter of the ASCII alphabet and its Latin-1 extension. Thus we can we can use arithmetic to switch from lowercase to uppercase letters. 8. Text compression. Universal standardised codes are extremely important for the smooth cooperation of computers across networks and for the portability of programs. In the remainder of these notes, however, I want to look at a slightly different scenario. Suppose we are asked to develop a code for for a certain source of information. We assume that we know the set A of symbols that the source is using. Now, in such a situation it may well be that the set A has fewer than 256 elements and so the use of 8 bits for the representation of each symbol could be wasteful. As an example, consider English text. It will consist of 52 upper- and lowercase characters plus some 10 punctuation symbols. In this situation a code of 6 bits per character suffices. 9. The source code theorem. Suppose now that we also know the relative frequency with which we will encounter each character from the information source. We can then do even better by adapting the length of the code to the frequency of the character occurring. An example of such a strategy is the Morse Code Alphabet. One might wonder whether a particular code is optimal for the given application. Surprisingly, there is a precise answer to this question. It was developed by Claude Shannon in 1948. a ¾ A Ô Ô a Suppose the relative frequency with which the symbol appears is a . (So the sum of all will be 1.) Shannon calls the value ½ Á áµ=ÐÓg Ô a the information content of the symbol a. He then defines X Ô ¢ Á áµ a a¾A which he calls the entropy of the data source. This is his source code theorem: There is a coding of the symbols in A in a binary alphabet such that the average number of bits per symbol is arbitrarily close to the entropy of the source. No coding can have a bit rate smaller than the entropy. ¾ A Ð áµ The average number of bits is calculated almost like the entropy. Suppose the coding of symbol a takes many bits. Then the average length is X Ô ¢ Ð áµ a a¾A 2 10. Huffman code. Four years after Shannon published his theorem, David Albert Huffman came up with a very simple practical code which is optimal among all codes which code every character separately. His method is best explained by example symbol frequency code 0 a .5625 0 0 b .1875 1.0 10 0 .4375 c .1875 1 110 .25 1 d .0625 111 1 In a first stage one constructs a binary tree by repeatedly grouping together the two entries with smallest frequency (assigning the sum to the new root), until only one node remains (which must carry the frequency 1). Each path in this tree is labelled by assigning a left branch with 1 and a right branch with 0. The code for each symbol is obtained by collecting all labels along the path from the root to the symbol. :6¿ In this particular example, one gets an average bit rate of ½ bits per symbol which is a lot better than the naive code with :6¾ 2 bits per character. The entropy of the system is ½ so it would be hard (and probably not worthwhile) to improve on this coding. It worthwhile to note that every sequence of 0s and 1s can be seen as a stream of Huffman coded characters (except that at the end we may have some digits left over). 11. Improving on the source code theorem. The second part in Shannon’s theorem assumes that there is no correlation between successive characters in the stream of information produced by the source.

Handout 4 Character Representation

Package 'Pinsplus'

C Strings and Pointers

Technical Study Desktop Internationalization

Lecture 2: Variables and Primitive Data Types

Chapter 4 Variables and Data Types

The Art of the Javascript Metaobject Protocol

Julia's Efficient Algorithm for Subtyping Unions and Covariant

Does Personality Matter? Temperament and Character Dimensions in Panic Subtypes

Plain Text & Character Encoding

Data Types in C

Wording Improvements for Encodings and Character Sets

Software II: Principles of Programming Languages