Quick viewing(Text Mode)

Character Set Encoding Hood of Such Corruption

Character Set Encoding Hood of Such Corruption

TUGboat, Volume 11 (1990), No. 2 171

Electronic mail is subject to set trans- Philology lations, and these are often inconsistent, particularly if the mail has passed through Bitnet nodes or IBM mainframes; multiple code pages increase the likeli- Character Set Encoding hood of such corruption. Nelson H.. Beebe Switching Between Character Sets Introduction Should it become necessary to switch code pages in the middle of a document (.. for a business in The by Janusz . Bieri [3] which follows this Sweden to address a to a customer in Turkey), paper complements an earlier one by Yannis Hara- some mechanism must be provided to do so. The lambous [5] on the subject of support for larger char- IS0 8859 encodings of &bit characters define escape acter sets. sequences that permit changing code pages. Because this is an area of international interest For multi-byte character sets, the situation is in the community, it seemed worthwhile more complex. JISCII 17, 81, the Japanese Industrial to review some of the issues. in order to provide Standard Code for Information Interchange, is a 14- background for those readers who are not actively bit character set defined on a 94 x 94 grid addressed following the subject. by two 7-bit characters, using characters in the range There are currently at least two IS0 groups that 33.. .126, but biased downward by 32 so that the are actively engaged in the standardization of char- rows and columns are numbered from 1 to 94. The acter set encoding. They are identified here by the Chinese GB-2312 and Korean KS 5601 standards reference numbers of the standards on which they also use a 94 x 94 grid. are working, IS0 8859 and IS0 10646. The JISCII character set includes special sym- The IS0 8859 group deals with ASCII and bols and , the printable ISO/ASCII EBCDIC character set issues and with standardiza- character set, Cyrillic, Greek, the Japanese syl- tion of 8-bit character sets. The IS0 10646 group labic ( and ), followed deals with multi-byte character set issues. by Level 1 (2965 Chinese characters commonly I have been following the IS0 8859 work for used in Japanese), and Level 2 kanji (3388 lesser- more than two years, and recently joined the IS0 used Chinese characters). JISCII does not include 10646 discussions. Based on that experience, it the ISO/ASCII control characters or European al- seems clear that the problem is much more difficult phabetic extensions, nor does the ISO/ASCII subset than most people realize. occupy consecutive positions. Both groups have active electronic mailing lists; There are at least three ways of encoding doc- the end of this article has information on how to uments in JISCII: subscribe to them. 0 16-bit characters as &bit byte pairs;

256 Does Not Suffice 0 7-bit ISO/ASCII with shift-in and shift-out es- There is a need for more than 256 characters to sup- cape sequences to enter and leave 16-bit char- port even just those written in the acter sections; . As long as people (and computers) in- 7-bit ISO/ASCII where character pairs whose sist on using 8-bit characters, this gives rise to the first member has the high-order (8th) bit set problem of multiple 'code pages'. are taken to be JISCII. Text encoded according to one code must The last two are more compact than the first, but be accompanied by separate information stating suffer from what may be called the substring prob- what code page is to be used. This is difficult lem. in attribute-free file systems such as WNIX and PC Because these two involve a mixture of 8-bit DOS, since there is no guaranteed way to keep that and 16-bit characters, extraction of a valid substring information with a text file. Embedding of attribute requires examination of surrounding context. In the headers in the file itself is unacceptable. method, it may be necessary to scan back to Few electronic mail systems support the spec- the start to determine whether there is a preceding ification of a code page in the message header, al- escape sequence. In the third method, if the first though the mail headers are sufficiently ex- character in the substring does not have its high tensible that such support could be easily added. bit set, one need only examine a single preceding TUGboat, Volume 11 (1990), No. 2 character to find out whether the first character is ness characters, ideographs, and symbols to a normal one, or the second half of a pair. be encoded for storage and communication Since string searching and substring extraction between computers. However, each uses a are among the commonest operations performed on different approach to making the inevitable text by a computer, these are very serious draw- tradeoffs. backs. In my view, seems short-sighted, and too There is also the problem of determining string small. An 18-bit set would probably suffice, so : is the the number of characters, or maybe 36-bit machines like our venerable DEC-20, the number of memory cells used to hold the string? and the UNIVAC 1100 series, will someday be rein- Which one is needed depends on the application. carnated! What is more likely, though, is that falling Use of a 16-bit representation eliminates these memory prices will make 32-bit characters practica- problems for JISCII, and could as well for a character ble. set that supported all those derived from the . Inadequate Display Support However, when Chinese is included, about There is a serious problem of character display. How 50,000 more characters are needed [4]. There are is a person to read a document that requires charac- also differences in characters used in the People's ters unavailable on the terminal or printing device? Republic of China (due to simplifications instituted This becomes particularly relevant as we enter an after 1949) and those in the Republic of China era of international electronic mail and document (Province of Taiwan). exchange. When the 2800 syllabic characters of Korean While personal computers and workstations are are thrown in [4], plus the 900 or so letter increasingly offering support for multiple character variants of classical Arabic [9, 10,2], and the dozens sets, much remains to be done before the display of systems used in India, it seems that even problem can be eliminated. a 16-bit set of up to 65536 characters may be insuffi- cient to cover the world's major languages. Because Impact on Programming Languages speakers of Chinese, Indian languages, and Arabic If character sets are enlarged, computer program- account for more than half of the world's population, ming languages must be modified. these languages cannot be ignored. rn 3.0 added only one bit to the character set Overlapping with the work on IS0 10646 is an encoding, and is riddled with assumptions that 256 effort to develop a comprehensive 16-bit character is the size of the character set. These assumptions set called Unicode; some Unicode traffic was origi- are of course introduced in the interests of compact- nally broadcast to the IS0 10646 list, but that prac- ness, so that T@ can run on small machines. With tice was discontinued while this article was in prepa- effort, some of these could be eliminated, but prob- ration. Subscription details are given in the last sec- ably not all of them; doing so would introduce in- tion below. compatibilities, and thus lose the right to the name The IS0 10646 list review contains the following rn. : With the exception of the ANSI C standard [I], As of March, 1990, two coding schemes have adopted in December 1989, existing programming emerged. The International Organization languages (or at least their compilers) assume 7-bit for Standardization (ISO) Subcommittee 2, or 8-bit characters; the last machines using only 6- Working Group 2 (SC2/WG2) has developed bit characters were retired in the early 1980s. the IS0 10646 Multi-Octet Code. It is now ANSI C provides support for 'wide' characters; a "draft proposed" standard (two levels re- wide strings take the form " . . . It and wide charac- moved from being an international standard). ter constants are written as L' . . '. Hexadecimal es- The IS0 working group has been working on cape sequences, \xhhh. . . are introduced; they may this project for the last 6 years and it has have any number of hexadecimal digits. The under- been subject to unusually wide review for a lying representation of wide character strings may proposed standard. The other draft stan- use one or more bytes per character, allowing room dard is the result of the work of a consor- for future expansion. Shift-in and shift-out repre- tium of U.S. companies, mostly from the west sentations are permitted. However, ANSI C states coast. It is called Unicode. Both of these that a byte with all bits zero shall be interpreted draft standards enable the world's communi- as a null character (and therefore, a C string ter- cation (newspapers and magazines) and busi- minator), independent of the shift state, and a byte TUGboat, Volume 11 (1990), No. 2 173 with all bits zero may not occur in the second or In Japanese and Chinese, the order of ideo- subsequent bytes of a multibyte character. Also, a graphic characters is determined by the authors of comment, string literal, or character constant shall each . Many base the order on begin and end in the shift state, and shall the 214 fundamental 'radicals' (character part build- consist of a sequence of valid multibyte characters. ing blocks); the dictionary is ordered by groups of characters having the same radical, and within each Impact on Collating Sequences group, by increasing numbers of strokes, and by pro- Any assignment of characters to a numerical code nunciation. However, some characters have more introduces collating sequence problems. than one radical, many have the same pronuncia- For example, the Danish and Norwegian alpha- tion (in Japanese, 5500 kanji have only 336 different bets are A... Z, ffi, 0, A, while Swedish reverses the sounds [6]), and pronunciations may vary with di- order of the last three and uses umlauts: fi, 0, A. alects (Chinese has dozens of dialects that are mutu- Note that these Scandinavian accented letters ally incomprehensible, but share a common writing are considered separate letters; this differs from system). Dictionaries from the People's Republic French and German, which alphabetize such letters of China can also be found with ordering according without regard to accents. Danish, Norwegian, and to the Pinyin representation in the Latin alphabet, Swedish also occasionally use acute accents on the that is, according to Mandarin pronunciation. letters 'e' and 'o', for disambiguation of homonyms, JISCII has yet another assignment of Chinese and for a few foreign ; these accents are ig- (kanji) characters into two levels according to fre- nored in alphabetization. quency of use. In the JIS Level 1 kanji, order is ac- With the reform of 1948, Denmark cording to dictionary and pronunciation order [ll, ceased to capitalize , introduced the new letter p. 68); subgroupings are mostly according to stroke A in place of the old Aa, and moved it from the front count. with exceptions. In JIS Level 2 kanji, or- of the alphabet to the end. Under the reform, Aa is der is according to radical and stroke count. These collated as if it were spelled A, so some people moved difficulties have traditionally discouraged the use of from the front of the telephone book to the back. indexes in Japanese books, and also seriously impact When Aa occurred in proper names, the owners were filing of information in computers and offices. permitted to retain the old form, so both continue For a readable account, see the chapter Practi- to exist: Aarhus University is in Wrhus, Denmark, cal Consequences of a Large Character Set in . Mar- and both the University and city listings are found shall Unger's book [ll]. at the end of the telephone book. Thus, in many languages, and even in English, In German, 8 ('scharfes s' or 'es-zet') capitalizes sorting according to a collating sequence is a dif- to (or rarely, SZ), does not occur as an initial ficult problem, and cannot easily be letter, and is alphabetized as 'ss'. changed by a computer program. This has impor- In Spanish, 'ch' is treated as a single letter tant ramifications for BIB^, Ibw,and MakeIn- falling between c and , '11' is treated as a letter be- dex. In some BIB^ styles, article are lower- tween 1 and m, and fi is treated as a letter between cased, and some I4m styles convert titles to upper- n and o. case letters; in both, the result is a disaster if the Although several languages in Eastern Europe happens to be German. and the Soviet Union employ the Cyrillic alphabet, there are variations between countries in both order, Internationalization of Software and the exact letters used. The reforms introduced The chapters on Native Language Support and Reg- after the Russian revolution in 1917 removed some ular Expressions in [12] describe the changes that letters from the alphabet, but scholars of pre-1917 must be made to the C run- libraries, and to literature still require them. A good treatment was many UNIX utilities, when extended character sets given by David Birnbaum in a posting of 30-Nov- are used. 1989 to the IS0 8859 list. For example, international software cannot con- The New York Stock Exchange listings are al- tain embedded character strings; these must be ways by corporate , yet is moved into separate external files that can be cus- according to company name; IBM is listed as if tomized for each language. In addition, output for- it were spelled 'International Business Machines'. mat strings must be extended syntactically to per- Telephone books in some areas move the Macdon- mit reordering of output tokens (cf. English "The alds and the McKays in front of other names begin- White House" and French "La Maison Blanche"). ning with M. TUGboat, Volume 11 (1990), No. 2

Summary and Conclusions Letter case is ignored in LISTSERV commands. Character coding is a very complex issue, and de- Your e-mail return address is automatically ex- spite the vigorous discussions on the IS0 8859 list, tracted from your mail message. The I do not see a solution on the horizon. Because it is used to annotate the mailing list, which can be uses a different character set (EBCDIC) than every- retrieved with a message like REVIEW IS010646, in one else, IBM will be affected more by character set case you would like to know your correspondents by issues than other vendors; its conservatism, and his- other than cryptic e-mail addresses. The REVIEW torical slowness to respond to the demands of the command also provides a summary of the purpose market and its users, also suggests that solutions of the discussions. will not soon be forthcoming. All list traffic is archived; a message with the In my view, the advent of support for 8-bit char- text INDEX IS010646 will retrieve an index for that acters in TJ$ 3.0 will for some time hinder, rather list, and a following message with the text GET than help, document portability. There is a conflict IS010646 f iletype will fetch a particular file. For between the desire for ease of use and readability more details on LISTSERV, send a message with the of the input file on the part of the author or typist text INFO GENINTRO. who enters it by, say, striking the 0 key on a Danish To get on the Unicode list, send a message re- keyboard, and the co-author in Britain who cannot questing inclusion to Glenn Wright: display the 0 on the screen, and may have no idea [email protected] what character was intended. Authors who to the 7-bit ISO/ASCII char- References acter set and with some labor. enter \OC) instead of American National Stacdards Institute, 1430 using the 0 key, will promote document portability. Broadway, New York, N. Y., 10018. Amerz- Alternatively, translation filters will be needed, can Natzonal Standard Programmzng Language but it may not be possible to base them entirely on C, ANSI X3-159.1989, December 14 1989. simple text substitutions, at least in the 7-bit to 8- Joseph D. Becker. Arabic Processing. bit direction, since \O cannot be substituted if it is Communzcatzons of the Assoczatzon for Com- the initial part of another control sequence. Also. putzng Machanery, 30(7):600-610, July 1987. in TEX 3.0, hyphenation opportunities will be lost Janusz S. Bien. On Standards for Computer if accented characters encoded as single &bit values Modern Extensions. TUGboat, 11(2):175- are replaced by control sequences. 183. June 1990. ~IBTEX,IP', and MakeIndex will require re- visions in the future for S. Duncan, T. Mukaii, and S. Kuno. A Com- puter Graphics System for Non-Alphabetic Or- support of 8-bit character sets, thographies. Computer Studzes zn the Human- more flexible provision for specification of - ztzes, 2(3):113-132, October 1969. ing order, Yannis Haralambous. and Latin Alphabet suppression of capitalization changes. Languages. TUGboat, 10(3):342-345, Novem- Even itself may need modifications, since ber 1989. the xchro character translation array is initialized A. . Hershey. for Computers. early in the program to values which depend upon Technical Report TR-2101, U. S. Naval Weap- the local character set, and no provision is made for ons Laboratory, Dahlgren, Virginia 22448, Au- switching code pages dynamically. gust 1967. Joining the Mailing Lists Japanese Standards Association, 1-24, Akasaka 4 Chome, Minato-ku, Tokyo, 107 Japan. To subscribe to the IS0 8859 or IS0 10646 mailing Japanese Industrzal Standard JIS C 6626-1978 lists, send an e-mail message to the server Code of the Japanese Graphzcs Character Set [email protected] for Informatzon Interchange, 1978. with the body text (LISTSERV ignores the e-mail Sub- Japanese Standards Association, 1-24, Akasaka ject: ) 4 Chome, Minato-ku, Tokyo, 107 Japan. SUBSCRIBE IS08859 Japanese Industrzal Standard JIS C 6234-1983 24-dots Matrzx Character Patterns for Dot or Prznters, 1983. SUBSCRIBE IS010646 TUGboat, Volume 11 (1990), No. 2

G.A. Kubba. The Impact of Computers on Ara- to Computer Modern -I strongly support the bic Writing, Character Processing, and Teach- principal idea, and I pursue it in the present paper. ing. Information Processing, 80:961-965, 1980. To organize the discussion in a systematic way, I Pierre Mackay. Problem Scripts. will use the notions - borrowed from [2]-of text Byte, 11(2):201-218, February 1986. encoding, typing and rendering. J. Marshall Unger. The Fiflh Generation 2 Text encoding Fallacy- Why Japan is Betting its Future on Artificial Intelligence. , In the context of w,encoding means the character 1987. sets of the fonts in question and their layouts. In the present section I will focus my attention on the X/Open Company, Ltd. X/Open Portability character sets, as the layouts should be influenced, Guide, Supplementary Definitions, volume 3. among others, by typing considerations. Prentice-Hall. 1989. In an attempt to obtain a general idea about the use of the latin alphabet worldwide, I looked up the o Nelson H.F. Beebe only relevant reference work I am aware of, namely Center for Scientific Computing and Department of Languages Identificatzon Guzde [7] (hereafter LIG). Mathematics Apart from the latin scripts used in the Soviet Union South Physics Building and later replaced by Cyrillic ones, it lists 82 lan- University of Utah guages using the latin alphabet with additional let- Salt Lake City, UT 84112 ters (I preserve the original spelling): USA Albanian, Aymara, Basque. Breton, Bui, Tel: (801) 581-5254 Catalan, Choctaw, Chuana, Cree, Czech, Internet: BeebeQscience .utah.edu Danish, Delaware, Dutch, Eskimo, Espe- ranto, Estonian, Ewe, Faroese (also spelled Faroeish), Fiji, Finnish, French, Frisian, Fulbe, German, Guarani, Hausa, Hun- garian, Icelandic, Irish, Italian, Javanese, Juang, Kasubian, Kurdish, Lahu, Lahuli, - Latin, Lettish, Lingala, Lithuanian, Lisu, On Standards Luba, Madura. Miao, Malagash, Malay, for Computer Modern Font Extensions Mandingo, Minankabaw, Mohawk, Mossi, Janusz S. Bien Navaho, Norwegian, Occidental, Ojibway (also spelled Ojibwe), Polish, Portuguese, Abstract Quechua, Rhaeto-Romanic (Ladin, Ro- Haralambous' proposal to standardize the unused mansh), Rumanian, Samoan, Seneca, Serbo- part of Computer Modern fonts is discussed, and Croatian. Sioux, Slovak, Slovene, Spanish, some modifications and extensions suggested. The Suto, Sundanese, Swahili. Swedish, Tagalog, idea is pursued by designing the extended CM font Turkish, Uolio, Vietnamese, Volapiik, Welsh, layout, and an example is given for one of its possible Wolof, Y, Yoruba, Zulu. uses. This list includes some languages and dialects 1 Introduction with no script at all, for which the information sup- plied concerns more or less standard transcription. In my note [4] I advocated an old (115, p. 461, 16, For most of them this fact is noted explicitly, but p. 451) but rarely used idea to place national letters the exception of Kasubian (usually recognized as a (actually, the Polish ones, but the generalization is dialect of Polish) suggests that this is not always obvious) in the unused part of Computer Modern the case. I noticed some inconsistencies in the nu- fonts, i.e. as the characters with the codes higher merous indexes to the book, but only one omission than 127; this approach allows the handling of na- (described later) in the proper text. Of course, it is tional languages in a way upward compatible with difficult for me to judge the reliability of the work the standard (American) English TEX. A similar as a whole. proposal was made independently by Yannis Hara- The number of additional letters in the latin lambous [8], who states also that the use of non- alphabets listed in LIG - including some variants English letters of latin alphabets should be coordi- of shape but excluding upper case letters - is 176. nated, resulting in a single widely used extension