Quick viewing(Text Mode)

The Problem with Unicode

The Problem with Unicode

THE PROFESSION

lie plain text messages such as I have to deal with every day: letters, e-mail, The Problem handwritten notes. Plain text of this kind, being mostly brief and personal, never mixes systems. Toward the spectrum’ other end lie formal with documents, beautifully typeset and replete with tables, indexes, and illus- Neville Holmes, University of Tasmania trations. Rarely do writing systems mix in such documents. Somewhere in the middle lie HTML documents. Traditionally, we use markup to pro- duce all but the simplest documents.

ne reader took my “Seven Great Blunders of the Com- puting World” column (Com- Unicode is a success, but O puter, July 2002, pp. 112, 110-111) as generally offen- would another approach sive to the memory of modern com- have fared even better? puting’s great pioneers (Letters, Oct. 2002, pp. 9-10). Others questioned if more significant blunders than those I selected exist (Letters, Nov. 2002, pp. 8-9). But only one group, Unicode ing system that “provides a unique This system of coding in situ instruc- supporters, strongly denied a particu- number for every character, no matter tions to a compositor, human or pro- lar alleged blunder. what the platform, no matter what the grammed, prescribes detailed aspects A relatively inconclusive e-mail program, no matter what the lan- of the document’s final form. The typ- exchange led me to offer this column to guage” (www.unicode.org/unicode/ ical modern markup coding defines a Unicode’s most vocal supporter for a standard/WhatIsUnicode.html). That hierarchy of detail by specification or considered one-round debate on the not all characters have unique numbers default. For example, the coding first issue, which he accepted. After he failed is one of Smith’s complaints. A little specifies a font, within the scope of that to respond to the half column I sent him further on, the text states that the specification it specifies a size, then supporting my claim, I suspected that “Unicode Standard has been adopted within the size it specifies a form: the Unicode people had withdrawn by … industry leaders,” that it “is roman, italic, or small capitals. from the debate when they realized that required by modern standards,” and The markup is encoded in plain text, by blunder I did not mean failure. that it “is supported in many operat- as is the text it modifies. This text is However, a recently arrived issue of ing systems, all modern browsers, and coded within a single — Vector, the British APL Association’s many other products.” properly so for simplicity’s sake. The quarterly journal, led off with an Even a brief study of the online resulting document almost never needs Adrian Smith editorial commenting on material, impressive in both amount more than one writing system. Unicode (www.vector.org.uk/v193/ and detail, confirms that Unicode The writing system specification ed193.htm). This piece recalled Uni- clearly is an admirable success. But all properly belongs at a level above the code’s basic problem, which particu- this does not avert my claim that typographical. Font classes such as larly afflicts technical symbolism. Thus Unicode is a blunder. A different typewriter, serif, and sans serif have as provoked, I will now expand my case approach would have worked much little meaning in the Arab writing sys- in the very faint hope that a much bet- better for encoding text, documents, tem as diwani, kufic, and thuluth have ter approach to digital implementation and writing systems. in the Latin writing system. The Arab of the world’s writing systems might yet allography has no relationship to be adopted. TEXT AND DOCUMENTS Korean hanguel syllabary, and both Text encoding and document encod- starkly differ from the Latin writing UNICODE ing differ, although the two cover a system with its two cases and spaces What is Unicode? The official spectrum of written- uses. At between words. Unicode site states that it is an encod- the most populous end of the spectrum Continued on page 114

116 Computer The Profession Continued from page 116

various ways. For example, an accent- Table 1. Possible eight-bit coding system for the Latin writing system. ing combiner would usually place a Binary Hexadecimal Class punctuation mark over a of the alphabet. Traditional symbols such as 000x xxxx 00-1f Modifiers @#$£¥% are overlays, and even the & 0010 xxxx 20-2f Combiners originated as an E ligated to a sub- 0011 xxxx 30-3f Punctuation scripted t after the Latin et, and thus 0100 xxxx 40-4f General symbols would not need a basic code. The com- 0101 xxxx 50-5f Arithmetic symbols biners generalize the kerning first used 0110 xxxx 60-6f Italic digits in the 16th century to accommodate 0111 xxxx 70-7f Roman digits the Greek writing system, as Figure 1 100x xxxx 80-9f Italic smalls shows, as well as the coding similarly 101x xxxx a0-bf Italic capitals used in TeX to effect accents. 110x xxxx c0-df Roman smalls The generative capability of this 111x xxxx e0-ff Roman capitals approach provides for complex use of accents as in Vietnamese and for the stable generation of new translitera- Documents are best marked up in a to culture. For example, German treats tions and symbols, thanks to typogra- single writing system, with any mixing ä as though it were a, while Finnish phy’s ability to provide esthetically of writing systems specified through treats the two as distinct. English treats pleasing forms of newly popular com- markup directly or, better, by using rh as two letters, while Welsh treats pound symbols such as the euro. macrodefinitions or specifying an them as one. Thus, the placement of inclusion. symbols within alphabets should be Other basic letters chosen to support transliteration. Punctuation codes allow for very THE LATIN WRITING SYSTEM By ignoring traditional alphabetic simple symbols, useful in combination, By putting all writing systems and lan- sequences, other alphabetic writing such as accents. These symbols, which guages together, Unicode becomes much systems—particularly the Greek and should include the blank and some rul- too complex and unstable. A far better Cyrillic—can probably be accommo- ings, also allow versatility in modifica- approach would be to provide a stan- dated at the font level so that accept- tion, particularly when doubled, as in “ dard for each writing system, with each able direct transliteration can be and =. The general and arithmetic sym- standard providing the system’s specific achieved across those systems. This bol codes provide basic shapes chosen graphical characteristics. The Latin sys- would permit a single Cyrillic-Greek- for their usefulness in combinations, as tem of writing can, for example, be com- Latin coding standard rather than most of the familiar symbols can con- pletely and effectively encompassed in three separate standards. veniently be generated as compounds. an eight-bit coding system. Table 1 sug- The codes for the two sets of digits gests this system’s nature. Modifiers and combiners provide for 10 numerals as well as for This approach treats the writing sys- The modifiers and combiners pro- all the signs needed to represent deci- tem as a generative graphical structure vide for the numerous symbolic varia- mal values, such as a negative sign and from which basic symbols can be tions and distinctions needed both for decimal point. This allows compressed selected, modified, or combined outside different and for specialists four-bit coding for numbers in special any particular font. Specific fonts can such as mathematicians and phoneti- numeric applications. then vary the details of plain, modified, cists. Modifiers affect a single symbol, or combined forms in their own ways. combiners affect two symbols, and the Keyboards No attempt should be made to pro- affected symbols can themselves be Given that different languages use vide compatibility with ASCII or basic, modified, or combined. the Latin writing system differently, EBCDIC, especially with those Modifiers can shrink or expand, using this coding system would require extremely awkward control characters thicken or thin, raise or lower, rotate different keyboards and keyboard dri- intended for use in telegraphy. The or reflect, double or treble, with many vers, with some single keystrokes gen- sooner we phase out these obsolescent of these permutations offering hori- erating several bytes of code. In systems the better. zontal, vertical, or other variations. A particular, because the alphabets will In addition, no attempt should be horizontal reflector would let ([{< be provide roman and italic i and j with- made to implement any particular col- generated from )]}>, for example. out the superposed dot, tapping the i lating sequences. Not only are these Combiners can juxtapose, ligate, or and j keys will generate a compound complex, they also differ from culture overlay the two symbols they affect in code. Coding these letters without their

114 Computer The Profession

customary dots is not only much more general—accommodating the full Turkish alphabet and allowing accent- ing in a more general way—but also provides a range of useful ligatures such as those needed for some stan- dard phonetic symbols. Being able to mix roman and italic forms in plain text without recourse to markup will benefit expressiveness greatly. Better still, it can be easily introduced on any keyboard by pro- viding an extra shift key. Keyboards should also allow for the keying in of modifiers and combiners in their own right and not just to generate ad hoc symbols. English would benefit particularly from the reintroduction of Figure 1. A piece of Greek type—the vowel alpha—kerned for use with separate accents, accents on borrowed words to preserve shown here in combination with a smooth breathing. [From A New Introduction to Bibliog- distinctive pronunciations. Not only raphy, Philip Gaskell, 1972. Reprinted by permission of Oxford University Press.] could words like café, résumé, and zoöl- ogy be confidently spoken, but foreign In contrast, using graphical synthesis, separate writing system encodings names as well. For example, in addition the alternative that I propose could could be made possible and not manda- to fostering better pronunciation, accommodate the entire Latin writing tory by including them within Unicode adding tone marks to Chinese names system within an eight-bit codespace. as subsets alongside existing subsets, would also show some long-overdue In the unlikely event the profession giving us the best of both worlds. I cultural respect and sensitivity. accepts my argument unanimously, Existing keyboards could be driven replacing Unicode with separate stan- Neville Holmes is an honorary research to provide this generality, but adoption dards for each writing system would associate at the University of Tasma- of the kind of coding system I’ve still be quite impractical, particularly nia’s School of Computing. Contact described would lead to keyboards and as Unicode is still in the process of him at [email protected]. drivers that let the system’s full capabil- replacing older systems. However, if Details of citations in this essay, links ities be evoked simply. Among these even a majority of computing profes- to further material, and possibly fur- capabilities would be alternative cod- sionals consider my argument merely ther discussion of issues are at www. ings for some symbols, for example by persuasive, an orderly transition to comp.utas.edu.au/users/nholmes/prfsn. reflection of a symmetric symbol. In contrast to Unicode, these codings would be obvious given the small code- space. At the font level, such alterna- tives would allow fine distinctions. REACH HIGHER Notice that, if Greek and Cyrillic fonts Advancing in the IEEE Computer Society use the same coding scheme as Latin can elevate your standing in the profession. fonts, letter forms symmetric in one Application to Senior-grade membership family may not be symmetric in another. recognizes ✔ ten years or more of professional expertise learly, Unicode is far too com- Nomination to Fellow-grade membership recognizes plex. Partly, this stems from its ✔ C attempt to accommodate all the exemplary accomplishments in computer engineering world’s languages and their writing systems in one gigantic codespace. GIVE YOUR CAREER A BOOST I UPGRADE YOUR MEMBERSHIP Further, Unicode does not take full advantage of the systematic graphical computer.org/join/grades.htm features of the various writing systems.

June 2003 115