The Problem with Unicode

The Problem with Unicode

THE PROFESSION lie plain text messages such as I have to deal with every day: letters, e-mail, The Problem handwritten notes. Plain text of this kind, being mostly brief and personal, never mixes writing systems. Toward the spectrum’s other end lie formal with Unicode documents, beautifully typeset and replete with tables, indexes, and illus- Neville Holmes, University of Tasmania trations. Rarely do writing systems mix in such documents. Somewhere in the middle lie HTML documents. Traditionally, we use markup to pro- duce all but the simplest documents. ne reader took my “Seven Great Blunders of the Com- puting World” column (Com- Unicode is a success, but O puter, July 2002, pp. 112, 110-111) as generally offen- would another approach sive to the memory of modern com- have fared even better? puting’s great pioneers (Letters, Oct. 2002, pp. 9-10). Others questioned if more significant blunders than those I selected exist (Letters, Nov. 2002, pp. 8-9). But only one group, Unicode ing system that “provides a unique This system of coding in situ instruc- supporters, strongly denied a particu- number for every character, no matter tions to a compositor, human or pro- lar alleged blunder. what the platform, no matter what the grammed, prescribes detailed aspects A relatively inconclusive e-mail program, no matter what the lan- of the document’s final form. The typ- exchange led me to offer this column to guage” (www.unicode.org/unicode/ ical modern markup coding defines a Unicode’s most vocal supporter for a standard/WhatIsUnicode.html). That hierarchy of detail by specification or considered one-round debate on the not all characters have unique numbers default. For example, the coding first issue, which he accepted. After he failed is one of Smith’s complaints. A little specifies a font, within the scope of that to respond to the half column I sent him further on, the text states that the specification it specifies a size, then supporting my claim, I suspected that “Unicode Standard has been adopted within the size it specifies a form: the Unicode people had withdrawn by … industry leaders,” that it “is roman, italic, or small capitals. from the debate when they realized that required by modern standards,” and The markup is encoded in plain text, by blunder I did not mean failure. that it “is supported in many operat- as is the text it modifies. This text is However, a recently arrived issue of ing systems, all modern browsers, and coded within a single writing system— Vector, the British APL Association’s many other products.” properly so for simplicity’s sake. The quarterly journal, led off with an Even a brief study of the online resulting document almost never needs Adrian Smith editorial commenting on material, impressive in both amount more than one writing system. Unicode (www.vector.org.uk/v193/ and detail, confirms that Unicode The writing system specification ed193.htm). This piece recalled Uni- clearly is an admirable success. But all properly belongs at a level above the code’s basic problem, which particu- this does not avert my claim that typographical. Font classes such as larly afflicts technical symbolism. Thus Unicode is a blunder. A different typewriter, serif, and sans serif have as provoked, I will now expand my case approach would have worked much little meaning in the Arab writing sys- in the very faint hope that a much bet- better for encoding text, documents, tem as diwani, kufic, and thuluth have ter approach to digital implementation and writing systems. in the Latin writing system. The Arab of the world’s writing systems might yet allography has no relationship to be adopted. TEXT AND DOCUMENTS Korean hanguel syllabary, and both Text encoding and document encod- starkly differ from the Latin writing UNICODE ing differ, although the two cover a system with its two cases and spaces What is Unicode? The official spectrum of written-language uses. At between words. Unicode site states that it is an encod- the most populous end of the spectrum Continued on page 114 116 Computer The Profession Continued from page 116 various ways. For example, an accent- Table 1. Possible eight-bit coding system for the Latin writing system. ing combiner would usually place a Binary Hexadecimal Class punctuation mark over a letter of the alphabet. Traditional symbols such as 000x xxxx 00-1f Modifiers @#$£¥% are overlays, and even the & 0010 xxxx 20-2f Combiners originated as an E ligated to a sub- 0011 xxxx 30-3f Punctuation scripted t after the Latin et, and thus 0100 xxxx 40-4f General symbols would not need a basic code. The com- 0101 xxxx 50-5f Arithmetic symbols biners generalize the kerning first used 0110 xxxx 60-6f Italic digits in the 16th century to accommodate 0111 xxxx 70-7f Roman digits the Greek writing system, as Figure 1 100x xxxx 80-9f Italic smalls shows, as well as the coding similarly 101x xxxx a0-bf Italic capitals used in TeX to effect accents. 110x xxxx c0-df Roman smalls The generative capability of this 111x xxxx e0-ff Roman capitals approach provides for complex use of accents as in Vietnamese and for the stable generation of new translitera- Documents are best marked up in a to culture. For example, German treats tions and symbols, thanks to typogra- single writing system, with any mixing ä as though it were a, while Finnish phy’s ability to provide esthetically of writing systems specified through treats the two as distinct. English treats pleasing forms of newly popular com- markup directly or, better, by using rh as two letters, while Welsh treats pound symbols such as the euro. macrodefinitions or specifying an them as one. Thus, the placement of inclusion. symbols within alphabets should be Other basic letters chosen to support transliteration. Punctuation codes allow for very THE LATIN WRITING SYSTEM By ignoring traditional alphabetic simple symbols, useful in combination, By putting all writing systems and lan- sequences, other alphabetic writing such as accents. These symbols, which guages together, Unicode becomes much systems—particularly the Greek and should include the blank and some rul- too complex and unstable. A far better Cyrillic—can probably be accommo- ings, also allow versatility in modifica- approach would be to provide a stan- dated at the font level so that accept- tion, particularly when doubled, as in “ dard for each writing system, with each able direct transliteration can be and =. The general and arithmetic sym- standard providing the system’s specific achieved across those systems. This bol codes provide basic shapes chosen graphical characteristics. The Latin sys- would permit a single Cyrillic-Greek- for their usefulness in combinations, as tem of writing can, for example, be com- Latin coding standard rather than most of the familiar symbols can con- pletely and effectively encompassed in three separate standards. veniently be generated as compounds. an eight-bit coding system. Table 1 sug- The codes for the two sets of digits gests this system’s nature. Modifiers and combiners provide for 10 numerals as well as for This approach treats the writing sys- The modifiers and combiners pro- all the signs needed to represent deci- tem as a generative graphical structure vide for the numerous symbolic varia- mal values, such as a negative sign and from which basic symbols can be tions and distinctions needed both for decimal point. This allows compressed selected, modified, or combined outside different languages and for specialists four-bit coding for numbers in special any particular font. Specific fonts can such as mathematicians and phoneti- numeric applications. then vary the details of plain, modified, cists. Modifiers affect a single symbol, or combined forms in their own ways. combiners affect two symbols, and the Keyboards No attempt should be made to pro- affected symbols can themselves be Given that different languages use vide compatibility with ASCII or basic, modified, or combined. the Latin writing system differently, EBCDIC, especially with those Modifiers can shrink or expand, using this coding system would require extremely awkward control characters thicken or thin, raise or lower, rotate different keyboards and keyboard dri- intended for use in telegraphy. The or reflect, double or treble, with many vers, with some single keystrokes gen- sooner we phase out these obsolescent of these permutations offering hori- erating several bytes of code. In systems the better. zontal, vertical, or other variations. A particular, because the alphabets will In addition, no attempt should be horizontal reflector would let ([{< be provide roman and italic i and j with- made to implement any particular col- generated from )]}>, for example. out the superposed dot, tapping the i lating sequences. Not only are these Combiners can juxtapose, ligate, or and j keys will generate a compound complex, they also differ from culture overlay the two symbols they affect in code. Coding these letters without their 114 Computer The Profession customary dots is not only much more general—accommodating the full Turkish alphabet and allowing accent- ing in a more general way—but also provides a range of useful ligatures such as those needed for some stan- dard phonetic symbols. Being able to mix roman and italic forms in plain text without recourse to markup will benefit expressiveness greatly. Better still, it can be easily introduced on any keyboard by pro- viding an extra shift key. Keyboards should also allow for the keying in of modifiers and combiners in their own right and not just to generate ad hoc symbols. English would benefit particularly from the reintroduction of Figure 1. A piece of Greek type—the vowel alpha—kerned for use with separate accents, accents on borrowed words to preserve shown here in combination with a smooth breathing.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    3 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us