Omega — Why Bother with Unicode?

Omega — why Bother with Unicode? Robin Fairbairns University of Cambridge Computer Laboratory Pembroke Street Cambridge CB23QG UK Email: [email protected] Abstract Yannis Haralambous’ and John Plaice’s Omega system employs Unicode as its coding system. This short note (which previously appeared in the UKTUG magazine Baskerville volume 5, number 3) considers the rationale behind Unicode itself and behind its adoption for Omega. Introduction and written by Yannis Haralambous (Lille) and John Plaice (Université Laval, Montréal). It follows on As almost all T X users who ‘listen to the networks’ E quite naturally from Yannis’ work on exotic lan- at all will know, the Francophone T X users’ group, E guages (see, among many examples, Haralambous, GUTenberg, arranged a meeting in March at CERN 1990; 1991; 1993; 1994), which have always seemed (Geneva) to ‘launch’ Ω. The UKTUG responded to to me to be bedevilled by problems of text encoding. GUTenberg’s plea for support to enable T Xusers E Simply, Ω (the program) is able to read scripts from impoverished countries to attend, by making that are encoded in Unicode (or in some other code the first disbursement from the UKTUG’s newly- that is readily transformable to Unicode), and then established Cathy Booth fund. The meeting was to process them in the same way that T Xdoes. well attended, with representatives from both East- E Parallel work has defined formats for fonts and other ern and Western Europe (including me; I also at- necessary files to deal with the demands arising from tended with UKTUG money), and one representa- AFONT Unicode input, and upgraded versions of MET , tive from Australia (though he is presently resident the virtual font utilities, and so on, have been writ- in Europe, too). ten. Ω itself is based on the normal Web2C distribu- The speakers at the meeting were Michel Gooss- tion that is at the base of most modern Unix imple- ens (the president of GUTenberg1), as host for GUT- mentations, and of at least one of the PC versions enberg and as an expert on background to, and the that is freely available. use of Unicode), and Yannis Haralambous and John Plaice, Ω’s two developers. Why Unicode? The meeting can be accounted a success; all that attended enjoyed themselves, and also learnt a There are something between 3000 and 6000 lan- lot. guages in use in the world, for which a writing sys- This paper is a minor revision of an article I tem exists. (The set of languages is shrinking all the wrote for the UKTUG magazine Baskerville,vol- time as the deadening effect of cultural intrusion, ume 5, number 3, and outlines some of my views of primarily through the electronic media, overwhelms (and support for) the choices that Haralambous and the desire to support existing cultures to the extent Plaice have made. I will consider the arguments for of teaching their language to the young.) The distri- using Unicode as a foundation for future text pro- bution of languages is by no means even throughout cessing, in particular (of course) TEX-related pro- the globe (Michel showed us a map), and there are cessing. many that have not been and will presumably now never be formally recorded. What is Ω? When we come to writing systems, we find almost every variation imaginable in use somewhere Ω (Haralambous and Plaice, 1994) is an extension in the world. The Latin-like system (written left of T X and related programs that has been designed E to right with modest numbers of diacritics simply 1 And now (at the time of writing) president-elect of TUG arranged) has very wide penetration, not least be- itself. cause so many languages were first written down by TUGboat, Volume 16 (1995), No. 3 — Proceedings of the 1995 Annual Meeting 325 Robin Fairbairns Western European missionaries or other explorers. some of the characters that ASCII does specify are Languages such as Vietnamese are classified as ‘com- left “for national variation” in ISO 646; ASCII itself plex Latin-like’, with ≥ 2 diacritics per character; an then became the USA national variation of ISO 646. artificial example of the same effect is IPA (the In- An example of national variation is defined for the ternational Phonetic Alphabet) which has sub- and UK, which specifies that the code point that holds super-scripts and joining marks. Languages such as ‘#’inASCII should hold a pounds sign ($). There Hebrew and Arabic are written right to left, and con- are versions for various Nordic languages that in- stitute another class. Then there are the multiple- clude characters such as æ or ˚a in place of braces, a ligature writing systems typified by the Indic lan- version for French with acute, grave and circumflex- guages such as Devanagari (of which we had a fasci- accented letters, one for German that offers letters nating exposition at the 1993 UKTUG Easter meet- with umlauts and ‘sharp s’ (ß). ing on ‘non-American’ languages, from Dominik Wu- There were various attempts at mechanisms to jastyk), and finally the syllabic scripts (such as Ko- assign different character sets for use by those who rean Hangul and Japanese Hiragana and Katakana), need to use characters from several different sets (for and the ideographic scripts (Chinese and Japanese example, someone writing a Swedish-English Dictio- Kanji). nary); an example is ISO 2022, which defines escape Encodings are needed for computer operations sequences for such switches. These efforts proved on language of any sort. There are differences be- impractical (at least they seemed so to me), and 8- tween the coded representation and the written (or bit developments of ISO 646 arose, with the ability printed) representation. Everyone who’s read about (comfortably) to express more than one language. TEX at all will know about ligatures (the CM fonts, Thus were born the ISO 8859 character sets. and most PostScript fonts, implement ligatures so The commonest of these (at least in the ken of most that, for example, ‘fl’ typed appears as ‘fl’ printed). English speakers) is ISO Latin-1 (ISO 8859-1, that More significantly, almost all adults in Western cul- is, part one of the multi-part standard), which was tures write ‘joined-up’, which is in itself application designed for use by Western Europeans. As well of a form of ligature. All these ligatures are for pre- as the ‘basic ASCII set’ in the first 128 characters, sentation, not for information, and so it is unrea- it has diphthongs and vowels appropriate to most sonable for them to be represented in a character Western European languages. Oddly, it omits the set. Other ligatures, however, form real characters œ dipthong that French uses, and (perhaps less sur- in some languages (examples are æ in Danish and prisingly2) it omits some of the accent forms used by Norwegian, and œ in French). Welsh. ISO 8859 didn’t stop with part 1, though; Each encoding represents a ‘character set’ that there are variants that accommodate Cyrillic (for is to be used by the computer; the history of how Russian, Serbian, and several other languages of the these character sets have developed is long and sorry old Soviet Union), Arabic, Hebrew, and so on. indeed. In the ‘dark ages’ (in fact, as recently as the This is all well and good, but it doesn’t an- early 1960s, when I started computing), every make swer the needs of a writer preparing multilingual of computer system had its own character code, many documents, except in the case that the multiple lan- of them based on the 5-bit teleprinter codes used in guages are accommodated in the same part of ISO telex printers. Eventually, the rather more sophisti- 8859: it will happen some of the time, but most ‘in- cated teletypes appeared, which used seven bits of teresting’ combinations will require switches of char- an eight-bit code; this 7-bit codification was stan- acter set whenever the language changes. dardised as ASCII (the American Standard Code for So ISO (by this time, jointly with IEC) started Information Interchange), which was (in the area of development of an all-encompassing character set, to application it was designed for) an excellent code. It be numbered ISO/IEC 10646 (the difference of 10 000 had all the properties needed for many of the signif- is no accident). ISO/IEC 10646 was to accommodate icant development of computers in the 1960s, but it every possible language in the world by the simple had one serious flaw: it was not able to encode dia- expedient of allowing 32-bit characters. Of course, critics, which are used in almost every language (but no-one can comprehend a 32-bit character set, and which your all-American information interchanger so the set was to be structured, as a hypercube of would seldom have a need for). different repertoires; the (0, 0, 0, ∗) repertoire would To regularise the resulting mess, ISO adopted the ASCII standard as the basis for an international 7-bit character set, ISO 646. ISO 646 is identical to 2 Given that Wales would have been represented by the ASCII in the code points that it specifies; however, BSI in the standardisation process. 326 TUGboat, Volume 16 (1995), No. 3 — Proceedings of the 1995 Annual Meeting Omega — why Bother with Unicode? be the same ISO-Latin 1, but all the other sets could world’’, which would be transparently converted be accomodated, too. to “hello world”. Thus, the two grave accents and Independently, Apple and Microsoft got together the two single quotes constitute ‘programming’.

Omega — Why Bother with Unicode?

Unicode and Code Page Support

DE-Tex-FAQ (Vers. 72

Miktex Manual Revision 2.0 (Miktex 2.0) December 2000

Introduction to Unicode

The TEX Live Guide TEX Live 2012

The Textgreek Package∗

Makor: Typesetting Hebrew with Omega

The Mathspec Package Font Selection for Mathematics with Xǝlatex Version 0.2B

Section 18.1, Han

Math Symbol Tables

Unicode Cree Syllabics for Windows and Macintosh

I18n, M17n, Unicode, and All That