Unicode As a Standard Framework for Syllables and Other Special Characters
Total Page:16
File Type:pdf, Size:1020Kb
Unicode as a Standard Framework for Syllables and Other Special Characters CHRIS HARVEY University of Manitoba Although the computer has promised greater flexibility for those wishing to design, type and publish multilingual documents, many who have tried to do so have encountered such problems as: - how do I get character x to appear on my screen? - where can I find a font that contains all of the characters required? - how do I access those characters? - how can I stop my old files from showing up as garbage on someone else's computer (or my own new one)? Problems of this sort can be magnified exponentially for those who work full-time in minority languages that use uncommon scripts or characters, and need to develop entire corpora of educational materials, dictionaries, newsletters, books and web pages. The discussion that follows looks into some answers to these questions a propos of North American Native lan guages and will show that it is possible to take advantage of the com puter's vaunted flexibility to produce documents in such languages from the humblest of emails to professionally designed works of typographic art. THE COMPUTER AS TYPEWRITER Much as early printers emulated handwritten calligraphic manuscripts in their type design and layout, the computer word-processor has been typi cally understood and used as an electronic typewriter. The goal was a printed copy, on standard-size paper. As time went on, laser and ink-jet printers introduced more and better-quality font options, but the idea remained the same: the print-out of the document is the final copy to be read. As font technology progressed, it went in an entirely Western Euro pean direction. For linguists and native speakers working in non-Western orthographies, it often proved impossible to find the necessary letters in the ASCII or ANSI character sets.1 Following the typewriter mentality, writ- Papers of the 35th Algonquian Conference, ed. H.C. Wolfart (Winnipeg: University of Manitoba, 2004), pp. 125-136. 126 CHRIS HARVEY ers could only adapt to the unchanging hardware, for example, by amend ing or abandoning previous orthographies in favour of a system that was more Ascii-friendly.2 Hand-written diacritics over printed letters, or char acters like schwa and eth, pencilled in by hand, presumably document cases where getting the computer to print or type the character correctly was more trouble than it is worth. Another fallback strategy - using word processor formatting techniques3 - produces characters that cannot be sorted, and whose formatting is lost when copying text from one machine to another. As an alternative, each small community of linguists or speakers - or in many cases, each individual - could modify a pre-existing font to suit their immediate needs.4 In many cases, these modifications were done in graphics programs which are not built to design fonts, again limiting their transferability. Several additional problems arise from this process. First, every per sonalised font has its own mapping, so that if a language requires the let ter (5), font x maps it on 9 while font y uses it to replace a. Consequently, a word-processor document typed using font x will display the wrong characters on a computer using either font y or no special font at all. This incompatibility also plagues those who, in opening a file they themselves typed a few years back, discover that they have not installed their old fonts on their new computer, and are left with an illegible mess (I speak from personal experience). Second, new characters often replace letters which appear on standard keyboard keys (for simpler access), making it necessary to switch fonts every time one switches languages. This is a formatting nightmare since reformatting and organising documents in 1. ASCII and ANSI code pages are the standard lists of available characters, numbered up to 256 (including a substantial number of command codes, such as delete, tab, etc.). The most recent operating systems today employ Unicode, which provides for approximately 65,000 characters. 2. One example of this adaptation is provided in Ancient Greek dictionary websites, which use an ASCII alphabet to reproduce the Greek. 3. Such as the Kwak'wala underline accent, as in a, k, g, etc., designed specifically for the North American typewriter (Galois 1994). How does one, for emphasis, underline an underline? 4. Modifications for personal use are permitted, wider distribution is not: since elec tronic fonts are intellectual property, copyright law prohibits widespread modification and distribution. UNICODE AS A FRAMEWORK FOR SYLLABICS 127 multiple fonts is extremely time-consuming and error-prone. Third, per sonalized fonts are typically language- or dialect-specific, so that an East ern Cree syllabics font, for example, may or may not include characters particular to Western Cree syllabics (fig. 1). Orthographic standards in some languages change (like a dialect chain) slightly from community to community, meaning the font may only be useful to a small group of peo ple. Figure 1. Some finals in Cree and Ojibwe syllabics Final Consonant P t k C m n s § y Eastern < c b t L Q. <> C/l 0 Western • / \ - C 3 n U + These are some of the ways in which users of non-Western orthogra phies were hampered by the language attitudes of software designers in the not so distant past. With no consensus on ASCII font mapping, the result of exchanging documents electronically was predictably fraught, and a printout was essentially the only reliable way or sharing documents. Thus the computer became an expensive typewriter, just slightly more versatile than the old IBM Selecrric. TOWARDS A DIGITAL MENTALITY The current world of email, web pages, dictionary databases, interlinear software and electronic archives has pushed the demands of the computer far beyond that of a text-printing machine. Of all the font solutions listed above, only one - modifying orthography of the language to fit ASCII - seems compatible with the needs of the modem computer user. This solu tion has not been at all acceptable for languages with long literary histo ries such as Greek, Russian, Korean, Bengali, etc.; it should similarly be ruled out for those Native languages of North America that use uncom mon orthography. This is especially true for the Algonquian languages that use syllabics, where a wholesale switch to a roman orthography would create a literary schism between older speakers and the newly edu cated, along with disposing of a part of cultural identity (Poser 2000:2). However, until a few years ago, there was no other practical choice. 128 CHRIS HARVEY Many web pages and databases require special fonts. In the past - and this is still a factor today - these were typically either modified ASCII or ANSI fonts, or non-roman orthographies squeezed into space left over at the bottom range of ASCII. Limitations include non-transferability, obso lescence and platform specificity (Mac or Windows). If - as they say - the average web user has the attention span of a five-year old, in much less time than it takes to download and install a special font, that user will be surfing elsewhere. One response to this font problem is the .pdf file, which allows the browser to view a page, fully formatted and in any font, without down loading anything - except of course the browser plug-in. In the end, the .pdf file is little more than a scan of a print copy and harkens back to the computer as a typewriter. These files are not interactive, nor can they be transmitted to other users for modification. Web sites using ASCII or ANSI modified fonts can look great - many Inuktitut sites use such fonts - but web-patience will probably not tolerate a page full of AcAuE, which is what faces those who have not downloaded and installed the correct font. Current versions of Windows and MacOS are Unicode-compliant. Unicode is not a new font or font technology. Rather, it is a mapping sys tem (or numbered table) in which each distinct character (approximately 65,000 characters, at present) is given a number, unique to itself. This number is permanent and unchanging, regardless of hardware or software. Unicode numbers are presented as hexadecimal (base sixteen)5 so that the numbers look somewhat awkward, but most people will never have to see the encoding anyway. For example, the schwa character (a) has the cod ing 0259, and syllabics L- (mwa) is 14B7. To make this vast character set manageable, Unicode is divided into sections (ranges), one for each script (a representative selection of characters is included in the Appendix). Often purported to cover all of the world's languages (though not by Unicode developers), some bias towards the world's majority languages nevertheless remains. All characters used in French and Hungarian (for example) have been uniquely encoded, whereas some characters for Guarani and Navajo (to name but two) need to be composed of two or more characters. A significant number of languages use glyphs unavail- Hexadecimal numbers use the digits 0-9 and the letters A-F (representing the num- UNICODE AS A FRAMEWORK FOR SYLLABICS 129 able in Unicode, and others still - such as Javanese, Balinese, Batak and Mayan hieroglyphics - use scripts that are entirely absent. Through the introduction of standard character mapping, Unicode is intended to pro vide the same set of rules for all computer users, so that any time, any where, someone can look at a web page or read an email in any language without special downloads or fonts.6 At the time of this writing, most peo ple are still using software which does not take full advantage of Unicode, so the legibility of transferred documents containing Unicode characters is still limited to those using compatible software.