as a Standard Framework for and Other Special Characters

CHRIS HARVEY University of Manitoba

Although the computer has promised greater flexibility for those wishing to design, type and publish multilingual documents, many who have tried to do so have encountered such problems as: - how do I get character x to appear on my screen? - where can I find a font that contains all of the characters required? - how do I access those characters? - how can I stop my old files from showing up as garbage on someone else's computer (or my own new one)? Problems of this sort can be magnified exponentially for those who work full-time in minority languages that use uncommon scripts or characters, and need to develop entire corpora of educational materials, dictionaries, newsletters, books and web pages. The discussion that follows looks into some answers to these questions a propos of North American Native lan­ guages and will show that it is possible to take advantage of the com­ puter's vaunted flexibility to produce documents in such languages from the humblest of emails to professionally designed works of typographic art.

THE COMPUTER AS TYPEWRITER Much as early printers emulated handwritten calligraphic manuscripts in their type design and layout, the computer word-processor has been typi­ cally understood and used as an electronic typewriter. The goal was a printed copy, on standard-size paper. As time went on, laser and ink-jet printers introduced more and better-quality font options, but the idea remained the same: the print-out of the document is the final copy to be read. As font technology progressed, it went in an entirely Western Euro­ pean direction. For linguists and native speakers working in non-Western orthographies, it often proved impossible to find the necessary letters in the ASCII or ANSI character sets.1 Following the typewriter mentality, writ-

Papers of the 35th Algonquian Conference, ed. H.C. Wolfart (Winnipeg: University of Manitoba, 2004), pp. 125-136. 126 CHRIS HARVEY ers could only adapt to the unchanging hardware, for example, by amend­ ing or abandoning previous orthographies in favour of a system that was more Ascii-friendly.2 Hand-written over printed letters, or char­ acters like schwa and eth, pencilled in by hand, presumably document cases where getting the computer to print or type the character correctly was more trouble than it is worth. Another fallback strategy - using word processor formatting techniques3 - produces characters that cannot be sorted, and whose formatting is lost when copying text from one machine to another. As an alternative, each small community of linguists or speakers - or in many cases, each individual - could modify a pre-existing font to suit their immediate needs.4 In many cases, these modifications were done in graphics programs which are not built to design fonts, again limiting their transferability. Several additional problems arise from this process. First, every per­ sonalised font has its own mapping, so that if a language requires the let­ ter (5), font x maps it on 9 while font y uses it to replace a. Consequently, a word-processor document typed using font x will display the wrong characters on a computer using either font y or no special font at all. This incompatibility also plagues those who, in opening a filethe y themselves typed a few years back, discover that they have not installed their old fonts on their new computer, and are left with an illegible mess (I speak from personal experience). Second, new characters often replace letters which appear on standard keyboard keys (for simpler access), making it necessary to switch fonts every time one switches languages. This is a formatting nightmare since reformatting and organising documents in

1. ASCII and ANSI code pages are the standard lists of available characters, numbered up to 256 (including a substantial number of command codes, such as delete, tab, etc.). The most recent operating systems today employ Unicode, which provides for approximately 65,000 characters. 2. One example of this adaptation is provided in Ancient Greek dictionary websites, which use an ASCII to reproduce the Greek. 3. Such as the Kwak'wala underline accent, as in a, k, g, etc., designed specifically for the North American typewriter (Galois 1994). How does one, for emphasis, underline an underline? 4. Modifications for personal use are permitted, wider distribution is not: since elec­ tronic fonts are intellectual property, copyright law prohibits widespread modification and distribution. UNICODE AS A FRAMEWORK FOR SYLLABICS 127 multiple fonts is extremely time-consuming and error-prone. Third, per­ sonalized fonts are typically language- or dialect-specific, so that an East­ ern font, for example, may or may not include characters particular to (fig. 1). Orthographic standards in some languages change (like a dialect chain) slightly from community to community, meaning the font may only be useful to a small group of peo­ ple. Figure 1. Some finals in Cree and Ojibwe syllabics

Final Consonant P t k C m n s § y

Eastern < c b t L Q. <> C/l 0

Western • / \ - C 3 n U +

These are some of the ways in which users of non-Western orthogra­ phies were hampered by the language attitudes of software designers in the not so distant past. With no consensus on ASCII font mapping, the result of exchanging documents electronically was predictably fraught, and a printout was essentially the only reliable way or sharing documents. Thus the computer became an expensive typewriter, just slightly more versatile than the old IBM Selecrric.

TOWARDS A DIGITAL MENTALITY The current world of email, web pages, dictionary databases, interlinear software and electronic archives has pushed the demands of the computer far beyond that of a text-printing machine. Of all the font solutions listed above, only one - modifying orthography of the language to fitASCI I - seems compatible with the needs of the modem computer user. This solu­ tion has not been at all acceptable for languages with long literary histo­ ries such as Greek, Russian, Korean, Bengali, etc.; it should similarly be ruled out for those Native languages of North America that use uncom­ mon orthography. This is especially true for the Algonquian languages that use syllabics, where a wholesale switch to a roman orthography would create a literary schism between older speakers and the newly edu­ cated, along with disposing of a part of cultural identity (Poser 2000:2). However, until a few years ago, there was no other practical choice. 128 CHRIS HARVEY

Many web pages and databases require special fonts. In the past - and this is still a factor today - these were typically either modified ASCII or ANSI fonts, or non-roman orthographies squeezed into space left over at the bottom range of ASCII. Limitations include non-transferability, obso­ lescence and platform specificity (Mac or Windows). If - as they say - the average web user has the attention span of a five-year old, in much less time than it takes to download and install a special font, that user will be surfing elsewhere. One response to this font problem is the .pdf file,whic h allows the browser to view a page, fully formatted and in any font, without down­ loading anything - except of course the browser plug-in. In the end, the .pdf filei s little more than a scan of a print copy and harkens back to the computer as a typewriter. These files are not interactive, nor can they be transmitted to other users for modification. Web sites using ASCII or ANSI modified fonts can look great - many Inuktitut sites use such fonts - but web-patience will probably not tolerate a page full of AcAuE, which is what faces those who have not downloaded and installed the correct font. Current versions of Windows and MacOS are Unicode-compliant. Unicode is not a new font or font technology. Rather, it is a mapping sys­ tem (or numbered table) in which each distinct character (approximately 65,000 characters, at present) is given a number, unique to itself. This number is permanent and unchanging, regardless of hardware or software. Unicode numbers are presented as hexadecimal (base sixteen)5 so that the numbers look somewhat awkward, but most people will never have to see the encoding anyway. For example, the schwa character (a) has the cod­ ing 0259, and syllabics L- (mwa) is 14B7. To make this vast character set manageable, Unicode is divided into sections (ranges), one for each script (a representative selection of characters is included in the Appendix). Often purported to cover all of the world's languages (though not by Unicode developers), some bias towards the world's majority languages nevertheless remains. All characters used in French and Hungarian (for example) have been uniquely encoded, whereas some characters for Guarani and Navajo (to name but two) need to be composed of two or more characters. A significant number of languages use glyphs unavail-

Hexadecimal numbers use the digits 0-9 and the letters A-F (representing the num- UNICODE AS A FRAMEWORK FOR SYLLABICS 129 able in Unicode, and others still - such as Javanese, , Batak and Mayan hieroglyphics - use scripts that are entirely absent. Through the introduction of standard character mapping, Unicode is intended to pro­ vide the same set of rules for all computer users, so that any time, any­ where, someone can look at a web page or read an email in any language without special downloads or fonts.6 At the time of this writing, most peo­ ple are still using software which does not take full advantage of Unicode, so the legibility of transferred documents containing Unicode characters is still limited to those using compatible software.

TYPING A NON-ROMAN ORTHOGRAPHY ON AN ENGLISH KEYBOARD The big web browsers are now Unicode-friendly, so looking at other peo­ ple's work does not have to be a big problem any longer. Creating docu­ ments for minority languages using common word-processors or desktop publishers is a much bigger issue. The vast majority of commercially available (physical) computer keyboards (at least in North America) use the standard Qwerty layout, and keys are not easily replaced. No word- processors to my knowledge permits users to directly key in Unicode characters that do not belong to the active operating system language. Hexadecimal coding renders the previous method of memorising number sequences (such as e = alt-130) impracticable. Without specific keyboard software, one is relegated to time consuming and awkward "search-and- replace" or "insert character" methods.

ROMAN ORTHOGRAPHY As suggested above, languages which require only a few special charac­ ters can be accommodated by modifying a font, replacing a few standard ASCII characters with those from the language's orthography. For exam­ ple, to write Meskwaki, only three new characters are required, <£>, (s>, and the long vowel mid-dot (-).7 In a modified font, these characters

6. Of course, this assumes that standard fonts are expanded to include a fuller range of characters. There is no single font that contains all 65,000+ Unicode characters. A font containing the requisite characters must still be available for screen display or printing, but, with standardized character mapping properly implemented, any font that contains those characters may be used. 7. Meskwaki orthography used here is that of Dahlstrom (2003). 130 CHRIS HARVEY might, for example, replace the c-, z-, and semi-colon characters, respec­ tively. This solution raises many of the same concerns discussed above, such as non-transferability (the new characters would be available only to those who shared a copy of the modified font), and the need to switch fonts every time a non-Native word (or, at least, a c, z or semi-colon) appears in a text. Unicode, however, demands that a character have only one mapping, one number. Keyboard software is used to conveniently access the required Unicode glyphs. As in the previous example, the c-, z-, and semi­ colon-keys could still be used to type the characters , and (•), but the unambiguous Unicode values for these characters would be stored as part of the document. The result is illustrated in fig. 2, where the Meskwaki sentence (from Dahlstrom 2003:4) was typed with a specially designed software keyboard8 and printed in three different Unicode fonts (Arial, Times New Roman, and Tahoma), without change to the charac­ ters. If an ASCII based font were used instead, each of the non-standard characters would revert to its original English keyboard value. Figure 2. A Meskwaki example in three different Unicode fonts a-kwina-hkaci ni- neSihka ota-hi-nemiya-nini ki-na e-ye-ki akwinahkadi ni-na nesihka ota-hi-nemiya-nini ki-na e-ye-ki a-kwina-hkaci ni-na nesihka ota-hi-nemiya-nini ki'na e-ye-ki

This comparatively basic example can be extended to more ortho- graphically complex situations, such as historical records like that which Rhodes (2004) presents for Skugog Mississaga. Some of the non-standard characters include: macrons (a), breves . Combin­ ing some diacritics is also possible. Each of these characters has been given a unique Unicode number, and will display in any (properly encod- ded) that contains them.

8. This example was typed on a beta version of "English Extended", a software key­ board which should be available at the author's website (languagegeek.com) in the near future. Keyboards for specific languages (Meskwaki) are also downloadable. The software keyboards discussed here were all created using the keyboard design software available as a freedownloa d at www.tavultesoft.com. UNICODE AS A FRAMEWORK FOR SYLLABICS 131

SYLLABICS

Syllabics offers a range of issues relating to keyboards and keyboard design unique to North American Native languages. Without special soft­ ware - such as Tavultesoft's Keyman - it is impractical to type in syllab­ ics even if the font is ASCII. Unicode has been fairly generous with regard to Native languages using syllabics, although only the character sets for Inuktitut and Oji-Cree are completely covered. Fig. 3 illustrates characters missing from Uni­ code's syllabics range that prevent the typing of some Algonquian lan­ guages without an additional extended font.9 Figure 3. Characters for Algonquian languages not in Unicode Moose Cree ring-diacritics syllabics: , <, >, c... 6-final: *

Ojibway i-series finals: A, n, f, r, etc.

Blackfoot w-onset : =

For other dialects of Cree or Ojibwe, using Unicode character values means that the same character string (for example, P-r^AVA-A-cro." -o" AVV-A-\ the syllabic version of kinehiydwiwininaw nehiyawewin / The is our identity (Wolfart & Ahenakew 1993) will print properly in any Unicode font that includes these characters. This sort of transferability is simply impossible using ASCII fonts.

KEYBOARD DESIGN AND LAYOUT At this point, it should be clear that, for most public purposes, Unicode fonts and special keyboard software are necessary for reading and typing in any language with unique characters, especially one which uses a syl- labics-based orthography. The matter of how the keyboard ought to be designed is contentious. Emails I have received at languagegeek.com indicate a number of disagreements regarding keyboard design. People seem wedded to their accustomed layouts, and no matter how unreason­ able the design, they prefer their own. The situation is exactly analogous

9. There are even more omissions for Dene languages. We can only hope the Unicode Consortium will add these characters in the near future. 132 CHRIS HARVEY to that of a proficient Qwerty typist trying to use the Dvorak keyboard, no mean feat. Here again we see the strength of standard character mapping and flexible keyboard layout. Although a specially designed (software) keyboard layout is required for input, so long as special glyphs are stored in the document with unique values, other end-users can (at least hypo- thetically) work with those documents using whatever the software key­ board they prefer.

UNICODE LIMITATIONS This paper has so far praised the utility of Unicode, and it is most cer­ tainly the future of representing Native languages on the web. However, there are a number of drawbacks which must be taken into account in switching software and documents to Unicode. At the moment of writing, there is not a wide range of software avail­ able which takes advantage of Unicode. Those using WordPerfect, Page­ Maker, Windows 9X/ME, are out of luck. Although some software claims to be Unicode-compatible, attempts at keyboard entry of characters often results in printed garbage. Even Microsoft Word has problems: introduc­ ing characters from many Unicode ranges causes Word to automatically switch languages into Chinese.10 As mentioned above, Unicode lacks many roman orthography and syllabics glyphs required for correctly typing Native languages. By defi­ nition, Unicode mapping is carved in stone, so there is no way to change any of the indices. This makes sense because, warts and all, the system must be backwardly compatible and consistent, or the days of non-trans- ferability would quickly return. Missing characters may be added to the list via a formal procedure at the Unicode Consortium.11 Sorting is another issue that Unicode can complicate. Recall that the Unicode order is fixed,s o that "D" always follows "C" in the default sort order. Some older software permitted the modification of sort order, such

10. Thus, in producing this paper, typing characters in the "Spacing Modifier Letters" range - mostly diacritics - almost always switched my default font to Chinese. Since this range is not included in the Chinese font, the resulting characters are empty boxes. I have reported this to Microsoft and several web fora, and can only hope Microsoft, Corel, Adobe, etc. update their software for the next release. 11. I hope to undertake this procedure to amend the syllabics range, as I have already done for four letters of Saanich-Salish (SENCOTEN). UNICODE AS A FRAMEWORK FOR SYLLABICS 133 that a few changes in a text file gave the user some degree of control over what is now governed by an operating system locale. I have learned that Microsoft will not be opening up its os to locale additions, so setting up an entire system to work in a minority language is not going to be possible with their software. A quick review of the charts (cf. Appendix) shows that the characters are certainly not in an alphabetical order acceptable to all languages - an impossibility in any case. Homemade ASCII fonts were malleable enough for the designer to build the correct sorting order into the font, but this is no longer possible. I am currently designing a sorting program for the languages with which I am working. I hope to have a solution soon. Finally, the gaps in the Unicode character set need to be filled with uniquely numbered characters, lest the old issue of font incompatibility be re-introduced.12

LINGUISTICALLY-SENSITIVE SOFTWARE USES Some may say, "My language doesn't use any unusual letters, why should I care about any of this?" Regardless, the Native languages that do use different orthographies should be respected. In bibliographies, Russian and Greek titles and authors are often written in the appropriate orthogra­ phy, why not the same for syllabics? Broader availability of syllabic char­ acters through Unicode will make this kind of use much more feasible. There is a trend for official place names to revert back to their original appellations in the Native language, such as: Deninu Kfu (Fort Resolu­ tion) in the NWT, and Asb ( (Iqaluit, formerly Frobisher Bay) in Nunavut. There is no longer a technical reason why these place names cannot be used instead of, or alongside, the English or French name. In virtually all journals and other works in linguistics, examples from

12. The Unicode Consortium suggests that certain non-European accented characters be composed from a base glyph and a combining diacritical mark. Unfortunately, in current font technology, there is no way for the non-spacing accent to differentiate between majuscule and minuscule letters. Consequently these composed characters either end up with the diacritic too high above the minuscule or superimposed onto the majuscule. Also, composed glyphs often do not space properly in relation to surrounding characters. For my own use, and that of others who wish to download it, I have developed the font, Aboriginal , attempting to include - as single, pre-composed characters - all characters required for North American Native languages. This font may be accessed at languagegeek.com. 134 CHRIS HARVEY

European languages are almost always presented in their standard orthog­ raphy. So a French example from a linguistics paper would be given as Le garqon aime le chien, not in its phonetic equivalent, yet many linguistics papers do not treat minority languages the same way, replacing the offi­ cial orthography (where it exists) with IPA or some other linguistic . More generalized use of Unicode will enable writers to include the standard orthography in their examples, if they choose, without fear that some characters will not display properly. Modern computer technology can help to level the playing field for minority languages the world over by making even uncommon characters available to all. Communities of speakers and linguists can then decide for themselves how and where to take advantage of this technology. Whether it uses a complete non-Western script, or just a few uncommon glyphs, giving an Aboriginal language a presence on the web can provide an active means to use the language on a daily basis, help ease communi­ cations between distant speakers, and promote awareness and research in the language. It is also a psychological boost for the community of speak­ ers, showing for all to see that their language can thrive, on equal terms with the world's more dominant languages: a language that is viable on computers and the internet is, by extension, a viable means of communi­ cation in the modem world.

REFERENCES Ahenakew, Freda, & H.C Wolfart, eds. & trs. 1993. kinehiydwiwininaw nehiyawewin I P-o"Al7A-A-cra_0-o" AVV-A-' / The Cree language is our identity: The Ronge lectures of Sarah Whitecalf. Publications of the Algonquian Text Society. Winnipeg: University of Manitoba Press. Canadian Aboriginal Syllabics Encoding Committee, www.vaxxine.com/vermeulen/ default.eht Dahlstrom, Amy. 2003. Sentence-focus in Meskwaki. Paper read at the 35th Algonquian Conference, London, Ontario. Galois, Robert. 1994. Kwakwaka'wakw settlements, 1775-1920: A geographical analysis and gazetteer, with contributions by Jay Powell and Gloria Cranmer Webster (on behalf of the U'mista Cultural Centre, Alert Bay, British Columbia). Vancouver: Uni­ versity of British Columbia Press. Miller, Wick. 1972. Newe natekwinappeh: Shoshoni stories and dictionary. University of Utah Anthropological Papers 94. Poser, William J. 2000. The . Yinka Dene Language Institute Technical Report 1, Vanderhoof, British Columbia, http://www.ydli.org/products/techreps.htm Poser, William J. 2003. DAlk'wahke: The firstCarrie r writing system MS UNICODE AS A FRAMEWORK FOR SYLLABICS 135

Rhodes, Richard A. 2004. Alexander Francis Chamberlain and The Language of the Mis- sissaga Indians of Skugog. Papers of the 35th Algonquian Conference, ed. by H.C. Wolfart, pp. 363-372. Winnipeg: University of Manitoba. Sugarhead, Cecilia. 1996. cr^O I Ninoontaan /1 can hear it: Ojibwe stories from Lansd- owne House. Edited, translated and with a glossary by John O'Meara. Algonquian and Iroquoian Linguistics Memoir 14.

APPENDICES:

Unicode charts Following are several examples of parts of Unicode ranges. Each range consists of the alphanumeric and characters of one orthogra­ phy and is displayed in table format. The four digit number below each character is the unique encoding number.

1400 Unified Canadian Aboriginal Syllables 14DF

140 141 142 143 144 145 146 147 146 149 14A 14B 14C 14D \ a. A A > D > d b L =L •T "0 ^^ MM t<» H» 1443 M* !<• t«i urn MM t« IW HO *« n w u V A A > } •c d b" L L r- a vm Hit 1411 IW «* MO t«l 1471 UK M* MM IW MCt 1401 Jt A 4> r\ A •> 3 c- b b A J •J 0" i« un na MS u* ua IW 1*71 us IW HU 1*8? ICI 14D1 A > 3 > > D < b b T- 1 J- a "> t«a IW lot HO IW IW 1 6 > < 2) c- ^ d r r •J .O C~ t«t i.u i« i>M 1*4* MM IW ww IW IW ItM IW itci uw

0250 IPA Extensions 02AF

025 026 027 028 029 02A R sue un urn am use OM a 9 ID K z mi 0791 out 1X1 K71 D G 3 ? uu rau HS) ua cm 6 Y a 3 dz tan ED uu •Bl 136 CHRIS HARVEY

Keyboard mappings Languages that require characters that are beyond the range of the ordi­ nary QWERTY keyboard may require their own distinct keyboard from which the appropriate Unicode characters can be accessed. Cheyenne is one such example.

Cheyenne keyboard:

The Moose Cree Unicode keyboard shown here is based on the assump­ tion that syllabic characters are made of component parts, and facilitates typing in syllabics by QWERTY typists. For example, to generate Unicode character 1441, <• (pwa), one must first type the p-final (<, located on the p key of the standard QWERTY layout), then the corresponding vowel < (QWERTY a), then the w-dot (QWERTY w).

Moose Cree Unicode keyboard

i 6 7 8 0 t 3 \

s t- • .: l« V •V ' 4i •' » I '* «. L • » X / l ri1 1