Frequently Asked Questions and Selected Resources on Cyrillic Multilingual Computing

Frequently Asked Questions and Selected Resources on Cyrillic Multilingual Computing Kevin S. Hawkins SUMMARY. The persistence of multiple standards, both de jure and de facto, for handling text on the computer is perhaps the most perplexing problem facing those who wish to use a computer in more than one language, or even to provide access to data across more than one operating system and application platform. Using Cyrillic as an example, this article attempts to answer questions commonly asked by non-expert computer users who wish to work in more than one language. [Article copies available for a fee from The Haworth Document Delivery Service: 1-800-HAWORTH. E-mail address: <[email protected]> Website: <http://www.HaworthPress.com> © 2005 by The Haworth Press, Inc. All rights reserved.] Kevin S. Hawkins, MS, is Electronic Publishing Librarian, Scholarly Publishing Office, University Library, University of Michigan. Address correspondence to: Kevin S. Hawkins, 300 Hatcher Graduate Library North, 920 North University Avenue, Ann Arbor, MI 48109-1205 USA (E-mail: kshawkin@ umich.edu). The author wishes to thank David Dubin and the other members of the GSLIS Re- search Writing Group at the University of Illinois at Urbana-Champaign for their com- prehensive advice, and Troy Williams, Benjamin Rifkin, Michael Brewer, David Dubin, Tatiana Poliakevitch, Irina Roskin, Peter Houtzagers, and Uladzimir Katkouski for their specific corrections and suggestions. [Haworth co-indexing entry note]: “Frequently Asked Questions and Selected Resources on Cyrillic Multilingual Computing.” Hawkins, Kevin S. Co-published simultaneously in Slavic & East European Infor- mation Resources (The Haworth Information Press, an imprint of The Haworth Press, Inc.) Vol. 6, No. 2/3, 2005, pp. 3-21; and: Virtual Slavica: Digital Libraries, Digital Archives (ed: Michael Neubert) The Haworth Information Press, an imprint of The Haworth Press, Inc., 2005, pp. 3-21. Single or multiple copies of this article are available for a fee from The Haworth Document Delivery Service [1-800-HAWORTH, 9:00 a.m. - 5:00 p.m. (EST). E-mail address: [email protected]]. Available online at http://www.haworthpress.com/web/SEEIR © 2005 by The Haworth Press, Inc. All rights reserved. doi:10.1300/J167v06n02_02 3 4 VIRTUAL SLAVICA: DIGITAL LIBRARIES, DIGITAL ARCHIVES KEYWORDS. Character encoding, character sets, keyboard layouts, multilingual computing, fonts, computers, Cyrillic INTRODUCTION Nearly everyone who has used a computer has experienced what are often called “font problems” or “encoding problems”: letters or punctuation marks outside of the very basic ones used in English, which show up on screen or in print as something other than what was originally in- tended. While in some cases, such as jumbled punctuation marks, the reader can “read around” these characters, when using non-Latin writing systems it becomes impossible to use a document without decod- ing it. While a scrambled document can sometimes be decoded simply by using a different character encoding scheme (character encoding or character set) or a different font, the solution is often not so simple. The problem might involve settings in software used to create, copy, view, or deliver the file, or a combination of these. Information providers, in- cluding librarians, have the responsibility to use settings in their own software properly and make choices that will guarantee wider access to their materials. Furthermore, while settings in the user’s software are out of the control of information providers, steps can be taken to mini- mize the likelihood that these settings will interfere with properly read- ing the document. We have so many different standards, both de jure (official) and de facto (unofficial), because multilingual computing, like most technolo- gies, caught on faster than a standard could be agreed upon, especially since dominant commercial and political interests tend to oppose standardization initially in order to preserve market share, sphere of influ- ence, or ideology. The proliferation of conflicting, inferior standards is the result of a history we have not yet succeeded in fully escaping, but all stakeholders, even dominant ones, now agree that it is in everyone’s interest to escape this history. It is so complicated to store and enter characters used in the writing systems of natural languages because there are many software layers and levels of abstraction involved, all of which are easily confused with one another. Terminology is inconsistent, with some terms having common usage but a different technical meaning. I will address common points of confusion in multilingual computing, using Cyrillic as an example. However, everything that I say ap- Kevin S. Hawkins 5 plies to those using non-U.S. computers, those writing in languages in the Latin alphabet other than English, and those using other scripts (writing systems)1 on the computer besides Cyrillic. I gloss over distinc- tions when I believe that they are insignificant to users of contemporary operating systems and software: my loose use of terminology, guaran- teed to offend experts in text processing, is designed to make this very complex topic accessible to the lay audience. I hope that my conversa- tional style and use of the frequently-asked-questions (FAQ) format, while unorthodox for a scholarly journal, will make this very difficult topic easier to understand. Following the FAQ is a short bibliography of resources on multilingual computing. IMPORTANT POINTS OF CLARIFICATION There is one terminological and ontological distinction that must be explained up front: the difference between a character and a glyph. Characters are basic, abstract units in a writing system and are not to be confused with their particular appearance in any font (called a glyph).2 The Latin character B is not the same as the Cyrillic character V, despite appearances, and the Latin character M is not the same as the Cyrillic character M even though the latter pair are used for essentially the same sound. A character is still the same character no matter what glyph you use: whether you write it in a serif font or a sans serif font, or whether you write it in italics or boldface or large or small. Storing text as characters rather than glyphs enables searching within and across files, where the particulars of appearance are rarely significant.3 Characters can be correlated when helpful in searching–for example, capital and lowercase letters are correlated at the operating system level (and in many programs), whereas it is not helpful to correlate Latin M and Cy- rillic M. Which languages are considered for the purposes of character sets to use the same writing system has the potential to be a contentious issue, but such decisions are increasingly made by experts who strive to balance linguistic and pragmatic needs. French B and English B are considered the same character, as are Russian G and Ukrainian H. That is, the notion of character is writing-system dependent, not language-dependent. Another common point of confusion among computer users is over whether it is possible in creating a digital document of any sort to guarantee that the reader will see it exactly as the creator does. While the ex- act appearance of a document can be preserved in PDF format,4 it is 6 VIRTUAL SLAVICA: DIGITAL LIBRARIES, DIGITAL ARCHIVES impossible to preserve it in HTML and even word processing formats since many of these details change as you view the file. There are non-standard and deprecated practices in HTML for specifying and even embedding fonts, but there is no way to be sure that the document will show up exactly as you see it when viewed by another person, who may have a different operating system, collection of fonts, web browser, and browser settings. The appearance of word processor documents is subject to the user’s settings in the word processor, the fonts available, and the printer driver. So you should never create a document whose proper appearance depends on text appearing at a certain location on the page or appearing in a certain font or size. FREQUENTLY ASKED QUESTIONS 1. I want to read text written in Cyrillic. Do I need to download a special font? What if I know I can view text in Cyrillic already but the website claims I need a special font? Do I really? Until the mid-1990s special fonts were needed to work in “exotic” languages on major operating systems distributed in the U.S. Since then all major operating systems come with some built-in or freely available capacity for displaying documents in Cyrillic. At first this involved add- ing optional components–especially fonts and keyboard layouts–to work in exotic languages. These fonts often had “Cyrillic” or “Cyr” ap- pended to the name but otherwise matched pre-installed fonts in name and appearance. Since then, due to standardization on Unicode,5 any given font usually has Cyrillic “built in” to it, so separate fonts for different scripts are no longer needed and one web page can now contain characters from more than one non-Latin script. In today’s operating systems the fonts and keyboard layouts are installed by default for many languages and need only be activated, or at least they are included on the installation disks. However, sometimes websites providing textual resources in Cyrillic will claim that a certain font is necessary to view the text properly. While this is never strictly true, there are three reasons why it can be practically true. First, if the resource was created using a custom font with a non-standard font encoding,6 such as those that allow you to type in Cyrillic by changing fonts but not changing keyboard layouts, you will need to use some font with the same font encoding, and chances are Kevin S.

Frequently Asked Questions and Selected Resources on Cyrillic Multilingual Computing

Sample Chapter 3

I18n, M17n, Unicode, and All That

How to Access Alternate Glyphs

Multiple-Policy Character Annotation Based on CHISE

XML in the Development of Component Systems

Replacement Draft for TEI P5/CH

Unicode Introduction Ken Zook October 20, 2008 Contents 1 Unicode Introduction

Luatex Glyph Chart for Stix2math.Otf Some Background Information: Unicode Ranges

Data-Centric XML

The World in 30 Minutes

Ios Internationalization: the Complete Guide Addison-Wesley Mobile Programming Series

Unicodestandard the UNICODE CONSORTIUM .4Vaddison