UNRESTRICTED MULTILINGUAL SUPPORT BUILT ON

HISTORICAL OPERATING SYSTEM

A thesis submitted in partial fulfilment of the

requirements for the award of the degree

Master of Science

(18 Credit project)

from

UNIVERSITY OF NEW SOUTH WALES

by

Zheng-Yu JU

School of Computer Science & Engineering

February 1996 CERTIFICATE OF ORIGINALITY

I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it contains no material previously published or written by another person nor material which to substantial extent has been accepted for the award of any other degree or diploma of a university or other institute of higher learning, except where due acknowledgement is made in the text.

I also declare that the intellectual content of this thesis is the product of my own work, even though I may have received assistance from others on style, presentation and language expression. ABSTRACT

Unicode standard encoding makes developing unrestricted multilingual application possible. In this thesis we explain the current situation, discuss the encoding standard of the future - and its ASCII compatible variant in details. We also explain how the current implementation of the X Window System supports the internationalization and describe how this windowing system can be extended by using Unicode standard encoding to provide unrestricted multilingual support based on a traditional operating system.

iii ACKNOWLEDGMENTS

Firstly, I would like to express my gratitude and special thanks to my supervisor Associate Professor John Lions for his guidance, support and valuable suggestions in the research and development of this thesis.

Secondly, I would like to sincerely thank my supervisor Dr John Zic for his guidance, encouragement and in depth review of my thesis. His comments have significantly improved it. I would also like to thank him for his patience and time on occasions beyond normal hours.

Thirdly, I would like to thank Mr. Raymond, and CSG officers for the help they have generously provided.

I would also like to acknowledge and thank the academic and support staff of the School of Computer Science and Engineering, University of New South Wales who have created one of the most pleasant study and research environment in computing that I have encountered.

Finally, thanks to my wife for her full support and patience, my mother-in­ law for her taking care of my new born daughter, and other family members for their support.

iv TABLE OF CONTENTS

Chapter 1 Introduction

1.1 Encodings methods ...... 1

1.2 Objective of the thesis ...... 11

1.3 Overview of the thesis ...... 12

Chapter 2 Sets

2.1 Introduction ...... 14

2.2 Current Situation ...... 14

2.2.1 ASCII, ISO 646, NRCS ...... 14

2.2.2 ISO8859 ...... 15

2.2.3 Han (Chinese/Japanese/Korean) characters ...... 18

2.2.4 Character set switching ...... 19

2.3 Towards a universal coded character set...... 20

2.3.1 How many characters are there? ...... 20

2.3.2 Writing Systems ...... 20

2.3.3 What is a Character? ...... 23

2.3.4 Character and ...... 26

V 2.3.5 Keysym ...... 27

2.4 Standard character sets of the future ...... 28

2.4.1 ISO/IEC DIS 10646 ...... 28

2.4.2 Unicode Standard ...... 29

2.4.3 Goal of the Unicode ...... 29

2.4.4 Conformance ...... 30

2.4.5 Coverage ...... 31

2.4.6 Unification of Unicode and 10646 ...... 33

2.4.7 Code structure of the 10646 ...... 34

2.4.8 Unicode Standard Codepoints Assignment ...... 36

2.5 UTF-8 ...... 45

2.6 UTF-7 ...... 47

Chapter 3 Internationalization and Multilingual Support in X

3.1 Introduction ...... 49

3.2 Background ...... 51

3.3 Internationalisation In X ...... 53

3.3.1 Internationalisation with ANSI- ...... 58

3.3.2 Text Representation ...... 61

vi 3.3.3 ISO8859-1 and Other Encodings ...... 64

3.3.4 Multi-byte and Wide-Character Strings ...... 66

3.3.5 Locale Management ...... 67

3.3.6 Internationalised Text Output...... 68

3.3.7 String Encoding for Internationalisation ...... 69

3.3.8 Internationalised Interclient Communication ...... 70

3. 3. 9 Localisation of Resource Database ...... 71

3.4 Multilingual support in X ...... 71

3.4.2 Extending X ...... 72

3.4.3 Font...... 74

3.4.3.1 Homogenised ...... 74

3.4.3.2 Harmonised typeface ...... 76

3.4.3.3 One big font vs. many little fonts ...... 77

3.4.4 Multilingual Text Output ...... 78

3.4.5 Multilingual Interclient Communication ...... 88

3.4.6 Multilingual Resource Database ...... 89

vii Chapter 4 Text Input

4.1 Introduction ...... 93

4.2 Background ...... 94

4.3 ...... 98

4.4 Architecture ...... 100

4.4.1 Client /Server model vs. Library Model...... 101

4.5 Connect to Input Method ...... 102

4.6 Input Context ...... 103

4.6.1 Input Context Focus Management ...... 104

4.6.2 Preedit and Status Area Geometry Management...... 104

4.6.3 Preedit and Status Callbacks ...... 104

4.6.4 Getting Composed Input ...... 105

4.6.5 Event Handling ...... 105

4.6.5.1 BackEnd Method ...... 106

viii 4.6.5.2 FrontEnd Method ...... 107

4. 7 Layering of IM ...... 108

4.8 Multilingual Text Input...... 110

4.8.1 Achieving the Goal ...... 113

4.9 Application Programming ...... 117

4.9.1 programming based on Xlib or higher level toolkits ...... 128

4.10 Miscellaneous ...... 119

Chapter 5 Validation and Conclusion

5.1 Introduction ...... 122

5.2 What have been achieved ...... 122

5.3 Future research directions ...... 125

5.4 Problem cannot be solved ...... 126

5.5 Conclusion ...... 126

References

Character Set Standards ...... 128

Language and Writing System ...... 130

Asian Language Input ...... 131

ix Internationalisation ...... 132

X Window System ...... 133

Appendix A

Important Data Structures ...... 136

X 1 INTRODUCTION

1.1 Encoding methods

Modem computer systems have their origins in the United States and Great

Britain; it is certainly natural, then, that the early systems supported a single language: English.

About 45 years after the invention of the first electronic computer, the hardware is so advanced that even a desk top personal computer has enough power to support multilingual systems or applications. But developing such systems or applications which are based on existing character encoding standards is very difficult and may never truly succeeded.

Today, most localised or internationalised systems support a restricted number of languages. In order to represent various restricted collections of languages, different character sets have been developed according to the language chosen. See Figure 1-1. This has resulted in an expanding profusion of character sets, each limited to one or a small number of languages. It is possible to choose any character set for representing English, However, other languages have many fewer choices. Some languages do not even have character set which adequately represent them.

The problem becomes more complex when a number of languages are to be supported at once; if one wished to support English, French, and Arabic at the same time, then none of the character sets can represent all three at once.

Adopting a Universal Character encoding scheme such as Unicode will allow representation of any of the world's written characters. Unicode will be discussed in later chapter.

Choose a Character Set

ASCII EBCDIC JISX208 GB2312

PC CP850 ISO 8859-1

ASMO449 C:•abl:>--- ISO 8859-6

Figure 1-1 Choose a Character Set

2 The conventional approach to representing text of various scripts m most systems is to switch among character sets, for example, switch from W estem

Europe (IS08859-1) to Hebrew (IS08859-8). These characters are encoded in 8 bit bytes, the code values O (Decimal) through 127 (Decimal) are the same encoding as the ASCII standard, and the values 128 through 255 are assigned to "extended characters" that vary by character sets. Figure 1-2 illustrates two different character sets. One character set can represent only one language or small class of languages, so the user or application often must switch character sets to switch languages.

ASCII

Extended Characters

IS08859-1 Character Set

Figure 1-2 8 bits Character Sets

3 ASCII

Extended Characters

IS08859-8 Hebrew Character Set

Figure 1-2 8 bits Character Sets

East Asian languages (Chinese, Japanese and Korean) are represented with

"double byte" character sets, such as GB2312 [ken 93]. These are still represented internally as arrays of strings, with two bytes per character.

Japanese text [ken 93], for example, commonly uses (at least within the computer industry) words written in the Latin alphabet along with phonetic characters from the and kiragana alphabets and ideographic characters. In other words, the application developer must parse strings of mixed single-byte and double-byte characters (DBCS), such as ASCII and

4 shift JIS, by detecting the "lead byte" and "trailing byte" of DBCS to determine character boundaries.

These approaches were adequate when software was basically mono-lingual.

Today these means of supporting various languages are viewed by many to be inadequate for the following reasons.

1. The rmxmg of character sets and of character widths complicates

processing, making development take more time.

For example, application developers using Japanese text can no longer use

a simple increment/decrement operation to let the pointer pointing at

next/previous character, and they need to find out from which character

set the character is encoded: katakana, kiragana or kanji?

2. Performance often suffers due to character set or font switching, and

double byte processing.

For example, Han ideograph fonts are very large compare to the ASCII

fonts: a 8xl6 () bit map font needs 21320 bytes of memory to hold it,

while a 16x16 (gb2312) bit map font needs 725480 bytes of memory to

hold it. A common method is to load it (or most of it) from the disk every

5 time it is needed, and the disk operation is very slow compare to memory

operation. Also in gb2312 encoding, the most significant bit needs to be

set to zero for each byte (two bytes character) before they can be used for

text rendering in some systems (X window system, for example).

3. Both the system and applications are mono or at most bilingual. For

example, if the system or application supports only gb2312, then it can

only support English and Chinese.

In this project, we used the Unicode encoding standard to address the problem. Unicode is the only existing standard that supports the largest number of scripts in a plan text format. This has an enormous impact in building truly internationalised software [Ngair 94]. In particular, a Unicode program automatically supports a large number of useful scripts and therefore requires minimal localisation effort to customise it for different countries.

Moreover, using a single Unicode coding scheme will greatly enhance the interoperability of the software in terms of data transfer. Of course, the unique advantage of supporting most of the useful scripts in a plan text format by Unicode has helped to eliminate the necessity of supporting multiple coding schemes and fonts (character set or font switching discussed in the

6 prev10us paragraphs) in a multilingual program. In fact, this has greatly simplify the design of a multilingual system .

In contrast with the existing model of Internationalisation/localisation

(l18N/L10N) that is based on the concept of locale, Unicode software has less dependency on operating system support. Therefore, a Unicode program needs not to wait for the operating system to support a language before it can support it, and each newer release of the program will need less testing on different language environment. More problematical for the locale style of creating internationalised software is the absence of a single accepted standard for both PC and the workstation platforms. This means that a

Unicode software has potentially much less problem in term of portability across different platforms.

There are operating systems already use Unicode encoding standard - one is

Plan 9 and the other is Window NT. Plan 9 [Pike 90] was designed and implemented by Rob Pike and Ken Thompson from AT&T Bell Laboratories.

It is a distributed operating system which uses an ASCII compatible variant of Unicode standard encoding - UTF-8 (a file system safe Ucs

Transformation Format). This system address the problem of representing

7 multilingual text at all levels of an operating system, from the file system and kernel through the application and up to the window system and display.

Since UTF-8 is the only format for text in Plan 9, the interface to the operating system had to be converted to UTF-8. Text strings cross the interface in several places: command arguments, file names, user names, error message, and miscellaneous minor places such as commands to the 1/0 system. Some library routines provided for the converting purpose and made these changes straightforward.

Plan 9 has a two-level font structure, it simultaneously breaks the huge

Unicode into manageable components and provides a unifying architecture for assembling fonts from disjoint pieces. It also promotes sharing, for example, a user can load one set of Japanese characters but dozens of for the Latin- I characters. Also, customization is easy.

English speaking users who don't need Japanese characters but may want to read an on-line Oxford English Dictionary can assemble a custom font with the Latin- I characters and the International Phonetic Alphabet (IP A).

Plan 9 has implemented a simple input method, but this method is unsatisfactory when working in a non-English language especially for

8 ideographic languages such as Chinese or Japanese. Right-to-left text such as

Hebrew or Arabic has not been addressed in Plan 9.

Windows NT [William 94], like Plan 9, has addressed the problem of representing multilingual text at all levels of an operating system, from the file system and kernel through the application and up to the window system and display.

Unlike Plan 9, Window NT also supports non Unicode encoding applications, by converting text strings from various International, national or industrial standard to Unicode (not UTF) for the internal process and converting back to the encoding which applications originally use. Also communication between processes that use different encoding standards will been converted to

Unicode first (this is handled automatically and internally).

Window NT has been localised, among these localised versions, Japanese and Chinese version have input methods that have addressed the problem that plan 9 has not been able to address.

Window NT uses single instead of breaking it into subfonts.

This improves the performance and reduces the complexity but requires large amount of memory to hold the font bitmaps.

9 Window NT not only implemented Unicode encoding, but also address the backwards compatibility problem equally well (Plan 9 does not address this at all since it does not need to). Like Window NT, this thesis attempts to address the problem - provide a multilingual development environment and at the same time keep the backwards compatibility on platforms (mainly

Unix) other than operating system.

After a carefully conducted survey, found that we could achieve this goal by implementing Unicode Standard into X Window System.

The primary advantage of X is that it already supports 16-bit characters, and it is independent of character encoding.

Another advantage of X is it has been internationalised since release 5. In companson, the Universal Language Support (ULS) Technical

Reference[John 94] was implemented most of functions required for ULS.

However, the naming convention is different. The architecture of X allows for further expansion.

The third advantage of X is its near-universal adoption by all the major companies in the computer industry. You can get an X implementation for just about any UNIX workstation or PC.

10 Also in the X environment there are a rich set of high level development tool kits, Motif for example. These tool kits provide the programmer an application framework. In short, application programmer doesn't need to know the underlying mechanism - how the system supports multilingual and how to use the library functions to build up a multilingual application, the only thing he/she need to do is add a procedure call to internationalised the application at the beginning of the programs. The text process unit (text widget in Motif tool kit, for example) is capable of processing languages other than English (in practice, applications which are developed directly based on Xlib are rare). This relieves programmers from what they are not able to or very difficult to take care of (lack of cultural background for example).

1.2 Objective of the thesis

To investigate and demonstrate how the concepts of multilingual programming environment could be built on top of the historical operating systems, meanwhile keeping the backwards compatibility to protect from any

11 loss of previous investment and to provide a smooth transition from various encoding standards towards a uniformed Unicode standard encoding.

The objectives of the thesis are:

• to develop or modify various encoding converter functions.

• to develop or modify Application Programming Interfaces (APis).

• to develop or modify font sets.

• to modify Input Method presentation layer.

• to integrate new or modified functions, APis, fonts and Input Method into

X Window System Release 6 implementation.

1.3 Overview of this thesis

Chapter 2, presents the most common current standard character sets and then discusses the Unicode standard encoding in detail.

12 Chapter 3, presents the internationalisation issues in X Window System and then discusses how to extend it so that X Window System could support the

Unrestricted Multilingual with the backwards compatibility.

In chapter 4, Input Methods in Current implementation of X are introduced, and a discussion of the issues which prevent these Input methods from been used in unrestricted multilingual environment is presented as well as a solution to problem.

Chapter 5 covers the validation of the results. It also includes the discussion of the problems that left unsolved in this project, and finally the conclusion of the thesis.

The appendix provides the important data structures which may help future users of this system. These structures can not been found in any Xlib

Reference Manual book.

The source code is included on the attached floppy disk.

13 2 CHARACTER SETS

2.1 Introduction

There currently exist dozens of standard character sets ( and many widely used sets that are not recognized standards, such as the IBM PC code pages), each dealing with a particular set of languages or applications. Researchers in fields that require additional characters have developed private encodings. It is therefore possible to represent incompatible standards. To ensure free circulation and easy exchange of information, it is essential that an internationally agreed standard character code be developed with the power to represent all human language. In this chapter, we briefly present the most common current standard character sets and then discuss the Unicode standard encoding in some detail.

2.2 Current Situation

2.2.1 ASCII, ISO 646, NRCS

The American Standard Code for Information Interchange (ASCII) is the only

14 character set that just about everyone (except IBM mainframes) agrees on, and the only character set that can be safely transmitted everywhere on the

Internet - or almost. Since its 7 bits encode only 128 characters, of which 33 are reserved (mostly wasted) for device control, the coding of even small numbers of diacritically marked characters for European languages can be done only by replacing some of the less used characters of

ASCII. ISO 646 [ISO 646] the international standard for 7-bit character sets, defines an International Reference Version (IRV), which is ASCII. Twelve positions in which it is permitted to place alternate characters to create

National Replacement Character sets (NRCS). This system is inadequate and rarely used except in Scandinavia. Given a choice between having all the braces in a C program tum into accented letters, and writing French without its accents, most people have preferred the latter.

2.2.2 ISO 8859 [ISO 8859]

Since characters are transmitted and stored on most computers in 8-bit bytes

(octets), the obvious solution to the need for more codepoints is to make use of the unused 8th bit, In the absence of an appropriate standard, though, several manufacturers independently assigned meanings to the 128 new

15 codepoints. Thus we now have to live with IBM PC code (in several versions). Apple Macintosh code, Hewlett-Packard Romans, Adobe Standard

Encoding, DEC Multinational Character Set, and others. Unix systems and

Internet services were very slow to adopt 8-bit characters, preferring to strip the 8th bit and work exclusively with 7-bit ASCII.

An international standard for an 8-bit code capable of representing virtually every European Latin- language (ISO 6937 [ISO 6937]) was developed quite early, but never received much favor among system implementors. This was because diacritical marks were encoded separately from the letters they modified; thus . The costs of such a system were usually considered to outweigh its benefits. In order to handle all the European Latin-script languages with a single 8-bit code for each /base-letter combination, as well as the other major alphabetic scripts (Greek, Cyrillic, Hebrew, and Arabic), it was necessary to produce a series of several standard character sets: ISO 8859. The standard currently has nine parts, with a tenth awaiting approval. All of them contain

ASCII in their first 128 positions.

16 • Western Europe: 8859-1 (Latin 1) [ISO 8859]

• Eastern Europe: 8859-2 (Latin 2)

• Esperanto, Maltese, and Turkish: 8859-3 (Latin 3)

• Nordic: 8859-4 (Latin 4 ) were poorly conceived, have been little used,

and will likely be withdrawn from the standard.

: 8859-5

• Arabic: 8859-6

• Hebrew: 8859-7

• Greek: 8859-8

• Turkish: 8859-9 (Latin 5) was introduced to handle Turkish in a less

baroque manner than Latin 3; it is identical to Latin 1 with the substitution

of Turkish for Icelandio letters.

• Nordic: 8859-10 (Latin 6) will cover the Nordic languages more

adequately than Latin 4.

There is also a set of supplementary characters designed for use in conjunction with any of the Latin sets. By switching between one's native

Latin set and the supplementary set, it should be possible to encode any

Latin-script language covered by ISO 8859.

17 Most parts of ISO 8859 have been well accepted and widely implemented.

ISO 8859 suffers from the same problem as ISO646, but on a different scale: it represents several sets of characters within the same limited encoding space. Although it is no longer necessary to choose between a program and

French accents, it is still not possible to exchange files between Eastern and

Western Europe without character set problems. Another example is that, it is not possible to exchange files between PCs, Macintoshes, and Latin 1 machines without transcoding - and we still can't send any of these character sets reliably by Internet mail.

2.2.3 Han (Chinese/Japanese/Korean) characters [Huang 89]

Languages written with non-alphabetic scripts introduce entirely new problems. The most important are Chinese, Japanese, and Korean [Katzner

86], each of which uses a native phonetic script of 40-50 signs to supplement a vocabulary of many thousands of (Han characters). Although all three languages' ideograms are of common origin, they have developed independently and diverged. In addition, the People's Republic of has introduced "simplified" [Ramsey 87] characters that differ from those still

18 used in Taiwan, Hong Kong, and elsewhere. National standards: Simplified

Chinese GB2312, Japanese Kanji JIS X 0208, Japanese Kanji supplementary

JIS X 0212 and Korean ( and ) KS C 5601 exist for encoding each language using 2-octet ( 16-bit) or 3-octet characters.

2.2.4 Character set switching

A standard method of announcing which character set is being used and of switching among character sets has been created (ISO 2022 [ISO 2022]). It is therefore possible to mix different NRCS, parts of ISO8859, or any other of the many special-purpose character sets that have been registered (according to ISO 2375), without danger of misinterpretation. In practice, however, ISO

2022 is particularly confusing and awkward to implement and has had limited use. (on DEC VT200 terminals and X Window System Compound Text are examples of partial implementations.) Moreover, it does not make any attempt to provide a unified and consistent repertoire of characters; it simply allows many varied character sets to coexist - as long as they have been registered, None of the manufacturers' private character sets are registered.

19 2.3 Towards a universal coded character set

The only way out of the character set morass is to define a new standard character set offering a unified and consistent repertoire capable of representing all the major languages of the world. The demand for native­ language interfaces in more and more languages, and the need to exchange information on a worldwide scale, have made such a universal character set a commercial necessity. Dealing with all the world's languages and scripts at once is not an easy task, and requires addressing many issues that previous character sets have been able to avoid.

2.3.1 How many characters are there?

The first issue that must be resolved and not the least controversial is the approximate number of characters to be encoded and the number of bits required to offer this number of codepoints. Computer architecture places major constraints on this choice. We no longer have DECsystem-1 Os and 20s with 18- and 36-bit words; the only convenient units for most processors are

8, 16. and 32 bits. It has been estimated that 18 bits would be sufficient for just about everything anyone would ever want to encode as a character,

20 Therefore a choice must be made between rejecting some possible characters to make everything fit compactly in 16 bits, or wasting storage space with 32- bit characters. The most important reduction of the total number of characters inevitable for a universal 16-bit code - is obtained by encoding Han characters common to Chinese, Japanese, and Korean once only, despite differences in meaning or variations in form. This reduction is referred to as

Han Unification [Unicode 92].

2.3.2 Writing systems

The writing system [Coulmas 89] of most Latin-script languages is very simple: characters are aligned horizontally, left-to-right, without overlapping or changing direction; the only non-linear elements that intervene are diacritical marks placed above some letters, Even them. the number of possible diacritic/base-letter combinations in any one language is usually small enough that it is easy (and often preferable) to use a separate code for each precomposed diacritically marked letter. In general, however, writing systems are not as simple.

Certain Latin-based [Comrie 87] writing systems are already more complex.

21 Vietnamese often requires two diacritical marks on a single letter, one of which is a tone mark. Standard phonetic script combines superscript and subscript as well as marks that apply to more than one letter at once. It is not even practical to enumerate all the possible diacritic/base-letter combinations in standard phonetic script, since it is designed as an open­ ended system in which new combinations may be invented as required.

Arabic and Hebrew [Gaur 85] are written from right to left, but numbers and insertions in are written by changing direction within a line of text. Both scripts denote only consonants with full letters; vowels are

(optionally) written as points over or under the consonants. Arabic comes from four calligraphic, rather than typographic, tradition, in which letters have initial, medial, final, and standalone forms; Hebrew also has a few of these positional variants. and even Greek retains one or two.

Other complex scripts include those derived from (the script of ancient Sanskrit) which use an involved system of ligatures, and the Korean

Hangul alphabet [Huang 89] with which alphabetic symbols are combined into syllabic blocks.

22 2.3.3 What is a character?

The different operations performed on text in a computer - including input, rendering (display), searching, and sorting - have different preferences for the way in which the text is encoded. Rendering would be simpler if presentation forms such as the ligatures "fi" and "fl" were encoded explicitly: but this would complicate the input process and make correct searching and sorting difficult. This is a trivial example. but issues of this nature abound, especially in more complex writing systems. The method of encoding [Clews 88] must be a tradeoff between the requirements of different types of processing. The characters of a character set are the elements required by the chosen encoding.

In this context the term "character" has a particular meaning which overlaps partially with conventional uses of the word. Characters are not simply abstract shapes (typographic characters), nor do they necessarily correspond exactly to the elements of any one writing system.

In Latin script, the question of what is and is not a character arises mainly with diacritical marks and ligatures. Diacritics must be encoded as independent characters for applications such as standard phonetic script and

23 are best encoded that way for the occasional diaeresis or stress marks used in

English. But it would be very inconvenient to insist on such an encoding for

Turkish, in which diacritically marked letters have their own separate positions in the alphabet. Thus independent diacritics must be included in the character set, but should not be used in certain applications. Ligatures such as fi and fl probably should not be considered characters. This is because they are significant only for rendering and can be derived automatically. The ae ligature, on the other hand, is part of the alphabet in Norwegian and Danish. use of the re ligature in French is not absolutely required but is considered good typographic practice; it cannot be determined automatically and therefore must be encoded explicitly [Clews 88]. But when sorting or searching it should be treated as if it were the individual letters o and e.

Positional variants in scripts such as Arabic are presentation forms that should not be encoded separately. In Hebrew and Greek, however, the handful of variant forms traditionally have separate codes. For each writing system there are situations where it cannot be clearly decided just what is or is not a character.

Existing character sets evidently have a large influence on the design of the

24 universal character set. which must be able to represent all text that could be encoded previously. Some of the mistakes of the past must be retained for the sake of compatibility. but should be avoided in the future. For example, the inadequacies of typewriters and of ASCII have accustomed people to ignore the distinction between the hyphen, dash and minus sign or between opening and closing quotation marks. The hyphen-minus and neutral vertical quotation mark must continue to exist, but their use should be discouraged once the correct distinct characters exist. The universal character set can allow text to be encoded more precisely and more richly than before, and facilitate improved methods of processing text.

It is not expected that every device or piece of software that supports the universal character set should be capable of handling the requirements of all writing systems. However, it is essential that the character set itself contain all the elements required for every writing system (and only as many non­ essential characters as are imposed by convenience or backward compatibility). Not all writing systems have been previously considered for processing by computer, while in other cases more than one competing encoding scheme exists, careful and well informed choices must be made.

25 2.3.4 Character and Glyph

The distinction between the notions of "character" and "glyph" is fundamental in recent work with character sets. Informally, a character is a unit of information used to encode text, whereas a glyph is a shape (a homogeneous set of which constitutes a font) used to render text, The rendering process includes a mapping (not necessarily one-to-one) from characters to . A familiar example of such a mapping is the encoding vector in a Postscript font.

This distinction leads to two principles for the design of a character set: first, that variations in form (multiple glyphs) required for high-quality rendering of text should not be encoded with separate characters if their meaning is the same; and second, that even if two candidates for encoding are visually identical (such as capital with (Croat, Lapp) and capital eth

(Icelandic)), and thus can be rendered with a single glyph, they must nevertheless be encoded separately if their meanings differ. Similarly, the script to which a character belongs is significant: Latin capital A and Greek capital are distinct characters despite their shared form.

26 As usual, comprormse, compatibility, and convenience blur the distinction and make the principles ambiguous. In the areas of mathematical symbols and diacritical marks, it is impractical to associate characters with distinct uses of each , since the uses are varied and changing; instead, the shapes

(glyphs) have to be used as the basis for characters. A borderline example is the case of the diaeresis or umlaut diacritical mark. Here the two meanings of the symbol are clearly defined and it can be useful for some applications to distinguish between them. In general, however, making such distinctions with diacritical marks is more trouble than it is worth.

2.3.5 Keysym [Adrian 93]

It should also be kept in mind that the symbols used for keyboard input are not necessarily characters either. An input method is used to convert sequences of keystrokes into characters. At the simplest level, the input method consists merely of the interpretation of shift, control, and alternate keys held down in conjunction with another key. Other functions of an input method include dead-key handling, compose-character processing, and input of Han characters by typing a phonetic representation and disambiguating by

27 choosing from a menu. The keysyms of the X Window System do not, therefore, constitute a character set and cannot in general be used directly as characters.

2.4 Standard character sets of the future

2.4.1 ISO/IEC DIS 10646 [ISO/IEC DIS 10646]

An international standards committee began work on a draft international standard (ISO DIS 10646) for a universal coded character set (UCS) several years ago. The DIS (Draft International Standard) was a four-octet (32-bit) code, where each octet was limited to values which would represent printable characters in IS08859. This limitation would have made the code easier to transmit and process by obsolete means, but eliminated a huge number of codepoints, to the extent that no two-octet subset of the code could offer a minimal encoding of all the major languages. No was considered. In an attempt to reduce the costs of storing and transmitting four octets for each character, various compaction forms were defined, of length 1,

2 or 3 octets or variable-length. The committee did not have the means to do adequate research in many areas, with the result. that the draft submitted to

28 international balloting in 1991 still had many serious problems. It was not adopted as a standard.

2.4.2 Unicode Character Encoding Standard

The Unicode [Unicode 92] [Unicode 93] character encoding standard is a fixed-width, uniform text and character encoding scheme. It includes characters from the world's scripts, as well as technical symbols in common use. The Unicode standard is modeled on the ASCII character set. The

Unicode Consortium adopted a 16-bit architecture which extends the benefits of ASCII to multilingual text. Unicode characters are consistently 16 bits wide, regardless of language, so no escape sequence or control code is required to specify any character in any language. Unicode character encoding treats symbols, alphabetic characters, and ideographic characters identically, so that they can be used simultaneously and with equal facility.

2.4.3 Goal of the Unicode

The primary goal of the Unicode project was to remedy serious problems common to most multilingual computer programs: overloading of the font

29 mechanism when encoding characters, and use of multiple, inconsistent character codes caused by conflicting national character standards. Few national standards allowed for special purpose characters, such as proprietary or typographical characters. The ASCII character set and its extensions, although widely used and accepted as standard in most computing systems, are limited to 256 characters. ASCII is therefore inadequate in an increasingly complex global computing environment.

Designers of the Unicode standard envisioned a uniform method of character identification that would be more efficient and flexible than current encoding systems. Their system would be complete enough to satisfy the needs of technical and multilingual computing, as well as text publishing. The main goals were to eliminate the special case systems and complex application codes currently in use in many character encoding standards, and to make a larger range of characters available in order to meet the requirements of professional quality typesetting and desktop publishing internationally.

2.4.4 Conformance

An application may be considered to conform to the Unicode standard if it

30 makes use of independent fixed-width 16-bit characters and uses Unicode code points to represent Unicode-defined characters. Code conversion from other standards to the Unicode will be considered conferment if the matching table produces accurate conversions in both directions.

2.4.5 Coverage

The Unicode Standard contains over 28,000 characters from the world's scripts. These characters are more than sufficient for modern communication, as well as classical forms of languages such as Greek, Hebrew, Latin, Pali,

Sanskrit and literary Chinese. Over 20,000 unique characters defined by national and industry standards of China, , Korea, and Taiwan are included. The Unicode standard also includes math operators and technical symbols, and dingbats. To define the content of the

Unicode standard, the Unicode Technical Committee relied primarily on existing standards. Many characters have been included solely because they are part of an existing standard in widespread use, despite the fact that they violate the general principles of the Unicode standard in some instances.

The Unicode standard includes the character content of all major International

Standards approved and published before December 31, 1990, in particular,

31 the ISO International Register of Character Sets, and the ISO 6937 and ISO

8859 families of standards, as well as SGML (ISO 8879 [ISO 8879]).

Characters from other standards have also been included, specifically, from bibliographic standards used in libraries (Roman ANSI 239.47-1985 [ANSI

239.47], and East Asian ANSI 239.64-1990 [ANSI 239.64]), and from important national standards:

• India: ISCH 1988 [ISCH]

• China: GB2312-1980 [GB2312]

• Japan: JIS X 0208-1990 [JIS X 0208] and JIS X 0212-1990 [JIS X 0212]

• Taiwan: CNS 11643-1986 [CNS 11643]

Also included are characters from certain draft standards (such as Glagolitic,

Old Cyrillic and Romanian Cyrillic for bibliographic information interchange

ISO DIS 6861.2 [ISO DIS 6861.2]), and from various industry standards in common use. Another source of characters is from numerous papers and national bodies' contributions to the ISO SC2/WG2 committee on character encoding.

The Unicode standard does not encode rare, obsolete, idiosyncratic, personal, novel, rarely exchanged or private-use characters, nor does it encode logos or graphics. Artificial entities, whose sole function if to serve transiently in the

32 input of text, are also excluded from the Unicode standard.

2.4.6 Unification of Unicode and 10646

During 1991, after the failure of ISO DIS 10646, and given the stated intentions of many companies to produce products using Unicode despite its not being an officially sanctioned standard, members of the Unicode

Consortium met over a period of several months with representatives from the

ISO to pursue a single international character encoding standard. Both bodies recognized that developing a single, universal character code would be beneficial. Meetings in October of 1991 finally resulted in mutually acceptable changes to both Unicode and the ISO DIS 10646 which merged their combined repertoire into a single numerical character encoding.

A new ISO DIS 10646 [ISO/IEC DIS 10646-1:1993(E)] reflecting the result of this merger effort was distributed for international ballot in January 1992; was approved by ISO/IEC as an International Standard (ISO/IEC 10646) in

June, 1992 and published in May, 1993. The Unicode [Unicode 93] Standard,

Version 1. 1 is the newest version of the Unicode Standard. Unicode 1. 1 includes the changes and additions that were made to Unicode 1.0 in the

33 process of alignment with the international character encoding standard,

ISO/IEC 10646-1 (the first of the ISO/IEC 10646).

2.4.7 Code structure of the 10646

ISO/IEC 10646 defines two alternative forms of encoding:

• A 32-bit encoding conceptually divided into some 32,000 Planes, each

containing 65,535 characters (a Plane is the basic code-space of

ISO/IEC 10646 and contains space for 65,535 characters)

• A 16-bit encoding encompassing Plane Zero (the first Plane (Plane 00

of Group 00) is the Basic Multilingual Plane (BMP), Unicode 1.1

Matches the BMP). Figures 2-1 and Figure 2-2 outlines the code

structure of the ISO/IEC 10646.

The 32-bit form is referred to as "UCS-4" (Universal Coded Character Set containing four bytes) and the 16-bit form is referred to as "UCS-

2"(Universal Coded Character Set containing 2 bytes).

The character code values of ISO/IEC 10646-1 (UCS-2) and Unicode version

34 1.1 has been made precisely the same. Since ISO/IEC 10646 does not at

Groups

Group 127

• •• '------I Group 000

Plane 00 BMP Planes 256/Group

Figure 2-1 ISO/IEC 10646: Code Architecture

Plane (16-bit)

Ao~ L---'l======;;;.:::z:======l

Cell--

Figure 2-2 A Plane in ISO/IEC 10646

35 present encode any characters outside of the BMP, the result is that the character repertoires and encoding assignments of the Unicode Standard and

1SO/IEC 10646 is identical. The character "A", LATIN CAPITAL LETTER A, for instance, has the unchanging numerical value 41 hexadecimal. This value may be extended by any quantity of leading zeros to serve in the context of the following fixed-length encoding standards:

Bits Standard Binary Hex Dec Char

7 ASCII 1000001 41 65 A

8 ISO8859-1 01000001 41 65 A

16 Unicode 00000000 01000001 41 65 A

32 10646 (UCS-4) 00000000 00000000

00000000 01000001 41 65 A

This design eliminates the problem of disparate code values in all systems that use any of the above-named standards.

2.4.8 Unicode Standard Codepoints Assignment

Unicode (Unicode will be used as a synonym for BMP, 1SO/IEC 10646-1 and

UCS-2 thereafter) uses a flat, linear 16-bit encoding space for assigning

36 codes to characters. This allows for a fully dense encoding encompassing up to 65,536 character codes. Unlike some multibyte character encodings,

Unicode does not reserve the values of any single byte that constitutes a single two-byte code. Each Unicode character code should be treated as an Integral 16-bit unsigned integer value in the range [0 .. 65,535]. In this space, two particular code values, namely 0xFFFE and 0xFFFF, are explicitly not given character values. These code values serve special purposes.

The Unicode encoding space may be divided into four general linear sections, or zones, which follow one another. The first zone, or alphabetic zone (A- zone), contains all general alphabetic, punctuation, and symbol characters.

The ideographic zone (I-zone) which immediately follows contains the Han ideographic characters. Following this are the open zone (O-zone), reserved for future open use, and the restricted zone (R-zone), restricted to private and compatibility characters. Figure 2-3 illustrates the layout of the Unicode

37 codepoints.

Basic Multilingual Plane

CO Control I I C1 Control I A-zone 0000 • 1FFF General 2000 • 2FFF Symbol 3000 • 3DFF CJK Phonetic

I-zone

4EOO • 9FFF CJK Unified Ideographs

0-zone

AOOO • DFFF Future Expansion

A-zone EOOO • FFFF Restricted Use I FFFE, FFFF

Figure 2-3 Unicode Layout

Unicode assigns just over half of its code space (52.1 % ) to character elements. This includes the Chinese, Japanese and Korea Unified Ideographic

Characters (also known as UniHan (Unified Han Character Set)) encoding.

Nearly two-thirds of the assigned encoding space is taken by Han script elements. The reserved space

38 General Scripts Layout

0000 Latin 0200 Latin, IPA, Modifiers, Diacritics, Greek 0400 Cyr/Ille, Armenian, Hebrew 0600 Arabic 0800 Devanagarl, Bengali 0A00 ocoo Gurmukhl, Gujarati, Oriya, Tamil Telugu, Kannada, Malaya/am 0E00 Thal, Lao 1000 Geog/an 1200

Unassigned

1E00 Latin & Greek Extensions 2000

Figure 2-4 General Scripts Layout

includes 6,400 code cells which are reserved for private use, 64 code cells assigned to CO (range [0 .. 31], new line and carriage return control codes are in this range for example) , and Cl (range [128 .. 159], has not been used) control codes, and 2 code cells (U+FFFE and U+FFFF) are reserved for special purposes. Figure 2-4 illustrates the general scripts layout.

The first portion of the A-zone is the general scripts space. It currently contains 19 separate scripts which are encoded in separate character blocks

(some scripts use more than one block). In addition, this area contains a character block which represents general purpose non-spacing, diacritic marks.

39 Left to right European scripts are encoded first, followed by the right to left scripts of the Middle East, and, finally, Indic and Central Asian scripts.

Considerable space remains at the end of this encoding area for further expansion and for adding new scripts, some of which has already been used for extensions to the Latin and Greek scripts.

In addition to primary symbols, such as letters, diacritic marks, and other non­ spacing marks; the general scripts space also includes a number of characters which represent control codes, numbers, punctuation, and symbols. These other, secondary symbols are encoded in this area for two reasons: (1) for

ASCII and ISO 8859 compatibility; and (2) because certain symbols are confined to specific scripts. ASCII and ISO 8859 compatibility characters are limited to the range U+OOOO to U+OOFF; script specific secondary symbols are limited to the character block(s) which encode particular scripts. Figure 2-

5 illustrates the Latin Blocks.

40 0000

ASCII 0080 ISO 8859-1 Latin 1

0100 European Latin 0180 Extended Latin 01FF

Figure 2-5 Latin Blocks

Following the Latin 1 block are the European Latin and Extended Latin character blocks; another Extended Latin block is located near the end of the

General Scripts Space. These blocks contain symbols contained in other important standards and forms which extend the Latin script to write non­

European languages.

Unicode version 1.1 implements Basic Multilingual Plane as following subsets (Figure 2-6 illustrates the detailed allocation of these subsets):

• Basic Latin (ASCII) • Latin Extended B

• Latin-I • Latin Extended Additional

• Latin Extended A • International Phonetic Alphabet

41 Extensions • Malayalam • Spacing Modified Letters • Thai • Combining Diacritical Marks • Lao • Basic Greek, Greek Symbols & • Basic Georgian & Georgian

Coptic Extended

• Greek Extended • Hangul Jamo • Cyrillic • Hangul Compatibility Jamo • Armenian • Hangul, Hangul Supplementary-A • Basic Hebrew & Hebrew Extended & Hangul Supplementary-B • Basic Arabic , Arabic Extended • Currency Symbols • Arabic Presentation Forms • Bombining Symbols • Arabic Presentation Forms-A • • Arabic Presentation Forms-B • Mathematical Operators • Devanagari • Optical Character Recognition • Bengali • Dingbats • • Gujarati • Katakana • Oriya • • Tamil • Chinese/Japanese/Korean Unified

• Telugu Ideographs • Kannada

42 Allocation of the BMP

1S0/IEC 10646-1 :1993

oo \ 1o I 20 I 30 I 40 I 50 I 60 \ 10 80 \ 90 \ AO \ BO \ co \ DO \ ED \ F0 00 Basic Latin I ISO646 / ASCII ) Latin-1 supplement 01 Latin Extended-A Latin Extended-B 02 Latin Extended -B I Standard Phonetic !IPA extensions) I Spacing Modifier letters 03 Combining diacritical Marks (B-1 & B-2 ) I Ba sci Greek, Greek symbols & Coptic 04 Cyrillic 05 Armenian I Basic Hebrew , Extended 06 Basic Arabic , Extended, Presentation Forms (A & B) 07 08 09 Oevanagari Bengali 0A Gurmukhi Gujarati 0B Oriya Tam il DC Telugu Kan nada OD Malahalam

OE Thai Lao

I ~11 , Hangul Jamo , Compa1ibility Jamo, Hanggul , Supplementary A & B ~ Si 10 1E Latin extended Add i1ional 1F Greek extended 20 General punctuation \ Super-/Subscripts I Curre ncy symbols I v,:-"'""""" 21 Letter like symbols I Number forms I Arrows 22 Mathermatical operators 23 24 Control pictures I OCR I 25 Box drawing I Block elements I Geometric shapes 26 Mescellaneous symbols 27 Dingbats 2e '- ..th0-i4~ .&~t%4%¾'>'®/%::"%::\'¾~~"t¾°%4: . . ~ - . ' 2F'\ ~

e.g. Latin Extended-A 1s m range 0100 to 0170

Figure 2-6 Allocation of BMP part 1

43 oo I 1o I 20 I 30 I 40 I 50 I so I 10 I so I so AO I BO I co I DO I EO I FO ~ 30 CJK symbols & Punclu1lion I Hiragana Kat akana 31 Bopomofo Hangul compti bility Jamo I CJK ;~ ·.:..:-,,,:.,:,-... . -.,.· I misc 32 En clos ed CJK te llers & months >- 33 CJK Squared Words I CJK Abbre viatio ns ) 34 '\ - Korea n Hangu l ,1j -30 I ~ 3E -: Hangul Supp lemen tary-A -43 -44 I 45 - Hang ul Supplemen tary-B -40 4E -4F - Chi nese , Japenese, Korean Un ified Ideographs -9E -9F AO -A1 - 0-Zone (Reserved ) -DE -OF EO -E1 - Priva te Use Area -F7 -FB F9 CJK Compa ti bility Ideographs -FA FB Arab ic Presentat ion Forms I FC Arab ic Prese - ntati on Forms-A FD FE I Combining half I CJK comp I Smalltorm marks lorms •mints I Arab ic presentation torms-8 FF Half width & full width forms I Specals ,,.. ~

Figure 2-6 Allocation of BMP part 2

44 2.5 UTF-8

UTF-8 1s a File System Safe Universal Character Set

Transformation Format (also known as UTF-2 and FSS-UTF) published by X/Open Company, it does not, however, form part of the Unicode Standard Version 1.1 [Unicode 93].

UTF-8 was proposed for the handling of Unicode by existing operating systems and utilities (tools-and-pipes model of text processing embodied by the Unix system, for example). It is an intermediate step towards full Unicode support and it provides a common and compatible encoding during this transition stage.

UTF-8 has the following properties:

• Compatible with existing file systems: existing file systems

disallow the null byte and the ASCII slash character as a part of

the file name.

• Compatible with existing programs: The existing model for

multibyte processing is that ASCII does not occur anywhere in

45 a multibyte encoding. There should be no ASCII code values

for any part of a transformation format representation of a

character that was not in the ASCII character set in the Unicode

representation of the character.

• Ease of conversion from/to Unicode.

• The first byte indicates the number of bytes to follow m a

multibyte sequence.

• The transformation format is not extravagant in terms of number

of bytes used for encoding ..

• It is possible to find the start of a character efficiently starting

from an arbitrary location in a byte stream.

The Unicode transformation format encodes Unicode values in the

range [O .. Ox7FFFFFFF] using multibyte characters of lengths [1 .. 6] bytes. For all encodings of more than one byte, the initial byte

determines the number of bytes used and the high-order bit in each

byte is set. Every byte that does not start 1Oxxxxxx is the start of a

Unicode character sequence.

46 Bits Hex Min Hex Max Byte Sequence in Binary

7 00000000 0000007F 0vvvvvvv

11 00000080 000007FF ll0vvvvv lOvvvvvv

16 00000800 0000FFFF 1ll0vvvv lOvvvvvv lOvvvvvv

21 00010000 00lFFFFF 11 ll0vvv lOvvvvvv lOvvvvvv lOvvvvvv

26 00200000 03FFFFFF lllllOvv l0vvvvvv lOvvvvvv lOvvvvvv l0vvvvvv

31 04000000 7FFFFFFF 111111 0v lOvvvvvv lOvvvvvv lOvvvvvv lOvvvvvv lOvvvvvv

The Unicode value is just the concatenation of the v bits in the multibyte encoding. When there are multiple ways to encode a value, for example U+0000, only the shortest encoding is legal.

2.6 UTF-7

UTF-7 [Goldsmith 94] is a Mail-Safe Transformation Format of

Unicode. Since FSS-UTF uses octets in the range decimal 128

through 255 to encode Unicode characters outside the ASCII range.

Thus, in the context of mail, those octets must themselves be

encoded. This requires putting text through two successive

encoding processes, and leads to a significant expansion of

47 characters outside the ASCII range.

To overcome the disadvantage, UTF-7 encodes Unicode characters as ASCII, together with shift sequences to encode characters outside that range. For this purpose, one of the characters in the

ASCII repertoire is reserved for use as a shift character. UTF-7 contains provisions for encoding characters which ACSII in a way that all mail systems can accommodate.

UTF-7 should normally be used only in the context of 7-bit transports, such as mail and news. In other contexts, straight

Unicode or UTF-8 (FSS-UTF) should be used.

48 3 INTERNATIONALISATION

AND MULTILINGUAL SUPPORT IN X

3.1 Introduction

Based on current systems, four different classes of language support are currently possible [Glenn 94]. The first, English-only, is the most common software produced today. A slightly different version is to support other language environments, but to do so only one language at a time; for example, a Chinese-only system. This class of support can be called variable monolingual, since it supports a varying number of monolingual environments.

Most current systems which claim some level of localisation or internationalisation fall into a third class, which might be called restricted multilingual. In these cases, more than one language is represented, where

English is usually one of the languages. The restriction applies to the numbers of language environments that are supported; usually only a very small number are supported.

49 The fourth class of language support is unrestricted multilingual. This type of system is truly transparent to language and provides support for all languages. A system or application of this type might not necessarily support some particular language; however, it will have the capability of doing so.

Figure 3-1 shows different classes of language support.

Figure 3-1 Evolving Language Support

Today, most localised or internationalised systems support a restricted number of languages. X Window System Release 6[James 94] is one of the system based on locale mechanism which fall in the third class.

50 In this chapter, we are gomg to discuss internationalisation issues in X

Window System and then discuss how to extend it so that X Window System could support the Unrestricted Multilingual with backwards compatibility.

In the following chapters, we use internationalisation as a synonym for restricted multilingual and simply use multilingual to replace unrestricted multilingual. Also the Xlib Reference Manual [James 94] should be used as the reference to find out the details of the functions we mentioned here.

3.2 Background

There have been numerous research versions of X. Version 10, Release 4

(popularly known as Xl0.4), which was released in 1986, become the basis for several commercial products. Development of most Xl0.4 products was curtailed, however, when it became apparent that Version 11 would not be compatible with it. Version 11, Release 1 became available in September

1987, Release 2 in March 1988, Release 3 in February 1989, Release 4 in

January 1990, Release 5 in August 1991, and Release 6 in May 1994.

51 Version 11 is a complete window programming package. It offers much more flexibility in the areas of supported display features, window manager styles, and support for multiple screens and provides better performance than X

Version 10 and also fully extensible.

Although X is fundamentally defined by a network protocol, most application programmers do not want to think about bits, bytes, and message formats.

Therefore, X has an interface library. This library provides a familiar procedural interface that masks the details of the protocol encoding and transport interactions and automatically handles the buffering of requests for efficient transport to the server. The library also provides various utility functions that are not directly related to the protocol but are nevertheless important in building applications. The exact interface for this library differs for each programming language. Xlib is the library for the C programming language. Figure 3-2 shows a block diagram of a complete X environment.

Each X server controls one or more screens, a keyboard, and a pointing device.

In X, many facilities that are built into other window systems are provided by client libraries. Toolkits (providing menus, scroll bars, dialogue boxes, and so on), higher-level graphics libraries, and management systems

52 Application Application Mail GKS Library Application lPaeudo TTf Window Terminal X Toolkit X VDI Manager Emulator X Library X Library X Library X Library

Network

X Network Protocol

' X Server X Server

Device Library Device Library ~ ...... / I I Keyboard I [scree~ I Keyboard I Scree~ ---[scraa~

Figure 3-2 X widow system block diagram

can all be implemented on top of the X library. Although the X library provides the foundation, the expectation is that applications will be written using these higher-level facilities in conjunction with the facilities of the X library, rather than solely on the "bare bones" of the X library.

3.3 Internationalisation In X

An internationalised application must display all text in the user's native or preferred language [James 94]. This includes prompts, error messages, and text displayed by buttons, menus, and other widgets. The obvious approach to this sort of internationalisation is to remove all strings that will be displayed

53 from the source code of the application and put them instead in a file that will be read in when the application starts up. Then it is a relatively simple matter to translate the file of strings to other languages and have the application read the appropriate one at sartup. Many X applications that use the X resource manager to provide an "app-default" file are already internationalised in this way, though some still have non-internationalised error messages.

An internationalised application must display times, dates, numbers, etc. in the format that the user is accustomed to. Where an American user sees a date in the form month/day/year, and English user should see day/month/year, and a German user should see day.month.year. The definition of "alphabetical order" is a similar customary usage that varies from country to country. In

Spain, for example, string "eh" is treated as a single letter that comes after

"c". So while the string "Chile" and "Colombia" are in alphabetical order for an American user, they are out of order for a Spanish user. These and related problem of local customs are resolved with the ANSI-C setlocale mechanism.

Calling this function causes the ANSI-C library to read a database of localisation information. Other functions in the C library (such as printf for displaying numbers and strcoll for comparing strings) use the information in this database so that they can behave correctly in the current locale. The

54 X 11 R6 internationalisation mechanisms are built upon this setlocale mechanism.

An internationalised program must be capable of displaying all the characters used in the user's language, and must allow the user to generate all these characters as input. For terminal-based applications, this can be thought of as a hardware issue: a French user's terminal must be capable of displaying the accented characters used in French, and there must be some way to generate those characters from the keyboard. With X and bitmapped displays, character display is not a problem - simply a matter of finding the required font or fonts. For languages like Chinese, fonts with many characters are required, but X supports 16-bit fonts, which is large enough for almost all languages. Keyboard input for Chinese and other ideographic Asian languages is another matter, however. When there are more characters in a language than there are keys on a keyboard, some sort of "input method" is required for converting multiple keystrokes into a single character.

55 Ideographic languages require complex input methods, and often there is more than one standard method for a language. An internationalised application must support any input method chosen by the user. Xl 1R6 provides this capability which is described in the next Chapter.

An internationalised program must operate regardless of the encoding of characters in the user's language. A program ( or operating system) that ignores or truncates the eighth bit of every character won't work in Europe, because the accented characters used in many European languages are represented with numbers greater than 127. An application that assumes that every character is 8 bits long won't work in Japan whare there are many thousands of ideographic characters. Furthermore, common Japanese usage intermixes 16-bit Japanese characters with 8-bit Latin characters, so it is not even safe to assume that character are of a uniform width. X11R6 makes an extension to the setlocale model, an internationalised X application reads a localisation file at stratup that contains information about the text encoding

56 used m locale. This information allows X to correctly parse strings into characters and figure out how to display them. Figure 3-3 illustrates the internationalisation frame work in X.

Application

<< ANSI/MSE API >> << XLib API >> << ANSI/MSE API >> (X Contrlb) (X Core) (X Contrlb)

Locale Library Input Output C Library non-ANSI implementation Method Method ANSI implementation

_..,,. Wi-'\\, 'l-:':,· < Locale Service API > X Locale Object

~"'U.~':§

F o ntSet Info (for redering ) CharSet info (for co nversion) c onverstion functions localisation Database

Figure 3-3 Internationalisation frame work

• The input method (IM) forms a virtual keyboard to allow users to input

characters in a character set which is larger than a physical keyboard can

handle (for example, character set GB2312-80 has more than 6,000

characters, it is impossible to assign a key to each character).

57 • The output method (OM) is responsible for text rendering.

3.3.1 Internationalisation with ANSI-C [Plauger 92]

Clearly it is not feasible to write an application that has special case code for the formatting customs of every country in the world. A simpler approach is to use a library that reads a customising database at startup time. This database would contain the , the decimal separator symbol, abbreviations for the days of the weeks and names of the months in the local language, the collation sequence of the alphabet, etc. This is the approach taken by the ANSI-C library.

The first step in any internationalised application is to establish the locale - to cause the localisation database to be read in. This is done with the C library function setlocale. It takes two arguments: a locale category and the locale name. The locale name specifies the database that should be used to localise the program, and the locale category specifies which behaviours (for example, the collation sequence of the alphabet or the formatting of times and dates) of the program should be changed. Passing the empty string as the locale name will cause setlocale to get the name of the locale from the

58 operating system environment variable named LANG. This allows the application writer to leave the choice of locale to the end user of the application. There is no standard format for locale names, but they often have the form:

language[_territory[ .codeset]]

So the locale "Fr" might be used in France, while "En_GB" might specify

English as used in Great Britain, and "En_US" English as used in the U.S.

The codeset field can be used to specify the encoding (i.e., the mapping between numbers and characters) to be used for all strings in the application when there is not a single default encoding used for the language in the territory. The locale "ja_JP.ujis" is an example - "ujis" is the name of one of the encodings in common use for Japanese. The name of the default locale is simply "C". This locale is familiar to American computer users and all C programmers.

The category LC_ALL instructs setlocale to set all internationalisation behaviour defined by ANSI-C to operate in the given locale. The locale may also be specified for each category individually. The standard categories( other, non-standard, categories may also be defined) and the aspects of program behaviour that they control are listed below:

59 • LC_COLLATE: This category defines the collation sequence used by the

ANSI-C library functions strocll and strxfrm which are used to order

strings alphabetically.

• LC_CTYPE: this category defines the behaviour of the character

classification and case conversion macros (such as isspace and tolower)

defined in the header file . Different languages will have

different classifications for characters. Not all applications have upper

case equivalents, for examples, and characters with codes between 128

and 255 which are non-printing in ASCII are important alphabetic

characters in many European language.

• LC_MONETARY: This category does not affect the behaviour of any C

library functions. The problem of formatting monetary quantities was

deemed too intricate for any standard library function, so the library

simply provides a way for an application to look up any of the localised

parameters it needs to do its own formatting of monetary quantities. The

ANSI-C function localeconv returns a pointer to a structure of type lconv

that contains the parameters (such as decimal separator, currency symbol,

and flags that indicate whether the currency symbol should appear before

60 or after positive and negative quantities, etc.) needed for numeric and

monetary formatting in the current locale.

• LC_NUMERIC: This category affects the decimal separator used by

printf (and its variants) It also affects the values in the lconv structure returned by scanf (and its variants) localeconv. gcvt ( and related functions)

strtod and atof

• LC_TIME: This category affects the behavior of the time and date

formating functions strftime and strptime. It defines such things as the

names of the days of the week and their standard abbreviations in the

language of the locale.

3.3.2 Text Representation

Remember that characters displayed by computer are represented by numbers. The correspondence between numbers and characters ( on most

American computers ) is defined by the ASCII encoding. There is nothing special about ASCII except that it is one of the most firmly established

61 standards of the computer world. Text composed in one encoding (ASCII, for example) and displayed in another (perhaps EBCDIC, still used by IBM mainframes) will be nonsense because the number-to-character mapping of the encodings are not the same.

We've been using the term encoding rather loosely. Before we consider text representation any further, some definitions are appropriate (see chapter 2 for character and font graph). A character set [Adrian 93] is simply a set of characters; there are no numbers associated with those characters. The term codeset is sometimes used as a synonym for encoding. A charset (not the same as character set) is an encoding in which all characters have the same number of bits. ASCII is a 7-bit encoding, for example, and is therefore a charset. Figure 3-4 diagrams the relationship between character sets, characters, fonts, and glyphs.

The last two fields of an X font name specify a charset [BDF 94]. By definition, the index of a font glyph in the font is the same as the encoding of

62 the corresponding character in the charset. When the encoding of a locale is a charset, this obviously simplified matters a great deal: text in the locale can be displayed using glyphs from a single font, and the character encoding can be used directly as the index of the corresponding font glyphs.

character set character plus encoding encoding is inde x of = charset font glyph A 65 !'#$%& ...... B 66 ABCDEFGH .. . C 67 abcdefgh .... . ⇒ D 68 ⇒ {I}- ...... E 69 F 70 G 71

'-----i.----..._

Character Charset Font Font set Glyphs

Figure 3-4 Character sets,

encodings, charsets, fonts, and glyphs

In current implementations of Xl 1R6, not all languages can be represented with a single charset (see chapter 2 for the detailed discussion). Japanese text, for example, commonly requires Japanese ideographic characters, Japanese phonetic characters, and Latin characters. Each of these character sets has its

63 own standard fixed-width encoding, the ideographic charset is 16-bits wide while the phonetic and Latin charsets are 8-bits wide. Full Japanese text display requires a font for each charset, and Japanese text representation requires a "super-encoding" that combines each of the component encodings.

3.3.3 1S08859-1 and Other Encodings

Most X fonts on the system have the charset "iso8859-1" (also known as

"Latin-1 "). Because there are fewer than 256 characters are 8 bits long. And there are no special shift sequences that modify the interpretation of characters. Because there are not any shift sequences, it is possible to use the encoding of all Latin-1 characters directly as font indices.

Since IS08859-1 contains a superset of the ASCII characters and the characters are a uniform 8 bits and strings do not contain embedded shift states. In conjunction with the ANSI-C internationalisation facilities, most programs originally written for ASCII use can easily be ported for use in most

Western European countries (see also chapter 1 for the discussion of iso8859 family encoding).

64 It is not so simple once try to go beyond Western Europe and Latin-based alphabets. Japanese text [ken 93], for example, commonly uses (at least within the computer industry) words written in the Latin alphabet along with phonetic characters from the katakana and kiragana alphabets and ideographic kanji characters. This is done with shift sequences, bytes embedded in the running text which control the character set in which the following character will be interpreted.

Compound Text [ISO 2022] is another text representation that is used in X applications. Compound Text strings identify their encoding using embedded escape sequences (they can also have multiple sub-strings with multiple encodings) and are locale-independent. The Compound Text representation was standardised as part of Xl 1R4 for use as a text interchange format for interclient communication [David 94]. It is often used to encode text properties and for the transfer to text via selections, and is not intended for text representation internal to an application.

65 3.3.4 Multi-byte and Wide-Character Strings

Strings in encoding that contain shift sequences and characters with non­ uniform width can be stored in standard Null-terminated arrays of characters, but can be difficult to work with in this form: the number of characters in a string cannot be assumed to be equal to the number of bytes, and it is not possible to iterate through the characters in a string by simply incrementing a pointer. On the other hand, character strings are usefully passed to standard C functions like strcat and strcpy, and assuming a terminal that understands the encoding functions like printf work correctly with these strings.

As an alternative to these multi-byte strings, ANSI-C defines a wide­ character type, wchar_t, in which each character has a fixed size and occupies one array element in the string. (The wchar_t is 2 bytes on some systems, 4 bytes on others, and may be 1 byte on systems that support nothing but the default C locale). ANSI-C defines functions to convert between multi-byte and wide-character strings: mblen, mbstowcs, mbtowc, wcstombs, and wctomb. Multi-byte strings are usually more compact than wide-character strings, but wide-character strings are easier to work with.

Unfortunately the ANSI-C library does not provide adequate functions or

66 conventions for sophisticated internationalised text manipulation. Xlib provides its own equivalent functions with extended functionality.

3.3.5 Locale Management

An internationalised X application begins in the same way as a ANSI-C terminal-based internationalised program: with a call to setlocale. An X program , however, requires two additional steps.

Immediately after calling setlocale, an application should call

XSupportsLocale() to determine if the Xlib implementation supports the current locale. This function takes no arguments and return a Boolean.

After verifying that the locale 1s supported, an application should call

XSetLocaleModifiers(). A "locale modifier" can be thought of as an extension to the name of a locale; it specifies more information about the desired localised behavior of an application. X11R6 as shipped by X

Consortium recognised one locale modifier, used to specify the input method to be used for internationalised text input for the locale.

67 3.3.6 Internationalised Text Output

X 11 R6 bases its new text output routines on a new Xlib abstraction, the

XFontSet. An XFontSet is bound to the locale in which it is created, and contains all the fonts needed to display text in that locale, or all the independent charsets used in the encoding of that locale. Technical Japanese text, for example, often mixes Latin with Japanese characters, so for a

Japanese locale, fonts might be required with the charsets jisx0208.1983-0 for

Kanji ideographic characters, jisx0201.1976-0 for phonetic characters, and iso8859-1 for Latin characters.

Drawing internationalised text in Xl 1R6 is conceptually very similar to drawing text in X11R4 - there are routines that allow you to query font metrics, measure strings, and draw strings. The Xl 1R6 functions use an

XFontSet rather than an XFontStruct or a font specified in a graphics context. The drawing the measuring routines interpret text in the encoding of

68 the locale of the fontset, and correctly map wide or multi-byte characters to the corresponding font glyph (or glyphs).

3.3.7 String Encoding for Internationalisation

Perhaps the most fundamental concern of internationalisation is the encoding of strings. Because X is a networked window systems, an X client must communicate with the X server, usually with a window manager, sometimes with a session manager, and often with other clients through the X selection mechanism [David 94] (which is used to implement copy-and-paste).

X 11 defines the X Portable Character Set as a set of basic characters that must exist in all locales supported by Xlib. Those characters are:

a .. z A .. Z 0 .. 9

!"#$%&' ()*+,-./:;<=> ?@[\\]"- , {I}~

, , and

X 11 also defines the Host Portable Character Encoding as the encoding for that character set. The encoding itself is not defined; the only requirement is

69 that the same encoding is used for all locale on a given host machine. A string in the Host Portable Character Encoding is understood to contain only characters from the X Portable Character Set. The Latin Portable Character

Encoding is the characters of the X Portable Character Set encoding as a subset of the Latin- I encoding. In practice, however, it is likely that all systems will simply use an encoding which is a superset of ASCII, (with the possible exception of mainframes that use EBCDIC) and therefore all characters in the X Portable Character Set will share a single, standard

(ASCII) encoding.

3.3.8 Internationalised Interclient Communication

When writing an internationalised application it is not safe to assume that all interclient communication with text properties will be done with Latin-I or

ASCII strings. XI IR6 provides some functions that do not make this assumption. The first is a convenience routine for communication with window managers. XmbSetWMProperties() is a function very similar to

XSetWMProperties(), except that the window _name and icon_name arguments are multi-byte strings (rather than XTextProperty pointers) in the encoding of the locale. If these strings can be converted to the STRING

70 encoding (Latin-I plus newline and tab), then their corresponding

WM_NAME and WM_ICON_NAME properties are created with type

STRING. If this conversion cannot be performed, the strings are converted to

Compound Text (this conversion can always be done, by the definition of

Compound Text), and the properties are created with type

COMPOUND_TEXT.

3.3.9 Localisation of Resource Database

X resources are a useful way to allow the localisation of strings - rather than hardcoding its strings. An Client can look them all up by name from a locale­ dependent resource file. However, the problem is that although resource values can be localised, and may contain text in the encoding of the locale, resource names must still be hardcoded into the application. This situation is unfortunate but there is a way around it by using Unicode encoding and an additional translation Database. See section 3.4.6 for details.

3.4 Multilingual support in X

X Window System make a real contribution to internationalisation. It has defined most the locales for commonly used languages. However multilingual

71 support is not provided by X Window System. This problem can not be solved by combining different character set which is the base X Window

System has built on. Unicode make it possible to overcome this major problem.

3.4.2 Extending X

One possible approach is that following the convention in X Window System: define a Unicode locale, add all functions needed for this locale and keep the

API (Application Programming Interface) unchanged. In this approach, all existing internationalised features or functionality in Xlib we discussed in the previous section are reserved, consequently a full compatibility is reached which is crucial in the process of migration to the Unicode. This approach should be considered as a intermediate step towards the full support of the

Unicode. Figure 3-5 shows the how a locale fits in the system.

Another possible approach is make the text input and output functions locale independent. The locale affect Xlib only in its encoding of resource files and values instead of affecting encoding and processing of input method text;

72 encoding of resource files and values; encoding and imaging of text strings and encoding and decoding for inter-client text communication.

ko_ KR.eucKR ja_J P.eucJ P

bg _BG .IS08859-5

zh_CN.eucCN

Locale for Chinese

en_GB.IS08859-1

.j iu Define a Unicode locale • add new functions for • this locale • and plug in

Figure 3-5 Adding a new locale

This approach should be the used in the future major release of X which should have only one encoding Unicode. It will significantly reduce the complexity of the structure of the system and consequently reduce the coding lines and will be easier to maintain.

The disadvantage is that it will be very difficult to maintain the compatibility with the applications that developed under current implementation of X.

73 In our research implementation we take the first approach, for the reasons we have just discussed.

3.4.3 Font

Unicode is a character encoding standard, not a glyph standard. There is no requirement that a glyph mapped to the of a Unicode point of a Unicode character should have a particular design, nor that glyphs in one Unicode subset share design features with those of another.

3.4.3.1 Homogenised typeface [Charles 94]

An easy way to develop a Unicode font is to arbitrarily assemble disparate fonts, say a Latin font of one design, a Greek of another, a Cyrillic of a third, math operators of a fourth, and so on, all of somewhat different styles, weights, widths, proportions, and shapes. Two different Unicode fonts assembled in this fashion could have identical Unicode character sets by very different glyphs. By converting all the existing font sets in X, we can have a set of homogeneous typeface Unicode font set. This can be easily achieved by assign Unicode encoding value to the font graph in a BDF (Bitmap Distribution Format) [BDF 94] file. For all iso8859-1 fonts we do not even need to convert them at all, since iso8859- l encoding is compatible with Unicode. What we need are the alias (name used for Unicode environment)

74 for these fonts. The following is an example, it shows how a Hebrew font could be converted to a Unicode sub- font, the font file is in BDF.

STARTFONT 2.1 COMMENT $XConsortium: heb6x13.bdf,v 1.3 94/04/02 16:26:51 gildea Exp$ FONT -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO8859-8 SIZE 13 78 78 FONTBOUNDINGBOX 6 13 0 -3 STARTPROPERTIES 19 FONTNAME_REGISTRY "" The registry name could be FOUNDRY "Misc" changed to : UCS2-HEBREW FAMILY_NAME "Fixed" (proposed name for a Unicode WEIGHT_NAME "Medium" subset font according to the SLANT"R" naming convention in X) SE1WIDTH_NAME "SemiCondensed" ADD_STYLE_NAME "" PIXEL_SIZE 13 POINT_SIZE 120 RESOLUTION_X 75 RESOLUTION_Y 75 SPACING "C" AVERAGE_WIDTH 60 CHARSET_REGISTRY "ISO8859" CHARSET_ENCODING "8" ______, DEFAULT_CHAR 0 FONT_DESCENT 3 FONT_ASCENT 10 COPYRIGHT "Copyright (c) 1991 by Joseph Friedman." END PROPERTIES CHARS 186 STARTCHAR asciiOOO Characters in the ASCII ENCODING0 1-----~range are ignored. SWIDTH4610 DWIDTH60 BBX 613 0-3 Converting from codepoint 250 (iso8859-8) BITMAP to codepoint OSEA (Unicode)

STARTCHAR taw STARTCHAR taw ENCODING 250 ENCODING 05EA SWIDTH4610 SWIDTH4610 DWIDTH60 DWIDTH60 BBX 613 0-3 BBX 613 0-3 BITMAP BITMAP

ENDCHAR ENDCHAR ENDFONT ENDFONT

75 This could be done by using a conversion program. To develop such a utility program is also not difficult.

A problem with this design is that the typographic features of documents, programming windows, screen displays, and other text-based images will not be preserved when transported between systems using different fonts. Any typographic element could change: type style, text line-endings, page breaks, document length, window sizes, and so on.

3.4.3.2 Harmonised typeface [Charles 94]

"Harmonisation" means that the basic wights and alignments of disparate alphabets are regularised and tuned to work together, so that their inessential differences are minimised, but their essential, meaningful differences preserved. Within a harmonised font, when text changes from Latin to

Cyrillic, or from Greek to Hebrew, or when mathematical expressions or other symbols are introduced into text, the visual size, weight, and rhythm of the characters will not appear to change, will not jar or distract the reader, but the basic character shapes should nevertheless be distinctive and immediately recognisable.

76 This solves the problem of Homogenised typeface design, however, to develop such a font will require years to complete because it is so large (more than 28,000 characters, and here is more to come) and must be originated by a single designer or by a close collaboration between a few designers. As a result we have less fonts to choose from (there are only Unil6 and Uni24 available for use in the X Window System. In fact these fonts are not well designed and only for research purpose).

3.4.3.3 One big font vs. many little fonts

The advantage of using many little fonts are economical memory management and greater font loading speed when a complete Unicode character set is implemented as many small subfonts from which characters can be accessed independently[Pike 93]. However, extra code will not be needed by using one big font.

77 3.4.4 Multilingual Text Output

The simplest way to display a piece of plan Unicode encoded text is using the locale neutral API functions XDrawString16(),

XDrawlmageString16(), or XDrawText16() with one big font.

The problem is that it has to use a fixed big font set. Regardless whether the user need to process three languages, (for example English, French and

German), may only need a small font set. Also applications that have been internationalised will not work without modification especially for these high level toolkits (motif widgets), since they use different API functions. For example, the internationalised function for XDrawStringl6() 1s

Xmb/wcDrawString().

Another possible approach is using the internationalisation text drawing API functions, by add new conversion functions mbstocs(), wcstocs(), (see source code LcUtf.c implemented by the X Consortium), under these API functions to support multilingual text drawing. Figure 3-6 shows the structure of this extension.

78 Text Functions' API

Chinese French locale • • • locale setlocale() -

set font

Yes

I1 BN/L 1ON or multilingual I1 BN/L 1 ON or multilingual neutual neutual 8-bit text functions 16-bit text functions

Partl Structure Diagram

Figure 3-6 Structure of the extension

79 XDestroyOCProe destroy

XSetOCValuesProe set_va lues

XGetOCValuesProe get_values

X m b Text Es eapemen I Pree mb_escapement

XmbTextExtentsProe mb_extents

XmbTextPerCh arExtentsP roe mb_extents_per _char

XmbDrawStringProc mb_draw_string

XmbDrawlmageStringProe mb_d raw _image_string

XwcT extEscapementProc wc_escapement 1------lText functions ' API

XwcTextExtentsProc wc_extents

XwcTextPerCharExtentsP roe wc_extents_per _char

XwcDrawStringProc we_draw_string

XwcDrawl mageStringProc wc_d raw_image_stri ng

XlcConv --~--- lsetlocale()

int

FontSet

Part 2 Related Data Structure

Figure 3-6 Structure of the extension

In this approach, the function setlocale() sets up all proper converting functions for the locale selected from the conversion function list which 1s

initialized during the startup. For example:

80 in file Lcinit.c add the Unicode Loader XlcUtfLoader in the loader list:

#ifdef USE_UTF_LOADER

_XleAddLoader(_XleU tfLoader, XleHead);

#endif after setlocale() is called, it in turn call the _XlcUttLoader() which should provided by the Xlib implementor like any other Locale. _XlcUttLoader() then loads all the conversion functions into the conversion function list.

Finally the setlocale() function finds the proper conversion functions in the list and puts the function addresses into the Structure XOCGenericPart.

_XleUtfLoader(name)

char *name;

XLCd led; led= _XleCreateLC(name, _XleGenericMethods); if (led== (XLCd) NULL) return led; if ((_XleCompareISOLatin 1(XLC_PUBLIC_PART(led)->eodeset, "utf') )) { _XleDestroyLC(lcd); return (XLCd) NULL;

_XleSetConverter(led, XleNMultiByte, led, XleNCharSet, open_mbstocs); _XleSetConverter(led, XleNWideChar, led, XleNCharSet, open_wcstocs); _XleSetConverter(lcd, XleNMultiByte, led, XleNWideChar, open_utftowcs); _XleSetConverter(led, XleNWideChar, led, XleNMultiByte, open_wcstoutf); _XleSetConverter(lcd, XleNMultiByte, led, XleNChar, open_mbtocs); _XleSetConverter(led, XleNCharSet, led, XleNMultiByte, open_cstombs); _XleSetConverter(led, XleNCharSet, led, XleNWideChar, open_cstowcs); return led;

81 The converting function converts Unicode encoded text into fontset. Figure 3-

7 illustrates the character rendering and its locale database in this approach.

Unicode JIS X 0208-1990

U+6697 t--.-~~nversion •I 1637 B~ - t -:

GB2312-80 · . · j 1 621 B~ -1 ·dlfferenc:•

KS C 5601-1992 -. 6862 B-ff- 1 r - ----

Figure 3-7 Part 1: Character Rendering

Which font set (character set) it converts to is depend on the order of these font sets in the Locale Database [Yoshio 94]. In the locale data base (See figure 3-6 part 2 on the next page), Japanese font comes first, Chinese font comes second and Korea font comes last.

When the text is combined of Latin alphabet words, phonetic characters from katakana, hiragana alphabets, ideographic kanji, for example, three font sets are required.

The advantage of the this approach is that the existing code remams untouched (all the toolkits build on top of Xlib are 100% compatible) and any

82 existing resources can be used directly (fonts, for example, do not need to be converted to Unicode).

# XLocale Database Sample for en_US.utf

# XLC_FONTSET category XLC_FONTSET # fs0 class (7 bit ASCII) fs0 Fontset and Charset needed for rendering charset ISO8859-1 :GL ... ASCII characters font ISO8859-l :GL

# fsl class (Kanji)

fs 1 Fontset and Charset charset TISX0208.1983-0:GL needed for rendering .... Japanese characters font TISX0208.1983-0:GL

# fs2 class (Chinese Han Character) fs2 Fontset and Charset charset GB2312.1980-0:GL needed for rendering font GB2312.1980-0:GL ...

# fs3 class (Korean Character)

fs3 Fontset and Charset

charset KSC5601.1987-0:GL L... needed for rendering Korean characters font KSC5601.1987-0:GL

END XLC_FONTSET

Figure 3-7 Part 2: Locale Database

83 One of the mam disadvantages is that adding a new writing systems 1s difficult, because it still employs various encoding themes and corresponding fonts which has been proven not suitable for multilingual application. Another disadvantage is that when text is combined with Japanese ideographic characters and simplified Chinese ideographic characters, for example, the user needs to specify the Japanese and Chinese font sets such that displayed characters will have the same type face or font style. Simply letting system to choose the font set may produce a undesirable side effect.

The modified approach is to use Unicode font. We can use either a big single font, or use many small subset fonts. The advantages of modified approach are:

1. Reduced coding, converting functions are not required.

2. Reduced memory usage, mapping tables are not needed.

3. New writing system can be add by simply provide font.

4. All ideographic characters are in one set - CJK unified Ideographs. So there no type face problem.

Figure 3-8 illustrates the internal mechanism of this approach.

84 Text functions ' API

setlocale() Unicod Chinese • • French locale locale locale

set font

No Yes

11 BN/L 1 ON or multilingual 11 BN/L 1 ON or multilingual neutual neutual 8-bit text functions 16-bit text functions

Figure 3-8 Internal

Mechanism of the second approach

From the following pseudo code clearly shows that there 1s no conversion needed (see the function m file lcUtf.c for the detail). All the code m the function is used to figure out which subfonts is needed for rendering the text or immediate return if using a big font.

85 static int my_ucstocs(conv, from, from_left, to, to_left, args, num_args) XlcConv conv; XPointer *from; int *from_left; char **to; int *to_left; XPointer *args; int num_args;

if(not initialized){ find the bonderies of the subfonts (starting code point and ending code point) add in the list from small code points to large code points; set initialzed

if (empty string) return 0;

while(havn't found){ find which subfont the code point is in for (not in the current subfont) try the next subfont if(not in the last) return -1; else{ return the ID of the subfont

We also added one more function to switch between X Consortium approach and our modified approach by detecting the Locale Database. The flag will be set if we use Unicode charset and fontset which begins with "UCS", right function will be chosen accordingly. For example (also see Figure 3-9): static int ucstocs(conv, from, from_left, to, to_left, args, num_args) XlcConv conv; XPointer *from; int *from_left; char **to; int *to_left; XPointer *args; int num_args;

if( use UCS fonts){ return my_ucstocs(conv, from, from_left, to, to_left, args, num_args); }else{ return orig_ucstocs(conv, from, from_left, to, to_left, args, num_args); }

86 # # XLC_FONTSET category # XLC_FONTSET # fs0 class (UCS-2 families) fs0 { charset UCS-2:GLGR (0000) font ucs-latinl I Ranging # fsl class (UCS-2 families) fsl { charset UCS-2:GLGR font ucs-cjkunifiedideographs (FFFF)

END XLC_FONTSET

# # XLC_XLOCALE category # Unicode Charset and XLC_XLOCALE Sub-Fontsets encoding_name "UTF" mb_cur_max 3 state_depend_encoding False

# cs0 class cs0 { side GLGR:Default length 2 ct_encoding UCS-2:GLGR

# csl class csl side GLGR length 2 ct_encoding UCS-2:GLGR

END XLC_XLOCALE

Figure 3-9 Modified Locale Database

With this approach that full backwards compatibility is also achieved, however existing fonts need to be converted to Unicode fonts except iso8859-

1 fonts.

87 3.4.5 Multingual Interclient Communication

Unicode standard encoding is the right replacement of the Compound Text encoding. Since Unicode has been designed for multilingual applications.

Firstly, it is easily understood, can be easily implemented since there is no sophisticated escape sequence needed. Secondly, it is much more efficient because the 16-bit format can be directly used without any translation. And lastly, any encoding of a particular locale can always converted to Unicode encoding without loss of information.

In our experimental implementation, we added a new ATOM* "UNICODE" to the XTextProperty, so that the application programmer can directly use it for the interclient communication. Any existing ATOM type remains in the system for the compatibility.

UTF-8 [Unicode 93] format also can be used for this purpose, the advantage is that it is historical file system and ASCII compatible, ordering independent, suitable for the networked system. However, converting from/to Unicode is needed, consequently less efficient than directly using Unicode encoding.

88 Figure 3-10 shows how the selection interclient communication mechanism works. The selection mechanism is used to implement "Cut and Paste" or

"Drag and Drop". When text is involved, the text will be converted from/to

Unicode instead of Compound Text.

Process A Process B

I insertion point

Locale A Locale B

Locale A Unicode Locale B .. or - UTF-8 Selection mechanism

Figure 3-10 Interclient communication - Selection

3.4.6 Multilingual Resource Database

It is possible now that resource database files can be written m Unicode

(UTF-8 format). However, resource name should still be in X Portable

* ATOM An atom is a unique numeric ID corresponding to a string name. Atoms are used to identify properties, types, and selections in order to avoid the overhead of passing arbitrary length property name strings.

89 Character Set (compatible with UTF-8) but this is something beyond the issue of encoding.

By using Unicode encoding, we can encode resource name in the range out side the ASCII, but resource names must still be hardcoded into the application. For example, a Chinese user who wishes to customise the behaviour of an application written by a Japanese programmer will have to specify values for resources that are named in Japanese which are mnemonic to the Japanese programmer, but which are meaningless to the Chinese (or

American) user, the Unicode encoding will not give any help in this kind of situation. However, Unicode encoding (UTF-8 format) simplify the matter greatly. By employing a translation mechanism between the resource database and localised resource database files, only a dictionary with multiple

(different language) entries is needed ( this is only possible when using

Unicode, since the dictionary its self requires multilingual support). The resource names do not has to be restricted in ASCII range, but all existing resource files which written in ASCII remains compatible. Figure 3-9 shows the relations between the resource database and resource database files.

90 All in UTF-8 Format * -1tl=l?- !al. ~ •_ 1:(, E

t 52 : Localized resource file~ t~f~ ,_Jit!~ * :;F.l S'< t\ Jm,1 § * {~' ~~ : ~ j:,

*Background: grey ~term*background: skyblue -- *labelstring : ~ j:, I I I I Translation I I Loading... I + backgroud -' t= ---':r2 Background ~~~- xterm *!ii: fi,i. g~ ~iti skyblue ~ & I§! X Resource grey :::r:R: E§,: Database labelstring ~ .::t, • • • - •

Dictionary

Figure 3-10 Relationship between

Resource database and resource database files

In this approach, resource names and resource values can be both localised.

Although we only showed translation from Chinese to English in this

example, but it can be applied to any language. The dictionary its self could

be created without much difficulty.

This features has not been implemented in our experimental implementation,

since we believe this should be implemented in the toolkits. Recall that we

91 are trying not break the principle which guides the designers of X Window

System - Do not add new functionality unless an implementor cannot complete a real application[Robert 92].

92 4 TEXTINPUT

4.1 Introduction

The internationalisation in the X Window System Version 11, Release 5

(Xl 1R5) [Robert 92] provides a common API which application developers can use to create portable internationalised programs and to adapt them to the requirements of different native languages, local customs, and character string encodings (this is called "localisation"). As one of its internationalisation mechanisms Xl 1R5 has defined a functional interface for internationalised text input, called XIM ().

The protocol used to interface Input Method Servers (IM Servers) with the

Input Method libraries (IM libraries) to which applications are linked was not address in X11R5. This led application developers to depend on vendor-specific input methods, and made it more difficult for developers to create portable applications. Fortunately this problem has been resolved in the XllR6 [Masahiko 94], however, as we have mentioned in text output section, X 11 R6 only provides us a restricted multilingual programming environment. It has not address the problem that a X Client has to be in the same locale as the IM Servers. The X Client can not connect to SJXA (a

Japanese IM Server as shipped from X Consortium as a contribution) and

93 a French Input Method (implemented in Xlib) for example. That means if the X Client is running in Unicode locale, it can not connect to any existing Input Method (mostly for European languages ) and SJXA, the

Japanese IM Server. This makes the implementation of Unicode locale incomplete and it provides us only a monolingual environment, because it can only accept the American English input.

This situation will not be improved when the restriction remains, because of the size and complexity of these input methods, and because of how widely they vary from one language or locale to another, they are usually implemented as separate processes.

In this chapter, we provide a solution to this problem by removing the restriction in the current implementation, so that a multilingual X Client running in Unicode locale will be able to connect to any Input Method existing in the System or added later which run in any given locale.

4.2 Background

Text input is much more simple for some languages than others. English, for instance, uses an alphabet of a manageable size, and input consists of

94 pressing the corresponding key on a keyboard, perhaps in combination with a shift key for capital letters or special characters.

Some languages have larger alphabets, or modifier such as accents, which require the addition of special key combinations in order to enter text.

These input methods [Adrian 93] may require "dead-keys" or "compose­ keys" which, when followed by different combinations of keystrokes, generate different characters.

Text input for ideographic languages [Low 88] is much less simple. In these languages, characters represent actual objects rather than phonetic sounds used in pronouncing a word, and the number of characters in these languages may continue to grow. In Japanese [Ken 93], for instance, most text input methods involve entering characters in a phonetic alphabet, after which the input method searches a dictionary for possible ideographic equivalents (of which there may be many). The input method then presents the candidate characters for the user to choose from.

In Japanese, see Figure 4-1, either Kana (phonetic symbols) or roman letters are typed and then a region is selected for conversion to Kanji.

Several Kanji characters may have the same phonetic representation. If that is the case with the string entered, a menu of characters is presented and the user must choose the appropriate one. If no choice is necessary or

95 a preference has been established, the input method does the substitution directly.

These complicated input methods must present state information (Status

Area), text entry and edit space (Preedit Area), and menu/choice presentations (Auxiliary Area).

These complicated input methods may require one or more areas in which to show the feedback of the actual keystrokes, to propose disambiguation to the user, to list dictionaries, and so on. The input method areas of concern are as follows:

• Status Area:

The status area is a logical extension of the LEDs that

exist on the physical keyboard. It is a window that is

intended to present the internal state of the input

method that is critical to the user. The status area may

consist of text data and bitmaps or some combination.

• Preedit Area:

The preedit area displays the intermediate text for

those languages that are composing prior to the client

handling the data.

96 Much of the communication between the IM library and the IM Server involves managing these IM areas. Because of the size and complexity of these input methods, and because of how

!status Several Kanji characters may have the same phonetic representation IPreedit

!Auxiliary I

Figure 4-1 A widely they vary from one language or locale to another, they are usually implemented as separate processes which can serve many client process on the same computer or network.

97 4.3 Input Method

X 11 internationalisation support includes the following four types of input

method:

• on-the spot:

The client application is directed by the IM server to

display all pre-edit data at the site of text insertion. The client

registers callbacks invoked by the input method during pre­

editing

• off-the-spot:

The client application provides display windows for the pre­

edit data to the input method which displays into them

directly. See Figure 4-2.

Figure 4-2 Off-the-spot style

• over-the spot:

98 The input method displays pre-edit data in a window which it

brings up directly over the text insertion position. See Figure

4-3.

Figure 4-3 over-the-spot style

• root-window:

The input method displays all pre-edit data in a separate area

of the screen in a window specific to the input method. See

Figure 4-4.

Figure 4-4 root-window style

99 Client applications must choose from the available input methods supported by the IM Server and provide the display areas and callbacks required by the input method.

4.4 Architecture

Within the X Window System environment, the following two typical architectural models can be used as an input method.

• Client/Server model:

A separate process, the IM Server, processes

input and handles preediting, converting, and

committing. The IM library within the

application, acting as client to the IM Server,

simply receives the committed string from the

IM Server.

• Library model:

All input is handled by the IM library within the

application. The event process is closed within

the IM library and a separate IM Server process

may not be required.

100 Figure 4-5 diagrams several possible connections between a client and its input method.

4.4.1 Client /Server model vs. Library Model

Most languages which need complex preediting, such as Asian languages, are implemented using the Client/Server IM model. Other languages which need only dead key or compose key

X client " B " & " C " use a comp I ex A s,an IM an d connect to a single ( back- end) input s erver. The input X Server server connects w it h a tra nslatio n server wh ich I performs d ictionary lookup

'iii j "' fr&1 locale x

X C lient S imple IM " A ll ~ in Xlib I ~ c,r...... tl ...... Eurooean IM bullt Int o Xllb

I locale y X C lient ~ IM ~ .I X Client 11 9 11 I Server ·1 " C" -~ I - j L Other non-x Transla tion client on the S erv e r h o s t o r ~ network

Figure 4-5 Possible connections

between a client and its input method.

IOI processmg, such as European languages, are implemented usmg the

Library model.

4.5 Connect to Input Method

When the client connects or disconnects to the IM Server, an open or close operation occurs between the client and the IM Server.

The IM can be specified at the time of XOpenlM() by setting the locale of the client and a locale modifier. Since the IM remembers the locale at the time of creation, XOpenlM() can be called multiple times (with the setting for the locale and the locale modifier changed) to support multiple languages (Note, it does not support multilingual). Figure 4-6 diagrams the

Client/Server model of connection.

X Server

Locale A

X Cllent IM Instance 1 Server J ( or text fleld 1 )

' ', -:,;

Locale B

X Cllent IM I n stance 2 C~-=..-=..--:...-::...-::...-.-.~----'"Vl-'>J ( or text fleld 2 ) Server K

Figure 4-6 Client/Server model connection

102 4.6 Input Context

Just as the X server can display multiple windows for a single client. Xlib provides a value called the "Input Context" (IC) to manage each individual input field. The function XCreatelC() creates a new input context in an input method, and returns an opaque handle of type XIC.

Like the Window or GC types, XIC has a number of attributes which can be set. These attributes control the interaction style for input done under that context, the regions to be used for the Preedit and Status areas, the

XFontSet with which the text should be drawn and so on.

A text editor that supported multiple editing windows within a single top­ level window could choose to create one IC (Input Context) for each editing window, or to share only one IC among all such windows. In the first case, each window would have different Preedit and Status areas, and each could be in a different intermediate state of pre-editing. In the second case, there would be a single Preedit and a single Status area shared by all editing windows, and the application would probably reset the state of the

IC each time the input focus moved from one window to another. An IC can be destroyed using XDestroylC().

103 4.6.1 Input Context Focus Management

Since there is only one keyboard associated with an X display, X allows only one window to have the input focus at a time. For the same reason, only one input context (per application ) can have the focus at a time. If an application has multiple text entry windows using multiple input context, that application will have to call XSetlCFocus() every time the input focus changes. An application that shares a single IC among multiple text entry windows will have to set the Focus Window attribute of that IC each time the focus changes.

4.6.2 Preedit and Status Area Geometry Management

Depending on interaction style, an input method may require screen space to display preedit and status information. The application is responsible for providing these areas, but except for the on-the-spot style. The geometry management and geometry negotiation is handled through attributes of each input context and with a "geometry callback" function [Joel 94].

4.6.3 Preedit and Status Callbacks

When usmg the on-the-spot interaction style, the IM will request the application to display the preedit and status information for it. This is more complicated for the application, but because the application has finer

104 control over the positioning of the information, it allows the appearance of a seamless interface with the IM. The IM makes requests of the application through a series of callback functions specified as attributes of the IC.

4.6.4 Getting Composed Input

When the application gets a KeyPress event, it will normally use that event in a call to XmbLookupString() or X wcLookupString(), and they return multi-byte or wide-character strings in the codeset of the locale. Because it may take multiple keystrokes to enter a single character of text, these functions may return a status code that indicates that no composed input is ready.

4.6.5 Event Handling

In order for an input method to perform pre-editing of input, it must have access to all KeyPress events, XFilterEvent() provides the hook that makes this possible. This function must be called from within an application's event loop before XFilterEvent() invokes the filter and the

IM has a chance to examine the event.

105 Existing input methods support either the FrontEnd method, the BackEnd

method or both. The Back.End method is the default method in X11R6 implementation.

The difference between the FrontEnd and Back.End methods 1s in how events are delivered to the IM Server. See Figure 4-7.

X Client X Client ,~L-ib-r-a-ry~ Library

IM Sever IM Sever

X Server X Server

BackEnd Method FrontEnd Method (default) (Extension)

Figure 4-7 The Flow of Events

4.6.5.1 BackEnd Method

In the Back.End method, client window input events are always delivered

to the IM library, which then passes them to the IM Sever. Events are

handled serially in the order delivered, and therefore there is no

synchronisation problem between the IM library and the IM Server.

l06 Using this method, the IM library forwards all KeyPress and KeyRelease events to the IM Server and synchronises with the IM Server.

4.6.5.2 FrontEnd Method

In the FrontEnd method, client window input events are delivered by the X server directly to both the IM Server and the IM library. Therefore this method provides much better interactive performance while preedting

(particularly in cases such as when the IM Server is running locally on the user's workstation and the client application is running on another workstation over a relatively slow network).

However, the FrontEnd model may have synchronisation problems between the key events handled in the IM Server and other event handled in the client, and these problems could possibly cause the loss or duplication of key events. For this reason, the BackEnd method is the default method supported in X11R6. Figure 4-8 diagrams the path a

107 character follows between being typed on the keyboard and being displayed on the screen in an BackEnd model.

X Client · XIC ' XIM Event Loo other 1 event• XNextEvent(dpy ,&e); lf(XFllterEvent(&e,None) continue ' IM • may Input , fllter Method ~ D. event XwcLookupStrlng(lc,&e, .. ) lf(status=XLookupChars) Xwc Drawstring( dpy, wl n , ... ) KeySy'm Compo••d T•xt

KeyPress cllent draws Event composed text

Cllent Window

X Server

Hardware Slgnal

keyboard

Figure 4-8 How a keystroke becomes a displayed character

4. 7 Layering of IM

The Xlib XIM implementation in Xl 1R6 is layered into three functions, a

protocol layer, an interface layer and a transport layer. The purpose of this

layering is to make the protocol independent of transport implementation.

Each function of these layers are:

108 • The protocol layer:

implements overall function of XIM and calls

the interface layer functions when it needs to

communicate to IM Server.

• The interface layer:

separates the implementation of the transport

layer from the protocol layer, in other words, it

provides implementation independent hook for

the transport layer functions.

• The transport layer:

handles actual data communication with IM

Server. It is done by a set of several functions

named transporters.

This approach makes various communication channels useable such as X protocol or, TCP/IP, DECnet, STREAM, etc. and provides the information needed for adding another new transport layer. See Figure 4-9 .

109 X Client IM Server Compound Text Current Loca le Encod ing to Encoding to I\ Current Locale Compound Text I\ Enc oding Encod ing

The App lic at ion Layer The Application Layer

1 The Presentation The Presentation i Layer Layer I I\ I\

The Protocol Layer The Protocal Layer ¾

I\ I'- The Interface Layer The Interface Layer

I\ The Transport Layer The Transport Layer

Xlib X protocol X protoc ol TCPAP TCPiP .;:..~:,,e::.:.,:,:c:::::::::::»"tx».--=t~-..:,~~:,:; ··""'""%:"'''''";,m~m~:::.-::.:::;:w-,--;:=:::w,::::::;,::::; • oEcn!t DECnet " ,::·•.:«' STREAM STREAM ,~ '\. ' ~""' m

Figure 4-9 Layering of IM

4.8 Multilingual Text Input

As we mentioned in the Introduction, the mam hurdle we need to overcome is the restriction that client has to be in the same locale as the

IM Server.

One approach to resolve this problem is to develop a multilingual IM

Server which runs in Unicode environment, and is able to input text for all the languages or expendable in the future. Figure 4-10 illustrates the unified input method block diagram.

110 The advantage of this approach is that there is no need to modify the existing code to migrate from internationalised application to multilingual application. Additionally, no extra effort is needed for the application programmer to write a new multilingual application, since there is no new

API function has been added. The down side is that it is difficult to develop such an IM Server, because it is unlikely that there will be an algorithm that suits all existing writing system in the world. The only way is to integrate all the algorithms together in one IM Server which consequently makes such an IM very complex. Also the existing internationalised applications which run in different locales will not be able to connect to this multilingual Server.

X Server

Multilingual X Client I IM Server I ( include the Simple IM in Xlib)

Figure 4-10 Unified input method architecture

I I I Another approach is to removed restriction in current implementation of X

Window System, so that a client running in one locale, will be able to connect to the IM Server or the Input Method developed for different locales. See Figure 4-11.

X Server I I

I,,,, Locale A X Client ,I JM lnternatlonalled Server "J" "A" tfil!! ·1 ~I """•"•" -"\l': 1-=

one way Unlcodee~rYtfJlale A

II I one way locale A -- Unicode Unicode Locale JI_ - - X Client M Unicode I IM fil!l!Unlcode .I X Client Multilingual Multilingual Multilingual "B" I I Server "K" I I "C"

one way convertlon locale m - Unicode

l@l JID.L Locale m Assume that there is no X Client IM lnternatlonalled 1 for locale m, but Multilingual I "m" IM Server "k" supports Locale m . '""

Figure 4-11 Input method in Multilingual environment

This approach is better because it can make use of all the existing input methods as well as the IM developed in Unicode locale, and future expansion is easy, and more flexible than the first approach. The disadvantage is that new code need to be added into the application in order to connect to more than one input method. For example,

112 XOpenlM() will be called more than once and some management codes are likely needed too.

4.8.1 Achieving the Goal

We have mentioned that a connection to an input method is opened with a call to XOpenlM(). This the current implementation of the function takes as arguments the Display, an XrmDatabase(), and a resource name and resource class of type char *.

XOpenlM(Display * display, Xrm.Database db, Char * res_name, Char* res_class)

display Specifies the connection to the X server. db Specifies a pointer to the resource database res_name Specifies the full resource name of the application. Res_class Specifies the full class name of the application.

It also uses the current locale and locale modifiers as implicit arguments.

The locale determines the default input method that XOpenlM() will connect to, as well as the encoding of the strings which will be returned by

Xmb/X wcLookupString(). See Figure 4-12.

113 X Client IM Server

To the Locale From the Locale ,------,. Encoding which Encoding wh ich bund to IM bund to IM

The Application The Application layer Layer 1 1 The The Presentati on Compound Text Encoding Presentation Layer Layer The The Presentation The communicat Layer communication Ion Layers Layers Xllb IM From the Locale Encoding wh ich bund to IM

Communication Channel

Figure 4-12 Input method structure before the modification

By examining the source code of the current implementation, we found that separating the locale which bind IM and the locale which return string will be encoded is achievable without a substantial modification of the source code and keep the full compatibility.

We added two new API functions which will not take implicit arguments.

First one is _XSetLocaleModifiers():

_XSetLocaleModifiers( char* locale_name,

char* modifiers)

locale_name specifies the locale which IM supports, normally it will be the

locale which IM or IM Server runs in .

114 If NULL is passed as the first argument, it 1s equivalent to

XSetLocaleModifiers() which takes current locale as the implicit argument. char* _XSetLocaleModifiers(locale_name, modifiers) char *locale_name; char *modifiers; { XLCd led; char *user_mods;

ij((locale_name == NULL)ll(*locale_name ==NULL)) I* added *I led =_XleCurrentLC(); else I* added *I led= _XOpenLC(locale_name);

if (!led) return (char *) NULL; if (!modifiers) return led->core->modifiers; user_mods = getenv("XMODIFIERS"); modifiers = (*led->methods->map_modifiers) (led, user_mods, (char *)modifiers); if (modifiers) { if (led->core->modifiers) Xfree(led->core->modifiers); led->core->modifiers = (char *)modifiers; } return (char *)modifiers; }

The second one _XOpenIM():

_XOpenlM( Char* locale_name, Display* display, XrmDatabase db, Char * res_name, Char* res_class)

locale_name specifies the locale which IM supports, normally it will be the

locale which IM or IM Server runs in.

115 If NULL is passed as the first argument, it is equivalent to XOpenlM() which takes current locale as the implicit argument. Finally, we need to modify the function _XimGetEncoding() in the presentation layer, this function will be executed to set the converter after XOpenIM() is called.

Private Boo! _XimGetEncoding(im, buf, name, name_len, detail, detail_len) Xim 1m; CARD 16 *buf; char *name; int name_len; char *detail; int detail_len;

XLCd led= im->core.lcd; CARD16 category= buf[0]; CARD16 idx = buf[l]; int Jen ; XlcConv ctom_conv; XleConv ctow_conv; XLCd cur_led; I* added *I

cur_led = _XlcCurrentLC(); I* added *I

if (idx == (CARD16)XIM_Default_Encoding_IDX) { /* XXX */

/* the unmodified code : it opens a converter which will converts compound text to multibyte text in the same locale bund to the IM. if (!(ctom_conv = _XleOpenConverter(led, XlcNCompoundText, led, XlcNMultiByte))) *I

I* the modified code: it opens a converter which will converts compound text to multibyte text in two different locale if led <> cur_led. Else it will be the same as the unmodified code shown above *I if ( !(ctom_conv = _XlcOpenConverter(lcd, XlcNCompoundText, cur_led , XlcNMultiByte))) return False;

}

116 After having done these modifications, we not only can connect to the IM runs m the different locale, but also place these to new functions any where m the application code as long as the function

_XSetLocaleModifiers() 1s called before the function _XOpenIM() 1s called, so that the input methods can be open or close dynamically instead of only in the initialisation part of the code. See Figure 4-13.

X Client IM Server

T o the~ From the Locale ~---~ L ocale En coding Encoding which ~ wh ich n.QJ. bund bund to IM to IM

The Application The Application Layer Layer

1 l The The Presentati on Compound Text Encoding Presentation Layer Layer The The Presentation The c ommunlcat Layer communication Ion Layers Layers Xlib r IM From the Locale Encoding which bund to IM

Communication Channel

Figure 4-13 Input Method

Structure after the modification

4.9 Application Programming

In this section we discuss one issue which related to application programmmg: how to make use of them to convert a internationalised application to multilingual application.

117 4.9.1 Programming based on Xlib or higher level toolkits

Programming based only on Xlib is the simplest case. What application implementers need to do is to use _XSetLocaleModifiers() and

_XOpenlM() instead of XSetLocaleModifiers() and XOpenlM(). There are called as many times as needed to connect to different IM or IM servers anywhere in the program as long as the order is preserved. When programmmg usmg higher level toolkits the two functions

XtSetLanguageProce() and XtAppMainLoop() cannot be used in the application. The implementers need to write their own to replace these two functions. The first one is simple because this function is combined with setlocale(), XSupportsLocale() and XSetLanguageModifiers(), to call them separately and replace the last one with the new function will solve the problem.

XtAppMainLoop() is also a convenience procedure which contains to functions : XtAppNextEvent() and XtDispatchEvent(). This is not

118 enough for multilingual applications. The code for event loop will be some like following: while(l) { XNextEvent(dpy, &event); /* get the next event*/ if (XFilterEvent(&event, None))/* let the IM to examine the event*/ continue; switch (event.type) {

case Key Press: /* if the event is the keypress event */ /* use the internationalised function to get the result from the IM */ len = XwcLookupString(ic, &event, buffer, buf_len, &keysym, &status);

Basically by switching between '------1different IC, we can switching between the different input methods . . . . I* do some error checking stuff*/ switch (status) { case XLookupNone: /* No composed Key*/ break; case XLookupKeySym: case XLookupBoth: ... I* do something */ break; case XLookupChars: ... I* do something else*/ break;

case ... : /* other events */

} /*while*/

The complexity is application dependent. It could be integrated in to some convenience procedure to reduce the code line and complexity.

4.10 Miscellaneous

Please note that it is does not make sense if, for example, a client running m French locale connect to a Chinese input method just like the

119 converting text encoded in French locale to text encoded in Chinese locale, although the functions are capable of doing so.

In this implementation we have not attempted to develop an globalised IM

Server, however we did some modifications to the Japanese IM Server

SJXA which shipped as a distribution, it can now run in Unicode locale, see Figure 4-14.

Figure 4-14 SJXA runs in Unicode locale in Japanese input mode

and could be extended to support other languages like Chinese, and

Korean. See Figure 4-15.

Figure 4-15 SJXA runs in Unicode locale in Chinese input mode

120 We feel the problem that how to dynamically load the resource files could be the main obstacle that remains unsolved, which is out of the scope of this thesis.

121 5 VALIDATION AND CONCLUSION

5.1 Introduction

In this chapter an attempt has been made to assess and validate the results that have been obtained in the development and implementation of the X

Window System research version and to conclude that a multilingual programming environment has been achieved.

5.2 What have been achieved

• We have added and modified new conversion functions into the existing

system, changed the encoding conversion structure which used m

Interclient communication and other areas has been fundamentally

changed. It is therefore generally possible to create two-way conversion

routines between Unicode and another encoding scheme that is lossless. It

is superior than the old one which based on Compound Text encoding

122 (ISO 2022). Because it is easier to add new conversion function - only

two-way

Encoding A Encoding D

Unicode Encoding B Encoding E Encoding

Adding In another Encoding Encoding C Encoding F

Encoding G

Figure 5-1 New conversion structure conversion is needed. Because it more efficient - no escape sequences need to be checked and translated. Because it is more manageable, See Figure 5-2, imagine there are five more encodings. In reality, the number is much larger than that.

• Localisation is greatly simplified, because no conversion is needed, what

basically needed is a directory which used to store the localised resource

files and locale database files.

123 Encoding A Encoding E

Encoding B Encoding F

Encoding C Encoding E

* Adding In a new Encoding Encoding G I .______,

Figure 5-3 Old conversion structure

• Text drawing performance is also improved because it does not needed (if

use big font) to find out which charset and font to use for the rendering.

For example, to display a piece of text written in Japanese, three charsets

are normally needed in the old system.

• It is more flexible to chose an Input Method, an application can connect to

an Input Method developed for the different locale if it needs to.

• There are more functions application programmers can chose from

compare to ASNI-C library. If there equivalent functions in both libraries,

124 then generally functions in Xlib will have extended functionality than the

one in ANSI-C library.

5.3 Future research directions

This experimental implementation only conforms to Unicode level 1 -

Precomposed characters only. Level 2 - Restricted Combining Characters and level 3 - Unrestricted Combining Characters have not been implemented. Also only left to right writing system has been implemented, the right to left writing system has not been considered. The reason is mainly because I do not have sufficient know ledge of many different languages and their writing systems.

• X Window System and ANSI-C are out of synchronisation. Basically the

internationalisation part and now the multilingual part of X are built on top

of the ANSI-C locale mechanism. The problem is that ANSI-C support

less locales than X does, although X has developed its own functions to

overcome this problem, but information like Currency symbols and

monetary formatting (LC_MONETARY); Decimal-pointe character,

numeric formatting (LC_NUMERIC) and Time, date formatting

(LC_TIME) are not covered by Xlib, X still depends on ANSI-C to

125 provide these information. The problem can be solved only if X is self­

contained.

5.4 Problems cannot be solved

There are problems we can not solve by building multilingual environment on top of historical operating systems:

• messages generated by operating system can not be localised, they remain

in English or any other language the system use, because they all based on

monolingual encoding scheme, except Plan 9 or Window NT which

already use Unicode encoding.

5.5 Conclusion

In this project, we have successfully integrated Unicode into the most widely a widely adopted windowing system - X Window System. Powered with

16-bit universal character encoding set, X Window System now be able to provide the application programmers a unrestricted multilingual programming environment and it also fully backwards compatible, all existing standard encoding schemes which have been widely used are still supported. It proves that a unrestricted multilingual environment can be provided without the

126 support of underlaying operating system (a full localisation can not be achieved however). This provides an intermediate means to migrate from various encoding schemes to a uniformed Unicode encoding without loss of any previous investments.

127 REFERENCES

Character Set Standards

[ANSI Z39.47] American National Standards Institute. Extended La.tin alphabet coded character set for bibliographic use, New York, 1985.

[ANSI Z39.64] American National Standards Institute. East Asian/character code for bibliographic use, New Brunswick: Transaction, 1990.

[CNS 11643] Tongyong hanzi biaozhun (Han Character Standard Interchange Code for General Use). -Taibei: Xingzhengyuan, 1986.

[GB2312] Code of Chinese Graphic Character Set for Information Interchange, - Beijing: Jishu Biaozhun Chubanshe, 1981.

[ISCII] Indian standard code for information interchange, New Delhi: Department of electronics, 1988.

[ISO 646] International Organisation for standardisation. Information processing. - ISO 7-bit coded character set for information interchange, 2d ed. [Geneva], 1983.

[ISO 2022] International Organisation for standardisation. Information processing. - ISO 7-bit and 8-bit coded character sets - Code extension techniques. 3d ed., [Geneva],1986.

128 [ISO DIS 6861.2] International Organisation for standardisation, Technical Committee 46. Subcommittee 4. lnfonnation and documentation - Cyrillic alphabet coded character sets for historic Slavonic languages and European non-Slavonic languages written in Cyrillic script, for bibliographic infonnation interchange, [Geneva], 1990.

[ISO 6937] International Organisation for standardisation. Infonnation processing - Coded character sets for text communication, [Geneva], 1983.

[ISO 8859] International Organisation for standardisation. Infonnation processing - 8-bit single-byte coded graphic character sets, [Geneva], 1987.

[ISO 8879] International Organisation for standardisation. Infonnation processing - Text and office systems - Standard generalised markup languages (SGML), [Geneva], 1986.

[ISO/IEC DIS 10646] International Organisation for standardisation. Joint Technical Committee 1. Subcommittee 2. lnfonnation technology Universal Coded Character Set (UCS), [Geneva], 1990

[ISO/IEC DIS 10646-1: 1993(E)] infonnation technology - Universal Multiple-Octet Coded Character Set (UCS) - part I : Architecture and Basic Multilingual Plane.

129 [JIS X 0208] Japanese Standards Association. Jouhou koukan you kanji fugoukei (Code of the Japanese Graphic Character Set for Information Interchange), Tokyo, 1990.

[JIS X 0212] Japanese Standards Association. [Jouhou koukan you kanji fugou­ hojo kanji] (Code of the supplementary Japanese graphic character set for information interchange), Tokyo, 1990.

[KS C 5601] Korea Industrial Standards Association. Gengpo kyohwan yang pwuho (hankul mich hanca). (Code for Information Interchange (Hangul and Hanja)), Seoul, 1989.

[Goldsmith 94] D. Goldsmith, M. Davis, UTF-7 A Mail-Safe Transformation Format of Unicode, Request for Comments: 1642, July 1994.

[Unicode 92] The , The Unicode Standard: Worldwide Character Encoding, Version 1.0, Addison-Wesley, Vol .1, 1991 and Vol. 2, 1992. [Unicode 93] The Unicode Consortium, Unicode Technical Report #4: The Unicode Standard Version 1.1, Prepublication Edition, 1993.

Language and Writing System

[Clews 88] Clews, John. Language Automation Worldwide: The Development of Character Set Standards, North Yorkshire: Sesame Computer Projects, 1988.

[Comrie 87] Comrie, Bernard, ed. The World's Major Language. Oxford: Oxford University Press, 1987.

130 [Coulmas 89] Coulmas, Florian. The Writing Systems of the World. Oxford: Basil and Blackwell, 1989.

[Gaur 85] Gaur, Albertine. A History of Writing. New York: Scribners, 1985.

[Huang 89] Huang, Jack and Timothy D. Huang. An Introduction to Chinese, Japanese and Korean Computing. : World Scientific Publishing. 1989.

[Katzner 86] Katzner, Kenneth. The La,nguages of the World. Revised edition. London: Routledge, Kegan and Paul, 1986.

[Ramsey 87] Ramsey, S. Robert. The La,nguages of China. Princeton: Princeton University Press, 1987.

Asian Language Input

[ken 93] , Understanding Japanese Information Processing, O'Reilly & Association, Inc., 1993.

[Low 88] H.B Low, S.C. Chan, and P.B. Tan. A Proposal for a Methodology to Develop Better ( chinese) Computer Input Systems, In Proceedings of the International Conference on Chinese Information Processing, pages 147 - 151, 1988.

131 Internationalisation

[Jones 92] Jones, Scott, et al., Digital Guide to Developing International User Information, Digital Press, 1992.

[Plauger 92] Plauger, P.J. and Jim Brodie, ANSI and ISO Standard C Programmer's Reference, Microsoft Press, 1992.

[L. Honornichl 94] L. Honornichl, Converting between Unicode and 8-bit Character sets, Unicode Implementer's Workshop #5, 1994.

[William 94] William S. Hall, Internationalisation in Windows NT, Part I: Programming with Unicode, Microsoft System Journal, Vol 9 No 6, 1994, pp. 57 - 71, Part II: Locales, Languages, and Resources, Microsoft System Journal, Vol 9, No 7, 55 - 74.

[ Pike 90] R. Pike, D. Presotto, K. Thompson, H, Trickey, "", UKUUG Proc. Of the Summer 1990 Conf., London, England, 1990.

[Pike 91] R. Pike, "8.5, The Plan 9 Window System", USENIX Summer Conf. Proc., Nashville, 1991.

[Pike 93] R. Pike, K. Thompson, 'Hello world ... ', In Proceeding of the Winter 1993 USENIX Conference, pp. 43-50, San Diego, 1993

[Glenn 94] Glenn Adams, Introduction to Unicode, Unicode Implementers Workshop, 1994.

132 [Charles 94] Charles Bigelow, The Design of a Unicode Font, Unicode Implementers' Workshop, 1994.

[Ngair 94] Teow-Hin Ngair, A Multilingual Environment Based on Unicode, Unicode Implementers' Workshop, 1994. [Martin 94] Martin Diirst, Andre Weinand, Unicode in an Application Framework, Unicode Implementers' Workshop, 1994.

[John 94] John Gioia, Universal Language Support Unicode Implementers' Workshop, 1994

[E. Uren 93] E. Uren, R. Howard, and T. Perinotti, An Introduction to Software Internationalisation and Localisation, Van Nostrand Reinhold, N. Y., 1993.

[D. Taylor 92] D. Taylor, Global Software, Developing Applications for the International Market, Springer-Verlag, NY, 1992.

X Window System

[Jim 94] Jim Flowers, X Logical Font Description Conventions Version 1.5, X Consortium (X11R6), 1994.

[BDF 94] Bitmap Distribution Format Version 2.1, X Consortium (Xl 1R6), 1994.

[Katsuhisa 94] Katsuhisa Y ano, Y oshio Horiuchi, Xl 1R6 Sample Implementation Frame Work, X Consortium (Xl 1R6), 1994.

133 [Takashi 94] Takashi Fujiwara, The X/M Transport Specification, Revision 0.1, , X Consortium (Xl 1R6), 1994.

[Masahiko 94] Masahiko Narita, Hideki Hiura, The Input Method Protocol Version 1.0, X Consortium (Xl 1R6), 1994.

[Yoshio 94] Yoshio Horiuchi, X Locale Database Definition, X Consortium (X11R6), 1994.

[David 94] David Rosenthal, Inter-Client Communication Convention Manual Version 2.0, X Consortium (Xl 1R6), 1994.

[James 94] James Gettys, Robert W. Scheifler, Xlib - C Language X Interface, X Consortium (Xl 1R6), 1994.

[Robert 92] Robert W. Scheifler, James Gettys, X Window System, 3d ed. Digital Press, 1992.

[Adrian 93] Adrian Nye, Xlib Programming Manual, O'Reilly & Associates, Inc., 1993.

[Joel 94] Joel McCormack, Paul Asente and Ralph R. Swick, X Toolkit lntrinisics - C Language Interface, Revision 1, X Consortium (Xl 1R6), 1994.

[Asente 90] Asente, Paul J. and Ralph R. Swick, X Window System Too/kit, Digital Press, 1990.

[Johnson 90] Johnson, Eric F. and kevin Reichard, Advanced X Window Applications Programming, MIS: Press, 1990.

134 [Johnson 92] Johnson and Reichard, Using X, MIS: Press, 1992.

[OSF 93] Open Software Foundation, OSF/Motif programmer's Reference, Revision 1.2, Prentice Hall, 1993

[OSF 93] Open Software Foundation, OSF/Motif programmer's Guide, Revision 1.2, Prentice Hall, 1993.

[Eric 93] Eric F. Johnson and Kevin Reichard, Power Programming ... Motif, MIS:Press, 1993.

135 APPENDIX: Important Data Structures

XlMMethod1Rec, I ' XI MMethod,

Sletua ("clon ) (XIM )

char• (" ..t _valuH)( XIM , XI MArg' )

char• ("~l_values)(XIM , XIMArg")

XIC ("create _,c)(XIM , XIMArg")

int ('ct1tomb1)(XI M, char', 1n1, char", 1nt, Stalus")

ml ('ct,towce)(XI M, chr' , int, wchar_t' , int , Stalus' ) ., i,;;;:::"'-'f,:¾F:,;-JC~~ '"':°%::

1a- XI MRec , XIM Malhod1 malhods I • x1M XIMCoreRac core XI MCoreRac, XI MPrivateRac private "XI MCore I I XI MPrivateRec I XLCd led I XimLoca1PrivalaRac methods Xi mLocalPrivteRec I XIC ic_cham ximt-rolol-'nvateHec I X1mPro10Priva 1eRec proto XIC curranU c Window im_window Oi, play' di1play

XI MID 1m 1d :i- De!Tree 'top XrmDatabau rdb CAR0 16 ,nmd ii >------•"b XlcConv ctom _conv char' re s_nam e .____ x_,_ M_s,_,,,_,_·_••-••_,,_,_ _, ,_,,_.,____ .. ~ XlcConv ctowc_conv char' re s_claH f-----C_A_R_o,_2_·_1m__ o_n_k•_Y_ li 1_1---- -i.l CA R032 'im_ollkayli1t I XIM Valueslist 'im_valuea _list BIHAASK32 flag ~~ XIM ValuesList "ic_values_list BITMASK32 reg1sted_lilte r_event EV ENTMASK lorward_event_mask X1 MStyles ' styles EVENTMASK synchrnous_event_mask XIMCallbck dutroy_callback XlmProtolntrRec ' lntrproto char • im_name XIMReaourceUst im_inner_reaourc••

unsigned lnt Im_num_inner_resources XI MResurcelist im_resource1 Xl MReaou1ecU1t 1c _inner_re1ource1 unsigned int im_num_re1ource1 un1igned inl ic_num_inner_resourcH

char ' hold_data XI MR uourcelist lc_resource

,nt hold_data_len un1igned int ic_num_re1ourece1 cha r 'locale_name Bool visible_position CARD16 prolocol_major_veraion

CAR0 16 prolocol_m1nor_version Xrm0uar11 ·1aved_imvalues lnt num _uved_lmva lue1 XlcConv clom_conv

XlcConv ctow_conv XlmTransConnectProc connect XimTran1ShutdownProc 1hutdown XtmTran1Wri teProc write XimTransReadProc read ,

X1mT1an1Flu1hProc Huth ~ XlmTranaRegDl1palcher re911ter_d11patcher } Xl mTran1CallDi1patcher call_d11patcher "- XP01n1er apec ""-.

Input Method Data Structures

136 XICRec, I llir · x1c

XICCoreRec, I ·x1ccor1 )II M lni

XICnHI

Window cltenl_w,ndow ~------'\If XIC M ■ thod1R ■ e: . ·x1cM,thod1

XI MS 1yt ■ input_I tyle void {"dHlroyl(XIC) Wtndow 1ocu,_w1ndow void ("Hl_locu1)(XIC)

un ..gned lo ng hlter _ 1v ■ nt1 void ('un1el_tocu1)(XIC) XIMCellback geomelry_c:allb1ck ' char• ('1et_v1luH)(XIC, XIMArg") c: 1111· rH_n,m• char" ("get_v1lue1)(XIC , XI MArg") ch11·, .._ cl1u ch;ir' {"mb_rH ■ l)(XIC )

XIMC ■ llb1c:k dHt1oy_ut1b1ck wchu_t• (" wc _reHl)IXIC)

XI M C ■ llback 1lnng_convert1on_ullb1ek 1nt (" mb_lookup_,111ng)( XIC , XK ■ y E ve nl " . eh ■, •. 1nt , K ■ ySym• . St1tu1")

XI M S111ngConve,11onT ■ it 1lnt1g_cof1v1,.1on in! 1"wc_tookup_11r1ng)lXIC , XKeyEnnl" . wchu_1• . 1nl , KeySym• . St ■ tu1") XI MR ■ HI Sl1l1 rH1t_ 1111 ■

XI MHotKeyTriggen "hotkty

)(I MHotK,yS1,1, hotl,,1y_11,t,

ICPr11d1IAtl11bul11 pr11d1l _allr

ICS11tu1Att11bul11 11a1u1_ 11tr ,

Input Context Data Structures

CharSet Data Sturcture and FontSet Data Structure

137 XOCMethods Re c

XCloseOMP roc dose

XSetOMValuesProc sel _values

XGetOMValues Proc get _vakles

XCreateOCProc cre ale_oc

XOMRec IIJr XOCGenericRec ·xoM • XOCGene ric X MMeth ods methods XOM CoreRe c co re XO MGeneric Part XLCd led en Display' display

XrmDataba se rdb

char' res_name

char' res_class

XOC oc_list

XlcResourcellst resources

lnt num_resource

XOIICharStlllsl ,.quked_c:flarsel X MOrientation OMDa taRe c ~======-­ rlentati on Hsi (arra y) XOCCoreRec OMDalaRe c

·xoccore OMDalaRec

OMDalaRec

OMDataRec

Output Method Data Structures

138 XOCMethodsRec

XDestroyOCProc destroy XSe!OCValuesProc set_values XGe!OCValuesProc gel_values X mb Text Escapement Pr oc mb_ascapamanl XmbTextExtanlsProc mb _extents Xmb Ta xt PerChar Exlents Proc mb_a xtants_per_ char XmbDrawS!ringProc mb_draw_slring XmbOrawlmageStringProc mb_draw_imaga_slring XwcT ext EscapementProc wc _ascapament XwcT ext E xlanls Pree wc_axtents XwcT a 11:1 Pere harExl enlsProc wc _extents_per _char XwcOrawSlringProc wc _draw_string XwcOrawlmageStringProc wc _draw_i mage _strlng

XOCRec 1.- XOCGeneri cRe · xoc 'XOCGeneric XOCMeth ods methods

char• res class XOCCoreRec · xoccore

Output Context (FontSet) Data Structers

139 XLCdPublicMet hodsPart ~

XLC dPub llcMethods Rec 'X LC dPublicMethods

XL Cd Publ ic Methods Pa rt

XLCdMethodsRec

XLCdMethodsRec • XLCd Methods

1a- XLCdMethods methods l:!f"------< XLC dCore core

XPointer opaque

XLCdGenericPart gen char' language

char' territory

char' codeset

char' encodlng_name

lnt mb_cu r_max

Bool ln_state_depend

cher' default_s!rlng

XPolnter Klocale_db

Locale Data Structures

140