<<

Unicode Support in the Solaris™ 7 Operating Environment

Sun Microsystems, Inc. 901 San Antonio Road Palo Alto, CA 94303 1 (800) 786.7638 =1.512.434.1511 Copyright 1998 Sun Microsystems, Inc., 901 San Antonio Road, Palo Alto, California 94303 -1100 .S.A. All rights reserved.

This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers.

Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd.

Sun, Sun Microsystems, the Sun logo, and Solaris are trademarks, registered trademarks, or service marks of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. Information subject change without notice.

The OPEN LOOK and Sun™ Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Sun’s licensees who implement OPEN LOOK GUIs and otherwise comply with Sun’s written license agreements.

RESTRICTED RIGHTS: Use, duplication, or disclosure by the U.S. Government is subject to restrictions of FAR 52.227-14(g)(2)(6/87) and FAR 52.227-19(6/87), or DFAR 252.227-7015(b)(6/95) and DFAR 227.7202-3(a).

DOCUMENTATION IS PROVIDED “AS IS” AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON- INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.

Copyright 1998 Sun Microsystems, Inc., 901 San Antonio Road, Palo Alto, Californie 94303-4900 Etats-Unis. Tous droits réservés.

Ce produit ou document est protégé par un copyright et distribué avec des licences qui en restreignent l’utilisation, la copie, la distribution, et la décompilation. Aucune partie de ce produit ou document peut être reproduite sous aucune forme, par quelque moyen que ce soit, sans l’autorisation préalable et écrite de Sun et de ses bailleurs de licence, s’il y en a. Le logiciel détenu par des tiers, et qui comprend la technologie relative aux polices de caractères, est protégé par un copyright et licencié par des fournisseurs de Sun.

Des parties de ce produit pourront être dérivées des systèmes Berkeley BSD licenciés par l’Université de Californie. UNIX est une marque déposée aux Etats-Unis et dans d’autres pays et licenciée exclusivement par X/Open Company, Ltd.

Sun, Sun Microsystems, le logo Sun, et Solaris sont des marques de fabrique ou des marques déposées, ou marques de service, de Sun Microsystems, Inc. aux Etats-Unis et dans d’autres pays. Toutes les marques SPARC sont utilisées sous licence et sont des marques de fabrique ou des marques déposées de SPARC International, Inc. aux Etats-Unis et dans d’autres pays. Les produits portant les marques SPARC sont basés sur une architecture développée par Sun Microsystems, Inc.

L’interface d’utilisation graphique OPEN LOOK et Sun™ a été développée par Sun Microsystems, Inc. pour ses utilisateurs et licenciés. Sun reconnaît les efforts de pionniers de Xerox pour la recherche et le développement du concept des interfaces d’utilisation visuelle ou graphique pour l’industrie de l’informatique. Sun détient une licence non exclusive de Xerox sur l’interface d’utilisation graphique Xerox, cette licence couvrant également les licenciés de Sun qui mettent en place l’interface d’utilisation graphique OPEN LOOK et qui en outre conforment aux licences écrites de Sun.

CETTE PUBLICATION EST FOURNIE “EN L’ETAT" ET AUCUNE GARANTIE, EXPRESSE OU IMPLICITE, ’EST ACCORDEE, Y COMPRIS DES GARANTIES CONCERNANT LA VALEUR MARCHANDE, L’APTITUDE DE LA PUBLICATION A REPONDRE A UNE UTILISATION PARTICULIERE, OU LE FAIT QU’ELLE NE SOIT PAS CONTREFAISANTE DE PRODUIT DE TIERS. CE DENI DE GARANTIE NE S’APPLIQUERAIT PAS, DANS LA MESURE OU IL SERAIT TENU JURIDIQUEMENT NUL ET NON AVENU.

Please Recycle Contents

Preface ...... 1 and Multilingual Computing ...... 2 Multilingual Computing ...... 3 Multilanguage Environment ...... 3 Multiscript Environment ...... 4 Multilingual Environment ...... 4 Software Internationalization ...... 5 Internationalization Framework ...... 6 Supporting the Unicode Standard ...... 7 Benefits of Unicode ...... 7 The Unicode Standard ...... 8 About Unicode ...... 8 Coded Representations of Unicode ...... 9 Unicode in the Solaris Operating Environment ...... 12 Unicode UTF-8 en_US.UTF-8 Locale ...... 12 Codeset Conversion ...... 14 European Unicode Locales ...... 15 Asian Unicode Locales ...... 16 Font Resources under Unicode ...... 17 Unicode Technical Considerations ...... 18 Internationalized Applications with Unicode ...... 18 Unicode Application Interfaces ...... 19 Font Resources ...... 19 Setting Resource Definitions ...... 20 Additional References ...... 22 Appendix A: Codeset Conversions ...... 23 Preface

This paper is intended for software developers interested in support for the Unicode standard in the Solaris™ 7 operating environment. It discusses the following topics: ■ An overview of multilingual computing, and how Unicode and the internationalization framework in the Solaris 7 operating environment work together to achieve this aim ■ The Unicode standard and support for it within the Solaris operating environment ■ Unicode in the Solaris 7 Operating Environment ■ How developers can add Unicode support to their applications ■ Codeset conversions (described in Appendix A)

1 CHAPTER 1

Unicode and Multilingual Computing

It is not a new idea that today’s global economy demands global computing solutions. Instant communications and the free flow of information across continents — and across computer platforms — characterize the way the world has been doing business for some time. The widespread use of the Internet and the arrival of electronic commerce (-commerce) together offer companies and individuals a new set of horizons to explore and master. In the global audience, there are always people and businesses at work — 24 hours of the day, 7 days a week. global computing can only grow.

What is new is the increasing demand of users for a computing environment that is in harmony with their own cultural and linguistic requirements. Users want applications and file formats that they can share with colleagues and customers an ocean away, application interfaces in their own language, and time and date displays that they understand at a glance. Essentially, users want to write and speak at the keyboard in the same way that they always write and speak.

Sun Microsystems addresses these needs at various levels, bringing together the components that make possible a truly multilingual computing environment. It begins with the internationalization framework in the Solaris operating environment. Developers have different ways to internationalize their applications to meet the requirements of specific cultural regions. This framework continues by incorporating the Unicode encoding standard, a standard that provides users and developers with a universal codeset. Unicode is well-suited to applications such as multilingual databases, electronic commerce, and government research and reference. The Solaris 7 operating environment supports multilingual computing with multiple character sets and multiple cultural attributes.

2 Multilingual Computing

The concept “multilingual” in practice takes different forms. It is important to distinguish among the following types of environments: ■ Multilanguage ■ Multiscript ■ Multilingual

The movement from multilanguage to multiscript to multilingual implies an increasing level of complexity in the underlying operating environment.

Multilanguage Environment

A multilanguage environment means that a locale supports one writing system, or script, and one set of cultural attributes.

In this environment, an application inherits all the language and cultural attributes of the current locale, with text manipulated according to the language rules of this current locale. Because the locale is limited to supporting one writing script and one set of cultural attributes, the application is also limited to creating documents containing text in one script.

Thus, in a multilanguage environment, the user must launch a separate instance of an application in different locales for the application to take advantage of differing language and cultural attributes.

For example, if a user is using the English-based operating environment and wishes to create a document containing , the user must first set up the Chinese locale and then launch the application to begin creating Chinese content. If the user then wishes to enter Russian text, the Russian locale must be set up and another instance of the application launched — this time within the Russian locale — for the user to generate Russian text. In this environment, the Chinese and Russian text cannot be mixed and the user must alternate between locales to create text of different scripts.

Multilingual Computing 3 Multiscript Environment

A multiscript environment means that a locale may support more than one script, but the locale is still associated with only one set of cultural attributes.

In this environment, an application can create a document with text in multiple scripts. However, the application must tag or otherwise mark each separate run of text of the same script (script run) to apply the appropriate language attributes for proper input and display.

Note – In a Unicode-enabled locale, tagging script runs is not necessary because all language attributes are inherent in the Unicode codeset.

The multiscript environment supports text written in multiple scripts but is still limited to one set of cultural attributes. This means, for example, that text is sorted according to the sorting rules of the current locale.

To expand on the Chinese example from the multilanguage section, in a multiscript environment, rather than create two separate documents, the user can now create one multiscript document containing both Chinese and Russian text. The cultural attributes of the active locale still apply. Therefore, if the user is in the Chinese locale, the Chinese sorting rules (for example) will be applied to the mixed script text.

Multilingual Environment

A multilingual environment means that a locale can support multiple scripts and multiple cultural attributes. In this environment, an application can have the ability to transparently make use of both the language and cultural attributes of the locale within a single locale. In this case, an application can create a document in multiple scripts, and because the application has access to multiple cultural attributes, it has greater control over how text is manipulated. For example, a document containing text in multiple scripts can sort text according to its script rather than the sort order of the current locale. In a Unicode-enabled locale, the application can eliminate the step of tagging script runs.

To Extend the Chinese multiscript document example further to a multilingual environment, you are no longer constrained to applying only sorting rules of the active Chinese locale. In a multilingual environment, you can apply the sorting rules of the Chinese locale to the Chinese portion of our multiscript text, and then call upon the Russian sorting rules to apply to the Russian portion of our multiscript text.

4 Unicode Support in the Solaris 7 Operating Environment The multilingual environment brings you closest to the ideal of multilingual computing. An application can make use of locale data from any number of locales, while at the same time allowing the user to easily manipulate text in a variety of scripts. Every user can communicate and work in his or her own language and still understand and be understood by other users anywhere in the world.

Software Internationalization

Sun Microsystems defines the following levels at which an application can support a customer’s international needs: ■ Internationalization ■ Localization

Software internationalization is the process of designing and implementing software to transparently manage different cultural and linguistic conventions without additional modification. The same binary copy of an application should run on any localized version of the Solaris operating environment, without requiring source code changes or recompilation.

Software localization is the process of adding language translation (including text messages, icons, buttons, and so on), cultural data, and components (such as input methods and spell checkers) to a product to meet regional market requirements.

The Solaris operating environment is an example of a product that supports both internationalization and localization. The Solaris operating environment is a single internationalized binary that is localized into various languages such as French, Japanese, and Chinese, and supports the associated cultural and language conventions of each language.

When properly designed, applications can easily accommodate a localized interface without extensive modification. One suggestion for creating easy-to-localize software is to first internationalize the software and then encapsulate the language- and cultural-specific elements in a locale-specific database or file. This greatly simplifies the localization process, should a developer choose to localize in the future.

At a minimum, Sun Microsystems strongly encourages developers to internationalize their software. In this way, their applications can run on any localized version of the Solaris operating environment. As a result, such an application can easily manage the user's language and cultural preferences.

Software Internationalization 5 Internationalization Framework

A major aspect of developing a properly internationalized application is to separate language and culturally-specific information from the rest of the application code. The internationalization framework in the Solaris operating environment uses the following concepts to fulfill this aim: ■ Locales ■ Localizable interface ■ Codeset independence

A locale is a set of language and cultural data that is dynamically loaded into memory at runtime. Users can set the cultural aspects of their local work environment by setting specific variables in a locale. These settings are then applied to the operating system and to subsequent application launches.

The Solaris operating environment includes APIs for developers to access directly the cultural data of the current locale. For example, an application will not need to encode the currency for a particular region. By calling the appropriate system API, the locale returns the symbol associated with the current currency symbol the user has specified. Applications can run in any locale without having special knowledge of the cultural or language information associated with the locale.

Creating a localizable interface means accounting for the variations that take place when an interface is translated into another language. In addition, the Solaris operating environment provides messaging APIs and utilities that can be used to collect, generate, and use messages in applications.

Codeset independence means designing applications that do not make assumptions about the underlying codeset. For example, text-handling routines should not define in advance the size of the character codeset being manipulated.

For more information about the internationalization framework, see the Sun Microsystems white paper, “Asian Language Support in the Solaris 7 Operating Environment”. For more about designing applications to make use of Unicode, see Chapter 3, “Unicode Technical Considerations.”

6 Unicode Support in the Solaris 7 Operating Environment Supporting the Unicode Standard

Unicode, or Universal Codeset, is a universal scheme developed and promoted by the , a non-profit organization that includes Sun Microsystems. The Unicode standard encompasses most alphabetic, ideographic, and symbolic characters used on computers today.

Using this one codeset enables applications to support text from multiple scripts in the same documents without elaborate marking of text runs. At the same time, applications need to treat Unicode as just another codeset — that is, apply the principle of codeset independence to Unicode as well.

Unicode locales in the Solaris operating environment are called the same way, and function the same way, as all other locales. These locales provide the extra benefits that the Unicode codeset brings to the work environment, including the ability to create text in multiple scripts without having to switch locales. Sun Microsystems provides the same level of Unicode locale support for both 32-bit and 64-bit Solaris environments.

Benefits of Unicode

Support for the Unicode standard provides many benefits to application developers. These benefits include: ■ Global source and binary ■ Support for mixed-script computing environments ■ Reduced time-to-market for localized products ■ Expanded market access ■ Improved cross-platform data interoperability through a common codeset ■ Space-efficient encoding scheme for data storage

Unicode is a building block that designers and engineers can use to create truly global applications. By making use of one flat codeset, end-users can exchange data more freely without relying on elaborate code conversions to make characters comprehensible.

In adopting the internationalization framework in the Solaris operating environment, Unicode can be thought of as “just another codeset”. By following the concepts of codeset-independent design, applications will be able to handle different codesets without the need for extensive code rework to support specific languages.

Supporting the Unicode Standard 7 CHAPTER 2

The Unicode Standard

About Unicode

On most computer systems supporting writing systems such as Roman or Cyrillic, user input (usually via keypresses) is converted into character codes that are stored in memory. These stored character codes are then converted into glyphs of a particular font before being passed to the application for display and printing.

Each locale has one or more codesets associated with it. A codeset includes the coded representation of characters used in a particular language. Codesets may span one-byte (for alphabetic languages) or two or more bytes for ideographic languages. Each codeset assigns its own code-point values to each character, without any inherent relationship to other codesets. That is, a code-point value that represents the letter ‘a’ in a Roman character set will represent another character entirely in the Cyrillic or Arabic system, and may not represent anything in an ideographic system.

In Unicode, every character, symbol, and ideograph has its own unique character code. As a result, there is no overlap or confusion between the code-point values of different codesets. There is, in fact, no need to define multiple codesets because each character code used in each writing system finds its place in the Unicode scheme. Unicode includes not only characters of the world’s languages, but also includes characters such as publishing characters, mathematical and technical symbols, and punctuation characters.

Unicode version 2.1 contains alphabetic characters from languages including Anglo-Saxon, Russian, Arabic, Greek, Hebrew, Thai, and Sanskrit. It also contains ideographic characters in the unified Han subset defined by national and industry standards for , Japan, Korea, and .

8 Coded Representations of Unicode

In recent years, the Unicode Consortium and other related organizations have developed different formats for representing and storing a Unicode codeset. The ISO/IEC International Standard 10646-1 (commonly referred to as 10646) defines the Universal Multiple-Octet Coded Character Set (UCS) format for representing characters from all the significant world languages in multibyte format. The 10646 specification contains the following basic forms for representing characters: ■ Universal Coded Character Set-2 (UCS-2) – Also known as Basic Multilingual Plane (BMP). Characters encoded in two bytes on a single plane. ■ Universal Coded Character Set-4 (UCS-4) – Characters encoded in four bytes on multiple planes and multiple groups. ■ UCS Transformation Format 16-bit form (UTF-16) – Extended variant of UCS-2 with characters encoded in 2-4 bytes. ■ UCS Transformation Format 8-bit form (UTF-8) – A transformation format using characters encoded in 1-6 bytes.

UCS-2 defines a 64K coding space, or BMP, for representing character codes in a two-octet row and cell format. The row and cell octets designate the cell location of a particular character code within a 256 by 256 (00-FF) plane.

UCS-4 defines a four-octet coding space divided into four units: group, plane, row, and cell. The row and cell octets designate the cell location of a particular character code within a plane. The plane octet designates the plane number (00-FF), and the group octet the group number (00-7F) to which the plane belongs. In total, there are 256 planes occurring 127 times.

Figure 1 illustrates the UCS-2 and UCS-4 coding schemes.

UCS-4 Group octet Plane octet Row octet Cell octet

UCS-2 Row octet Cell octet

FIGURE 1 UCS-2 and UCS-4 coding schemes

Coded Representations of Unicode 9 In addition to the UCS forms of 10646, the Unicode standard also defines another form called UTF, or UCS Transformation Format. One version of UTF is an extended UCS-2 encoding form designed to include characters from outside the BMP 64K coding space. This form was first called UCS-2E (extended UCS-2), but is now known as UTF-16 (UCS Transformation Format 16-bit form).

The UTF-16 form translates a range of UCS-4 codes into a two-octet encoded string. It does this by reserving an area of codes in the BMP coding space for mapping to and from 16 planes of group 00 of UCS-4. Each plane is assigned a certain set of code positions in the two-octet UCS-2 scheme. Specifically, Planes 01 to 0E (14 planes, or 14 x 65,536 = 917,504 characters) are reserved for standard encodings and Planes 0F and 10 (2 planes, or 2 x 65,536 = 131,072 characters) are reserved for private use.

Although UCS-4 and UTF-16 provide comprehensive ways of representing several character sets, they do not preserve the byte values for ASCII characters. All UNIX® systems, because they are based on an ASCII kernel, reserve certain character codes for / operations, such as the null character as a string terminator, the slash (/) character as a path name separator, and the DEL and SPACE control characters. To circumvent this problem, another version of UTF was devised, called FSS-UTF (File System Safe-UTF), or now commonly known as UTF-8.

UTF-8 is an encoding scheme that maps the entire UCS-4 character set to a series of single-octet and multi-octet strings. In this scheme, the most significant bit is 0 for ASCII characters, and 1 for all other characters. The ASCII character range is contained in a single-byte encoding, and all other characters in a range from 2 up to 6-byte encoding. Table 1 describes the UTF-8 encoding scheme.

Bits Hex Min Hex Max UTF-8 Binary Encoding

7 00000000 0000007F 0xxxxxxx

11 00000080 000007FF 110xxxxx 10xxxxxx

16 00000800 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx

21 00010000 001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

26 00200000 03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

31 04000000 7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Note – The UTF-8 scheme does not use any ASCII byte values in its 2- to 6-byte sequences, yet ASCII values remain 8-bit within the new byte structure. Thus, UTF-8 is compatible with all legacy file systems and other systems that parse for the ASCII byte, while UCS-2/UTF-16 and UCS-4 are not compatible with ASCII.

10 Unicode Support in the Solaris 7 Operating Environment This is important because it allows applications that support Unicode to use existing data in ASCII format without applying a conversion utility. In addition, there is support within the Internet community for adopting UTF-8 as the encoding standard for the Internet.

In addition to its backward compatibility with 7-bit ASCII, UTF-8 is a space-efficient encoding scheme when the encoded data needs only one-byte or less (as for English and other Roman character-based writing systems). Because UTF-8 stores one-byte data as one byte rather than, for example, the two bytes required by UTF-16, this can significantly decrease the storage space required to hold large blocks of international data.

Because of its flexibility, and compatibility with ASCII and the UNIX operating environment, the Unicode support on the UTF-8 format is used in the Solaris operating environment. UTF-8 provides developers with a format that is compatible with existing internationalized environments and provides an easy path for Internet and legacy data interoperability. As a file system safe format, UTF-8 supports one- byte unit I/O operations and can represent the Unicode formats UCS-2 and UCS-4. And UTF-8 fits well within the XPG internationalization framework.

Coded Representations of Unicode 11 CHAPTER 3

Unicode in the Solaris Operating Environment

The Solaris operating environment provides support for Unicode through the use of several Unicode locales. The Unicode locales provide a framework for developing applications with multiple-script support for any country or region of the world. Solaris also benefits from the advantages that Unicode brings to creating global applications. Properly internationalized applications require no changes to support the Unicode locales. All internationalized CUI and GUI utilities and commands in the Solaris operating environment are available for use with the Unicode locales without modification.

All Unicode locales in the Solaris operating environment are based on the UTF-8 format. Each locale includes a base language in the UTF-8 codeset and the regional data related to this base language and its cultural conventions. This includes local formatting rules, text messages, help messages, and other related files. Each locale also supports several other scripts for input, display, code conversion, and printing.

Unicode UTF-8 en_US.UTF-8 Locale

en_US.UTF-8 is the flagship Unicode locale in the Solaris 7 operating environment. The en_US.UTF-8 locale is an American English-based locale with multiscript processing support for characters of many different languages. Features of all Unicode locales include enhanced input modes and input mode switching, support for MIME character sets in DtMail, expanded iconv code conversions, and a PostScript print filter.

12 The en_US.UTF-8 locale supports multiscript processing, with eight input modes: English/European, Cyrillic, Greek, Arabic, Hebrew, Thai, and for the Asian languages, the Unicode hexadecimal code input method and the Table lookup input method. The end user can input characters from any combination of these scripts and from the entire Unicode coding space.

End users switch between input modes using the Compose key or a Control key sequence. The Arabic, Hebrew, and Thai input modes provide full complex text layout features, including right-to-left display and context-sensitive character rendering. The Unicode hexadecimal code input method lets the user generate Unicode characters by typing in the hexadecimal equivalents. The Table lookup input method is the easiest method for non-native speakers to input characters of a foreign language. This method provides a lookup window on the desktop for choosing a script and then selecting characters of the script from the available look- up table.

The UTF-8 locales provide a print utility for printing text files. This utility can print flat text files written in UTF-8 using bitmap fonts available on the system. Because the output from the utility is standard PostScript, users can send the output to any PostScript printer.

The en_US.UTF-8 locale supports various MIME character sets in DtMail, including Latin, Greek, Cyrillic, Complex Text Layout, and Asian character sets. With this support, users can send and receive e-mail messages encoded in MIME character sets from almost any region of the world (Figure 2). DtMail decodes received e-mail messages by recognizing the MIME character set in use and the content transfer encoding provided in the message. The user sending a message specifies the MIME character set to use for the recipient mail user agent.

Unicode UTF-8 en_US.UTF-8 Locale 13 FIGURE 1 Using DtMail with multiple character sets

Codeset Conversion

The en_US.UTF-8 locale supports code conversions among the major codesets of several countries. Figure 3 illustrates the available codeset conversions between UTF-8 and other codesets under the Solaris operating environment.

FIGURE 2 UTF-8 Codeset Conversions

14 Unicode Support in the Solaris 7 Operating Environment Users can perform codeset conversions via a new graphical Sdtconvtool utility or via the iconv(1) command. The Sdtconvtool (Figure 4) detects all possible iconv code conversions available on the user’s system and presents them in an easy-to-use format.

FIGURE 3 Graphical Sdtconvtool for converting between codesets

Developers can use the iconv(3) function to access the same functionality. This includes conversions to and from UTF-8 and many ISO-standard codesets: UCS-2, UTF-16, UCS-4, UTF-7, KO18-R, Japanese EUC, Korean EUC, Simplified Chinese EUC, Traditional Chinese EUC, GBK, PCK (Shift JIS), , Johap, ISO-2022-JP, ISO-2022-KR, ISO-2022-CN, and so on.

See Appendix A on page 24 for a detailed listing of the supported code conversions.

European Unicode Locales

The Solaris 7 operating environment provides five European Unicode locales that offer the same level of support as en_US.UTF-8, with modifications to regional definitions in keeping with the requirements of each language and cultural area.

European Unicode Locales 15 The five other UTF-8 locales are fr.UTF-8 (French), de.UTF-8 (German), it.UTF-8 (Italian), es.UTF-8 (Spanish) and sv.UTF-8 (Swedish). Each of these locales contains the same feature set as en_US.UTF-8, with regional definitions for national currency, date and time, numerical notation, and translated text messages. These locales also support the new euro monetary symbol and related conventions.

FIGURE 4 Euro symbol glyph

Asian Unicode Locales

The Solaris 7 operating environment for the Asian language environment includes two UTF-8 locales: ja_JP.UTF-8 for the Japanese Solaris 7 operating environment and .UTF-8 for the Korean Solaris 7 operating environment. These locales provide the same scope of support as en_US.UTF-8 and the European Unicode locales.

The ja_JP.UTF-8 locale for in the Solaris operating environment contains all regional definitions and the same input methods as the other Japanese locales. The Japanese language version of the Solaris 7 operating environment provides enhanced code conversions for UTF-8, new Japanese UCS-2/ UTF-8 PostScript fonts, and support for Unicode codesets in the User-Defined Character editor.

The ko_KR-UTF-8. locale for Korean language in the Solaris operating environment contains all regional definitions and the same input methods as other Korean locales. It also contains expanded code conversion support and a Hanja Tool for dictionary editing.

16 Unicode Support in the Solaris 7 Operating Environment Font Resources Under Unicode

Unicode version 2.1 supports a total of 38,887 alphabetic characters from various regions of the world, with over 20,000 of them ideographic characters for the Chinese, Japanese, and Korean languages. The font resources for representing these characters, however, do not always operate in a one-to-one relationship. That is, there are some Unicode code points that have different, multiple glyphs associated with it. This is to ensure that these specific code points can be rendered correctly based upon the various contexts in which they may appear. This is especially the case with the Asian languages, where, for example, the Unified han (Chinese/ Japanese/Korean ideographs) glyphs are written and displayed differently among Simplified Chinese, Traditional Chinese, Japanese , and Korean hanja ideographs.

To manage these difficulties, the Solaris operating environment contains an output method that combines existing fonts to form a Unicode font set, instead of providing a single Unicode font. The Solaris 7 operating environment supports the following range of scripts: ■ Western/Eastern/Northern European scripts ■ Greek, Turkish, Cyrillic ■ Simplified Chinese, Traditional Chinese, Japanese, Korean ■ Arabic, Hebrew, Thai

For European scripts, there is one-to-one direct mapping between Unicode characters and corresponding glyphs. For text from Complex Text Layout languages (Arabic, Hebrew, Thai), the Solaris layout engines pre-process the text (that is, perform right-to-left swapping, contextual analysis, and so on) before rendering the associated glyphs. For Asian characters, the Solaris operating environment output methods provide dynamic remapping of the font and glyph index according to the locale definition. Each locale contains a font table with mapping mechanisms that specify which font and glyph to use for each character code. The mechanism remaps the Unicode code point values to existing Chinese, Japanese, and Korean fonts and glyph index pairs. A locale administrator can define the priority among fonts, that is, which font is searched first. For example, the mechanism may search the Simplified Chinese fonts first, then, if it cannot find the appropriate glyph, search the Traditional Chinese fonts, and so on.

Font Resources Under Unicode 17 CHAPTER 4

Unicode Technical Considerations

Internationalized Applications with Unicode

The use of the Unicode codeset enables developers to write applications that support multiple scripts simultaneously. Depending on the Unicode locale being used, the user will have access to one or several additional scripts, along with the base language script, for input, display, and printing. This way, network environments with distributed applications can provide individual users access to different language environments simultaneously.

By itself, using Unicode in an application does not mean that the application is fully internationalized. If, for example, an application customizes data handling for Unicode directly, then it needs to provide codeset converters as wrappers to support a codeset other than Unicode.

This approach is direct Unicode localization — not internationalization. With direct localization, developers may localize applications in a manner that duplicates or conflicts with the localization provided by the operating system. In addition, an application may assume that all characters are represented in two-octet cells, which conflicts with UTF-8.

To properly internationalize an application, use the following guidelines:

1. Avoid direct access with Unicode. This is a task of the platform’s internationalization framework.

2. Use the POSIX model for multibyte and wide-character interfaces. See the following section, “Unicode Application Interfaces.”

18 3. Only call the APIs that the internationalization framework provides for language and cultural-specific operations. All POSIX, X11, Motif, and CDE interfaces are available to Unicode locales.

4. Remain code-set independent.

Unicode Application Interfaces

When internationalizing applications for Unicode, it is recommended that developers use the POSIX or X Window model. These models define two sets of interfaces — for both multibyte and wide character — without specifying the encoding methods to use.

Standard multibyte codesets contain characters of varying widths, from one to several bytes, to represent a character. Characters are represented in a minimal amount of storage space in order to use the fewest number of bytes possible. Because multibyte codesets contain characters of varying widths, they are not convenient to process by standard functions.

The Unicode codeset provides the necessary format for both multibyte and wide- character representation. In the Solaris Unicode locales, multibyte interfaces use the UTF-8 representation of the character sets and wide-character interfaces use the UCS-4 representation.

Font Resources

Properly internationalized applications require only a few changes to run properly in the Solaris Unicode locales. One change that is required is to set the proper resource definitions for font sets (FontSet) or font list (XmFontList) in the application’s resource file.

The en_US.UTF-8 locale supports the following set of font character sets as the FontSet:

■ ISO 8859-1 (Latin-1)

■ ISO 8859-2 (Latin-2)

■ ISO 8859-4 (Latin-4)

■ ISO 8859-5 (Latin/Cyrillic)

■ ISO 8859-7 (Latin/Greek)

Unicode Application Interfaces 19 ■ ISO 8859-9 (Latin-5)

■ ISO 8859-15 (Latin-9)

■ BIG5 (Traditional Chinese)

■ GB 2312-1980 (Simplified Chinese)

■ JIS X0201-1976, JIS X0208-1983 (Japanese)

■ KS C 5601-1992 Annex 3 (Korean)

■ ISO 8859-6 based one (Arabic)

■ ISO 8859-8 (Hebrew)

■ TIS 620-2533 based one (Thai)

Setting Resource Definitions

To create a font set for an application, the resource definition should contain the complete set of fonts supported by the Unicode locale. For example:

fs = XCreateFontSet(display, "-dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-1, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-2, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-4, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-5, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-6, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-7, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-8, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-9, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-15, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-big5-1, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-gb2312.1980-0, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-jisx0201.1976-0, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-jisx0208.1983-0, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-ksc5601.1992-3, -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-tis620.2533-0", &missing_ptr, &missing_count, &def_string);

20 Unicode Support in the Solaris 7 Operating Environment The XmFontList resource definition of an application should also include all fonts for every character set supported by the locale. For example:

! ! This is an example XmNFontList definition for en_US.UTF-8 locale: *fontList: -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-1; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-2; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-4; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-5; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-6; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-7; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-8; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-9; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-iso8859-15; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-big5-1; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-gb2312.1980-0; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-jisx0201.1976-0; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-jisx0208.1983-0; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-ksc5601.1992-3; -dt-interface system-medium-r-normal-s*utf*-*-*-*-*-*-*-*-tis620.2533-0:

Setting Resource Definitions 21 CHAPTER 5

Additional References

Internationalization Books

The Unicode Standard, Version 2.0 ISBN: 0-201-48345-9

Solaris Internationalization Guide for Developers Solaris Documentation

Creating Worldwide Software (2nd Edition) Bill Tuthill and David Smallberg ISBN: 0-13-494493-3 Sun Microsystems Press and Prentice Hall

Programming for the World – A Guide to Internationalization Sandra Martin O'Donnell ISBN: 0-13-722190-8 Prentice Hall

22 Web Site URLs

Unicode Consortium http://www.unicode.org/

Sun Microsystems, Inc. http://www.sun.com/

Sun White Papers http://www.sun.com/software/whitepapers.html

Sun Online Documentation http://docs.sun.com/

Sun Developer Programs http://www.sun.com/developers

Web Site URLs 23 Appendix A: Codeset Conversions

The following table provides a detailed listing of the supported code conversions: From Code To Code Description

646 UTF-8 ISO 646 (US-ASCII) to UTF-8 UTF-8 646 UTF-8 to ISO 646 (US-ASCII) UTF-8 8859-1 UTF-8 to ISO 8859-1 UTF-8 8859-2 UTF-8 to ISO 8859-2 UTF-8 8859-3 UTF-8 to ISO 8859-3 UTF-8 8859-4 UTF-8 to ISO 8859-4 UTF-8 8859-5 UTF-8 to ISO 8859-5 (Cyrillic) UTF-8 8859-6 UTF-8 to ISO 8859-6 (Arabic) UTF-8 8859-7 UTF-8 to ISO 8859-7 (Greek) UTF-8 8859-8 UTF-8 to ISO 8859-8 (Hebrew) UTF-8 8859-9 UTF-8 to ISO 8859-9 UTF-8 8859-10 UTF-8 to ISO 8859-10 UTF-8 8859-15 UTF-8 to ISO 8859-15 8859-1 UTF-8 ISO 8859-1 to UTF-8 8859-2 UTF-8 ISO 8859-2 to UTF-8 8859-3 UTF-8 ISO 8859-3 to UTF-8 8859-4 UTF-8 ISO 8859-4 to UTF-8 8859-5 UTF-8 ISO 8859-5 (Cyrillic) to UTF-8 8859-6 UTF-8 ISO 8859-6 (Arabic) to UTF-8 8859-7 UTF-8 ISO 8859-7 (Greek) to UTF-8 8859-8 UTF-8 ISO 8859-8 (Hebrew) to UTF-8 8859-9 UTF-8 ISO 8859-9 to UTF-8 8859-10 UTF-8 ISO 8859-10 to UTF-8 8859-15 UTF-8 ISO 8859-15 to UTF-8 UTF-8 KOI8-R UTF-8 to KOI8-R (Cyrillic) KOI8-R UTF-8 KOI8-R (Cyrillic) to UTF-8 UTF-8 UCS-2 UTF-8 to UCS-2 UCS-2 UTF-8 UCS-2 to UTF-8 UTF-8 UCS-4 UTF-8 to UCS-4 UCS-4 UTF-8 UCS-4 to UTF-8

24 UTF-8 UTF-7 UTF-8 to UTF-7 UTF-7 UTF-8 UTF-7 to UTF-8 UTF-8 UTF-16 UTF-8 to UTF-16 UTF-16 UTF-8 UTF-16 to UTF-8 UTF-8 eucJP UTF-8 to Japanese EUC (JIS X0201-1976, JIS X0208-1983 and JIS X0212-1990) UTF-8 PCK UTF-8 to Japanese PC Kanji (a.k.a. SJIS) UTF-8 ISO-2022-JP UTF-8 to Japanese MIME charset UTF-8-Java eucJP UTF-8-Java to Japanese EUC (JIS X0201-1976, JIS X0208-1983 and JIS X0212-1990) UTF-8-Java PCK UTF-8-Java to Japanese PC Kanji (a.k.a. SJIS) UTF-8-Java ISO-2022-JP.RFC1468 UTF-8-Java to Japanese MIME charset (one way conversion) eucJP UTF-8 Japanese EUC to UTF-8 PCK UTF-8 Japanese PC Kanji (a.k.a. SJIS) to UTF-8 ISO-2022-JP UTF-8 Japanese MIME charset to UTF-8 UTF-8 ko_KR-euc UTF-8 to Korean EUC (KS C 5636 and KS C 5601-1987) UTF-8 ko_KR-johap UTF-8 to Korean Johap (of KS C 5601-1987) UTF-8 ko_KR-johap92 UTF-8 to Korean Johap (of KS C 5601-1992) UTF-8 ko_KR-iso2022-7 UTF-8 to Korean MIME charset (ISO-2022-KR) UTF-8 ko_KR-cp933 UTF-8 to IBM MBCS CP933 ko_KR-euc UTF-8 Korean EUC to UTF-8 ko_KR-johap UTF-8 Korean Johap (of KS C 5601-1987) to UTF-8 ko_KR-johap92 UTF-8 Korean Johap (of KS C 5601-1992) to UTF-8 ko_KR-iso2022-7 UTF-8 Korean MIME charset (ISO-2022-KR) to UTF-8 ko_KR-cp933 UTF-8 IBM MBCS CP933 to UTF-8 UTF-8 gb2312 UTF-8 to Simplified Chinese EUC (GB 1988-1980 and GB 2312-1980) UTF-8 iso2022 UTF-8 to Simplified Chinese MIME charset (ISO-2022-CN) UTF-8 GBK UTF-8 to Simplified Chinese GBK gb2312 UTF-8 Traditional Chinese EUC to UTF-8 iso2022 UTF-8 Simplified Chinese MIME charset (ISO-2022-CN) to UTF-8 GBK UTF-8 Simplified Chinese GBK to UTF-8 UTF-8 zh_TW-euc UTF-8 to Traditional Chinese EUC (CNS 11643-1992) UTF-8 zh_TW-big5 UTF-8 to Traditional Chinese Big5 UTF-8 zh_TW-iso2022-7 UTF-8 to Traditional Chinese MIME charset (ISO-2022-TW) UTF-8 zh_TW-cp937 UTF-8 to IBM MBCS CP937 zh_TW-euc UTF-8 Traditional Chinese EUC to UTF-8 zh_TW-big5 UTF-8 Traditional Chinese Big5 to UTF-8 zh_TW-iso2022-7 UTF-8 Traditional Chinese MIME charset (ISO-2022-TW) to UTF-8 zh_TW-cp937 UTF-8 IBM MBCS CP937 to UTF-8

25 Sun Microsystems, Inc. 901 San Antonio Road Palo Alto, CA 94303

1 (800) 786.7638 +1.512.434.1511 http://www.sun.com/software/

October 1998