Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference

USENIX Association Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference Boston, Massachusetts, USA June 25–30, 2001 THE ADVANCED COMPUTING SYSTEMS ASSOCIATION © 2001 by The USENIX Association All Rights Reserved For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. Citrus project: true multilingual support for BSD operating systems Jun-ichiro itojun Hagino <[email protected]> Research Laboratory, Internet Initiative Japan Inc. http://citrus.bsdclub.org/index.html Abstract The ISO C/SUS V2 standard uses two dif- Citrus project aims to implement a com- ferent encodings for multilingual support: an plete multilingual programming environment for internal and an external encoding. We use the BSD-based operating systems. The goals include: term "internal encoding" to mean a multilingual encoding system used inside C programs (like • ISO C/SUS V2-compatible multilingual pro- inside variables for manipulation), normally gramming environment (locale support), declared as an array of wchar_t. On other papers • multi-script framework, which decouples C the term "process code" is also used. "External API and actual external/internal encoding, encoding" means a multilingual encoding system • gettext and POSIX NLS catalog, used outside of programs (like files or screen out- put stream). It is also called as "file code". The All of our source code is, and will be dis- ISO C/SUS V2 libraries will supply conversion tributed under a BSD-like license. functions between external and internal encoding. The paper concentrates onto the multi- Here we call the functions "encoding engine". script framework, which is unique and central to Many of existing ISO C/SUS V2 libraries, our approach. Most other free software imple- especially freely available ones like GNU libc, mentations support only Unicode in their multi- assume that the internal encoding is Unicode- lingual library, or converts external representation based. More specifically, they assume UCS4 as into Unicode internal representation and loses sig- the internal encoding. They also advocate UTF-8 nificant information on the external representa- [Yergeau, 1998] as the external encoding. The tion. We believe a Unicode-only approach is not encoding engine is hardcoded for internal UCS4 future-proven, and is not the right way to handle encoding. External encodings other than UTF-8 multilingual text. We support multiple different are supported by conversion functions in encod- encodings in ISO C/SUS V2 compatible library, ing engine, which convers external encodings to and made our library code (as well as user pro- UCS4, and vice versa. To implement a correct grams) future-proven. multilingual support, we believe it is very important to not hardcode any encoding, including Uni- 1. Motivation, and multi-script frame- code. The Citrus project library does not hard- work code any existing encoding. In the next section Compared to vendor UN*X operating sys- we discuss more fully why Unicode is not tems and GNU libc, BSD-based operating sys- enough. tems has been a bit behind in multilingual pro- In order to be able to make less assump- gramming support. This may be because of lack tions about multilingual encodings, we have of manpower, difference in interest, or whatever. designed our libraries to support multiple external Anyway, because of the lack of multilingual sup- encodings, and multiple encoding engines and port, we are seeing increasing pain in porting internal encodings. In other words, each internal applications from/to BSD-based operating sys- encoding has its own encoding engine. We call tems. For example, GNOME, GTK+ and other this concept a "multi-script framework." To sup- window manager software relies heavily upon the port multiple encodings, we simply need to sup- presence of multilingual libraries. We definitely ply the appropriate encoding engines. We also need a multilingual support library for BSD-based use dynamic loading to load encoding engines, so operating systems that is based on ISO C/SUS V2 that we can add more engines on the fly. standards. inside programs User programs ISO C/SUS V2 functions Files/char devices wchar_t stream encoding engine (external encoding) (internal encoding) 2. Why Unicode is not enough plaintext stream, and is a good foundation for tru- You may be still wondering why we need to ely internationalized plaintext handling. For support a multi-script framework, instead of hard- example, by using X11 ctext encoding (which is a coded Unicode support. Below we discuss why subset of ISO-2022 encoding), we can mix Unicode is not sufficient, and why hardcoding of Korean, Chinese, Taiwanese, and Japanese text encodings is not preferable. into a single plaintext stream. We specifically have chosen NOT to hard- Separately from the above direction, there code Unicode in our library suite. Since we were a couple of regional encodings (encodings started using computers for text processing, we that support single language only within a single have experienced many transitions in character set plaintext stream), including EUC (like euc-kr), encodings. Here we list our history in chronolog- MS-Kanji (Shift JIS), or Unicode. Here we list ical order: Unicode under "regional encodings", as we cannot mix Chinese/Taiwanese/Japanese text in a • First, we had a total chaos of vendor-defined plaintext - we need to annotate plaintext with font character set encodings, with 6bit, 7bit or 8bit designation to do it. encodings. Unicode cannot handle multiple Asian • ASCII-only 7bit encodings (and EBCDIC in characters in a plaintext at the same time, due to some places). "han unification". Unicode maps multiple charac- • Hardcoded 8bit encodings. For example, in ters from different Asian regions, into the same many of the european countries, ASCII + Unicode codepoint - this is called "han unifica- ISO-8859-1 (so-called Latin-1) has been used. tion". It was introduced to reduce the amount of In Japan, JIS X0201, which gives us ASCII codepoint used in Unicode (to fit characters into variant and Katakana, was widely used. UCS2 16bit region). While some of the unified • Stateful ISO2022 character encodings, with chararcters are indeed same across Asian regions, explicit character set designations. others have totally different glyphs and meanings in different Asian regions [KUBOTA,]. Suppose ISO-2022 character encodings allow us to that we have assigned the same codepoint for O, encode multiple language text into a single and O with umlaut. Some people cannot notice the difference, however, this will become a signif- 3. Gory details icant problem for people who uses O with umlaut Under our implementation based on multi- in their language (like for German language). script framework, we can switch external encod- Also note that, if we convert Asian multilingual ing, internal encoding and encoding engine as we text into Unicode and then convert it back to other wish, and we support multiple encodings/engines. multilingual encoding, the conversion will not be able to preserve the information contained in the External encoding is normally a stateful, or original multilingual text, due to the han unifica- stateless multibyte encoding, like ISO-2022-JP tion. When we convert the multilingual text into [Murai, 1993] or UTF-8. An octet stream will be Unicode, we will lose the information encoded in used for multilingual representation inside files. the source due to the han unification. Many of the existing external encodings use vari- able-length encodings; one letter will be pre- NOTE: there are proposals to perform lan- sented as a octet, two octets, or more octets. guage tagging [Whistler, 1999] in Unicode, how- When we read in external encoding representation ev er, the language tagging jeopardizes one of the into C program, we normally use array of char to very important aspects of Unicode, uniform 32bit hold it. Internal encoding is a stateless, fixed- wide-char representation, and we do not consider bitwidth encoding. It is defined internally by the it to be useful. library, depending on the current encoding engine Every time we change from one encoding we are using. We use a type called "wchar_t" to system to another, the transition is very painful. hold it. At this moment wchar_t is a 32bit integer In fact, there still are applications that are not (int32_t). 8bit-clean. Even for Unicode, there are imple- When the external encoding is 8bit (like mentations assume 16bit UCS2, and they need to Latin-1), we can just typecast external encoding migrate to 32bit UCS4. From our experience, it (char) into internal encoding (wchar_t). When makes no sense any more to pick a single encod- the external encoding is UTF-8, the natural choice ing to rely on. We do not advocate the use of for internal encoding is UCS4. RFC2279 ISO-2022, either. We believe that no encodings [Yergeau, 1998] defines starndard conversion should be hardcoded, since to do so means that between them. When the external encoding is we will have to bear painful transitions over and ISO-2022, we use a compressed representation of over again. ISO-2022 stream as the internal encoding. Here is another reason for us to avoid hard- With setlocale(3), we pick a pair of internal coded encoding, and avoid hardcoded Unicode and external encoding, and encoding engine. A support. There has been widely deployed user- programmer can convert internal encoding and base which uses non-Unicode multilingual text, external encoding, using ISO C/SUS V2-compati- including big5, euc-kr, KOI8-R, MS-Kanji and ble library calls, like mbstowcs(3).

Load more