USENIX Association

Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference

Boston, Massachusetts, USA June 25Ð30, 2001

THE ADVANCED COMPUTING SYSTEMS ASSOCIATION

© 2001 by The USENIX Association All Rights Reserved For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. Citrus project: true multilingual support for BSD operating systems Jun-ichiro itojun Hagino Research Laboratory, Internet Initiative Japan Inc. http://citrus.bsdclub.org/index.html

Abstract The ISO /SUS V2 standard uses two dif- Citrus project aims to implement a com- ferent encodings for multilingual support: an plete multilingual programming environment for internal and an external encoding. We use the BSD-based operating systems. The goals include: term "internal encoding" to mean a multilingual encoding system used inside C programs (like • ISO C/SUS V2-compatible multilingual pro- inside variables for manipulation), normally gramming environment (locale support), declared as an array of wchar_t. On other papers • multi-script framework, which decouples C the term "process code" is also used. "External API and actual external/internal encoding, encoding" means a multilingual encoding system • gettext and POSIX NLS catalog, used outside of programs (like files or screen out- put stream). It is also called as "file code". The All of our source code is, and will be dis- ISO C/SUS V2 libraries will supply conversion tributed under a BSD-like license. functions between external and internal encoding. The paper concentrates onto the multi- Here we call the functions "encoding engine". script framework, which is unique and central to Many of existing ISO C/SUS V2 libraries, our approach. Most other free software imple- especially freely available ones like GNU libc, mentations support only Unicode in their multi- assume that the internal encoding is Unicode- lingual library, or converts external representation based. More specifically, they assume UCS4 as into Unicode internal representation and loses sig- the internal encoding. They also advocate UTF-8 nificant information on the external representa- [Yergeau, 1998] as the external encoding. The tion. We believe a Unicode-only approach is not encoding engine is hardcoded for internal UCS4 future-proven, and is not the right way to handle encoding. External encodings other than UTF-8 multilingual text. We support multiple different are supported by conversion functions in encod- encodings in ISO C/SUS V2 compatible library, ing engine, which convers external encodings to and made our library code (as well as user pro- UCS4, and vice versa. To implement a correct grams) future-proven. multilingual support, we believe it is very impor- tant to not hardcode any encoding, including Uni- 1. Motivation, and multi-script frame- code. The Citrus project library does not hard- work code any existing encoding. In the next section Compared to vendor UN*X operating sys- we discuss more fully why Unicode is not tems and GNU libc, BSD-based operating sys- enough. tems has been a bit behind in multilingual pro- In order to be able to make less assump- gramming support. This may be because of lack tions about multilingual encodings, we have of manpower, difference in interest, or whatever. designed our libraries to support multiple external Anyway, because of the lack of multilingual sup- encodings, and multiple encoding engines and port, we are seeing increasing pain in porting internal encodings. In other words, each internal applications from/to BSD-based operating sys- encoding has its own encoding engine. We call tems. For example, GNOME, GTK+ and other this concept a "multi-script framework." To sup- window manager software relies heavily upon the port multiple encodings, we simply need to sup- presence of multilingual libraries. We definitely ply the appropriate encoding engines. We also need a multilingual support library for BSD-based use dynamic loading to load encoding engines, so operating systems that is based on ISO C/SUS V2 that we can add more engines on the fly. standards. inside programs

User programs

ISO C/SUS V2 functions

Files/char devices wchar_t stream encoding engine (external encoding) (internal encoding)

2. Why Unicode is not enough plaintext stream, and is a good foundation for tru- You may be still wondering why we need to ely internationalized plaintext handling. For support a multi-script framework, instead of hard- example, by using X11 ctext encoding (which is a coded Unicode support. Below we discuss why subset of ISO-2022 encoding), we can mix Unicode is not sufficient, and why hardcoding of Korean, Chinese, Taiwanese, and Japanese text encodings is not preferable. into a single plaintext stream. We specifically have chosen NOT to hard- Separately from the above direction, there code Unicode in our library suite. Since we were a couple of regional encodings (encodings started using computers for text processing, we that support single language only within a single have experienced many transitions in character set plaintext stream), including EUC (like euc-kr), encodings. Here we list our history in chronolog- MS-Kanji (Shift JIS), or Unicode. Here we list ical order: Unicode under "regional encodings", as we can- not mix Chinese/Taiwanese/Japanese text in a • First, we had a total chaos of vendor-defined plaintext - we need to annotate plaintext with font character set encodings, with 6bit, 7bit or 8bit designation to do it. encodings. Unicode cannot handle multiple Asian • ASCII-only 7bit encodings (and EBCDIC in characters in a plaintext at the same time, due to some places). "han unification". Unicode maps multiple charac- • Hardcoded 8bit encodings. For example, in ters from different Asian regions, into the same many of the european countries, ASCII + Unicode codepoint - this is called "han unifica- ISO-8859-1 (so-called Latin-1) has been used. tion". It was introduced to reduce the amount of In Japan, JIS X0201, which gives us ASCII codepoint used in Unicode (to fit characters into variant and Katakana, was widely used. UCS2 16bit region). While some of the unified • Stateful ISO2022 character encodings, with chararcters are indeed same across Asian regions, explicit character set designations. others have totally different glyphs and meanings in different Asian regions [KUBOTA,]. Suppose ISO-2022 character encodings allow us to that we have assigned the same codepoint for O, encode multiple language text into a single and O with umlaut. Some people cannot notice the difference, however, this will become a signif- 3. Gory details icant problem for people who uses O with umlaut Under our implementation based on multi- in their language (like for German language). script framework, we can switch external encod- Also note that, if we convert Asian multilingual ing, internal encoding and encoding engine as we text into Unicode and then convert it back to other wish, and we support multiple encodings/engines. multilingual encoding, the conversion will not be able to preserve the information contained in the External encoding is normally a stateful, or original multilingual text, due to the han unifica- stateless multibyte encoding, like ISO-2022-JP tion. When we convert the multilingual text into [Murai, 1993] or UTF-8. An octet stream will be Unicode, we will lose the information encoded in used for multilingual representation inside files. the source due to the han unification. Many of the existing external encodings use vari- able-length encodings; one letter will be pre- NOTE: there are proposals to perform lan- sented as a octet, two octets, or more octets. guage tagging [Whistler, 1999] in Unicode, how- When we read in external encoding representation ev er, the language tagging jeopardizes one of the into C program, we normally use array of char to very important aspects of Unicode, uniform 32bit hold it. Internal encoding is a stateless, fixed- wide-char representation, and we do not consider bitwidth encoding. It is defined internally by the it to be useful. library, depending on the current encoding engine Every time we change from one encoding we are using. We use a type called "wchar_t" to system to another, the transition is very painful. hold it. At this moment wchar_t is a 32bit integer In fact, there still are applications that are not (int32_t). 8bit-clean. Even for Unicode, there are imple- When the external encoding is 8bit (like mentations assume 16bit UCS2, and they need to Latin-1), we can just typecast external encoding migrate to 32bit UCS4. From our experience, it (char) into internal encoding (wchar_t). When makes no sense any more to pick a single encod- the external encoding is UTF-8, the natural choice ing to rely on. We do not advocate the use of for internal encoding is UCS4. RFC2279 ISO-2022, either. We believe that no encodings [Yergeau, 1998] defines starndard conversion should be hardcoded, since to do so means that between them. When the external encoding is we will have to bear painful transitions over and ISO-2022, we use a compressed representation of over again. ISO-2022 stream as the internal encoding. Here is another reason for us to avoid hard- With setlocale(3), we pick a pair of internal coded encoding, and avoid hardcoded Unicode and external encoding, and encoding engine. A support. There has been widely deployed user- programmer can convert internal encoding and base which uses non-Unicode multilingual text, external encoding, using ISO C/SUS V2-compati- including big5, euc-kr, KOI8-R, MS-Kanji and ble library calls, like mbstowcs(3). some other encodings. If someone says "all peo- ple just need to transition to Unicode", that is User programs should manipulate wchar_t wrong. Unicode imposes no pain to Latin-1 user- stream only, and should not manipulate text base, while imposing huge pain to Asian and encoded with external encoding. The details of other non-Latin-1 people. And, even if we transi- internal and external encoding are embedded into tion to Unicode, we are unsure if it is going to be the library API. Programs should not, and need enough. So we conclude that we should hardcode not to care about internal nor external encoding at no encodings. all. If a programmer hardcodes some assumption about internal/external encoding to their pro- What we should do is very similar to the grams, the programs will not be future-proven. approach with MIME [Freed, 1996] . We will We will also supply wchar_t-ready curses, regex have a way to identify encoding in a plaintext and other libraries, to keep external encoding out- stream, and have support for multiple different side of user programs. encodings. We will switch encoding engines according to the encoding identification. In C From our strategy, we hav etwo major bene- library case, we identify encoding by locale set- fits. First, our library is future-proven, as long as tings made by setlocale(3). internal encoding fits into the bitwidth of wchar_t (currently 32bit). Next benefit is that we can sim- plify encoding engine as we wish. Suppose we need to support JIS X0201 or Latin 1 as inter- nal/external encodings. In this case, the encoding engine can be a simple memory copy logic. We We still are investigating a better API to do not need to visit a slow encoding engine for abstract the manupulation of multiple encodings simple encodings. in a single program. We would like to make a Our strategy avoids problem with Unicode proposal when we are done. and han unification. Since we can use specific encoding engine that matches the external and 5. Status of implementation internal encodings, we can preserve information The Citrus project implementation is based supplied by the external data representation, into on 4.4BSD runelocale implementation [Borman, ] internal data representation. Therefore, we can . The project expanded it significantly to support have no data lossage during conversion from multi-script framework, iconv(3), BSD-licensed external to internal encoding, or vice versa. gettext library, more locale supports like If we use Unicode as internal and external LC_TIME, and expansions for simultaneous mul- encoding, we would not be able to enjoy the tiple locale handling (as presented above). We above mentioned benefits. With UCS4 as the still are missing multilingual support for stdio internal encoding, it is very hard to support exter- routines, like fgetwc function, as it require us to nal encodings other than UTF8. To support them, modify FILE structure on each of the BSD sys- we would need a huge conversion table to convert tem. Depending on the details in standard I/O the external encoding into the internal encoding, function implementation, the change to FILE and vice versa. Also, it will not be possible to structure may need a libc shared library major simplify the econding engine, even if inter- number bump. The current implementation sup- nal/external encodings are simple enough. ports NetBSD, OpenBSD and FreeBSD. We are trying to finalize our implementation and integrate 4. Multiple encodings it into BSDs. NetBSD integration is ahead of other BSDs at this moment. LC_CTYPE locale ISO C/SUS V2 multilingual programming support and BSD-licensed gettext library were API assumes that a single program uses single integrated into NetBSD development tree and will encoding throughout the program. External and be available in NetBSD 1.6. The source code is internal encodings are picked when setlocale(3) is available via anonymous CVS. The project is called, and the function usually gets called on definitely an ongoing effort. program startup time. Normally, we cannot deter- mine which encoding should be used, from a At this moment, we are using a tool called wchar_t data stream. mklocale(1) for converting LC_CTYPE locale definition files into binary representation. The There are various applications that would mklocale(1) tool and the locale definition file for- need a library support for multiple encodings dur- mat are derived from runelocale implementation. ing their runtime session. Examples would be We should migrate to more standard tool like web clients, text editors and email readers. For localedef(1), and the standard file format for these applications, we would need to pick internal locale definition files. and external encodings based on input data stream. So, ISO C/SUS V2 API is not sufficient There still are couple of issues to be for these type of applications. resolved. We picked 32bit wchar_t for now, just like many of other operating systems do. It We hav ea temporary workaround to pro- is good enough for future? It is a good question. vide a better multiple encodings support. ISO We believe 32bit is a good compromise. A good C/SUS V2 API has a data type, mbstate_t, to hold thing is that we actually are future-proven. As the intermediate state for encoding engine. Our long as people do not hardcode the current implementation includes a hidden reference to assumption that wchar_t is of 32bit quantity, we encoding engine inside mbstate_t. By holding an will be able to expand widechar representation to mbstate_t variable with a wchar_t array, we can 64bit, or something larger. It requires full recom- identify the encoding engine used to encode the pilation of and userland pro- wchar_t stream. When we convert wchar_t (inter- grams, but the transition will impose no change in nal encoding) back to external encoding, we can code. The transition will be just like transitioning automatically use the appropriate encoding from 32bit time_t to 64bit. engine. For full locale support (including LC_TIME and LC_COLLATE), we will need a large database for localized character tables, time Murai, 1993. format and others. For voluntary free software J. Murai, M. Crispin, and E. van der Poel, effort it can be way too hard to manitain. The “Japanese Character Encoding for Internet maintenance cost for the LC_CTYPE locale Messages” in RFC1468 (June 1993). databases is also high. We are wondering if we ftp://ftp.isi.edu/in-notes/rfc1468.txt. can integrate database from ICU [IBM, ] , and Borman, . reuse it in our multi-script framework. Paul Borman, 4.4BSD rune(3) implementa- tion. 6. Conclusion IBM, . We at Citrus project aim to implement a IBM, ICU: International Components for complete multilingual programming environment Unicode. http://oss.soft- for BSD-based operating system. The most ware.ibm.com/developerworks/open- important aspect of the effort is multi-script source/icu/project/. framework. We try to avoid any hardcoded Compaq, . encoding in our library, and embed conversions Compaq, “Writing software for the Interna- and encodings into ISO C/SUS V2 API. tional Market” in Tru64 UNIX Version 5.1 DEC/Compaq Tru64Unix uses a similar programming online documentation. technology as we have used [Compaq, ] , includ- http://tru64unix.compaq.com/faqs/publica- ing multiple switchable internal encoding, tions/base_doc/DOCUMENTA- dynamic library support for additional encodings TION/V51_HTML/ARH9YBTE/TITLE.HTM. support, and the use of 32bit wchar_t. As men- tioned in the abstract, vendor UNIX implementa- Author’s address tions are very ahead of free software implementa- tions, regarding to multilingual support. We wish Jun-ichiro itojun HAGINO to see more of vendor technologies to be made Research Laboratory, Internet Initiative Japan Inc. available to public consumption, under BSD or Takebashi Yasuda Bldg., GNU license. 3-13 Kanda Nishiki-cho, Chiyoda-ku, Tokyo 101-0054, JAPAN The author would like to thank Freenix Tel: +81-3-5259-6350 reviewers including Wendy Rannenberg, and Cit- Fax: +81-3-5259-6351 rus project members including Henry Nelson, for Email: [email protected] the time they gav eme improve the paper.

References Yergeau, 1998. F. Yergeau, “UTF-8, a transformation for- mat of ISO 10646” in RFC2279 (January 1998). ftp://ftp.isi.edu/in-notes/rfc2279.txt. KUBOTA, . Tomohiro KUBOTA, Most Important 1006 Ideographs for Japanese. http://www.debian.or.jp/˜kubota/unicode/. Whistler, 1999. K. Whistler and G. Adams, “Language Tag- ging in Unicode Plain Text” in RFC2482 (January 1999). ftp://ftp.isi.edu/in- notes/rfc2482.txt. Freed, 1996. N. Freed and N. Borenstein, “Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies” in RFC2045 (November 1996.). ftp://ftp.isi.edu/in-notes/rfc2045.txt.