An Implementation of Multilingualized Environment on Dynamic Objects

事業名: 平成 14 年度未踏ソフトウェア創造事業契約件名: An Implementation of Multilingualized Environment on Dynamic Objects 大島芳樹 Yoshiki Ohshima Twin Sun, Inc. 360 N. Sepulveda Blvd. Suite 2055 El Segundo, CA 90245 E-mail: [email protected] ABSTRACT. This paper describes the design and implementation of multilingualization (“m17n”) of a dynamic object-oriented environment called Squeak. Squeak is a highly portable implementation of a dynamic objects environment and it is a good starting point toward the future collaborative environment. However, its text related classes lack the ability to handle natural languages that require extended character sets such as Arabic, Chinese, Greek, Korean, and Japanese. We have been implementing the multilingualization extension to Squeak. The extension we wrote can be classified as follows: 1) new character and string representations for extended character sets, 2) keyboard input and the file out of multilingual text mechanism, 3) flexible text composition mechanism, 4) extended font handling mechanisms including dynamic font loading and outline font handling, 5) higher level application changes including a Japanese version of SqueakToys. The resulting environment has the following characteristics: 1) various natural languages can be used in the same context, 2) the pixels on screen, including the appearance of characters can be completely controlled by the program, 3) decent word processing facility for a mixture of multiple languages, 4) existing Squeak capability, such as remote collaborative mechanism will be integrated with it, 5) small memory footprint requirement. multilingual system. Firstly, the system should allow the user 1 Introduction to use characters from different languages (scripts) without The more people who live in the networked and comput- any burden. For instance, the system should easily support erized society, the more important the communication tools applications such as Arabic-English or Chinese-Japanese become. We envision that, in near future, non-technical peo- electric dictionaries in which different scripts are used to- ple including kids will express and exchange their ideas using gether. such tools. They may such ideas collaboratively and commu- Secondly, the displayed or printed result from the system nicate them over the network [1]. should be acceptable in terms of the appearance. The network and the tool are getting faster and ready for Thirdly, the system should be portable among the variety creating such a collaborative environment. What we need is of platforms. A multilingualized system will be used by peo- good software that is usable by people all over the world. Be- ple who use different hardware and software. The system cause the people will want to express their ideas in their own should be usable on all those platforms. natural languages, such software must be capable of not only We aim to fulfill those three requirement in this project. end-user accessible idea authoring, but also multiple natural This is an on-going project, but the system is already being languages. In the other words, it has to be a multilingualized used in several places. This fact proves that we are on the system. right track toward our goal. To achieve this goal, we started from a late-bound, dy- In this paper, we discuss the design and implementation of namic object-oriented system called Squeak[2]. Squeak is an the multilingualization of Squeak. implementation of a dynamic objects environment. Squeak The following sections are organized as follows. In sec- provides various end-user collaboration tools including an tion 2, we discuss the overall design goals and the prereq- end-user programming environment called SqueakToy[3]. uisites for understanding the issues of multilingualization. Also, the Squeak’s ability to control the every pixel on screen From section 3 to section 9, we discuss the implementation has a great potential. of the added features. Finally, in section 11, we conclude the However, Squeak has a weakness in its multilingual as- paper. pect; Squeak’s text handling classes lack the ability to handle 2 Overall Design natural languages other than English. We think that Squeak will be an ideal environment for network collaboration once In this section, we discuss the overall design issues of the it is multilingualized. multilingualized Squeak. There are three issues to consider when implementing a 2.1 Universal Character Set solution was feasible. In the original Squeak, a character, represented as an in- 2.4 MacRoman vs. ISO-8859-1 stance of Character class, holds an 8-bit quantity (“octet”). For compatibility with the original Squeak, ideally the def- Obviously, that representation is not sufficient for the ex- inition of the one-octet characters and strings should not be tended character set. changed. However, to adapt to Unicode, it is cleaner if the What kind of representation do we need for the extended first 256 characters are identical with ISO-8859-1 character character set? One might think that a 16-bit fixed represen- set, instead of the original Squeak’s MacRoman character tation would be enough, but even the “industrial standard”, set. Unicode version 3.2 [4][5], now defines a character set that We decided to modify the upper half of the first 256 char- needs as many as 21 bits per character. acters to be ISO-8859-1 characters. Because the characters in The plain Unicode has another problem; “han- the upper half are not used much, this change was easy. With unification”. The idea behind han-unification is that this change, the Character class represents the ISO-8859- the standard disregard the glyph difference of certain Kanji 1, which is equivalent to the first 256 characters in Unicode characters and let the implementation choose the actual standard. glyph. This abstraction contradicts the philosophy of 2.5 Keyboard Input Squeak; namely, it becomes impossible to ensure pixel At a layer from the OS to Squeak level code has to create identical execution and layout across the platforms. a character in the extended character set from the user key- On the other hand, Unicode seems to be good enough board input sequence. One would imagine that modifying for scripts other than CJKV (in Unicode terminology, CJKV VM to pass multi-octet characters to the Squeak level would refers to “Chinese, Japanese, Korean and Vietnamese, that be a good solution. However, this is not desirable in two use the Chinese origin characters) unified characters. They reasons. One is that such VM will be incompatible with the are well defined and contain many scripts. existing VMs installed to many computers. The other is that Obviously, we need more than 16-bit for a character. What the encoding of the multi octet character sent from the OS is is the upper limit? Actually we don’t have to decide the upper different from one platform to another. limit. Thanks to the late-bound nature of Squeak, we can In the current implementation of m17n Squeak, we decide always change the internal representation without affecting not to modify the VM. If the OS sends a multi octet character the other parts of system much, even when what is changed to the Squeak VM, the VM treats the octets as if they are is as basic as the Character class. This decision let us start separable characters and sent them to the Squeak level code with a 32-bit word for a character. We have added a kind of through the Sensor object. The Sensor then interprets the “encoding tag” to each character to discriminate the unified characters if necessary and generate appropriate character. characters, and also to switch the underlying font and scanner 2.6 Text File Export and Exchange method implementation. The requirement for the file out format is two fold. Firstly, 2.2 Memory Usage it should be understandable by the other non-Squeak soft- To represent text in any form of extended character set, ware. Secondly, the identity on the roundtrip conversion there must be a character entity that can represent more than from Squeak should be ensured. We don’t require that the an 8-bit quantity and a type of string that can store these other way of round trip doesn’t have to be identical. characters. One way to adapt this new representation is to Because the original Squeak’s file out format already uses change Character and String uniformly so that all instances the MacRoman format and uses all 8 bits of the octets in a file of Character or String represent this new wide character out, we need to introduce a mechanism to let the characters in and string (uniform approach). One of the most advanced the extended character set co-exist with the original MacRo- multilingualized system, Emacs after version 20, uses this man characters in the file out. To satisfy this goal, there are approach [6]. Another way is to add new representations and three feasible encoding schemes for this external file format. let them co-exist with the exisiting default ones (mixed ap- The first possible way is to adapt the X Compound Text proach). format of X Window System, or “ctext”[7]. The upside The uniform wide character representation is cleaner, but of ctext is compatibility with existing file outs. The 8-bit takes much space. In original version 3.2 image, The total characters in the file out remain the same semantics. Also, size of the String subinstances occupy is about 1.5MB. If many existing software can read and write this format at least we use unsatisfying 16-bit uniform representation or 32-bit partially. In fact, Japanese in ctext is essentially identical representation, the image size would grow a few megabytes. with the standard internet email format for Japanese[8]. The We decided to use the mixed approach.

Load more