<<

Product Internationalization

Digital Technical Journal

Digital Equipment Corporation

J9�,._, L T U A L .. - : ------,... ------.� WRITTEN LANGU AGE C) l:-l 1 '-" CHARACTER PLACEMENT � �: :� ttl 0 (") 1 y - I �eJ?@AL� : i:-l >-l NATIONAL CONV ENTIONS I 'tr1 DATE FORMATS I I TIME OF DAY FORMATS I·� NUMBER FORMATS

CURRENCY FORMATS

USER INTERFACE

"Not chaos·! ike, together crushed and brUJ.ted, GEOMETRY MANAGEMENT z 0 But as the world harmoniously confusc-l \'hcrc order in variety sec, 1 all And where, though things differ, all ag'fc" IMAGES

SYMBOLS

OLOR

Number 3

Summer 1993 Editorial Jane C. Blake, Managing Editor Helen L. Patterson, Editor Kathleen M. Stetson, Editor Circulation Catherine M. Phillips, Administrator Dorothea . Cassady, Secretary Production Terri Autieri, Production Editor Arme . Katzeff, Typographer Peter R. Woodbury, Illustrator Advisory Board Samuel H. Fuller, Chairman Richard Beane Donald Z. Harbert Richard]. Hollingsworth Alan . Nemeth Jeffrey H. Rudy Stan Smits Michael C. Thurk Gayn B. Winters The Digital Teclmicaljournal is a refereed journal published quarterly by Digital Equipment Corporation, 30 Porter Road LJ02/DIO, Littleton, Massachusetts 01460. Subscriptions thejoumal are $40.00 (non-U.S. $60) for four issues and $75.00 (non­ U.S. $115) for eight issues and must be prepaid in U.S. funds. University and college professors and Ph.. students in the electrical engineering and computer science fields receive complimentary subscriptions upon request. Orders, inquiries, and address changes should be sent to the Digital Teclmicaljournal at the published-by address. Inquiries can also be sent electronically to [email protected] .COM. Single copies and back issues are available for $16.00 each by calling DECdirect at 1- 800-D IGITAL (1- 800-344- 4825). Recent back issues of thejournal are also available on the Internet at gatekeeper.clec.com in the directory /pub/DEC/DECinfo/DTJ.

Digital employees may send subscription orders on the ENET to RDVAX::JOU RN AL. Orders should include badge number, site location code, and address.

Comments on the content of any are welcomed and may be sent to the managing editor at the published-by or network address. Copyright© 1993 Digital Equipment Corporation. Copying without fee is permitted provided that such copies are made for use in educational institutions by faculty mem­ bers and are not distributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted. All rights reserved. The information in thejoun1al is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes responsibility for any errors that may appear in thejournal. ISSN 0898- 901X Documentation Number EY-P986E-DP The following are trademarks of Digital Equipment Corporation: AXP, AXP, COD/Plus, COD/Repository, DEC, DEC OSF/1 AXP, DEC Rclb, DECwindows, DECwrite, Digital, the Digital logo, EDT, OpenVMS, OpenVMS AXP, OpenVMS VAX, TeamRoute, UURIX, VAX, VMS, and VT. Apple is a registered trademark of Apple Cori1puter, Inc. AT&T is a registered trademark of American Telephone and Telegraph Company. Hewlett-Packard is a trademark of Hewlett-Packard Corporation. fBM is a registered trademark of InternationalBusiness Mad1ines Corporation.

lntel is a trademark of Intel Corporation. Cover Design Lotus 1-2-3 is a registered trademark of Lonos Development Corporation. Scripts, symbols, and writing directions Microsoft, MS-DOS, and MS Windows are registered trademarks and Win32 and are elements of written communication Windows NT are trademarks of Microsoft Corporation. that are addressed by product international­ Motif, OSF/Motif, and OSF/1 are registered trademarks and Open Software Foundation ization, the featured topic in this issue. Like is a trademark of Open Software Foundation, Inc. engineering designs and standards for inter­ Motorola is a registered trademark of Motorola, Inc. nationalization, the graphic design on the PIC is a trademark of Laboratories, Inc. cover provides a framework that accommo­ PostScript is a registered trademark of Adobe Systems Inc. dates a rich diversity of the world's written is a trademark of Unicode, Inc. languages. UNIX is a registered trademark of UNIX System Laboratories, lnc. X Window System is a trademark of the Massachusetts Institute of Technology. The cover was designed byjoe Poze1ycki,jr., of X/Open is a trademark of X/Open Company Limited. Digital's Corporate Design Group. Book production was clone by Quantic Communications, lnc. I Contents

6 Foreword Claude Henri Pesquet

Product Internationalization

8 IntertUttional Cultural Differences in Software Timothy G. Greenwood

21 Unicode: A Universal Character Code )tirgcn Settels and f Avery Bishop

32 The X/Open Internationalization Model Wendy Rannenberg and ]Urgen Bettels

43 The Ordering of Universal Character Strirzgs Rene Haentjens

53 International Distributed Systems­ Architectural and Practical Issues Gayn B. Winters

63 Supporting the Chinese, japanese, and Korean Languages in the OpenVMS Operating System Michael M. T. u

80 Character Internationalization in Databases: A Case Study Hirotaka shioka and)im Melton

97 japanese Independent of ApplicatiollS Takahide Honma, Hiroyoshi Baba, and Kuniaki Takizawa I Editor's Introduction

examine benefits and limitations of , referencing Digital's DEC OSF/1 AXP implementation, and close with proposed changes to the model. Rene Haentjens' paper is not about a standard per but about the ways in which various cultures order words and names and the methods used in computers to emulate this ordering. examines the table-driven multilevel method for ordering uni­ versal character strings, its variations and its draw­ backs. The implications of Unicode relative to

1 I ordering are also considered. The development and adaptation of software for Jane C. Blake use in local markets is the common theme of three naging Editor . Gayn Winters identifies several program­ practices for the development of distributed Engineering products for international markers is a systems and discusses the benefits of moclu larity in multifaceted undertaking, as it entails the adaptation systems and in run-time libraries to reduce reengi­ of computer technology to the unique ami varied neering effort and costs. However, as Michael Yau ways cultures communicate in written languages. notes in his paper, reengineering is necessary for Papers in this issue describe some of the cultural systems designed when English was the only ­ and teclmological challenges to software engineers guage supported in computer systems. Michael pre­ and their responses. To pics include conventions of sents an overview of the engineering challenges culture and language, internationalization stan­ encountered and resolved in the creation of local dards, and design of products for local markets. variants of the OpenVMS operating system to sup­ Product internationalization begins with identi­ port the Japanese, Chinese, and Korean languages. fying the cultural elements and user expectations A third paper, written by Hiro Yoshioka and Jim that the software must accommodate. Tim Melton, provides a case study of a coengineering Greenwood has written a tutorial that provides project, i.., a project in which engineers from the insight into the cultural differences and the com­ local environment (or market) join in the product plexities of written languages as they relate to prod­ development. The case references the internation­ uct development. Among the topics he discusses alization of the DEC Rdb database (specifically for are scripts and orthography, writing directions, key­ Asian markets) utilizing an SQL standard. board input methods, conventions for values such The concluding paper focuses on software as time, and user interfaces. designed to facilitate .Japanese keyboard input and to As a counterpoint to the complexity of languages reduce reengineering/localization effort. Takahide and cultures, industry engineers and organizations Honma, Hiroyoshi Baba, and Kuniaki Takizawa have developed standards that lend simplicity and review the methods of Japanese keyboard input uniformity. Unicode, described here by .irgen and then describe a three-layer, application­ Bertels and Avery Bishop, is a significant interna­ independent software implementation that is tionalization standard that accommodates many embedded in the operating system and offers users more complex character sets than does 8-bit ASCI!; flexibility in the choice of an input operation. software produced using Unicode character encod­ The editors are grateful to Tim Greenwood, an ing can be localized for any language. The authors architect of Unicode currently working in the discuss the principles behind the 16-bit encoding Software Development Tools Group, for his help in scheme and considerations for application pro­ coordinating the development of papers and to cessing of Unicode text. They conclude with Gayn Winters, Corporate Consulting Engineer. approaches for the support of Unicode and refer­ Note to Internet Users: Recent back issues of ence the NT implementation. the D1J are now available in ASCII and PostScript for­ We ndy Rannenberg and Jiirgen Bertels have writ­ mats on gatekeeper. dec.com in the /pub/DEC/ ten a paper on another important standard, the DECinfo/DTJ directory. X/Open internationalization modeL X/Open sup­ ports multibyte code sets ami provides a compre­ hensive set of application interfaces. The authors

2 Biographies I

Hiroyoshi Baba Hiroyoshi Baba is an engineer in the Japanese Input Method Group in Digitaljapan, Research and Development Center. He is currently devel­ oping the japanese front-end input system on OpenVMS VA X and OpenVMS AXP and the conversion server system. He received a B.S. (1989) and an M.S. (1991) in electronics engineering from Muroran Institute of Technology, Japan. He joined Digital in April 1991 .

Ji.irgen Settels Ji.irgen Bettels is an internationalization architect and the stan­ dards manager for the International Systems Engineering Group. Since 1986, he has worked on many internationalization architectures starting with DECwindows. He participated in the , ECMA, and X/Open on internation­ alization. He contributed to the IS /IEC WG2/SC2, whose work m<.:rged Unicode and ISO 10646 into a single universal . Prior to joining Digital, he was a physicist at the European particle laboratory, CERN. )i.irgen has the degree of Diplom Physiker (physicist) from the University of Aachen.

F. Avery Bishop Avery Bishop is the program manager for Windows NT/Alpha internationalization. Prior to this position, he worked in ISE as Digital's represen­ tative to the Unicode consortium and the ANSI X3L2 technical advisory group on character encoding. He worked with IS O/IEC WG2/SC2, Unicode, and others in Digital to merge Unicode and IS O 10646 into a single universal character encod­ ing. Prior to that, he managed projects at DECwest and worked as the product management manager for ISE in Japan. Avery has a Ph.D. in electrical engineering from the University of Utah.

Timothy G. Greenwood Since 1981, Tim Greenwood has held various posi­ tions relat<.:d to internationalization at Digital. He was the architect for the Japanese and Chinese versions of DECwindows. This software introduced the string technology that was incorporated into Motif. Tim conceived of, managed, and wrote much of the software section of the internal version of the handbook on Producing International Products. He also participat<.:din the d<.:sign of international support on the X Window System. Tim is curr<.:ntly responsible for guiding the introduction of Unicode into Digital.

3 Biographies

Rene Haentjens Rene Haentjens is a software consultant working for both Digital Consulting Rclgium and Corporate Standards and Consortia. lie was the Belgian local engineering manager for two years. Today, Rene is a member of the Belgian, the European (), and the ISO committees on character sels and inter­ nationalization. He contributed significantly to the ISO/IEC 10646-1: 1993 stan­ dard. He has a civil engineering degree (chemistry) from the University of Ghent and has contributed to publications on compiler portability, on software engi­ neering, and on developing international software and user information.

Takahide Honma A senior software engineer, Takahide Honma leads the Japanese Input Method Group. He joined Digital in 1985 as a soft ware service engineer. He has worked on systems such as real-time drivers, network system (PS.I.), and database on VMS and was a consultant to customers. At the same time , he also took the role of a sales advisory support engineer. Since 1990, he has been with Research and Development in Japan and has worked on the Japanese input method. He has an lvi.S. (1983) in high-energy physics from Kyoto University and is a member of the Physics Society of Japan.

Jim Melton A consulting engineer with Database Systems. Jim Melton repre­ sents Digital to the ANSI X:1H2 Technical Committee for Database. He represents the United States to the ISO/IECJTC1/SC21/\XIG3 Working Group for Database. He edited the SQL-92 standard and continues to edit the emerging SQL3 standard. Jim also represents Digital to the X/Open Data Management Working Group and to the SQL Access Group. Jim is the author of Understanding the New SQL: A Complete Guide, published in 1992, and is a regular columnist (SQL ( pdate) for Database Prugramming & Design.

Wendy Rannenberg Principal software engineer Wendy Rannenberg man­ ages the UNIX Software Group's internationalization team. is responsible for the delivery of Digital's internationalization technology on both the ULTRIX and the DEC OSF/1 AXP platforms. Prior to joining Digital in 1988, she held engineer­ ing positions with Lockheed Sanders Associates and the Naval Underwatt:r Systems Center. Wendy holds a B.S. (1980) in engineering from the University of Connecticut at Storrs and is a member of !FEE, SWE, and ACvl. She has written or contributed to numerous technical pub! ications.

Kuniaki Takizawa Kuniaki Takizawa is an engineer with Digital Japan, Research and Development Center and is a member of the japanese Input Method Group. 1 Ie joined Digital in April 1991 and is currently developing and porting the henkan module and the input method library (IMLIB) on OpenVMS, UITRIX, and OSF/1. Ik graduated from the University of Electronic Communi­ cations (Denki-Tsushin (jniversity) in Japan in 1991. His speciality area was the structure of opera ling sysLems.

4 I

Gayn B. Winters Corporate consulting engineer Gayn Winters has 25 years' experience developing compilers, operating systems, distributed systems, and PC software and hardware. He joined Digital in 1984 and managed the DECmate, Rainbow, VA Xmate, and PC integration architecture. He was appointed Technical Director fo r Software in 1989 and contributes to the Corporate software strat­ egy. From 1990 to 1992, Gayn led the internationalization systems architecture effort and is on the Board of Directors fo r Unicode, Inc. He has a B.S. from the University of Califo rnia at Berkeley and a Ph.D. from MIT.

Michael M. T. Yau Michael Ya u is a principal software engineer in the International Systems Engineering Group. Since 1984, he has worked on Asian language support in the OpenVMS operating system. He led and managed the development team in Kong from 1986 to 1991. Currently, he provides archi­ tecture and product internationalization support to U.S. engineering groups. Prior to joining Digital, Michael worked fo r GEC Marconi Av ionics (U..). Michael holds a B.Sc. (Hons) in mathematics and an M.Sc. in communication engineering from the Imperial College of Science and Technology, University of London.

HirotakaYoshioka A senior software engineer in the International Software Engineering Group, Hiro Yo shioka is the project leader of the COD/Repository/ Japanese. He is a member of the internationalization special committee of ITSCJ (Information Technology Standards Commission of Japan) and ISO/IEC JTC1 SC22,1\.\1(;2Q internationalization. During the past nine years, he has designed and implemented the Japanese COBOL, the Japanese COBOL generator, and the inter­ , nationalized DEC Rdb. Hiro joined Digital in 1984, after receiving an M.S. in engi­ neering from Keio University, Yokohama.

5 Foreword

languages. To serve these markets well, we are com­ pelled to adopt a strategy for the internationaliza­ tion'' of our products. The strategy, i.e., to devdop products that "speak" the local language, has evolved from a fas­ tidious reengineering of a product after the fact to an architectural definition that ensures products ;m: designed originally to meet local-language requirements. Digital had three goals:

• Reduce development costs.

• Shorten the time to market. ClaudeHenri Pesquet Engineering Group Ma nager, • Increase product quality. International Systems The cost of reengineering prod ucts that were Engineering designed based upon a North American paradigm is similar to the cost of maintaining an application that was designed without regard to future main­ In the late 1970s, Digit al began to ship its first office tenance. Such costs could meet, if not exceed, the products outside the u.S. We realized then that it original product development cost. This was dis­ was no longer an option to provide users with the couraging, because the markers outside the U.S. ability to input, view, edit, and print foreign lan­ were smaller and emerging; producing the local guage text; it was instead a necessity, as well as a product compared in cost to producing the original passport for Digital into world markets. U.S. product. It became obvious that it was too The foreign-language requirement came as a expensive to continually rebuild products that sold shock to the application developers who had been only to a small market. trained in the late 1960s, at a time when the U.S. Local-language products were late to market English-speaking market represented more than 70 when compared with availability of the.: same prod­ percent of the total worldwide information tech­ ucts in the LS. This presented a twofold problem. It nology market. Toclay's reality is quite different. denied our multinational customers the capability The English-speaking IT market is below 40 per­ of installing products and applications simul­ cent, and trends ind icate that it will continue to taneously in their worldwide operations. Further, decline. This is not surprising, because only 8.41 prod uct launches, training, selling, support, and percent of the world's population is native English retirement had to be addressed one country at a speaking. Moreover, each year the commoditization rime because specific local-language components of computers lowers the entry point for the acquisi­ were not available simultaneously. tion of computer products; consequently the mar­ In addressing short-term "surface" issues, we had ket is expanding to encompass a much broader utilized the brute force of reengineering to pro­ socioeconomic community. Further, starting in the duce one language version at a time. As a result, we 1980s, the creation of global markets-for labor, delayed addressing the "deep" quality issue of origi­ materials, intellectual talent, financing, and distri­ nally designing and building into our products the bution channels-has forced businesses to continu­ internationalization fe atures that would allow for ally reach tside their domestic markets. easy adaptation to any language withollt modifica­ Government mandates also have an impact, requir­ tion of a product's core. ing that products sold within country boundaries A vision on how to address the internationaliza­ have l.ocal-language capability. Together these fac­ tion of prod ucts was developed by a worldwide tors wilJ increase the demand for and requirements team of architects led by Gayn Winters. The major­ of international products-products that wil l pro­ ity of this team was located outside the u.s. and had vide users with linguistic choices.

In recent years, Digital has broadened its market • Th<:term internationalization as it is used within the context focus to include not only the scientific/technical, ofthisjourna/ include; hotll rile technologiL·s and the pro­ cesses applied to enable a product to meer rile need of any mainly English-speaking markets, but also com­ local-language marker wirhour requiring modification of the mercial markets-a large market comprising many hase functionality of rhe producr.

6 I been closely involved in Digital's reengineering engineering-resource perspective, we would start efforts for many years. The team's prime motivation with parallel internationalization development, was to eliminate the need for reengineering. The and then inject internationalization expertise into vision they developed is one in which all Digital the original product development group. The strat­ integrated systems can process electronic informa­ egy from a process perspective was to customize tion containing multiple languages and character code for specific countries. and then roll back the sets, and can satisfy end-user linguistic preferences. country-specific code into the original product Aninherent part of this vision is to make all systems code base and continue future development from available simultaneously worldwide. this unique code base. The implementation has One of the major difficulties in implementing the resulted in major achievements, for example, the vision was that internationalization was not aimed simultaneous shipment of products to which this at specific products, rather it was a pervasive approach was applied. attribute required across systems. For product To illustrate our progress, the latest version of development groups trained to develop compo­ Rdb (relational database application) was devel­ nents, this represented a difficult change in mind­ oped with the injection of internationalization set. The implementation also required a huge expertise. The approach resulted in one common paradigm shift- code base and achieved worldwide simultaneous shipment. From one character To one character Many challenges remain. Standards have to be and ... and ... defined and implemented in areas such as naming One input method Many input methods conventions. user profiles, and character attributes. One cell Multiple cells Emerging technologies such as object-oriented soft­ One collation point Several collation points ware and multimedia need to be addressed. And One geometry Many geometries real-time multilinguality (the simultaneous transla­ Ideograms tion from one language to another) must be tackled. "Frozen" alphabet User-defined characters This issue of the jo urnal provides a broad sam­ The paradigm shift led to a redefinition of the pling of our product internationalization efforts­ elements to be incorporated in the basic design from the concept of cultural differences to the of new products. The strategy from a product specific internationalization of our Rdb product. perspective was to start with the base system The papers herein represent only a few of the hun­ (CPU, peripherals, network, and operating system), dreds of projects dedicated to the internationaliza­ and then move to the application side. From an tion of Digital's products.

7 Timothy G. Greenwood I

International Cultural .fferences in Software

Throughout the world, computer users approach a computer system with a sp ecific set of cultural requirements. In all cultures, theyex pect computer systems to accom­ modate their needs. A major part of interaction with computers occurs through written language. Cultural requirements, particularly written languages, influ­ ence the way computer systems must op erate. Cultural differences concerning national conventionsfo r tbe presentation of date, time, and number and user inter­ fa ce design fo r the components of images, color, sound, and the overall layout of the screen also affectthe development of computer technology Successfu l computer systems must respond to tbe multicultural needsof users.

Not chaos-like, together crushed and bruised, multiple development teams modifying the source But, as the world harmonious!)'confuscd: code of the original product. This process also pro­ Where order in variety we scc, duces multiple code bases, which makes ture And where, though all things differ, all agree. development and maintenance more complex. - Alcxander Pope Building software that can be localized with ­ In the first years of the computer age, users adapted imal software changes is called internationaliza­ themselves to the requirements of the computer. tion, often abbreviated to 118N (the letter I followed They had to learn the language of the machine to by 18 letters and the letter ). The basis of interna­ interact with it. Now the computer is part of daily tionalization is to identify those cu ltural elements life , a tool to complete a task. Computer systems that the software must accommodate and to sim­ must be adapted to the needs of their users. plify the task of adapting the product. This paper Computer users approach a computer system with describes a set of these cultural elements. The a specific set of cultural requirements. Successful remainder of this issue of the Digital chnical systems respond to these requirements. jo urnaldetails specific aspects of building interna­ tional software.

International Adaptationof Computer Systems Cultural Differences Each nation has developed its own cu lture, and Language is the most obvious cu ltural difference some areas of the world share a cu ltural back­ among people. Written language is an important grou nd. Adaptation of computer systems to differ­ method of communication with computers. This ent cu ltures uses processes known as localization paper examines written languages and their repre­ and internationalization. sentation in computer systems. It also presents cul­ Localization is the process of changing products tural differences concerning national conventions to suit users from different cu ltural backgrounds. for the presentation of date, time, and number and Localization is achieved by taking the source code user interface design for the components of images, for a product developed for one cou ntry and modify­ color, sound, and the overall layout of the screen. ing the source code and product to satisfy the needs The base functions of a product may change in of other countries. Often teams of developers in dif­ response to different needs around the world , and ferent countries are needed to adapt products. If some examples of these differences are illustrated . the original product is not built with a view toward Finally, with a look to the future, the paper presents being localized, this can be a very expensive and deeper cu ltural differences that are only beginning time-consuming process. There is the direct cost of to be represented in software.

8 !Ji. 5 No. .3 Summer 1993 Digital Te chuica/ journal InternationalCulturctl Dtf ferences in Software

Written Language No written language uses a pure set of either The written representation of spoken language alphabetic, sy llabic, or ideographic symbols; each requires a script and an orthography. The script is does use one set of symbols predominantly. The the set of symbols that represents the sound or is primarily alphabetic, but numerals meaning of components of the language. The and certain signs such as & are ideograms-'• Other orthography consists of the rules of spelling and languagcs use a more even mix. South Korean com­ pronunciation. Specific spelling and pronunciation bines the native alphabet with , the rules may diffe r among locations or communities; Korcan name for their ideographic characters. for example, the American English orthography dif­ Japanesc combines the and syl­ fe rs from the British English orthography. A script labaries (col lectively called kana) with the ideo­ may be tied to a specific language, for example, graphic characters called in Japanese. Written Korean Hangul, but frequently a script can repre­ Japanese, especially technical and advertising mate­ sent several languages. French and Italian both use rial, also often uses the , called romaji. the Latin script. A written language may be broad ly categorized Character Placement In most European lan­ into either an ideographic, a syllabic, or an alpha­ guages, basic symbols are written in a linear stream betic . The category is determined with each character placed on a baseline. In other by examining the relation between the symbols writing systems, for example, Korean Hangul, the in the script and the unit of sound or meaning elements do not follow this I in ear layout. Rather represented. than evolving piecemeal like most writing systems, In writing systems based on ideograms, every Hangul is the result of deliberate, I inguistically has a specific meaning that is not related to informed planning. It has been ca lled " ...probably its pronunciation. The ideograms imported from the most remarkable writing system ever invented ."7 Chinese, and used in Chinese, Ja panese, and th Korean uses an alphabet of 14 and 10 Korean writing provide examples in current use. 1 . These letters, cal ledjamo, are blocked into Thus A represents a man or person, even though sy llable clusters. Ifthe same technique were applied it is pronounced in Chinese, zin in Japanese, to English, cat might be written c;._ Figure I shows and in in Korean2 "3" represents "three" even the Korean Hangul alphabet, and Figure 2 shows though it is pronounced tatu in Swahili and trwa in the jamo blockcd into clusters. French. Ideographic writing systems typically con­ Thai also uses an alphabet and is written with the tain several thousand discrete symbols with a sub­ symbols arranged in a nonlinear fa shion. Thai is a set of approximately 2,000 symbols in frequent use. tonal language; different tones distinguish words The users of this writing system continue to that would otherwise be homonyms. Thai words develop new symbols. consist of consonants, vowels, and marks. In the syllabic writing systems, each symbol Each component is an atomic unit of the language. represents a syllable. 7 in Japanese katakana A is written in front of, above, below, or cl enotes the ma sound. There is a wide variation in behind the to which it refers. A tone the number of discrete symbols in a sy llabic sys­ mark, if present, is usually placed above the conso­ tem. Japanese kana uses some 47 symbols; the nant or above the upper voweL Thai potentially has writing of the people (a minority nationality symbols at four levels, as shown in Figure 3. Level scattered through provinces in Southwest ) uses a standardized syllabary of 819 symbols.1 In alphabetic systems, each symbol or letter 7t-t:-C.P l:I.A approximately matches a phoneme (the smallest CONSONANTS OA.;i:;7E-JI-<5" unit of speech distinguishing meaning). Thus M in Latin script, ? in Hebrew, and lfin Armenian denote them sound. Most have from 30 VOWELS to 50 discrete letters.·1 The match between phonemes and letters is not exact, especially in English, which has about 40 phonemes.> Some phonemes are represented by letter sequences, Figure 1 Korean Hangul Conwmantana such illi the th in thank. Vowel Signs

Digital Te cbllicalJounwl Vol. 5 No.3 Su111mer 1')')3 Product Internationalization

ol ::l.J" -"J :J :;t� tJJ- "6\1 7} >l.J" .<� "'{£ 4- -u� ooj, .:Lei � .?J�<>ll .A\-%.;>;)-C fiJ A tJJ- "6\j� ;>1) 7-jt>� c�) 51J .R.� �;;.J � �tHo): �q .

Figure 2 rean ngul Te xt Showing Blocking of ja

LEVEL 1. TONE MARK 9J LEVEL 2. VOWEL LEVEL 3. CONSONANT OR VOWEL 1GU LEVEL 4. VOWEL \Q.I

Figure 3 Th Script

one is an optional tone mark. Level two is an be positioned directly above the consonant; how­ optional vowel. Level three is either a consonant or ever, it is also currently acceptable for all tone a vowel preceding or following a consonant. Level marks to be physically positioned at level one. Thai fo ur is another optional vowel. A consonant never mechanical typewriters position all tone marks at has a vowel at both level two and level fo ur. level one. Some level-three consonants have part of their glyph images rendered in another level. They either Diacritics and Vo wels In Arabic and Hebrew dip into level four or rise into level two. The last let­ alphabets, vowels are indicated by placing vowel ter (yo ying) in Figure 3 is a level-three consonant, points above, below, or beside the letter. (Arabic but it has a small (separate) portion written below also uses the consonant letters alij;ya, and wauJ to the baseline. When this Jetter is written, this small represent the vowels a, i, and u.) Vowels are portion is written at level fo ur. When this letter is normally used only in religious text ami in teaching fol lowed by a true level-four vowel, the vowel is materials fo r people learning to read the language; shown instead of this portion. in other texts, vowels are inferred by the reader. In Figure 3, both tone marks are shown at level Since vowel points are used, written Hebrew with one to aid understanding of the script. In high­ vowels is called pointed. Figure 4 shows pointed quality printing of Thai, if a cell does not have a Hebrew from a children's comic and the same text level-two vowel, then the tone mark falls clow n to with the vowels removed.

.. - . C'1i1it• T •: .

1'9� c� ,� .?�s� C�l? N.,, .?�·Q'? l"lN�? PWIJJ;10 �'i?�l'!?'? .:1-1!l/?tFJc� nT! P1 ,���� ,-,� ·l117 K�� N�il l:t1 v,,, :l':lN :Cl1l"l f'�Nil C'1iln c���

l'C?J CK ,:J , 71'� Cl"lC N ?1 . ?1'�? l"lN�; jilt'n l"l;-t 1'f'1l'!J? .:-l�llmil l"l11 C3.' J', 1�l,lt' 11N 111? N:lr, N1il 1::11

Figure 4 Po inted and Unpainted He brew

10 Vu l. 5 No. J Summe-r 199.> Digital Te cbnical]ournal International Cultural Diff erences in Software

Other Sy mbols Most languages written with Latin INDEPEN- DENT INITIAL MEDIAL FINAL PHONETIC letters have cJiacritical marks on some letters. In FORM FORM FORM FORM VALUE some cases, the use of a provides a stress I l a or pronunciation guideline, as in the English word ...... -:- ...... b cooperate. Removing the diacritic docs not change "'! the meaning of the word . In other languages, a mark .,;.., ; ;. �

that appears to be a diacritic is a fundamental part . . ..::..., j !...... of the letter. The Danish letter a is a separate letter .:>s. in the alphabet and is not a variant of a. In German, <: .:>:- � j three vowe ls have umlauts and are separate letters � ,.._ X in the alphabet. The deletion of an umlaut can c. C: ,.._ change the definition of a word; fo r example, c. :>- C: kh schwiil means hot or humid, and schwul means :;, ...1.. d homosexual. :;, .i dh

Presentation Va riants The characters in the J .)" r Arabic writing systems change form, depending on whether they are the first, last, or middle character J .)" z of a word , or if they stand alone. Note that the ...... r ..r s

abstract characters themselves do not change, only • • • . .... the glyph image. Figure 5, adapted from Nakanishi, ..r ..r shows the presentation variants of Arabic letters.t� v-" ..p ..a. .;A s v-" ..p ..a. ...,.a. d Writing Direction .b 1 .6. h In English and many other writing systems, the let­

ters are written from left to right, with lines pro­ b 1 1. .Ia. z gressing from top to bottom. Japanese, Chinese, .&- .A. and Korean may also be written in this form but are t. t {'] j; ;.. traditionally written vertically. The characters flow t. t gh from the top of the page to the bottom, with lines � j A. 1.-.<1. advancing from right to left. The pages are ordered f _; A. in the opposite direction to that used for English. !,.) ...,. q Mongol ian is also written vertically, but the !.I s � -.!1 k columns of text advance from left to right. Consequently, pages of Mongol ian text are ordered J J 1 J

in the same direction as in Engl ish. -4. .. m Figure 6 shows a portion of a newspaper printed i t � in iwan. The newspaper exhibits many styles of 0 .:. ,J n

for mat. Headlines may run horizontally from right It> .... + h to left, or left to right; the text of the article may run vertically; and advertisements and tables may run J J .J" w,u horizontally from left to right. -:- ..s y,i In Japanese writing, Latin characters (romaji) I.S'-:f are interspersed with vertical kanji ( script) NUMERALS Arabic figures are written from right to left, but the figure written to the left shows the characters. Romaji may be presented with each higher value: rr 23 character in a horizontal orientation running verti­ cally down, or they may be presented vertically, with each character rotated by 90 degrees. In addi­ y r t 0 i v A 4, tion, if one, two, or three Latin characters are 2 3 4 5 6 7 8 9 0 mixed with vertical Han script characters, they may be presented horizontally in the vertical stream. Figure 7 shows mixed characters in a .Japanese text. Figure 5 Arabic Presentation Forms

Digital Te c1J11ica/]o unwl V,,f. 5 .Vo. 3 /11/ller 199.3 11 Product Internationalization

NUMBER AND PERCENT READ LEFT TO RIGHT TEXT READS RIGHT TO LEFT

TEXT READS LEFT TO RIGHT MAIN CONTENTS TEXT READS VERTICALLY TOP TO BOTTOM, COLUMNS ADVANCING RIGHT TO LEFT

Hgure 6 Ta iwanese Newspaper

PARENTHESIS ROTATED IN ARABIC NUMBERS VERTICAL TEXT PRESENTED HORIZONTALLY

Figure 7 japanese Te xt Sbotl'ing Latin Characters iVl ixert withKan ji

12 "'''· 5 No. 3 Summer IY'J.� Digitttl Te chnical jo urnal International CulturalDif ferences in Software

Semitic language scripts (e.g., Hebrew and Latin keyboards typically have two possible charac­ Arabic)are written horizontally from right to left, ters available from each alphabetic key: lowercase with lines advancing from top to bottom, but any letters are displayed by depressing the key alone, numbers using Arabic numerals are written left to and uppercase letters are produced by depressing a right.9 Any fragments of text written in the Latin shift modifier or a locking shift and the letter key. script are also presented left to right. This method Some keyboards have four levels, with three or leads to nesting segments of reversed writing direc­ four characters available from each key. Figure 10 tion as shown in Figure 8. The text in this figure shows the Arabic keyboard from Digital and the reads "Attention: Kalanit (1984) Tel-Aviv, ISRAEL;" Khmer keyboard from Apple Computer. The user where "Kalanit (1984)'' is a company name. Figure 9 switches into the additional two groups of charac­ is another example of combining left-to-right and ters with an additional modifier or shift key. Note right-to-left text. It shows a portion of the contents that the Arabic keyboard uses the additional group page from the EL AL airline magazine. to support Latin characters as well as Arabic, but the Kh mer keyboard uses all four grou ps for the

Te xt Input Khmer characters only. 10 A four-group keyboard is The following section discusses techniques fo r now a national standard in Germany. 11 addressing cultural differences in computer key­ The takana and hiraga syllabaries have board input. approximately 50 characters each. These can be Alphabetic writing systems typically have no input either directly from the keyboard or through more than 50 discrete symbols. Computer key­ a mapping of the syllable typed with the phoneti­ boards contain approximately 48 keys with sym­ cally equivalent Latin characters. For example, the bols from the writing system inscribed. The character ? (ma) can be input either by typing depression of a key produces a code from the key­ the ? key on a Japanese keyboard, or by typing board that is translated into a character coding m and a and using an input method to convert to according to some predefined coding. Input of a ? character not represented directly on the keyboard Although some early keyboards had many ranks requires depression of several keys. For example, in of individual keys, input of ideographic characters terminals from Digital, the a character is input on from modern keyboards always requires a multiple­ non-German keyboards by pressing "Compose s s". input method, with some user interaction.

Attention: :l�:lN-,n (1984) n'l'� ,ISRAEL LEFT TO RIGHT RIGHT-TO-LEFT SEGMENT LEFT TO RIGHT ... (1984) IS A LEFT-TO-RIGHT SEGMENT NESTED IN A RIGHT-TO-LEFT SEGMENT

Figure 8 sted Bidirectional Te xt

� EL AL News 46

� EL AL Route Maps 48

� EL AL Services

46 ,, � JWL11n

52 ,, � 'J'I,,'t.l

Figure 9 Combined Direction Te .x:t

Digital Te chnical jourual Vo l. 5 No. 3 Sl llll/1/er I'J'J.i 1 3 Product Internationalization

Figure 10 Arabic Keyboard (a bove) and Kh mer Keyboard

Both Japanese and Korean have phonetic writing progresses from right left and includes left­ systems. sers of these languages primarily use to-right segments of numbers and non-Hebrew or phonetic methods to input ideographic characters. non-Arabic text. As shown in Figure 8, these seg­ The Chinese-language user has many diffe rent ments can nest. The order in which to read bidirec­ input methods; these are based on phonetic input tional text can be am biguous and can depend on or on strokes or shapes in the character. Almost all the of the text. Figure 11 duplicates the of these methods display a list of possible candi­ information on a pair of signs displayed at parking dates as a result of the string input, and the user lots in Te l-Aviv. Urban legend has it that at least one selects the appropriate candidate. The implementa­ parking ticket was dismissed by the court on the tion of Japanese input methods is detailed in a sepa­ argument that the sign indicated that parking was rate paper in this issue of the]ournal. '2 not allowed from 5:00 p.m. to 9:00 a.m. To some extent the correct direction can be Bidirectional Te xt assigned automatically. Hebrew and Arabic charac­ Hebrew and Arabic user interfaces have an addi­ ters have an implicit direction of right to left, and tional level of difficulty As discussed earlier, the Latintext has an impI icit left-to-right direction. Thus text is bidirectional; the primary writing direction an output method can Jay out simple combinations

PA RKING NO

1700 _ 0900 0900 _1700

(THE ) BETWEEN

Figure 11 "No Pa rking" Signs in Te l-Aviu

14 Vo l. 5 No. 3 St.II/111/Cr 1993 Digital Te cht1ical ]o urllal InternationalCultur al Diff erences in Software

of bidirectional text correctly. Beyond these char­ acters, direction can be ambiguous. Punctuation marks are common to both Hebrew and Latin text. HEISEl 2 YEAR 10 MONTH 9 DAY Thus a period or or has no implicit direction; the software must wait for the next char­ acter to determine the direction of the segment. In Fig ure 12 japanese Date Fo rmat more complex cases such as the nested directions shown in Figure 8, direction attributes must be parsing program should be able to process both explicitly assigned to the segments. 13 As discussed formats. in the paper on Unicode in this issue, the Unicode and ISO 10646 characters sets do include a rich set Time-ofday Fo rmats of directional markers. 14 Similarly, time-of-day formats vary according to per­ Insertion of text should be performed in the way sonal and, to some extent, national preference. the user finds most convenient, which is not neces­ Possible time formats include sarily in accordance with the "correct" directional 9.15am 09:15 0915 09:15:36 09 15 09h15 order of a segment. lf entering a two-digit number in the "common" direction requires too many oper­ Time-zone abbreviations also change around the ations, or if the user was trained on a manual type­ world. Two or more different abbreviations may writer, most users would use the easiest typing indicate the same time zone. Eastern Standard Time order, i.e., entering the least -significant digit first (EST) is a U.S.-specific time-zone indicator. This and the most-significant digit second. "Smart" soft­ zone is called HNE (Heure No rmale !'Est) in ware, which puts the digits in the supposedly cor­ French-speaking Canada. Central European Time is rect order, is not doing this user a service. known as HEC by the French-speaking populations and as MEZ by German speakers. The same time­ Na tional Conventions zone abbreviation may stand for different time zones. AS T is used for both Alaska Standard Time Various entities such as date, time, and numeric val­ and Atlantic Standard Time, which are five hours ues can be presented differently. Such presentation apart. Time-zone abbreviations are not standardized differences develop both from national and from and may change. Time zones are not all at one- personal or company styles. These presentation intervals. Some countries have time zones at a differences are not only tied to different writing sys­ 30-minute difference from a neighboring zone. tems. For example, dates are presented differently Certain towns in Islamic countries use solar time in the United States and in England. and thus can have time differences of several min­ utes between towns within one time zone. Date Fo rmats The ninth day of October 1990 is written 9/10/90 in Number Fo rmats Europe but 10/9/90 in the United States. The order The separators used with numerals to express of the day and month numerals is well defined for a quantities vary as part of national and personal particular culture, but there are no overall formats preferences. In the United Kingdom and the United for the separator used, or indeed for the general States, the comma is a thousands separator, and the style. The separator may be a slash, , colon, period is a decimal separator. In continental space, or another symbol, according to policy or Europe, the opposite is true. Separators include personal preference. The style may be numeric 1,234.56 1.234,56 1 234.56 1234,56 date as shown or the name of the month may be 1 '234.56 1 ,234·56 spelled out, and the year may be two or four digits. Tn Japan, dates are based on the reign of the Numbers written in Japanese or Chinese using emperor. As shown in Figure 12, 1990 was the sec­ Chinese ideograms sometimes include the unit indi­ ond year of Heisei, the reign of the current cator, as in the number 28 =+ )\ ("two","ten", emperor. (The first and last years of two eras may "eight") and sometimes omit it = J\ . coincide. Showa, the previous era, ended January 7, Positive and negative indicators differ. The plus 1989, and Heisei started on January 8, 1989.) This and minus signs may be used before or after the date fo rmat is routinely used in business in Japan. number. In accounting, negative numbers are usu­ The Western date fo rmats are also used, so a date ally enclosed in parentheses.

Digital Te chnical jo urnal Mil. 5 . 3 Summer 1993 15 Product Internationalization

Currency Fo rmats page to page. The analogy is from writing on a long In currency fo rmats, the currency symbol may be scroll of paper, which is cut into pages. For a one or several characters and may be placed before Japanese word processor, which enables the user to or at the end of the number, or used instead of the type in the traditional top-to-bottom orientation, decimal point. Some examples are: 6s 2,50 does the bottom scroll bar control page advance by (Austria); 2,50 $ (French-speaking Canada); 2$50 sliding the selection to the left' There is no one cor­ (Portugal); and $2.50 (United States). rect answer. A designer can keep consistent with the traditional horizontal scroll or with word pro­ Us er Interfa ce cessors fo r Latin-based writing systems. As the point of contact between the user and the machine, the user interface is an obvious area fo r Images potential clashes of culture between the designer Some designers may consider that using images and the recipient. The interface must be localized to instead of text creates an international, culturally fit the cultural expectations of the end user. 15 The neutral product that requires no localization. This interface designer must be aware of issues of geome­ is only the case if the image is entirely abstract and try management, images, symbols, color, and sound. chosen to be equally foreign to all cultures. This may meet the requ irements of internationalization, Geometry Ma nagement but at the expense of good user interface design. Graphical interfaces in English use menu bars Most images are chosen to provide a cultural aligned at the left, with cascading mem1s fa lling mnemonic to the action. This link may have little from left to right. Menus in Hebrew and Arabic cas­ meaning in another cu lture. The rural mailbox cade from right to left. Figure 13 shows a menu image '" chosen for certain electronic mail sys­ from the Hebrew version of DECwindows !. tems is a good example. This image is unknown out­ Although Japanese and Chinese are traditionally side the United States, and some American city read from top to bottom with columns advancing dwellers are unfamiliar with it as well. The conven­ from right to left, most technical material is pre­ tion of raising the flag on the mailbox to indicate sented with the same flow as English has. Conse­ that new mail has arrived is not common through­ quently, user interfaces have the same left-to-right out American rural communities. It can instead flow as English. This may be considered an aspect indicate the presence of outgoing mail. of new technology setting new cultural norms. In addition, a graphic may be a play on words that Japanese and Chinese do present some geometry will not translate. One personal computer product management challenges. A word processor fo r uses a musical note to indicate that a written note is English uses the right scroll bar to advance from associated with an item in its database.

Ir-- -. I '\a .. __ ... tl � ilU'7 J.il �j i1U'7J.il-'1Ul A�

Figure 13 He brew DECwindows XU/

16 14>1. 5 No. 3 Summ�r 19.93 Digital Te ch11ical]ourtlal International Cultural Differences in Software

Sy mbols ing in open offices in all cultures, well-designed Symbols commonly used in one culture may be mis­ systems allow users to eliminate them or modify interpreted by someone from another culture. For the volume. example, the cross [g] is often used to indicate prohibition. However, in Egypt it does not have Functional Differences in Software this connotation. 16 Designers should allow for Much of this paper has covered areas where the the replacement of selection symbols such as form of the information must change for different ticks (checkmarks) and crosses found in many user cultures. The software may also require fu nctional interfaces. changes for different cultures. Applications that [g]It alic [2] Bold manipulate text provide a set of operations linked to the nature of both the writing system and the code set. We have seen that typing Japanese and Co lor Chinese requires an indirect input method. The significance of color varies greatly across Applications using the Latin script provide a user cultures. Table 1, taken from Russo and Boor, gives interface to an operation to change the case of the ideas associated with colors in six cultures.l7 a character. This operation is not applicable to For example, means danger in the United States, Japanese, but a Japanese word processor has an but it has the connotation of life and creativity in operation to convert from katakana to biragana. India. Garland found that using a red "X" as a pro­ A delete operation on a Latin Jetter deletes both hibitive symbol in Egyptian pictures was not effec­ the letter and the rectangular cell, a piece of the tive because the color red is not associated with screen real estate, causing the adjacent text to close forbiddance, and the "X" is not understood as up. With the cursor to the right of a Korean syllable prohibitive. 16 cluster or Thai consonant/vowel/tone combina­ tion, the user presses the delete key. What should Sound be deleted' Thai and Korean do not have the union In the book Global Software, Dave Taylor relates between a letter and its linear space that the Latin that when Lotus localized its 1-2-3 spreadsheet alphabet has. Two separate operations with differ­ fo r use in Japan, the developers had to remove ent user interfaces may be required, whereas one all beeps from the program.111Japanese users, typi­ suffices in English. The code set used also plays a cally seated much closer together than their part in determining the nature of the operation. Western counterparts, did not appreciate the The Thai code set independently codes every letter computer broadcasting to their col leagues every and tone, so deleting a single letter or tone is practi­ time they made an error. Since beeps can be irritat- cal. The national Korean code set codes syllable

Ta ble 1 Significance of Color across Cultures

Red Blue Green Yellow White u.s. Danger Masculinity Safety Cowardice Purity France Aristocracy Freedom Criminality Te mporary Neutrality Peace Egypt Virtue Ferti lity Happiness Joy Faith Strength Prosperity Truth India Life Prosperity Success Death Creativity Fertility Purity Japan Anger Vi llainy Future Grace Death Danger Yo uth Nobility Energy China Happiness Heavens Birth Death Clouds Heavens Wealth Purity Clouds Power

Digita/ 1ecbtlical]ournal Vol. 5 No. 3 Summer 1993 17 Product Internationalization

clusters. 19 Deleting one letter from a cluster may segments to be rearranged under certain circum­ produce a combination with no code. In Digital's stances. This follows from a logical analysis of Thai and Korean products, the action of the delete ordering of the segments and was implemented in operation is as suggested by the code set. Thai an early version of Hebrew DECwrite. Studies with deletes one letter or tone mark; Korean deletes the users revealed that they found this rearrangement syllable cluster. of text disconcerting and preferred to manually In unidirectional writing systems, the right arrow rearrange segments. The program was changed in a key navigates the cursor over the logical subsequent version. Note that this resolution is order of the text as it moves smoothly over the dependent on the specific product. One should not screen. The operation of logical movement ancl geo­ conclude that automatic reordering of text is metrical movement across the screen is identical always incorrect. Other bidirectional text systems within one line. This is not the case with bidirec­ perform this .reordering. tional text. The following fragment is from a Hebrew application one twoJ�t'7�O"n� lnl'( . Pressing the Responding to andSet ting Culture left arrow key moves the cursor to the left of the New technology in computer applications must word "one" if the action means to follow reading reflect the prevalent existing culture, but it also order, or to the left of the "o" in "two" if the action pl ays a part in creating new cultural norms. An ear­ is one of navigating screen real estate. lier section described how users of a Hebrew word Functional differences may come from regula­ processor might enter digits into a stream of tory constraints. The United States has export pro­ Hebrew by reversing the order of the digits. This hibitions on certain encryption techniques. cultural behavior was introduced during the days of Non-U.S. versions of products may need to remove manual typewriters or older computer systems, them or use different techniques. Standards and which required additional keystrokes to change regulations for connection to external devices such writing direction. An older technology introduced as modems vary around tbe world. a cultural expectation. As users in Israel grow more Product features may also need to vary based on accustomed to word processors that enter the cor­ less tangible aspects of a culture. LYREis a hypertext rect order automatically, and as the base of users product developed in France. The product allows exposed to older technology shrinks, we can antici­ students to analyze a poem from various viewpoints pate that the standard expectation of the order in selected by the teacher. Students are not allowed to which to enter digits will change. add their own viewpoints. This is acceptable in The Arabic and Khmer writing systems modify France bllt not in Scandinavian countries, where the shape of the written glyph based on surround­ independent discovery is highly valued.1' ing characters. The Khmer keyboard (Figure 10) shows separate glyphs for each variation (implying Co rrect ana Incorrect Actions separate codes). This design follows the lead of ear­ Learning the rules concerning cultural sensitivity lier typewriters and is familiar to users trained on does not guarantee that a software developer from such typewriters. It adds complexity to the key­ outside, or even inside, that culture will not make board and requires the user to manually enter the errors. Two examples illustrate this. correct glyph. The Arabic keyboard is from a sys­ When Lotus localized its 1-2-3 product into tem that codes each character independently of Japanese, the developers were aware that the glyph; the renderer selects the correct glyph to

18 Vo l. 5 No. 3 Summer 1993 Digital Te chnical journal International Cultural Differences in Software

is correct; but as we have seen, formal analysis will vary not only across national boundaries, but need not yield an appropriate answer. Ultimately across companies. As the set of cultural differences the correct answer is a delicate balance between to be addressed goes deeper, the circles of people users' expectations based on the past and the sharing those cultures will shrink. quirements of innovation. The users' expecta­ Techniques exist to build products with a high tions are set by previous implementations, which level of internationalization. These are described in were derived from limitations in the technology of other papers in this issue. These techniques will the time. We have a cycle of computers adapting to continue to develop and improve, but internation­ people adapting to computers. alization will never be a fully resolved considera­ tion. The term may fa ll from use as the cultural diffe rences being addressed have a decreasing Deeper Cultural Differences relationship to national boundaries. International­ Some of the cultural diffe rences discussed in this ization is simply making software easy to localize, paper such as the presentation of dates and cur­ and the essence of localization is meeting the indi­ rency are obvious even on a superficial examina­ vidual needs of the customers. As computer sys­ tion of the culture. Others such as the cultural tems become more powerful and software more reaction to color are learned from deeper study. sophisticated, adaptation to the individual will con­ We can expect the future development of soft­ tinue. Techniques to adjust software to fit personal ware to consider as yet unexplored cultural diffe r­ preferences will continue to develop. ences. New features in user interfaces, the use of sound, , pen-based computers, and anima­ Acknowledgments tion, will tie into aspects of cultural behavior that are currently little explored by researchers. Higher­ This paper is adapted from an earlier unpubl ished resolution screens and the prevalence of color work, circulated within Digital Equipment Corpo­ bring the ability to design applications that relate ration. The author would like to thank Gayn more directly to the user's sense of beauty. Winters for initiating and driving the paper and The personal computer revolutionized personal for many valuable comments on early drafts. This productivity. Applications such as spreadsheets paper took shape as reviews from colleagues succeeded because they modeled individual user's around the world corrected errors and contributed existing work practices and extended their capabil­ many examples, some of which are seen in this ity. A current trend is toward applications for paper. The author would like to acknowledge the the work group or collaborative computing. This input from John McConnell and Michael Yau in style attempts to revolutionize the way groups the United States; Ji.irgen Bettels in Switzerland; work. Jeffrey Hsu reports that "Collaborative sys­ K. H. and Fred in ; Trin tems can meet stubborn resistance when they are Tantsetthi in Thailand; Mike Feldman, Moti introduced in a company, because they challenge Huberman, and Moshe Loterman in Israel; Hirotaka the organizational culture with a new means of Yoshioka in Japan; and Nai- in . communication."21 The difference in the business decision-making process between Japan and the References and No tes United States is well documented, with Japanese 1. South Korean writing uses two scripts. groups valuing group decision and harmony or Hangul is an alphabetic system. Hanja is the highly. We can expect the emerging "groupware" set of ideograms imported from China and applications both to model existing styles of group used as the sole script until the invention of work and to change those styles. Hangul. There exists a widespread miscon­ The fu ture will also bring software agents 22 This ception that Hangul is ideographic. The software will act as a collaborator with the user to author wishes to stress that only the Hanja process information in much the same way as a script uses ideograms. human personal assistant. As with a human assis­ tant, we can anticipate that software agents will 2. The example shown contains no phonetic ele­ adapt to the specific requirements and habits of the ment. Many, more complex characters do user, a culture of one. We can imagine an agent rec­ have phonetic components. Some scholars ommending circulation lists for memos and aiding disparage the use of the term ideograph to in correctly phrasing the mail. The fo rms of address describe Japanese and Chinese writing,

Digital Te chuical jo unwl /. 5 No. 3 Summer 1993 19 Product Internationalization

asserting that the phonetic element is pri­ of the text in figure 8 correctly without direc­ mary. (St:c.: rc.:ferences 3 and 5.) This paper tional attributes. The Unicode ami ISO 10646 uses the term ideograph since it is in common character sets do inclmk a rich set of direc­ use. tional markers.

3. ). DeFrancis, The inese Languaf!.e Fa ct and 14. J Bettels and F. Bishop, "Unicode: A Universal FcmtetS)I, Second Paperback Edition (Hon­ Character Code," Digital Te clmical jo urnal, olulu: University of Hawaii Press, 1989): 91. vol . 5, no. 3 (Summer 199:3, this issue): 21-31.

4. M. Stubbs, Language and l.iterac.;y TheSoci­ 15. J. Nielsen, 'Tsability Te sting of International olinguistics of Readi11g (London, Boston, and Interfaces," in Designing Us er InteJfaces fo r Henley: Routledge and Kegan Paul, 1980): 48. International Us e, edited by J Nielsen (New Yo rk: Elsevier, 1990). 5. J DeFrancis, Visible Sf>eech: Th e LJiuerse One­ ness of Wr iting 5)! stems (Honolulu: Univer­ 16. K. Garland, "The Use of Short Te rm Feedback sity of Hawaii Press, 1989). in the Preparation of Technical and Instruc­ tional Illustration," in Research in 1 llustra­ 6. & is an interesting character. It was originally tion: Conference Proceedings Pa rt II (1982). to rmed as a of e and t and is now used as an ideogram in many European written 17 P. Russo and S. Boor, "How Fluent Is Yo ur languages. Interface? Designing fo r International Users," paper presented at INTERCHI, Amsterdam, 7 F Coulmas, The Writing -�vstems of the rld April 1993. (Oxford: Basil Blackwell, 1989): 118. 18. D. Taylor, Global Software: Developing Appli­ 8. A. Nakanishi, Wr iting .�)'Stems of the World, cations fo r the International Market (New third printing (Rutland. VT, and To kyo : Charles Yo rk, Berlin, Heidelberg, London, Paris, E. Tu ttle Company, 1988): 112. To kyo, Hong Kong, Barcelona, Budapest: 9. The numerals I, 2, 3, etc .. are generaJJy known Springer-Verlag, 1992): 54. as, and referred to, as Arabic numerals; how­ 19. The design of the Korean code set reflects ever, by one of those quirks of language, the comprom ises made among cultural, eco­ Arabic script uses a diffe rent set of symbols fo r nomic, and technological requirements. The nu merals, sometimes called lndic numerals. structure of the writing system leads to inde­ 10. Note that the Khm�: r keyboard has fo ur regis­ pendent coding of each jm no, with the dis­ ters because it is based on a glyph encoding of play device rendering them into syllable Khmer rather than a character encoding, clusters. Coding as syllable cl usters greatly which wou ld use two registers at most. Also, simpl ified the prevalent technology of the the subscript Khmer glyphs on the kevcaps, time and reduced the cost of the display which are used in conjuncts, are not neces­ device. sary if more sophisticated display software is 20. F. Hapgood, "A Journey East-The Making of used. 1-2-3 Release 2]. " Lotus: Computing fo r Ma n­ 11. DIN 2137, German keyboard fo r typewriters, agers and Professionals (Cambridge, MA: Allocation of Characters £0 Keys, Parts 1, 2, 6, Lotus Development Corporation, 1987). and 11 (Deutsch Jnstitut fiirNormun g, 1988). 21. .J. Hsu and T. Lockwood, "Collaborative Com­ 12. T. Honma, H. Baba, and K. Ta kizawa, puting," BYTE:: Magazine (March 1993): 120.

"Japanese Input Method Independent of 22. L Te sler, "Networked Computing in the Applications," Digital Te chnical jo urnal, vol. 1990s," Scientifi c American (September 5, no. 3 (Su mmer 1993, this issue): 97-107. 1991 ) 13. There is some dispute in the industry on the need for explicit direction markers. Under certain circu mstances, corn�ct rendition of nested direction text can be computed. For example, a renderer cou ld show the structure

20 Vo l. 5 No. j Still/Iller 19'J.) Digital Te cbuical jo urnal ]ii rgen Bettels E Ave1y Bishop

Un icode: A Universal Character Code

A universal character encoding is required to produce software that can be local­ ized fo r any language or that can process and communicate data in any language. Th e Un icode standard is the product of a joint effo rt of information technology companies and individual experts; its encoding has been accepted by ISO as the internationalstandard ISO/IEC 10646 Unicode defines 16-bit codes fo r the char­ acters of most scripts used in the world's languages. Encoding fo r some missing scripts will be added over time. The Unicode standard defi nes a set of rules that help implementors build text-processing and rendering engines. Fo r Digital, Unicode represents a strategic direction in internationalization technology Many software­ producing companies have also announcedfu ture supportfo r Unicode.

A universal character encoding-the Unicode stan­ 2. Va riable-length encoding. The character width dard- has been developed to produce interna­ varied from one to fo ur bytes, making it impossi­ tional software and to process and render data in ble to know how many characters were in a most of the world's languages. In this paper, we pre­ string of a known number of bytes, without first sent the background of the development of this parsing the string. standard among vendors and by the International 3. Overloaded character codes and fo nt systems. Organization fo r Standardization (ISO). We describe Character codes tended to encode glyph variants the character encoding's design goals and princi­ such as I igatures; fo nt architectures often ples. We also discuss the issues an application han­ included characters to enable display of charac­ dles when processing Unicode text. We conclude ters from various languages simply by varying with a description of some approaches that can be the fo nt. taken to support Unicode ami a discussion of Microsoft's implementation. Microsoft's decision In the 1980s, character code experts from around to use Unicode as the native text encoding in its the wo rld began work on two initially parallel proj­ Windows NT (New Technology) operating system ects to eliminate these obstacles. In 1984, the ISO is of particular significance fo r the success of started work on a universal character encoding. Unicode. This effort placed heavy emphasis on compatibility with existing standards. The TSO/IEC committee Background published a Draft International Standard (DIS) in spring 1991.2 By that time, the work on Unicode In the 1980s, software markets grew throughout (described in the next section) was also nearing the world, and the need for a means to represent completion, and many experts were alarmed by the text in many languages became apparent. The com­ potential for confusion from two competing stan­ plexity of writing software to represent text hin­ dards. Several of the ISO national bodies therefore dered the development of global software. opposed adoption of the DIS and asked that ISO and The obstacles to writing international software Unicode work together to design a universal char­ were the fo llowing. acter code standard. 1. Stateful encoding. The character represented by a particular value in a text stream depended on The Origins of Un icode values earlier in the stream, fo r example, the In some sense Unicode is an offshoot of the ISO/IEC escape sequences of the ISO/IEC 2022 standard 1 10646 work. Peter Fenwick, one of the early

Digital Te cfJnicafjou rnal Vo l. 5 No. 3 Summer 1993 21 Product Internationalization

conveners of the ISO working group responsible fo r As a result of the merger with ISO 10646, the 10646, developed a proposal called "Alternative B," Unicode standard now includes an errata insert based on a 16-bit code with no restriction on the called Unicode 1.0.1 in both volumes of version 1.0 use of control octets. He presented his ideas to to reflect the changes to character codes in Becker of Xerox, who had also been work­ Unicode 1.0 7 The Unicode Consortium has also ing in this area.' committed to publish a technical report called In early 1988, Becker met with other experts in Unicode 1.1 that will align the Unicode standard linguistics and international software design from completely with the ISO/IEC 10646 two-octet com­ Apple Computer (notably Lee Collins and Mark paction form (the 16-bit form) also called UCS-2. Davis) to design a new character encoding. As one of the original designers, Becker gave this code the Relationship between Un icode and name Unicode, to signify the three important ele­ 150/IEC 10646 ments of its design philosophy: Unicode is a 16-bit code, and ISO/TEC 10646 defines a two-octet (UCS-2) and a four-octet (UCS-4) encod­ 1. Universal. The code was to cover all major mod­ ing fo rm. The repertoire and code values of UCS-2, ern writtenlanguages. also called the base multilingual plane (BMP), are 2. Unique. Each character was to have exactly one identical to Unicode 1.1. No characters are cur­ encoding. rently encoded beyond the BMP; the UCS-4 codes defined are the two UCS-2 octets padded with two 3. Uniform. Each character was to be represented zero octets. Although ucs-2 and Unicode are very by a fixed width in bits. close in definition, certain differences remain. The Unicode design effort was eventually joined By its scope, ISO/IEC 10646 is limited to the by other vendors, and in 1991 it was incorporated as coding aspects of the standards. Unicode includes a nonprofit consortium to design, promote, and additional specifications that help aspects of maintain the Unicode standard. To day member implementation. Unicode defines the semantics companies include Aldus, Apple Computer, of characters more explicitly than 10646 does. Borland, Digital, Hewlett-Packard, International For example, it defines the default display order Business Machines, Lotus, Microsoft, NeXT, Novell, of a stream of bidirectional text. (Hebrew text The Research Libraries Group, Microsystems, with numbers or embedded text in Latin script Symantec, Taligent, Unisys, Wo rdPerfect, and is described in the section Display of Bidirectional Xerox. Ve rsion 1.0, volume 1 of the 16 -bit Unicode Strings.) Unicode also provides tables of character standard was published in October 1991, fo llowed attributes and conversion to other character by volume 2 in june 1992 4·5 sets. It was sometimes necessary to sacrifice the three In contrast with the Unicode standard, ISO 10646 design principles outlined above to meet conflict­ defines the fo llowing three compliance levels of ing needs, such as compatibility with existing char­ support of combining characters: acter code standards. Nevertheless, the Unicode • Level 1. Combining characters are not allowed designers have made much progress toward solving (recognized) by the software. the problems faced in the past decade by designers of international software. • Level 2. This level is intended to avoid duplicate coded representations of text fo r some scripts, Th e Merger of 10646 and Un icode e.g., Latin, Greek, and Hiragana. Urged by publ ic pressure from user groups such as • Level 3. All combining characters are allowed. IBM's SHARE, as well as by industry representatives from Digital, Hewlett-Packard, IBM, and Xerox, Therefore, Unicode 1.1 can be considered a the ISO 10646 and Unicode design groups met in superset of UCS-2, level 3. August 1991 ; together they began to create a single Throughout the remainder of this paper, we refer universal character encoding. Both groups compro­ to this jointly developed standard as Unicode. mised to create a draft standard that is often Where differences exist between ISO 10646 and referred to as Unicode/10646. This draft standard Unicode standards, we describe the Unicode func­ was accepted as an international character code tionality. We also point out the fact that Unicode standard by the votes of the ISO/IEC national bodies and ISO sometimes use different terms to denote in the spring of 1992 6 the same concept. When identifying characters, we

22 Vo l. 5 No. 3 Summer 1993 Digital Teclmicaljournal Un icode: A Un iversal Character Code

use the code identification and the ISO equivalent way in these languages is assigned a character names. single code position. This principk led to the uni­ fication of the ideographs used in the Chinese, GeneralDesign of Unicode Japanese, and Korean written languages. This so-called CJ K unification was achieved with the This section discusses the design goals of Unicode cooperation of official representatives from the and its adherence to or variance from the principles countries involved. of universality, uniqueness, and unifo rmity. The principle of uniqueness was also applied to decide that certain characters shou ld not be Design Goals and Principles encoded separately. In general, the principle states The fundamental design goal of Unicode is to create that Unicode encodes characters and not glyphs or a unique encoding fo r the characters of all scripts glyph variations. A character in Unicode represents used by living languages. In addition, the intention an abstract concept rather than the manife station is to encode scripts of historic languages and as a particular fo rm or glyph. As shown in Figure 1, symbols or other characters whose use justifies the glyphs of many fo nts that render the Latin encoding. character A all correspond to the same abstract An important design principle is to encode each character "a." character with equal width, i.e., with the same number of bits. The Unicode designers deliberately resisted any calls fo r variable-length or stateful Abstract encodings. Preserving the simplicity and unifor­ Letter Glyph Style mity of the encoding was considered more impor­ a Century Schoolbook tant than considerations of optimization fo r storage �a Helvetica requirements. a a Century Gothic A Unicode character is therefore a 16-bit entity, Script and the complete code space of over 65,000 code Book Antiqua positions is available to encode characters. A text �: encoded in Unicode consists of a stream of 16-bit Unicode characters without any other embedded Figure 1 Abstract Latin Letter "a" and controls. Such a text is sometimes referred to as Style Va riants Unicode plain text. The section Processing Unicode Text discusses these concepts in more detail. Another example is the Arabic presentation Another departure from the traditional design of fo rm. An Arabic character may be written in up to code sets is Unicode's inclusion of combining char­ fo ur different shapes. Figure 2 shows an Arabic acters, i.e., characters that are rendered above, character written in its isolated fo rm, and at the below, or ot herwise in close association with the beginning, in the midd le, and at the end of a word. preceding character in the text stream. Examples According to the design principle of encoding are the accents used in the Latin scripts, as well as abstract characters, these presentation variants are the vowel marks of the Arabic script. Combining all represented by one Unicode characterY characters are allowed to combine with any other Since much existing text data is encoded using character, so it is possible to create new text ele­ historic character set standards, a means was pro­ ments out of such combinations.H This technique vided to ensure the integrity of characters upon can be used in bibliographic applications, or by ­ conversion to Unicode. Great care was taken to guists to create a script fo r a language that does not create a l nicocle character corresponding to each yet have a written representation, or to transliter­ ate one language using the script of another. An example in recent times is the conversion of some • . • • . . • • • • Central Asian writing systems from the Arabic to •• --"'41 the Latin script, fo llowing Turkey's example in the _t,...... 1920s (Kazakhstan). An additional design principle is to avoid duplica­ tion of characters. Any character that is nearly iden­ Figure 2 Isolated, Final, Initial, and Middle Fo rms tical in shape across languages and is used in an r?f'the Arabic Chamcter Sheen

Digita/ 1ecbuica l journal Vu l. 5 No. 3 Summer 1993 23 Product Internationalization

character in existing standards. Characters identical before, above, below, or after a consonant, thus in shape appearing in different standards arc identi­ fo rming one display unit. This unit becomes the text fied and mapped to a single Unicode character. For element fo r purposes of rendering. For a process characters appearing twice in the same standard, a such as advance to next character, however, the indi­ compatibility zone was created. These characters vidual vowels and consonants are the natural units are encoded as required to make round-trip conver­ of operation and are therefore the text elements. sion possible between other standards and There is no simple relationship between text ele­ Unicode. The Unicode Consortium has agreed to ments and code elements. As we have shown, this create mapping tables for this purpose. relationship varies both with the language of the text and with the operation to be performed by the application. In earlier encoding systems such as Text Elements and Combining Characters ASCII or others with a strong relationship to a lan­ When a computer application processes a text doc­ guage, this problem was not apparent. When ument, it typically breaks down text into smaller designing a universal character code, the Unicode elements that corre�pond to the smallest unit of designers acknowledged the issue and analyzed data fo r that process. These units are called text ele­ which character elements have to be encoded as ments. The composition of a text element is depen­ code elements to represent the scripts of Unicode dent on the particular process it undergoes. The across multiple languages. Rather than burden the Arabic ligature lam-alef is a text element fo r the character code with the complexity of encoding rendering process but not fo r other character oper­ a rich set of text elements, the Unicode Te chnical ations, such as sorting. Committee decided that the mapping of code ele­ In addition, the same process applied to the same ments to more complex text elements shou ld be string of text requires different text clements depend­ performed at the application level. ing on the language associated with the string. Figure 3 shows sorting applied to the string "ch." If this string is part of English text, the text elements Code Sp ace Structure fo r the process of sorting are "c" and "h." In Spanish The Unicode code space is the fu ll 16-bit space, text, however, the text element for sorting is "ch" allowing fo r 65,536 different character codes. As because it is sorted as if it w<..:rc a single character. shown in Figure 4, approximately 50 percent of this For other text-processing operations, text ele­ space is al located . This code space is logically ments might constitu te units smaller than those divided into fo ur diffe rent regions or zones. traditionally called characters. Examples are the The A-zone, or alphabetic zone, contains the accents and diacritical marks of the Latin script. alphabetic scripts. The first 256 positions in the These small text elements interact graphically with A-zone are occupied by the ISO 8859-1, or 8-bit ANSI a noncombining character called a base character. codes, in such a way that an 8-bit ASCII code maps The interacts with the base character to the corresponding 16 -bit Unicode character A to fo rm the character A acute. If a given fo nt does through padding it with one byte. The posi­ not have the character A acute, but it does have A tions corresponding to the 32 ASCII control codes and acute accent as separate glyphs, the character 0 to 31 are empty, as well as the positions Ox0080 A acute has to be divided into smaller units fo r the to Ox009F rendering process. The characters of other alphabetic scripts In Thai script, vowels and consonants combine occupy code space in the range from OxOOOO to graphically so that the vowel mark can be either Ox2000. Not all of the space is currently occupied, leaving room to encode more alphabetic scripts. The remainder of the A-zone up to Ox4000 is allo­

Spanish English cated fo r general symbols and the phonetic (i.e., nonideographic) characters in use in the Chinese, �urra �harm Japanese, and Korean languages. chasquido �urrent The second zone up to OxAOOO is the ideograph, gano gig it or 1-zone, which contains the unified Han charac­ ters. Currently about 21,000 positions have been filled, leaving virtually no room fo r expansion in Figure3 Te xt Elements and Collation the !-zone.

24 Vo l. 5 No . .) Summer 1993 Digital Te cht�ical jourt�al Un icode: A Un iversal Character Code

1- A-ZONE -----...! I-ZONE - O-ZONE --1A-ZONE I-

llJ. II I I 11 11 I I I II PRIVATE USE illJJ.lCOMPATIBILITY ZONE tj UNIFIED CHINESE, JAPANESE, AND KOREAN CHINESE, JAPANESE, AND KOREAN NONIDEOGRAPHIC SYMBOLS EXTENDED LATIN AND GREEK '------INDIC SCRIPTS '------HEBREW AND ARABIC '------LATIN, GREEK, CYRILLIC, AND ARMENIAN '------IS0-646 INTERNATIONAL REFERENCE VERSION

Figure 4 Code Sp ace Allocation fo r Scripts

The third zone, or O-zone, is a currently unallo­ data, which does not vary across the different writ­ cated space of 16K. Although several uses fo r this ing systems. space have been proposed, its most natural use To day, and fo r some time to come, however, the seems to be fo r more ideographic characters. data that the appJ ication has to process is typically However, even 16K can hold only a subset of the encoded in some code other than Unicode. A fre­ ideographic characters. quent operation to be performed is therefore the The fo urth zone, the restricted or R-zone, has conversion from the code (file code) in which data some space reserved fo r user-defined characters. It is presented to Unicode and back. also contains the area of codes that are defined fo r One of the design goals of Unicode was to allow compatibility with other standards and are not allo­ compatibility with existing data through round-trip cated elsewhere. conversion without loss of information. It was not a goal to be able to convert the codes of other char­ Processing Un icode Te xt acter sets to Unicode by simply adding an offset. This would violate the principle of uniqueness, The simplest form of Unicode text is often called since many characters are duplicated in the various plain Unicode. It is a text stream of pure Unicode character sets. Most existing character sets there­ characters without additional fo rmatting or fo re have to be mapped through a table lookup. attribute data embedded in the text stream. In this These mapping tables are currently being collected section, we discuss the issues any appl ication faces by the Unicode Consortium and will be made avail­ when processing such text. Processing in this con­ able to the public. text applies to the steps such as parsing, analyzing, It was, however, decided that the 8-bit ASCII, or and transforming that an applic;:tion perfo rms to ISO 8859-1 character set, was to be mapped into the execute its required task. In most cases, the text first 256 positions of Unicode. Other character sets processing can be divided into a number of primi­ (or subsets), such as the Thai standard TIS 620-2529, tive processing operations that are typical ly offered could also be mapped directly, since character as a toolkit service on a system. In describing uniqueness was preserved. Also, one of the blocks Unicode text processing, we discuss some of these of Korean sy llables is a direct mapping from the primitives. Korean standard KSC 5601 . Some character sets contain characters that can­ Code Co nversion not be assigned code values under the Unicode One of the goals of Unicode is to make it possible to design rules. Often these characters are different write appl ications that are capable of hand ling the shapes of encoded characters, and encoding them text of many writing systems. Such an appi ication wo uld violate the principle of uniqueness. To would typically apply a model that uses Unicode as allow round-trip conversion fo r thcsl' characters, its native process code. The application could then a special code area, the compatibility zone, was set be written in terms of text operations on Unicode aside in the R-zone to encode them and to allow

Digital Te cb11icaljon r11af 11>1. 'i No. .> Summer I')'J.) 25 Product Internationalization

interoperation with Unicode. For example, the u + ',� + ', � wide fo rms of the Latin letters in the Japanese JIS Latin capita/ letter U 208 standard were invented to simplify rendering + combining on monospacing terminals and printers. + combining

Ch aracter Tra nsformations The most uncomposed and the most precomposed A frequently used operation in text processing is fo rms of these spellings can be considered normal­ the transformation of one character into another ized fo rms. \Vhen processing Unicode text, an character. For example, Latin lowercase characters application would typically transform the charac­ are often transformed into uppercase characters to ter strings into either of these two fo rms fo r further execute a case-insensitive search. In most tradi­ processing. tional character sets, this operation would translate Note that the spellings: one code value to another. Thus, the output string of the operation would have the same number of code values as the input string, and both strings Latin capital letter U would have the same length. This assu mption is no with grave accent + combining diaeresis longer true in the case of Unicode strings. Consider the Unicode characters, Latin small and letter a + combining grave accent, i.e., a string of two Unicode characters. If this string were part of a French text (in France), transtorming a to A would Latin capita/ letter U result in one Unicode character, Latin capital letter + combining grave accent A. If the same string were part of a French Canadian + combining diaeresis text, the accent would be retained on the upper­ case character. We can therefore make two observa­ would result in a diffe rent character: tions: (1) The string resulting from a character transformation may contain a diffe rent number of u characters than the original string and (2) The result depends on other attributes of the string, in This result is due to the rule that diacritical marks, this case the language/region attribute. which stack, must be ordered from the base charac­ Another important character transformation ter outwards. operation is a normalization transformation. This operation transforms a string into either the most By te Ordering uncomposed or the most precomposed fo rm of Traditional character set encodings, which are con­ Unicode characters. As an example, we consider fo rmant to ISO 2022 and the C language multibyte the different spel lings of the combination: model, consider characters to be a stream of bytes, including cases in which a character consists of more than one byte. Unicode characters are 16-bit entities; the standard does not make any explicit Latin capita/ letter U statement about the order in which the two bytes of with diaeresis and grave accent the 16-bit characters are transmitted when the data is serialized as a stream of bytes. This letter has been encoded in precomposed fo rm The orclering of bytes becomes an issue when in the Additional Extended Latin part of Unicode. machines with diffe rent internal byte-order archi· There are two additional spellings possible to tecrure communicate. The two possible byte encode the same character shape: orders are often called little endian and big endian. In a little-endian machine, a 16-bit word is addressed as two consecutive bytes, with the low­ Latin capita/ letter U with diaeresis order byte being the first byte ; in a big-endian + combining grave accent machine, the high-order byte is first. To day all com­ puters based on the Intel 80x86 chips, as well as and Digital's VA X and AJpha AXP systems, implement a

26 \kJ/. 5 No. J Summer 1993 Digital Te chnical jo urnal Un icode: A Universal Character Code

little-endian architecture, whereas machines built dard.-•5 (It is important to consult the second on Motorola's 680x.'<, as we ll as the reduced instruc­ vo lume because it contains corrections to the algo­ tion set computers (RJSC)of Sun, Hewlett-Packard, rithm given in the first volume.) and IBM, implement a big-endian architecture. In Al l printing characters are classified as strongly blind interchange between systems of possibly dif­ LR, weakly LR, strongly RL, weaklyRL, or neutral. In fe rent byte order, Unicode-encoded text may be addition, Unicode defines the concept of a global read incorrectly. To avoid such a situation, Unicode direction associated with a block of text. A block is has implemented a byte-orderma rk that behaves as approximately equivalent to a paragraph. The first a signature. As shown in Figure 5, the byte-order task of the rendering software is to determine mark has the code value OxFEFF. It is defined as a the global direction, which becomes the default. zerc.-width, no-break space character with no Embedded strings of characters from other scripts semantic meaning other than byte-order mark. are rendered accordingto their direction attribute. The code value corresponding to the byte­ Neutral characters take on the attribute of sur­ inverted fo rm of this character, namely OxFFFE, is an rounding characters and are rendered accordingly. illegal Unicode value. If the byte-order mark is inserted into a serialized data stream and is read by Directionality Control a machine with a diffe rent byte-order architecture, Although the default algorithm gives correct ren­ it appears as OxFFFE. This fact signals to the applica­ dering in most realistic cases, extra information tion that the bytes of the data stream have been read occasionally is needed to indicate the correct ren­ in reverse order from that in which they were dering order. Therefore, Cnicode includes a num­ written and should be inverted. Applications are ber of implicit and explicit fo rmatting codes to encouraged to use the byte-order mark as the first allow fo r the embedding of bidirectional text: character of any data written to a storage medium Left-to-right mark (LRM) or transmitted over a network. Right-to-left mark (RLM) Right-to-left embedding (RLE) Display of Bidirectional Strings Left-to-right embedding (LRE) To fa cil itate internal text processing, a Unicode­ Left-to-right override (LRO) compliant application always stores characters in Right-to-left override (RLO) logical order, that is, in the order a human being Pop directional fo rmatting (PDF) would type or write them. This causes complica­ It must be pointed out that the directional codes tions in rendering when text normally displayed are to be interpreted only in the case of horizontal right to left (RL) is mixed with text displayed left to text and ignored fo r any opt:ration other than bidi­ right (LR). Hebrew or Arabic is written right to left, rectional processing. In particular, they must not but may contain characters written left to right, if be included in compare string operations. either language is mixed with Latin characters. The LRM and RLM characters are nondisplayable Numerals or punctuation mixed with Hebrew or characters with strong directionality attributes. Arabic can be written in either order. Since characters with weak or neutral directionality Th e Default Bidirectional Algorithm take their rendition directionality from the sur­ rounding characters, LRM and RL\1 are used to influ­ Unicode defines a default algorithm fo r displaying ence the directionality of neighboring characters. such text based on the direction attributes of char­ The RLE and LRE embedding characters and the acters. We outline the algorithm in this paper; fo r LRO and RLO override characters introduce sub­ details, see both volumes of the Unicode stan- strings with respect to directionality. The override characters enforce a directionality and are used to LITTLE-EN DIAN BYTE-STREAM BIG-EN DIAN enforce rendering of, fo r instance, Latin letters or MACHINE TRANSFER MACHINE numbers from right to left. Substrings can be I OxFEFF I-I FE I OxFF I-I OxFFFE nested, and conforming applications must support I+ I 15 levels of nesting. Each RLE, LRE, LRO, or RLO char­ BYTE-ORDER FIRST SECOND ILLEGAL acter introduces a new sublevel, and the next fo l­ MARK BYTE BYTE CHARACTER lowing PDF character returns to the previous level. The directionality of the uppermost level is implicit Figure 5 By te -order Mark or determined by the application.

Digital Te cbnical jo tu-nal Vo l. 5 No. 3 St/11111/el' I'J')3 27 Pn.>duct Intcn1ationalization

Only correct resolution of directionality nesting represented by two bytes each, and three bytes gives the correct result. In neral it cannot be each are usecl fo r the remaining code values. assumed that a string of text that is inserted into Originally. Tf had been proposed fo r w;e in data other bidirectional text will have the correct cl irec­ transmission and to avoid the problem that embed­ tionality attributes without special processing. ded zero bytes represent for C language character This may result in the removal of directional codes strings in the char data type. Subsequently, it has in the text or in the add ition of further controls. As been proposed to use UTF in historical operating shown in Figure 6, particular care needs to be taken systems (e.g. , UNIX) to stOre Unicode-encoded sys­ h>r cut-and-paste operations of bidirectional text. tem resources such as fi le names. �<> Modifications of UTf have therefore been pro­ 11-mzsmissiun ouer 8-bit Cha nnels posed to address other special requirements such as preserYalion of the slash (!) character. 11 It Exi�ting communication systems often require that remains to be seen which of these various transfor­ data adheres to the rules of ISO/ IEC 2022, \-V hich mation methods will be widely adopted . resnve the t>-bit code values between OxOO and Ox II; (the CO space), bet ween OxHO and Ox9F (the C l space.:), and the code position DELETE. 1 Since Ha ndling of Combining Ch aracters Unicode uses these values to encode characters, In some of the operations discussed above, we have direct transmission of Unicode data over such trans­ indicated that the presence of combining charac­ �sion systems is not possible. ters requires processing Unicode text diffe rently The Unicode designers. in collaborat ion with from text encoded in a character set without com­ ISO, have thcrd(HT proposed an algorithm that bining characters. Normalization or transformation transforms Unicode characters so that the CO and of the characters into a normalized fo rm is usually C I characters and DELETE arc avoided. This algo­ a first helpful step for fu rther processing. For exam­ thm, the ucs transformation fo rmat (UTF), is part ple, to prepare a text fo r a comparison operation, of the ISO 1064(> standard as an informative annex. one may wish to decompose any precomposed It i� expected that it will he included in the revised characters. In this way, multiple-pass comparison 11icode standard. ami sorting algorithms, which typically pass The transfurmat ion algurithm has been con­ through a level that ignores diacritical marks, can ceived in such a way that the characters corre­ be applied almost unchanged.12 spunding to the 7- hit ASCII codes and the <:l codes For simple comparison operations, the applica­ are repre�cntccl by one byte (see Figure 4). Code tion must decide on a pol icy of what constitutes pusitions OxUOAO through Ox4015 (which include equality of two strings. If the string contains char­ the re mainder of the extended Latin alphabet) are acters with a single diacritical mark, it can choose

DESTI NATION TEXT IN LOGICAL ORDER: PLEASE SEND TO : DIRECTIONALITY NESTING: ( DES TINATION TEXT IN DISPLAY ORDER: PLEASE SEND TO : -A insertion point

TEXT TO BE PASTED IN LOGICAL ORDER mr . j. smith, 12 john doe str DIRECTIONALITY NESTING ( ) TEXT TO BE PASTED IN DISPLAY ORDER: rts eod nhoj 12 ,htims .j .rm

PASTED TEXT IN LOGICAL ORDER: PLEASE SEND TO:_ mr. j. smi th, 12 john doe str DIRECTIONALITY NESTING: ( ( ( ) )) PASTED TEXT IN DISPLAY ORDER WITHOUT NESTING: PLEASE SEND TO : htims .j .rm, 12 rts eod nhoj incorrect! PASTED TEXT IN DISPLAY ORDER WITH NESTING PLEASE SEND TO :_ rts eod nhoj 12 ,htims .j .rm

Note: Capital letters signify left-to-right \v rtling. Small letters stgnily right-to-left writing.

Figure 6 Cut and Pa ste of Bidirectional Te xt

2H l-'<11. 5 No.. l Stiii//J/

either strong matching, which req uires the diacriti­ The major d isadvantage is that the data type cal marks in both strings, or weak matching, which wou ld vary from one ve ndo r or platform to ignores diacritical marks. If the text includes char­ another and would thnd(Jre have no standard acters with more than one diacritical mark for a string-processing libraries. medium-strong match, the presence of certain 2. An existing data type, such as wchar_t in C marks might be required but not of others. Strong would be used. (Note that the char data type is matching is required fo r the Greek word fo r micro­ appropriate only if char is defined as 16 hits , or material f-LLKpoiJALK& and the Greek diminuitive if the string is given some fu rther structure fo rm of small f-LLKpouALKa. Without the diacritical to define its length by means othn than null marks, the words wou ld be identical. termination. Similar issues may exist in other Unicode requires that combining characters fo l­ languages.) low the base character. This solution was chosen over the alternatives of (1) precede and (2) precede The use of an existing data type h as the al'lv:llltagc and fo llow, fo r various reasons. 1·1 Text-editing oper­ of being widely known and implemented: how­ ations must ta into account the presence and ever, it also has the disadvantage of preexisting ordering of diacritical marks. A user-friend ly appli­ assu mptions about behavior and/or semantics. cation should be consistent in its choice of text ele­ 3. An opaque object would be used. Since the data ment on which operations such as next character in these objects is not v isible to the calling pro­ or operate. This choice should fe el gram, it can only be processed by routines or by natural to the user. For example , in Latin, Greek, invo king its member functions (e.g .. in C++) ami Cyrillic, the expectation wo uld be that accented characters are the unit of operation, Use of an opaque object has the advantage of hid- whereas in Dcvanagari and Thai, where several ing much of the complexity inherent in the world's combining characters and a base character com­ writing svstems from the application writer. It has bine into a cell, th�: natural unit is the individual the disadLtntages common to object-oriented sys­ character. tems, such as the need fo r software engineers to learn a new programming pa radigm and a set of Implementation Issues class I ibraries fo r the l nicock objects. In this section we describe some of the approaches that can be taken to support Unicode. As a concrete w ndows NTImplem ents Un icode example, we describe how the Microsoft Windows The Windows NT design team started wi th several NT operating system uses Unicode as the native text goals to make an operating sy.stem that wou ld pre­ encoding and maintains compatibility with existing serve the investment of customers and ckvc lopers. appl ications based on a diffe rent encoding. These goa ls affected their decisions regardi ng the data types and migration strategit·s clescrihcd in the previous section. General Co nsiderationsin Adding The goa ls related to text processing were to Un icode Support Informal discussions with vendors planning to sup­ 1. Provide hackrward comp:.ttihility port Unicode indicate that the fo l l owing data types a) Support exist ing \IS-DOS and !()-hit MS and data access arc being considered when using Windows appl ications, including those based the C . on H-bit and double-byte character s et (DBCS) code pages. 1. A new data type wou ld be designated for b) Support the DOS fik Unicode only. It wou ld be directl y accessible by syst<:m. the application, e.g., typedef unsigned short UNICH.AR. 2. Provide worldwide characrer support in a) File names The Unicode-only data type has the advantage b) File contents of being unencumbered with preconceptions c) l ser names about semantics or usage. Also, since the appli­ cation knows that the contents are in Unicode, it As described later in this section. these <: onfl ict­ can write code-set-dependent applications. ing goa ls were met under a single Windows , 'T

Digital Te cbuical jo urnal 14,1. 5 11. j S/11111/l('r /')')) 29 Product Internationalization

architecture, if not simultaneously in the same strcmp. Details of this procedure can be fo und in application and file system, then by clever segrega­ Win32 Application Programming lnteJface.1• tion of Wimlows NT into multitasking subsystems. These goals also affect the way Microsoft recom­ Procedures fo r Developing/Migrating Applications mends clevelopns migrate their existing applica­ in t!Je Dual Pa t!J In his paper "Program Migration to Unicode," Asmus Freytag of Microsoft explains tions to Windows NT. the steps used to convert an existing application to Th e Basic App roach Microsoft's overall approach work in Unicode and retain the ability to compile it is close to that of using a standard data l ype that as a DOS or 16 -bit Windows appl ication.1' The basic accesses data mainly through string-processing idea is to remove the assumptions about bow a functions. In addition, Microsoft defined a special string is represented or processed. AJ I references to set of symbols and macros fo r application develop­ string-related objects (e.g., char data types), string ers who wish to continue to develop applications constants, and string-processing functions are based on DOS (e.g., to sciJ to those with 286 and replaced with their dual-path equivalents. The ta l­ 386sx systems), while they migrate their products lowing steps are then taken. to run as native Win32 appl ications on Windows 1. Replace all instances of char with TCHAR, char''' NT. The developer can then compile the appli­ with LPSTR, etc. (For a complete listing, see cation with or without the compiler switch "Program Migration to Unicode.")1' -DUNICODE to produce an object module compiled fo r a native Windows NT or a DOS operating envi­ 2. Replace all instances of string or character con­ ronment, respectively. stants with the equivalent using the TEXT macro.1<• For example, Dual-path Data Typ es To select the appropriate char fi Lemessage[J "Fi Lename"; compilation path, .\1icrosoft provides C language char yeschar = 'Y'; header fi les that conditionally define data types, becomes macros, and fu nction names fo r either Un icode or traditional 8-bit (and DBCS) support, depending on TCHAR fi Lemessage[] = TEXT("Fi lename"); whether or not the symbol UNICODE has been TCHAR yes char = TEXT( ' Y'); defined. An example of a data type that illustrates 3. Replace standard char based string-processing this approach is TCHAR. If UNICODE is defined, fu nctions with the Wi n32 fu nctions. (See page TCHA.R is equivalent to wchar_t. Otherwise, it is the 221 of Wi n32 Application Programming Inter­ same as char. The application writer is asked to con­ fa ce fo r a complete listing.)11 vert all instances of char to TCI-iAR to implement the 4. Normalize string-length computations using dual development strategy. sizeof( ) where appropriate. For example, direct computation using address arithmetic should String-handling Functions Similarly, the macro take the fo rm: string_lengtb = (last_address TEXT is defined to indicate that string constants are fi rst_ad dress )"'sizeof(TCHAR); wide string constants when UNICODE is defined, or ordinary string constants otherwise. Application 5. Mark all files with the byte-order mark.17 writers should surround all instances of a string 6. Make other, more substantial changes. or character constant with this macro. Thus, Most character-code-dependent processing "" becomes TEXT("Filename"), and 'Z' should be taken care of by step 3, assuming the becomes TEXT('Z'). The compiler treats these as a developer has used standard fu nctions. If the wide string or character constant if UNICODE is source code makes assumptions about the encod­ defined, and as a standard char based string or char­ ing, it will have to be replaced with a neutral func­ acter otherwise. tion call. For example, the well-known uppercasing Finally, there are symbol names fo r each of the sequence various string-processing fu nctions. For example, if cha r_u pper = char_lower + 'a' -- 'A'; UNICODE is defined, the function symbol name _tcscmp is replaced by wcscmp by the C prepro­ implicitly assumes the language and the uppercas­ cessor, indicating that the wide character function ing rules are English. These must be replaced with a of that name is to be called. Otherwise, _tcscmp function call that accesses the Windows NT Natural is replaced with the standard C I ibrary fu nction Language Services.

30 Vo l. 5 Nu. 3 Stun mer !'}') � Digital Te chnical ]ourual Unicode: A Un iversal Character Code

Summary Organization fo r Standardization/Interna­ A universal character encoding-the Unicode stan­ tional Electrotechnical Commission, 1991). dard-has been developed to produce interna­ 7 Un icode 1.0. 1 Errata lnsertfor Th e Unicode tional software and to process and render data in Sta ndard, Ve rsion 1. 0, Vo lume 1 and Vo lume most of the world's languages. The standard, often 2 (Reading, MA: Addison-Wesley Publishing as 6 6 referred to Unicode/10 4 , was jointly devel­ Company, 1992). oped by vendors and ind ividual experts and by the International Organization fo r Standardization 8. TSO/IEC 10646 restricts the use of combining and International Electrotechnical Commission characters. See the definitions of level 2 and (ISO/IEC). Unicode breaks the (incorrect) principle level 3 in the section Relationship between that one character equals one byte equals one Unicode and ISO/IEC 10646. glyph. It stipulates the use of text elements that 9. Some of the presentation variants are are dependent on the particular text operation. encoded fo r compatibility with existing stan­ A number of software vendors are now moving to dards. For a discussion, see the section Code support Unicode. Microsoft's implementation sup­ Conversion. ports Unicode as the native text encoding in its Windows NT operating system. At the sametime, it 10. R. Pike and K. Thompson, "Hello Wo rld," maintains compatibility with existing applications Us enix Conference, 1993. based on 8-bit encoding. 11. File Sy stem Safe-UCS Tra nsfo·rmation For­ mat (Reading: X/Open Company Limited, Acknowledgments 1993). The authors would like to express their thanks to 12. A. LaBonte, "Multiscript Ordering for Uni­ Asmus Freytag of Microsoft Corporation and code," Proceedings of the Fo urth Un icode Masami llasegawa (ISO/IEC 10646 editor) fo r their Implementors Wo rkshop, Sulzbach (Unicode efforts in reviewing this paper. Inc., 1992).

References and No tes 13. Private communication, joseph D. Becker, 1993. l. Information Processing-/SO 7- bit and 8-bit Coded Ch aracter Sets-Code Ex tension Te ch­ 14. Win32 Application Programming Interfa ce niques, ISO 2022: 1986 (Geneva: International (Redmond, WA : Microsoft Press, 1992). Organization to r Standardization, 1986). 15. A. Freytag, "Program Migration to Unicode," 2. Information Te chnology-Multip le-Octet Proceedings of the Second Un icode Imple­ Coded Character Set, JSO!IEC DIS 10646· mentors Wo rkshop, Merrimack (Unicode 1990 (Geneva: InternationalOrg anization for Inc., 1992). Standardization/International Electrotechni­ 16. String constants in source code should be cal Commission, 1990). avoided in all cases. They violate one of the 3. ]. Becker, "Multilingual Word Processing," fundamental design rules of software interna­ Scientific American, vol. 251 (July 1984): tionalization, i.e., that objects dependent on

96 - 107. language and/or culture should be isolated into easily accessible modules fo r the purpose 4. Th e Unicode Sta ndard, Ve rsion 1.0, Vo lume of localization. 1 (Reading, 1\1.A: Addison-Wesley Publ ishing Company, 1991 ) . 17. Unicode defined the code value OxFEFF to have the semantic byte-order mark (BOM) and 5. Th e Un icode Standard, Ve rsion 1.0, Vo lume encourages software developers to place it as 2 (Reading, MA: Addison-Wesley Publishing the first character in a Unicode file. (For Company, 1992). details, see the section Byte Ordering.) 6. Information Te ch nology- Un iversal Multiple­ Octet Coded Character Set (UCS), ISO/IEC DIS 10646-1 .2:1991 (Geneva: International

Digital Te chnicaljounwl Vr!l. 5 Nn. j Su111mer 199.3 31 We ndy Ramumberg Jii rgen Bettels

TheX/Op en Internationalization Mo del

Soft ware internationalization standards allow developers to create applications that are neutral with respect to la nguage and cultural information. X/Open adopted a model f0 1· internationalization and bas revised the model several times to e:xpand tbe range of support. Th e la test version of tbe X/Open internationaliza­ tion model, which supports multibyte code sets, provides a set of interfa ces that enables users in most of Europe and Asia to de velop portable applications indepen­ dent of the language and code set. One implementation of this model, the interna­ tionalized DEC OSF/1 AXP version 1.2 (based on OSF/1 release 1.2) supports complex Asian languages such as Ch inese andja panese.

Software internationalization standards initiatives to isolate sensitiVIty to language and culture­ began in the late 1980s. This paper provides a brief specific information. Such interfaces include func· history of internationalization standards activities tionality to fo llowed by a description and an analysis of the • Attain character attributes independent of coded X/Open model fo r internationalization. The Open character sets, i.e., code sets Software Foundation's OSF/1 release 1.2 and Digital's DEC OSF/1 AXP version 1.2 internationalization imple· • Order relationships of characters and strings mentations serve as reference software fo r the • Process culturally sensitive fo rmat conversion description. The analysis covers both the strengths (e.g., elate, time, and numbers) and the limitations of the model. The paper con­ clud<:s with a discussion of current and fu ture rela· • Maintain user messages fo r multiple languages tionships between this model and other work in the field. Standardization of internationalization interfaces began predominantly in the UNIX environment. Companies such as Hewlett-Packard and AT&T pro­ Internationalization Standards vided early proprietary solutions. 1 The InternationalOrga nization fo r Standardization When X/Open announced its intention to (ISO) is the primary group that is curren tly publish· include support fo r internationalization in Issue 2 ing or developing internationalization specifica· of its X/Open Portability Guide (XPG2), Hewlett­ tions, including code sets, programming languages, Packard submitted its Natural Language Support and frameworks. Defore the ISO adopts emerging System as a proposal fo r an internationalization specifications, much work is done by other groups. model. X/Open further developed this proposal In the case of interfaces that support the develop­ and published the guide in 19872 Some principles ment of international applications, the Uniforum developed fo r these solutions fo und their way into Internationalization Te chnical Wo rk Group, the the emerging C programming language standard X/Open Internationalization Work Group, the (ISO/IEC 9899) and the POSIX operating system Unicode Consortium, and the X Consortium have interface specification (ISO/IEC 9945·1).u been instrumental. The subsequent version of the X/Open Internationalization is generally considered to be Portability Guide, XPG3, published in 1989, demon­ the processes and tools appl ied to create software strated further improvement in internationalization that is neutral with respect to language and cultural support.' The guide was aligned with the ISO/IEC C information. This neutrality can be accomplished standard and the JSO/IEC POSIX specification, both by providing a set of application interfaces clcsignecl of which meanwhile had been fi nalized.

32 litJ/. 5 No . .) Summer 1993 Digital Te cbnical jo urnal The X/Open Internationalization Model

A major drawback of the XPG3 specification is 1. Locale announcement mechanism that it is limited to single-byte code sets. Such code sets are used primarily fo r western European lan­ 2. Locale databases guages and preclude use of the X/Open internation­ 3. Internationalization-specific library routines alization model fo r Asian and eastern European languages. 4. International ized interface definitions for stan­ The Japanese UNIX Advisory Group developed dard C language library routines specifications to extend support to character sets that are encoded in more than one byte. These code 5. Message catalog subsystem sets are generally known as multibyte code sets. The locale announcement mechanism provides a The Multibyte Support Extensions developed by way fo r an application to load, at run time, a spe­ this group are now included in an addendum to the cific set of data that describes a user's native lan­ ISO/IEC C programming language standard 6 This guage and cultural information. An application user work was also adopted by X/Open fo r inclusion in can specify a language, a territory, and a code set Issue 4 of the X/Open Portability Guide (XPG4), by means of environment variables. The locale which was published in 1992.7.H.� announcement mechanism checks the environ­ However, the underlying model used by X/Open ment variables. If the variables are set, the applica­ and POSIX does not fu lly meet the needs of dis­ tion attempts to load the locale-specific data. If the tributed and multilingual computing environ­ environment variables are not set, most applica­ ments. Therefore, in 1992 X/Open and Unifo rum tions default to the use of the PO SIX (i.e., C lan­ created a joint internationalization work group, guage) locale or an implementation-defined locale. commonly referred to as the XoJIG. This group The POSIX locale definition is based on the u.s. ASCII investigated internationalization requirements for code set and the U.S. ! ish language. distributed and multilingual environments and, in In conjunction with locale databases, the November 1992, published a revised model fo r announcement mechanism provides access to code internationalization.1o set specification data, character collation informa­ tion, date/time/numerical/monetary fo rmatting The X/open Internationalization Model information, negative/affirmative responses, and When X/Open first investigated the need fo r application-specific message catalogs. internationalization services, several needs were Figure 1 shows the relationships among the com­ identified: ponents of the X/Open internationalization model.11 Refer to Figure 1 throughout this section, • Meet the market requirements of the X/Open as the various elements of the figure are described . member companies. (Many of these require­ The locale announcement mechanism is based ments were based on the needs of the European on the set locale( ) function Economic Community [EEC].) char *set loca le( int category, • Support more than one language and cultural const char *loca le) environment, including messages and date/time. The categories correspond to components of the • Provide fo r data transparency, i.e., remove 7-bit, locale database and have a set of corresponding U.S. ASCII restrictions from the environment. user environment variables. The announcement As discussed previously, X/Open adopted a mechanism supports an order of precedence when model fo r internationalization and has updated and querying the user's environment to establ ish the revised the model many times. The next section preferred locale. Table 1 shows the environment describes the current X/Open model. variables specified by XPG4. The LC_A LL environment variable has prece­ dence over all others, whereas the LANG environ­

Overview of the X/Open Po rtability ment variable has no precedence. The other l.C_ * Mo del, Is sue 4 environment variables are of equal weight. There are five components to the current X/Open Although it does not provide a naming conven­ internationalization model, X/Open Portability tion fo r locales, the X/Opcn model cloes specify the Guide, Issue 4 (XPc;4): locale argument as a pointer to a string in the fo rm

Digital Te chnical jout·,.al Vo l. 5 No. 3 Summer 1')')3 33 Product Internationalization

r------� �------DEVELOPMENT SYSTEM INTERNATIONALIZED SYSTEM DEVELOPMENT (XPG4 ONLY) SYSTEM

APPLICATION MESSAGE TEXT FORMAT

I CODE INTERNATIONALIZATION API LOCALE MESSAGE R SET f==1 Fl LE FILE l___,-J e FILE L_,-J INTERNATIONALIZATION SERVICES I I I y y STRING AND ' LOCALE MESSAGE CHARACTER localedef HANDLING RETRIEVAL gencat I I HAN DLING I I + � I I ------1 -- j ______l 1 l .______fL-L.....J l l_J LOCALE MESSAGEl DATABASE CATALOG

Figure 1 Components of the X/Open lnternationalization Model

Ta ble 1 Locale-specific Environment to sort alphabetically according to the telephone Va riables directory rather than the dictionary. Locale databases can be provided by either Va riable Use the system vendor or an application developer. A LC_ALL For all categories description of utilities that convert a source fo rmat LC_COLLATE For collation specification of a locale to a binary file fo llows. LC_CTYPE For character classification The setlocale( ) function accesses the binary LC_MESSAGES For responses and message locale databases and provides a global locale within catalogs a given application. The global locale is similar to LC_MONETARY For monetary information a global variable in that it is shared by all of an appli­ LC_NUMERIC For numeric information cation's procedures. Locale switching can be done LC_TIME For date/time information within an appl ication, but within the scope of the LANG If no others are set XPG4 model such locale switching is unnecessarily complex and costly, in terms of performance. A later section discusses additional limitations of this mechanism. XPG3: language[_territory[. codesetJ][@mo difierJ The set of interfaces shown in Ta ble 2 supports XPG4: international application development and was language[_ territoryJ[.codesetJ[@modi fierJ first introduced as part of the ISO/IEC c and the XP(i2 and XPG3 specifications. These interfaces are Examples of environment variable settings are used primarily to access data in the locale databases LANG = en_US.IS08859-1 or to manipulate locale-sensitive data. The XPG3 specification is based on the use of and ISO/IEC 8859-1 as the transmission code set.12 Some LCCOLLATE = ja_jP.jpEUC implementations use this as an set, instead of the ASCII code set. The mocli.fier is sometimes used to specify a partic­ A limited set of functions that support multibyte ular instance of a language or cultural info rmation characters is also available: mblen( ), mbtowc( ), for a locale. For instance, if support fo r a particular mbtowcs( ), wctomb( ), ancl wcstombs( ). Each sort order is necessary, in a German locale the user of these fu nctions is based on the ISO/IEC C wide might specify character (wchar_t) data type. The size of the data LC_COLLATE = de_DE.IS08859-l @phone type is not specified by the standard and can vary

34 Vol. 5 Nu. .J Summe-r I')'J.l Digital Te chnicflljournal The X/ Op en Internationaliza tion Model

Ta ble 2 Interfaces for International sets), worldwide portability interfaces, and local Application Development language support.

Interface Use Code Set Support As mentioned in the previous localeconv( ) For retrieving locale-dependent formatting parameters section, the XPG3 specification primarily supports code sets based on the ISO/IEC 8859-1 specification. nl_langinfo( ) For extracting information from the locale database The XPG4 model goes beyond this by including setlocale( ) For locale announcement additional interface specificationsto support lti­ byte locales and internationalized commands. strcoll( ) For locale-based string collation The XPG4 model is a superset of the five basic strftime( ) For converting date/time formats based on locale components of the XPG3 model. The use of the wchar_t data type is a key fe ature of the new inter­ strxfrm () For transforming a string for collation in current locale face specifications, because this data type supports multibyte code sets. In the internationalized DEC OSF/1 AXP version 1.2 system, the size of wchar_t is 32 bits, which enables the support of complex from one implementation to the next, depending Asian languages such as Chinese. This implementa­ on the code set support offe red by a particular ven­ tion is based on the OSF/1 release 1.2, which is itself dor. This multibyte function set does not provide designed to support 8-, 16-, or 32-bit wchar_t defini­ adequate support fo r Asian language application tions. The X/Open internationalization model is development. based on the concept of process and file codes. In In addition to the mb' and we''' fu nctions, the the internationalized DEC OSF/1 version 1.2 imple­ X/Open internationalization model specifies a set mentation, the wchar_t data type is used as process of extensions fo r many library fu nctions and com­ code. That is, internal to an application, characters mands. These extensions enable the support of are converted to the wchar_t data type before use. 8-bit characters as las provide the fu nctionality File code, i.e., on-disk data, is always stored as required to meet the original go al of ensuring data multibyte characters. An application converts all transparency. For example. changes to the printf( ) internal process code (i.e., wchar_t data type) char­ and scanf( ) fa milies of functions allow the order­ acter to multibyte character prior to storing it on ing of argu ments to be specified in translated mes­ disk. This enables file compression and enforces sage catalogs. In addition, about 80 commands, the use of a constant width fo r the processing of incl uding sort and date, were modified to support character information. The mb" and we• fu nctions the locale categories. convert between the two types of data. The size of The XPG specifications include a message catalog su bsystem. Although not very sophisticated, this the wchar_t data type combined with the capability to support multiple encoding schemes provides the subsystem provides much needed fu nctionality. flexibil ity required to have a code set-independent Minor updates have been made with each new issue of the Portability Guide. The subsystem comprises implementation. only three functions: catopen( ), catclose( ), and Restrictions exist on the use of certain characters in the second and subsequent bytes of a multibyte catgets( ). A command, gencat, is used to convert a message source file into a binary message catalog character so that full code set independence is diffi­ that is accessed at run time by an application. The cult to achieve. An example of such a restriction is behavior of the catopen( ) fu nction is dependent the slash character /. The UNIX file system uses this character as a delimiter in absolute and relative on the user's chosen locale allowing selection of pathname specifications. Implementations based translated messages. on OSF/1 release I .2 restrict the use of characters in the range Ox00-0x3F to the ASCII code set. XPG4Sp ecification and the OSF/1 Release However, even with this restriction, it is possible to 1.2 Imp lementation build robust systems that support a wide range of This section discusses the XPG4 model in terms multibyte code sets. of the OSF/ 1 release 1.2 implementation. lbpics To gain the necessary flexibility, the Open Soft­ include code set support, the locale definition ware Foundation introduced an object-oriented utility (the utility fo r handl ing data in mixed code architecture fo r the internationalization subsystem.

Digital Te cb11ical jo urnal \lrJ/. 5 No. j Summer t')'Jj 35 Product Internationalization

This architecture specifies the various components (e.g., isalpha) specified in the ISO/IEC c standard. As of the X/Open model as subclasses. At run time, mentioned previously, the XP<;;i model docs not an application instantiates objects built from these include all the interfaces necess:1ry for application subclasses by means of the sctlocalc( ) function developers to handle multibyte code sets. A new call. set of interfaces, which parallels the set of ISO/IEC C. 8-bit interfaces, was developed and integrated into localedef, iconu, and Code Set Jndej)e1ldence the XPG4 specification. The final version of the XPG3 does not provide a utility fo r describing interface specification was proposed to the JSO/IEC locales. Therefore, the number of diffe rent C committee as the Multibyte Support Extensions. approaches to the problem matched the number of vendors. Introduced in the POSIX specification Cultural Data/Local Language Support Local ISO/IEC DIS 9945-2 and hence adopted by X/Open, language support is achieved through the use of the localedef utility provides a mechanism fo r spec­ locale databases and message catalogs. The catalogs

ifying a locale in a portable . I-' for each code enable translation of user messages. Locale data­ set supported in the internationalized DEC. OSF/1 bases have two components: the cbarmap file and AXP system, there is a corresponding charmap file the locale definition fi le . These databases are cre­ and one or more corresponding locale definition ated by means of the localedef command. files that adhere to the POSIX specifications. The charmap fi le contains a POSIX-compliant Combined with a set of locale-specific methods and specification of the code set, i.e., a one-to-one code set converter modules, these subclasses pro­ mapping from character to code point. The locale vide the foundation fo r the OSF internationalization definition file contains the cultural information. architecture. Va rious sections of the definition file correspond Locale-specific methods provide a way fo r the to the categories referenced by the setlocale( ) ISO/IEC c language mbtowc( ), wcromb( ) fa mily of function. The definition fi le contains col lation functions to work in a multiple code set environ­ specifications, numeric and monetary fo rmatting ment. The wchar_t encoding of a multibyte charac­ information, date/time fo rmats, affirmative/ ter in the Japanese SJ IS code set is different from negative response specifications, and character that fo r a character in the Super DEC Kanji code set. classification information. In the 051-/ 1 release 1.2 At execution time, the corn:ct merhod is instanti­ implementation, these definition files are indepen­ ated based on the user's choice of locale. An exam­ dent of the code set. For example, the definition fo r ple of such an instantiation is shown in Figure 2. Japanese (ja_JP) can be combined with multiple A user-level utility (iconv) and several library li.111C­ charmap files such as SJ IS or eucjl'. tions (iconv( ), iconv_open( ), and iconv_close( )) provide a way to handle data that may be in mixed Strengths of the X/Op en Mo del 1.2 code sets. Internationalized DEC OSF/ 1 version The greatest strength of the X/Open international­ provides an extensive set of code set conversion ization model is that it is in place today and enables modu les. New conversion methods are easily added the development of portable, language- and code to the system. set-independent appliGttions. The international­ ized DEC OSf/1 AXP version 1.2 system provides sup­ Wo rldwide Po rtability lntelfo ces The XPG4 inter­ port throughout the commands and tttilities for 20 nationalization architecture paral lels the XP<;.VISO code sets that represent major European and Asian C model. For example, XI'G4 specifies a fa mily of languages. All this is accompl ished using XPG4 isw'' functions similar in design to the is· fu nctions application programming interfaces (APis). In addi­ tion, the programming paradigm is consistent with ANSI C, making it easier fo r appl ication developers LANG = ja_JP.SJIS to modify existing applications fo r international mbtowc( ) - sjis_mbtowc( ) support. or LANG = zh_TW.eucTW Limitations of the X/op en Model mbtowc( )-eucTW_mbtowc( ) As described previously, the X/Open model fo r international ization provides a comprehensive Figure2 instantiation of mbtowc() set of application interfaces, thus enabling the

36 vb l. 5 No. 3 S11111JI/.er 19'J.i Digital Te chnical jo urnal The X/Open Internationalization Model

development of applications that can be used process includes returning locale-specific user worldwide. t, as with many standards, there are messages to the client and processing user­ limits to what can be accomplished. In this case, locale-sensitive date/time fo rmats, collation limitations manifest themselves in several areas: information, and string manipulation.

• A window manager that supports multiple • C language API clients displays menus fo r a client based on the • Distributed computing environments client's locale. The user error messages displayed

• Multithreaded applications are based on the locale of the server. When a client sends a request to a server, the • Multilingual applications'-; request parameters that are passed between the • Unicode and ISO/IEC 10646 support'' 'r' client and the server imply an associated locale. Since the global locale is not an explicit argument in Because the X/Open and POSJX specifications are any of the XPG4 fu nctions, this locale is difficult to based on UNIX implementations, the AP!s are speci­ pass to the server. Consider the specific case of fied only fo r the C programming language. For pro­ remote procedure calls (RPCs), where an interface gramming languages such as COBOL, FORTRAN, and definition language (JDL) might be used to generate Ada. it is not necessarily possi ble to match the syn­ client stubs. Because of the global nature of the tax and semantics of the API. The re mainder of this locale, insufficient information is available to the section explores generic problems with the global IDL to determine if the locale information needs to locale model and addresses specific issues in more be used as an argument to any generated functions. detail. Thus, the server may need to change its locale fo r Global Locale Issues each client request, which may be unacceptable in terms of system performance. The X/Open model is based on the concept of a Using the current XPG model, synchronizing the global locale. This aspect of the model is achieved use of a specific locale between a client and server through the use of locale data that is maintained in may not be possible. Even if a client could specify a private, process-wide global structure. The use of a locale as part of the request, the locale may not a global locale is one of the more severe drawbacks be available at the server side or may be repl i­ to using the overall model. cated incorrectly on the server side. This situation When working with this model, application exists because locale names and content are not developers typically assume that a single language­ standardized. territory-code set combination is in use at a given Although the XPG4 specification includes the time and will remain constant on a per-process localedef command fo r specifying the content of a basis. Although it is possible to use the announce­ .locale clatabase, there is no provision fo r standardiz­ ment mechanism to determine the run-time locale ing the content. The only locale fo r which an of a process, this mechanism is cumbersome. The X/Open specification exists is the POSLX or C locale. application must both save and restore the locale In addition, there is no specification fo r explicitly information. naming a locale. Locale names are composed of lan­ Another drawback of the X/Open model is that guage, territory, and code set components. Many existing APis do not include a way to share locale­ vendors use ISO/IEC 639 and iSO/IEC .'i166fo r the lan­ specific information between processes. This, com­ guage and territory components, but there is little bined with the difficulty of locale switching, 1 imits agreement on code set naming conventions l7.1H the ability to support multilingual and distributed This naming scheme is not suffi cient fo r uniquely applications. identifying locales, as is required in a client-server Distributed Processing Issues model. Another problem with the X/Open moclel that In a client-server environment, the problem of sup­ impacts application performance and the ease with porting multiple locales becomes a serious issue. which an application can be internationalizecl is Consider the fo llowing examples: related to the process code. The representation of • A server gets requests from various cliLnts. each the process code, i.e., wchar_t , is implementation running their own locale. These requests are defined, and the mapping of multibyte characters processed using the locale of the client. The to wide character codes may be locale sensirive.

Digital Te cb11ical jo unml V!JI. 5 No.. i St/111111<'1' !')V.i 37 Product Internationalization

Therefore, wchar_t-encoded data cannot be support within the ex1stmg specifications; this exchanged freely between the client-server pair. publication should be available in late 1993. Some The only exception would be if the end user guar­ of the issues that the C language, I'OSIX, and XPG4 anteed that the process code was identical fo r a are facing to support Unicode or ISO/IEC 10646 are given locale fo r each part of the client-server pair. character compatibility, code restrictions, and valid The XPG4 specification does not include function­ character strings. ality to identify or to interrogate the wchar_t Unicode characters are incompatible with the encoding scheme used. C language char• data type used in the POSIX and X/Open models. Unicode characters are 16-bit enti­ ties, whereas the POSJX and X/Open characters are Multithreaded Applications in practice 8-bit bytes, even though theoretically The problems encountered in a distributed process­ the byte size is implementation dependent. lVlost ing environment become more complex if the APis defined in the POSIX and X/Open models application is also multithreaded. Using POSIX implicitly assume 8-bit characters. This principle threads, commonly referred to as pthreads, more is extended to cover Asian multibyte characters than one thread is in the execution phase at the by considering each character to be a sequence of same time. 1� Again, a problem with the global, pro­ 8-bit char data elements. Unicode characters, bow­ cess-wide locale is evident. The application cannot ever, cannot be broken down into sequences of maintain the state of the global locale, accom­ valid 8-bit char data elements. plished by a save/restore process, without blocking The POSIX character model requires that the all other threads. Likewise, execution of locale-sen­ code values fo r char''' data protect the code ranges sitive fu nctions requires locking all threads to fo r control characters between OxOO-Ox 1 F and ensure that the global state is not altered prior Ox80-0x9F, the code position DELETE, and the slash to completion. The need to continually lock and character /. No such restrictions exist in Unicode. unlock threads, in addition to being undesirable, The C language postulates that a results in a performance problem fo r international­ terminates a char'' string. Since the Unicode string ized applications. Another approach is to make most likely contains zero bytes, these bytes wo uld locale data thread-specific. be interpreted as string terminators. In principle, the C language would al low a compiler to define Multilingual Applications the char'' data type to be of 16-bit width. However, given the prevail ing assumption in POSIX and Xl'G4 The X/Open internationalization model is oriented that one character equals one 8-bit byte, a Unicode toward the development of monolingual applica­ character string cannot be a valid char'' string. tions. Therefore, the model does not provide fu nc­ For these reasons, Unicode cannot be a valid file tions to handle data that consists of an arbitrary code as defined by the l'OSIX and X/Open specifica­ mixture of languages and code sets. tions. Unicode is not usable as an XPG4 process The fo llowing are some examples of applications code either. Unicode and ISO/IEC 10646 allow the that may require multilingual services: combining of 16-bit characters. " However, in many • Appllcations that simultaneously interact with operations the (e.g., in the a number of users (e.g., transaction processing French character set, the grave accent) and the base systems), where each user can choose a language character (e.g., the letter e) have to be processed together. This situation contradicts the XPG4 • A word processing application fo r multilingual model, where each character of the process code is texts that need language-sensitive fo rmatting, individually addressed and processed. hyphenation, etc. Using a well-defined encoding as XPG4 process code would also violate the principle that the pro­ Un icode Support cess code is opaque, implementation defined, and With the arrival of the Unicode universal character not valid outside the current process. For all these code and the adoption of ISO/IEC 10646 as its fo rm, reasons, the X/Open Joint Internationalization both POSIX and X/Open have to address the issues Group decided to propose using Unicode in a mod­ of support.1"· 16 The X/Open Internationalization ifiecl fo rm of the universal multiple-octet coded Wo rking Group is preparing a paper on Unicode character set ( ) transformation fo rmat (UTF). 1('·20

38 \k; /. 5 No. j Summer 1')')_1 Digital Te ciJnicaljo urnal The X/Open Internationalization Model

Proposed Changes to the Mo del locale can be uniquely identified, the remote code The XPG4 model limitations c\escribec\ in the previ­ can replicate the locale by obtaining it and specify­ ous sections are we.l l understood in the internation­ ing this information as part of the operation. To alization community. X/Open has published a solve the locale replication problem, the Xoj !G Snapshot specificationfor a set of distributed inter­ developed a locale naming scheme, referred to as nationalization services J0 This specification does the locale specification. not solve all the problems identified in this paper. It The locale specification is a character string that does, however, address the problems associated contains the locale name fo r each category that with the use of the global locale mechanism, locale exists within the locale. The syntax fo r locale identification, and text object manipulation. Note names is a list of keyword-value pairs, where each that these are proposed changes and have not been pair defines a locale category. Certain keywords, adopted by any standards organization. such as cocle set name, encoding name, and owner The proposed changes include or vendor name, are standardized as part of the reg­ istration process. Table 3 shows two examples of • A locale naming specification that enables the locale specifications. identification of a given locale in a distributed Although this naming scheme provides fo r environment unique identification of locales, the names are long. • Definition and support of a locale registry The specification calls fo r the use of ASCII charac­ ters to name locales. The American English locale • A new set of APls that enables application soft­ specification is over 200 bytes in length. A short­ ware to hand notation called network locale specification - Concurrently manage and use many different token has been proposed. locales The network locale specification token is an 1 - Manipulate opaque text objects2 unsigned integer value that can be represented - Support statefu l and nonstateful encodings within fo ur bytes. The two most significant bytes and fi le codes that are excluded by the cur­ represent the registration authority. nder the pro­ rent standards (e.g., nonzero byte terminators posal, national and international standards bodies, used in the Unicode code set) companies, and consortia, etc., that wish to use net­ work locale specification tokens will receive Locale Na ming and the Locale Registry unique identifiers. A block of values will be In an internationalized environment, the server reserved fo r private use between consenting sys­ must replicate the client's locale. If the client's tems. A set of new fu nctions will allow conversion

Ta ble 3 Network Locale Naming Specifications

American English Locale Using the lSO/IEC Latin-1 Code Set

CTYPE=ANSI;en_ US;01 _00;1S0-88591-1987;;/ COLLATE=ANS I ;en_ US;01 _00; IS0-88591 -1 987;;/ MESSAGES=ANSI;en_US;01 _00;1S0-88591-1987;;/ MONETARY=ANSI ;en_US;0 1 _ OO;IS0-88591 -1987;;/ NUM ERIC=ANSI;en_US;01 _00;1S0-88591-1987;;/ TIME=ANSI;en_US;01 _00;1S0-88591-1987;;/

Japanese Locale Using Japanese (EUC) Encoding

CTYPE=ISO;ja_,l P;01 _00;,11S-X0208-1987,JIS-X0201-1987,-IIS-X021 2-1991 ;EUC;/ COLLATE=ISO;ja_J P;01 _00;J IS-X0208-1987,JIS-X0201-1987,JIS-X021 2-1991 ;EUC;/ M ESSAGES=ISO;ja_J P;01 _ OO;J IS-X0208-1987,J IS-X0201 -1 987,J IS-X0212-1991 ; EUC;/ MONETA RY=ISO;ja_J P;01 _00;JIS-X0208-1987,JIS-X0201-1987,JIS-X0212-1991 ;EUC;/ NUMERIC=ISO;ja_JP;01 _00;JIS-X0208-1 987,JIS-X0201 -1 987,J IS-X0212-1991 ;EUC;/ TIME=ISO;ja_JP;01 _00;JIS-X0208-1987,JIS-X0201-1987,JIS-X0212-1991 ;EUC;/

Digita1 1e c:huical]ourual H>l. 5 No. 3 Summer I'J'J.> 39 Product Internationalization

between the full locale specification and the locale types used in the XPG4 internationalization modd. specification token. As previously defined, a text object refers to a col­ The locale specification proposal solves the lection of text characters that may or may not have problem of unique naming fo r locales. Combined metadata associated with them. Support fo r direc­ with a locale registry, this proposal overcomes tionality, as required fo r right-to-left languages some of the limitations of the current X/Open such as Hebrew, is an example of when such meta­ model. Within the registry, each locale will have data would be introduced. If a text object has a a name defined according to the new syntax. locale defined as part of the metadata (i.e., self­ Assuming vendors add these registered locales to announcing data), the locale specified as part of the their systems, language-sensitivc operations in a data supersedes the locale passed as an argument to distributed environment will obtain the same the o* functions. The locale that is passed as a func­ results across systems. This registry has been estab­ tion argument acts as a default locale fo r operations lished by X/Open, and several locales have been that require it. Al l o' functions allow a locale identi­ submitted. fier to be passed as an argument. This capability eliminates the limitations of the XPG4 global locale. Multilocale Support The support of metadata associated with text A new set of interfaces, the set of o* fu nctions, has objects is implementation defined. been proposed. These interfaces provide capabili­ A text object data type is represented by a text ties similar to those defined by the XPG4 model. pointer of type txt_ptr. A text pointer represents all These ncw fu nctions addrcss many of the model's the information associated with a particular charac­ limitations, including multithreaded applications, ter position within the text object. This informa­ distributed systems, and multilingual applications. tion is sufficient to perform any kind of operation, Most of the o* fu nctions utilize three new data such as classification, extraction, or uppercasing. types: locale object, attribute object, and text In summary, the o* functions allow text objects object. To overcome the limitation imposed by a to be classified, converted, transferred to and from global, per-process locale, the fu ndamental XPG4 files, etc. The fu nctionality of the o* functions programming paradigm is altered to define locali­ is designed to parallel the character-handling zation on a per-call rather than a per-process basis. functionality provided by the X/Open internation­ This change is accomplished by defining a new alization model. For example, fu nctions fo r manipu­ opaque data typc called a locale object. A locale lating text pointers and fo r concatenating text object identifies the locale and can be passed as an objects are tuned to the multilocale model. argument to locale-sensitive functions on a per-cal l Interfaces have also been introduced to provide basis. In this way, the basic programming paradigm management functions fo r new objects. becomes

I. Perform operation X on data Yusing locale Z Conclusions When introduced, the X/Open Portability Guide and not Issue 3 model fo r internationalization met about 90 1. Set global locale Z percent of the known requirements in the western European market. The introduction of the XPG4 2. Perform operation X on data Y worldwide portability interfaces expanded the An attribute object is a generic opaque object region to include Asia, japan, and eastern Europe. that serves as a container to other opaque objects, Consequently, application developers can write such as a locale object. Use of an attribute object in portable code that supports a variety of languages. the proposed APis provides a solution that is not The use of the worldwide portability interfaces fo r specific to solving internationalization problems. It computer-aided design applications that are dis­ is anticipated that objects, in addition to the locale tribu ted worldwide is one example of such code. object, will be identified. The additional objects However, the use of the client-server model might result from requirements in such areas as expanded greatly in the time it took to develop multimedia, network security, and X 11-specific these standards. Also, the need to support truly extensions to the locale. multilingual applications in a distributed environ­ A text object is a new data type that replaces the ment became evident. New code set specifications character (char) and wide character (wchar_t) data (i.e., Unicode) have been adopted, and systems

40 \'<;/. 5 No. 3 Surnmer l993 Digital Te chnical journal Th e X/Open Internationalization Model

supporting Unicode as both file and process code 8. X/Open CA E Sp ecification, Commands and have been implemented. Application vendors are Utilities, Issue 4, ISBN 1-872630-48-0, C203 beginning to see their markets expand into every (Reading, U.K.: X/Open Company Ltd., 1992). corner of the world. The XPG4 model will continue to provide much­ 9. X/Open Internationalisation Guide (Read­ needed interfaces fo r quite some time. Ye t, to meet ing, U.K.: X/Open Company Ltd., 1992). the challenges of the truly distributed environ­ 10. Distributed Internationalization Services ment, a new API, similar to the o* fu nctions pre­ (S napshot) (Reading, U.K.: X/Open Company sented here, must be developed and accepted. Ltd., 1992).

Acknawledg ments 11. L. Laverdure, P. Srite, and J Colonna-Romano, NA S Architecture Reference Manual (May­ Thanks to Mike Feldman, Richard Hart, and Dave nard, MA: Digital Press, 1993): 255-264. Lindner, among others, who spent their time pro­ viding comments and recommendations during the 12. Information Processing-8-bit, Single-byte writing of this paper. Coded Graphic Character Sets-Part 1: Latin Alphabet No. 1, ISO/IEC 8859-1 (Geneva: References and No tes International Organization fo r Standardiza­ tion/International Electrotechnical Commis­ 1. UNIX Sy stem V Release 4 Multi-National Lan­ sion, 1987). guage Supplement (S VR 4 MNLS) Product Overview (Japan: American Telephone and 13. Information Te chnology-Portable Op erate Te legraph Co., 1990). ing Sy stem Interfa ce (POSIX) -Shell and Utilities, ISO/IEC DIS 9945-2 (Geneva: Interna­ 2. X/Open Po rtability Guide, Issue 2 (Reading, tional Organization fo r Standardization/ U.K.: X/Open Company Ltd., 1987). International Electrotechnical Commission, 1992). 3. Programming Languages-C, ISO/IEC 9899:

1990 (Geneva: International Organization fo r 14. Multilingual applications can process multi­ Standardization/International Electrotechni­ ple languages at the same time, whereas cal Commission, 1990). implementations of the X/Open model can process several languages but only on an incli­ 4. Information Te chnology-Portable Op erat­ viclual basis. ing System Interfa ce (POSIX) -Part 1: System

Application Program Interfa ce (A ) [C Lan­ 15. ]. Bettels and F. Bishop, "Unicode: A Universal guage}, ISO/IEC 9945-1 : 1990 (Geneva: Interna­ Character Code," Digital Te chnical jo urnal, tional Organization fo r Standardization/ vol. 5, no. 3 (Summer 1993, this issue): 21-31. International Electrotechnical Commission, 1990). 16. Information Te chnology- Universal Multi­ ple-Octet Coded Character Set (UCS)-Part 1: 5. X/Open Po rtability Guide, Issue 3 (Reading, Architecture and Basic Multilingual Plane, U.K.: X/Open Company Ltd., 1989). ISO/IEC 10646-1 (Geneva: International Orga­ nization fo r Standardization/International 6. Multibyte Support Extensions, ISO/IEC Electrotechnical Commission, 1993). 9899: 1990/Amendment 3: 1993(£) (Geneva: International Organization fo r Standarcl iza­ 17. Codes fo r the Representation of Na mes and tion/International Electrotechnical Commis­ Languages, ISO/IEC 639 (Geneva: Interna­ sion, 1993). tional Organization fo r Standardization/ International Electrotechnical Commission, 7. X/Open CA E Sp ecification, System Interfa ce 1988). Definitions, Issue 4, ISBN 1-872630-46-4, C204 (Reading, U.K.: X/Opcn Company Ltd., 18. Codes fo r the Representation of Names of 1992). Countries, ISO/IEC .1 166 (Geneva: International

Digital Te clmical journal liJI. 5 No. _; Summer !')').) 41 Product Internationalization

Organization fo r Standardization/International 21. As defined in the X/Open Draft International­ Electrotechnical Commission, 1988). ization Services Snapshot: A text object is an implementation-defined representation of a 19. Information Te chnology-Portable Op erat­ fragment of text that consists of zero or more ing Sy stem Interfa ce (POSIX) -Threads text characters. Extension fo r Po rtable Op erating Sy stems, IEEE 1003.4a/D7 (New York: The Institute of Electrical and Electronics Engineers, 1993). General Reference

20. File Sy stem Safe - UCS Transformation Fo r­ S. Martin and M. Mori, Internationalization in mat (Reading, U.K.: X/Open Company Ltd., OSF/1 Release 1. 1 (C ambridge, MA: Open Software 1993). Foundation, Inc., 1992).

42 Vo l. 5 No. 3 Summe-r 1993 Digital Te clmicaljourrtal Rene Ha entjens I

Th e Ordering of Un iversal Character Strings

In the countries of the world, people have developed various methods to order wo rds and names based on their cultures. Many challenges and problems are asso­ ciated with developing ways fo r computers to emulate human ordering methods. An efficient computer method fo r obtaining a quality ordering has been devised as an extension to the single-step compare. It solves many but not all of the problems. A universal code now exists to store words and names written in many languages and scripts, but there is no universal way to order words and names. He nce, Jormal sp ecification methods are needed fo r computer users to describe culture-specific ordering rules. Th is area is still op en to research. Mea nwhile, international stan­ dardization committees endeavor tofo rmulate sensible proposals fo r multicultural contexts.

To day, when we access information stored in com­ the languages that we use. A proposal to change puters, we often ask the computer to present us ordering habits is a bit like proposing a spell ing lists of items arranged in an order that is meaningful reform. Everyone is in favor of simplification as to us and easy to use. In the future, will the com­ long as it applies to other groups of people, but we puter render obsolete the lists of words and names see no reason to change things fo r ourselves. In ordered fo r human reference' Wi ll the computer fa ct, looking back to the roots of our own culture, look up all information in our place? Will we no we fincl many good reasons why things are as they longer need the skills to find our way around in dic­ are today, so a change is seldom perceived as an tionaries, telephone directories, and the like" These improvement. things are not impossible, but we ourselves might The conclusion is, fo r the time being, that we not live to see them happen. may as well use the computer to help us organize If ordering fo r human consumption is to stay lists and to take into account that the task of order­ around fo r a while, then the next question that we ing lists is not universally the same. might ask is whether or not it would be possible This paper explores the issues involved with to harmonize the ways in which lists are ordered ordering and the ways the computer can deal around the world. Most people are aware that alpha­ with them. It describes how people order words betic order may diffe r from one country to another. and names, and consequently, how they expect The same is true fo r scripts that are not based on words and names to be ordered if a computer does an alphabet: although the Chinese Han characters the ordering. It presents examples of ordering in are used to write Japanese and Korean, lists with various cultures. This paper concentrates on the

Han characters are not in the same order in the ordering of words and names; it does not include a People's Republ ic of China, Japan, Korea, and discussion of numerical ordering. Ta iwan, Republic of China. Can we change to a universal ordering system Wo rds, Names, and Character Strings or at least make ordering the same where the Computers store words and names as character same script is being used' If the order of words strings. The symbols that we use fo r writing are were the same, life wo uld surely be easier for the mapped to bit patterns in computers, and these pat­ traveler! Unfortunately (ifthe rt"ader permits that terns are chained together. For pragmatic reasons, expression), the way in which we work with the bit patterns do not correspond to graphic sym­ ordered lists is a cultural as1Kct and is related to bols in a simple one-to-onl" fashion. Attributes such

Digital Te chnical jo unltll l·bl. 5 No. .i Summer I'J'J.> 43 Product Internationalization

as the fo nt in which the symbol is presented ami the When people order words or names or when size of the symbol are usually stored in separate they are looking fo r them in an ordered I ist, they an:as, and the bit pattern fo r the specific charactn often usc (unconsciously sometimes) the meanings thar represents the symbol re mains the same. Also, of these words or some other knowledge about the several characters or bit paltcrns can sometimes he words or names. Fo r example, when looking fo r the represented by the same graphic symboL For exam­ name :Hdli/lcm in a tele phone directory, they ple, the characters LATIN CAPITAL LETTER A ami might try to fiml it between MacLeod and (;REI'K CAPITAL lETTER ALPHA can be renden:d with LHacNeui/le, knowing that /He is the same as Mac. the same graphic symbol A. Finally, the chaining of They might even look between Melbourne and characters to strings may not completely agree with Murph]\ ignoring the of McMillan altogether. If the visual arrangement of corresponding graphic the computer has only a character string that repre­ symbols. sents the letters of the name Mc,llillan, then it lacks In other words, there arc differences berween the knowledge to look up the name any other way. how people order words and names and how com­ Lexical ordering cannot incorporate expanding or puters order the corresponding character strings. ignoring prefixes and abbreviations; there is no lex­ People com hi ne knov.·lnlgc about words and ical rule to determine what part of the character names (for example, ho\V to read and pronounce string might be a prefix or an abbreviation. them) with \'isual aspect s of the writ ten or printed As another example, in Japanese many Han words and names. Computers must work with the characters (called kanji by the Japanese) are pro­ bit patterns. nounced in a different way depending on the

\Vi th regard to character coding, the Internct­ context. Japanese dictionaries fo r general use are

t iona/ Sta11dard ISO/ILC 10646-J: 1993. Un ioersa/ ordered by pronunciation; therefore, if the com­ Multiple-Octet (:o ded Character Set. and the de puter has only the kanji character in the character facto standard. Unicode version 1.1, are considered string, it cannot order or look up in the same way state of the art. These two coding methods can con­ as people do in Japan. The character for rice, for veniently be considered as identical, and the same example, is pronounced rnai in a fo rm such as abbn.:viation, lJCS, refe rs to both of them . With liCS nuti (imported rice), but as IJei in a fo rm such as coding, words and names can be stored in many {Jeilw (America). The diffe rence is due to the his­ of the scripts of the world, and Chinese Han clurac­ torical background of the character or when, in rers can be chained toget her with Latin, Greek , irs specific context, it was borrowed from the Cyrillic, Hebrew, and Arabic letters and many Chinese. When kanji are used in proper names, more. such as names of persons and geographical names, Before discussing the complexities of UCS there may be no context information, and human coding, this paper explores some imporrant intervention might be needed to know the correct aspects of ordering of character stri ngs in the next pronunciation. section. In these cases. since the computer must mimic how people order and is limited to lexical tech­ Lexical Ordering niques, more than codes for the letters or for With lexical ordering, the computer takes into the kanji must be stored in the character strings. account only the kinds of characters that appear in For exa mple, the computer might have a character the strings and the arrangement of these characters. string that contains a kanji character plus its Apart from the ordering algo rithm and the associ­ pronunciation represented with kana characters. ated data, the computer uses no orher knowledge Or the co mputer might have strings such as th:1t it might have about the words in the character (Mc)Millan with the convention that the parenthe­ strings. For ex ample, it does not use an electron ic ses indicate parts to be ignored fo r ordering and dictionary or les about 11:1tura l language syntax. searching. phonetics. and semantics. The idea is to sec how Modern dictionaries and telephone directories computers can work with rcason:�bly efficient tech­ use lex ical techniques as much as possible, which is niques, while staying close to how people work. better in a multicultural environment. It is much Meaning-based ordering and searching with the easier to understand and apply lexical rules for computer is an interesting subject in itself, but is searching than to acquire intuitive knowledge of too broad a scope for this paper. an unfamiliar culture.

-i-1 11>1. 'i No. 3 Sl/11/lller 1993 Digital Te ciJIIical ]o ur11al Th e O rdering of Uniuersal Uwracter Strings

Wo rds, No t Individual Letters Single-step or One-level Compare It is important to understand that people order The single-step compare or one-level ordering algo­ words and names, not just the individual letters and rithm is known by most readers: symbols. Consequently, good-quality lexical order­ Compare the first characters of the two strings; if ing that comes close to how people work cannot be equal, then compare the second characters; con­ achieved by looking at all the characters in a string tinue until a difference is f(>tmd or until at least one only once, from the first one through the last one. string is exhausted. If a difference is fo und, then the This concept can best he illustrated with alphabetic character-col lating sequence determines which scripts, and some English examples are given below. string precedes the other. (Example: words pre­ W11en one looks fo r SOS in a modern English dic­ cedes working because d precedes k. ) If one of the tionary, one expects to see it between sort and two strings is exhausted, then the shorter string soul. Now, to find SUS between sort and soul, one precedes. (Example: word precedes wo rds. ) If both must ignore that SOS is in uppercase letters and sort strings are exhausted , then they are considered and soul are in lowercase. This type of lookup is equal. achievable by looking at all the letters once. Now consider the abbreviation CAT, meaning Multiple-step or Multilevel Compare clear air turbulence. CA l is listed between casual The state-of-the-art computer method fo r compar­ and catalyst. In this case, we cannot ignore the dif­ ing character strings is a generalization of the single­ fe rence between CAT and cat. The dictionary lists step compare. If, after using the above algo rithm both words, ancl some dictionaries consistently list with the first collating sequence, both strings are lowercase words before uppercase words (or vice found to be equal, then in tlw second step the algo­ versa), so the order using lowercase first would be rithm is repeated. Both strings are compared again, casual, cat, CA T, cata�yst. It is not possible to devise starting from their first characters, now using the an algorithm or method that would arrange these second col lating sequence. The second step may fo ur words in the correct order by looking at all the be fo llowed by a third srer and so on, one step fo r letters once. To guarantee the correct order in all each co llating sequence. cases, a first step is needed in which uppercase is To be precise, the one collating sequence of all considered equal to lowercase: the two words cat characters is replaced by a matrix of collating and C4 Tmust be placed in the correct order in a sec­ weights and collating weight sequences fo r each ond step, in which uppercase and lowercase make weight (W) column. Consider the fo llowing a diffe rence. example: Dealing with uppercase and lowercase is not the only issue fo r alphabetic ordering. Many languages W1 W2 W3 LATIN CAPITAL LETTER D use letters with diacritical marks such as accents. LATIN SMALL LETTER E Wo rds and names may also contain spaces or spe­ LATIN SMALL LETTER E cial symbols, such as , , and WITH ACUTE LATIN SMALL LETTER E points. Examples arc big bang, !Jest-seller, rock 'n ' WITH GRAVE roll, ami PS. When ordering is strictly alphabetic, as LATIN CAPITAL LETTER E is the case in many dictionaries, then accents on let­ LATIN CAPITAL LETTER E WITH ACUTE ters, spacing, and special symbols are ignored in the LATIN CAPITAL LETTER E first step, but they are taken into account to resolve WITH GRAVE a tie. For example, the correct order in French LA TIN SMALL LETTER F might be denier; denier, dernier; or Nb, NB, N.B., Nd, The collating sequence fo r Wl is , , , n.d., N.IJ. in English. etc. This means that, with the example matrix, al l variants of Latin letter E are equal in the first com­ Ta ble-driven Multilevel Ordering parison step. The collating sequence for W2 is The heart of ordering methods is the comparison of , , , which means that in two character strings. If we have an algorithm to the second step, the accents make a difference, bur determine whether one string should precede, fo l­ there is no distinction bet ween lowercase.: and low, or bt· considered equal to a second string, then uppercase variants. That dist inction is made in the arranging a list of strings in the correct order is third step: the collating sequence fo r W.') is . straight fo rward . .

Digital Te e/mica/ jo unwl 1.-'IJI. .5 No . .> Stntlllll:'r 19'J3 Product Internationalization

The weight matrix and the collating sequences LATIN SMALL LETT ER E can be placed in tables that are used by the ordering

algori thm, hence the name table-driven multilevel HYPHEN-MINUS IGNORE IGNORE IGNORE ordering.

If this example matrix is extended in a similar The IGNORE indicates that the character is way, then the multilevel algorithm wou ld place the skipped in the comparison algorithm in the first fo llowing words (most of which are real French three steps. A collating sequence fo r W4, in which words) in this correct order: de , DlfNil:', denier, precedes all symbols for special characters DENIER, denier, DEVIER, dimie1; dernier. such as , guarantees that words and names The method that is described here is also used without special characters precede the ones with in POSIX (!SO/IEC 9945-2. 2 Sh ell and Ut ilities, exactly the same letters, but with special characters. LC_COI.LATE Definition).' Rolf Gavare was among A fo ur-level ordering such as the one suggested the first to publish a paper on multiple-step com­ here is sufficient fo r a good-quality, complete, and parisons 2 Al ain LaBonte was the first to describe it pred ictable alphabetic ordering with the Latin as explained in this paper, and he also implemented alphabet. it as a Canadian Standard (\. 2243.4.1-1992). LaBonte devisee! a complete and predictable order­ AdditionalLett ers ing method that corn.:sponds to very fine detail For most languages written in Latin characters, the with the best examples of French and English dic­ correct order of words would be senior, seiiorita, tionary ordering. 1 sentimental, sepa rable. To achieve this order, Wl

would be . . . , , , <0>, . . . , and the matrix Generate Co mparison Key would include LATIN SMALL LETTER N \VITH TII.DE, With the multilevel method, it is also possible to where WI is , W2 is , and W3 is . have the algorithm generate a comparison key for In Spanish, the N WITH TILDE is considered a let­ a specific character string rather than always com­ ter to be ordered between N and 0 and the correct pare two strings. These comparison keys can be order is senior, sentimental, senorita, separable. To stored with the character strings; a one-level com­ achieve this type of ordering, Wl would be . . . , , parison of keys then gives the same resu lt as a multi­ , . <0>, . .., and the matrix would add level comparison of the original character strings. LATIN SMALL LETTER N WITH TILDE, where WI is For example, and again extending the example , W2 is , and W3 is . matrix given above, the comparison key for denie could be a convenient numerical representation of Ligatures The multilevel method can also handle ligatures by . allowing each matrix element to be a sequence of The precedes all other weights. Its pres­ weights, rather than one we ight. For /f. in French, ence at the end of the comparison key subfields the matrix would include LATIN SMALL LIGATl RE guarantees that shorter strings precede longer AE, where Wl is , W2 is , and W.) strings. Efficient compression techniques exist fo r is . In these languages, LIGATURE AE is such comparison keys. equivalent to two letters when orderi ng words. In

Norwegian, the /f. is a letter on its own. WI is . . . , Va riations of tbe MultilevelMe thod , , , , . For the The fo llowing section expands upon the multilevel matrix element, LATIN SMALL LIGATURE AE, WI is method and gives examples of changes necessary to , W2 is , and W3 is . accommodate cultural differences in word order.

Sp ecial Sy mbols Some special symbols, sometimes cal led logograms, Wi th a small extension, the multilevel method can can be seen as short notations fo r words: & + %. A also handle special characters such as the hyphen culture-specific ordering may replace such symbols and the to mimic traditional human by the corresponding words. If the language is alphabetic ordering. Another weight column must Engl ish, fo r example, then Research & Development be added to the matrix given above to distinguish can be ordered as Research and Development. As letters from special characters: long as a fixed rule exists fo r replacing symbols by

46 1'<>1. 5 No. J Summer f')'J3 Digital Te clmicaljourual The Ordering of Universal Character Strings

equivalent words, the extension that was intro­ the algorithm start from the end of the strings fo r duced fo r /E can be applied in a similar way to the level that deals with the special symbols. POSIX obtain the desired ordering. On the other hand, if has a small extension to the multilevel method that the replacement word depends on the language gives similar results while still moving fo rward. used in the rest of the string, then lexical ordering This extension adds the position of the symbol to cannot do the job properly without more informa­ its table weight during comparison. tion coded in the character strings. Sp ecial Sy mbols in Combination with Fi ne Tu ning fo r the Accents Up percase and Lowercase Characters The table-driven multilevel method, as explained so This section does not introduce a new extension far, would place French words in this order: cote, but reconsiders the extension fo r the special sym­ cote, cote, cote, mar;on, macon. In a traditional, bols. This method adds a fourth weight column:

correct ordering, they shou ld be in the following LAT IN SMALL order: cote, cote, cote, cote, macon, mar;on. On gen­ LETTER E eral, accents at the end of a French word are more HYPHEN-MINUS IGNORE IGNORE IGNORE important fo r understanding than other accents.) To obtain the desired ordering, another exten­ With W3 fo r uppercase and lowercase and W4 sion of the multiple-step method is needed: fo r the fo r the special characters, the distinctions between second step, the one that discriminates between uppercase and lowercase are considered more quasi- (words that diffe r only in their important than the presence or absence of spacing diacritical marks), the comparison algorithm should and special symbols. In many cultures, this is start from the end of the strings rather than from indeed the case with proper names of people. The the beginning. For the other We stern languages that following order is desired with names that diffe r in use the Latin alphabet, this reverse processing fo r use of uppercase or lowercase letters: deGroot, de the accents is not needed. On the other hand, it Groot, Degroot, De groat, DeGroot, De Groot. does not hinder either, so the French method is For some geographical names, it could be argued acceptable as well. that special symbols are more significant than the French is not the only language with such quasi­ difference between lowercase and uppercase. homographs. In new-Greek, with the modern For example, the desired order is Sanssouci, monotoniko spelling, all multisyllabic words have SANSSOUCI, Sans Souci, SANS SO UCI, Sans-Souci, one accent that indicates the stressed syllable. SA NS-SOUCI. (Sanssouci is a castle near Potsdam in New-Greek has many quasi-homographs, including Germany; Sans Souci is a city in South Carolina, the fo llowing examples, which use a simple tran­ U.S.A , and a suburb of Sydney, Australia; and Sans­ scription of Greek letters to Latin letters: arguros, Souci is a historical place on Haiti.) To obtain this argur6s, diakonia, diakonia, metro, metro, para, order, W3 and W4 must be switched. pm·a. The French method of reverse processing produces acceptable results fo r new-Greek as wel l. Some Problems with the Multilevel Method Fi ne Tu ning fo r the Sp ecial Sy mbols To obtain the correct order, changes are sometimes With the tables extended as explained in the necessary to the multilevel method. This section section Special Symbols, the multiple-step algo­ discusses cases in which it is less easy to adapt the rithm would order words as fo llows: unionized, table-driven multilevel method. union-Lzed, un-ionized. For the exceptional cases such as this one, in which two words are identical Digraphs and Collating Elements except for the placement of a special symbol, the CH and LL have special placement in the Spanish order unionized, un-ionized, union-ized may alphabet. Spanish is not unique in this respect; com­ seem more appropriate. Usual ly, the hyphen is per­ binations of letters also have special placement in ceived as a word break, not on the first level, but on the Albanian, Hungarian, Vietnamese, and We lsh a subsequent level, and with word breaks, shorter alphabets. The We lsh ordering alphabet, for exam­ words always come first. ple, is A B C CH D DD E F FF G NG H I J L LL M N 0 P PH To obtain the latter ordering, one could use the R RH s T TH U w Y, and the fo llowing list of words is same technique as fo r the diacritical marks: have correctly ordered in We lsh: acw, achos, adwy,

Digital Te cbtzicaljo urtzal Vol. 5 No. 3 Summer 1993 47 Product Internationalization

addas, agwedd, angau, almon, allan, anfv ny ch, language has several DUP characters, the weights anf/(Jdus, antw; anthem. fo r which depend on the context. For first-level Before the multilevel method can he applit:d, it ordering, a DUP character in a Japanese word or is necessary to replace the multiple-character name can be considered equivalent to the character combinations by pseudo-characters. In POSIX that precedes it. Hence, if X represents a Japanese LC_COLLATE, such a mechanism is foreseen. One character, then X fo llowed by DUP is equivalent to X can declare combinations such as LAT IN SMALL fo llowed by X in the first comparison step. Tie LE'fTER C fo llowed by LATI N SMALL LETI'ER H to be breaking is done in a subsequent step: X DUP then collating elements and give them a name that can precedes X X. If collating-element definitions are he used in the matrix. used, definitions fo r all possible combinations are At first it would seem that this solves the prob­ required. lem. One complication, however, is that the two let­ ters together do not always represent the special Added Complexity with UCS Coding alphabet letter. In Welsh, for example, the N and G The concepts discussed in this section have existed are separate letters in the We lsh words melyngoch, in other coded character sets fo r some time. For dangos, gw,vn{

48 Vu /. 5 No. 3 Summer 19')3 Digital 1ecbnical journal The Ordering of Universal Character Strings

A popular ordering is by radical first, then by For our extended matrix method, not only thou­ number of keystrokes, and finally by Chinese pro­ sands, but an unlimited number of collating ele­ nunciation. With this ordering, tera comes first (it ments would have to be defined. UCS allows any has only six strokes), and kata precedes shiro number of combining characters to fo llow a non­ because of the Chinese pronunciation. If this were combining character. the one and only way of ordering Han characters, Logical Order and Coding Order then the computer would not need to know about the radicals, pen strokes, etc. Each Han character With UCS coding, the order of the characters in a has a different code (bit pattern), so a single (but string is the logical or reading order, not the order long) col lation order fo r the corresponding codes in which the symbols have to be printed or dis­ would be sufficient. played. Hence, UCS encoded text is difficult to dis­ Significantly, each dictionary of Han characters play and print, but relatively easy to be processed, has developed its own tradition fo r ordering. e.g., fo r ordering. Depending on the application, audience, school, or In Thai, unfo rtunately, this approach was not political considerations, the preferred ordering implemented totally. The vowels and diacritics that may be diffe rent. For example, the onyomi order­ appear above or under a consonant are coded in ing is also in popular use in Japan. It is by Chinese logical (reading) order, but Thai has five so-called pronunciation first, then by stroke count. With pre-positioned vowels that are written and codeu onyomi ordering, kata comes first, then tera, and before the consonant after which they have to be shiro is the last one. pronounced. This corresponds to current comput­ Han characters are always ordered character by ing practices in Thailand and was incorporated in character, so the multilevel method that applies UCS coding as a sort of backward compatibility. For multiple weights in multiple steps involving com­ example, the word written and encoded as E + CH + plete strings is not required. Han characters require N (ignoring vowel shortener and tone mark) is pro­ multiple weights with a specific combination that nounced and ordered accordingly. To allow is dynamically selected fo r a single-step ordering. correct ordering fo r ucs-encoded Thai, some pre­ It is not evident how this dynamic single step can processing is necessary to arrange the Thai vowels be combined with the standard multiple-step in the correct position fo r the ordering step. method, which is needed fo r UCS strings containing Fo rmatting Ch aracters Han characters mixed with other ones. Many coded character sets contain characters that do not correspond to some written symbol but Combining Characters have some control fu nction, often fo r output fo r­ UCS also contains the concept of combining charac­ matting. For ordering, these fo rmatting characters ters. In the example matrices given above, it was can usually be hand led in the same ways as special assumed that letters with accents such as LATIN characters. SMALL LETIER E WITH ACUTE are coded as one char­ The characters ZERO WIDTH JOINER and ZERO acter. UCS indeed has such one-character codings, WIDTH NON-JOINER are among the ucs fo rmatting but it allows a letter with an accent to be coded as characters. Their primary purpose is to influence two characters as well. The sequence of two char­ the display of characters of a cursive script such as acters LATIN SMALL LETIER E fo llowed by COMBIN­ Arabic. Befo re UCS was finalized. some people sug­ ING ACUTE is also valid in UCS. gested that ZERO WIDTH NON-JOINER might be used UCS does not state that LATIN SMALL LE1TER to indicate the absence of special digraphs such as E WITH ACUTE is the same as LATIN Si'vlALL LETTER in the Welsh word melyngoch. It has also been pro­ E fo llowed by COMBINING ACUTE; it leaves it to posed that ZERO WIDTH JOINER might be used to applications to consider them equivalent or not. create new letters such as unusual or newly Needless to say, many application developers will invented ligatures. To day, this is no longer consid­ want users to have the possibility of considering ered a valid use of these fo rmatting characters. both fo rms equivalent, at least fo r ordering. The notion of equivalence becomes quite intri­ To ward a Fo rmal Description cate with two or more diacritical marks. Sec the of Ordering paper on Unicode in this issue fo r a discussion on Excellence fo r computer applications means not transformations between equivalent spellings. ' only that the application incorporate a different

Di�ilal Te chnical journal Vu /. 5 No. 3 Srmrmer I'J'J, ) 49 Product Internationalization

way of ordering fo r each culture, but also that it manual, and fo rmal definitions about ordering can give fn.:edom to its users to define variations and be built upon its content. use differentappr oaches to ordering. This is impor­ tant fo r some cultures. �ot so long ago, the usc of Preprocessing multiple letter fo nts was considered specialized Preprocessing a character string, transforming it work fo r professional printers; today every word into text dements or linguistic units in a logical processor must allow it. Flexibility with regard sequence, is a second concept that deserves elabo­ to ordering may also become commonplace a few ration. It was mentioned in relation to Thai with its ye:us from now. But how can such flexibility be pre-positioned vowels in a preceding section. provided in a computer-digestible yet user-friendly Breaking down a string into the smallest units to way? be processed by an ordering algorithm and arrang­ Many documents describe ordering in an infor­ ing these units in the desired processing order is mal way National standards on ordering are seldom a powerfu l mechanism. It could also be used to fo rmal definitions. They contain directives sucl1 as detect col.lating elements, to replace Japanese DUP each unbroken sequence of digits, disregarding characters, or to transform character sequences , spaces, and stops is considered as one that contain combining characters. This mecha­ character; or multiple hyphens col late as one; or ij nism wo uld then allow the table-driven multilevel is ordered as i + j; or f3 = . Such directives are method to be used to its fu ll extent on prepro­ vague fo r computers. They are imprecise: Is the cessed strings. hyphen to be understood as the character HYPH EN­ Preprocessing might change the character string: MINUS only, or also as related, but distinct clurac­ units are rearranged, characters are replaced by ters in UCS coding such as HYPHEN, MINUS SIGN, and other ones, etc. It is possible that two originally dif­ others' They are also incomplete: ij is ordered, but fe rent character strings could be preprocessed to not 1}, lj, and if They use graphic symbols, where an identical intermediate fo rm. If ordering is to be the computer wants to know things about clurac­ complete and predictable, preprocessing must gen­ ters: Does f3 stand fo r LATIN SMALL LETI"ER SHARP S erate additional tags that are taken into account by or for GREEK SMALLLET TER BETA? the multilevel method. On the other hand, the descriptions fo r POSIX Consequently, the output of the preprocessing LC_COLLArE are quite fo rmal. They are more or less phase might be more than pieces of character bound to a specific implementation, in this case the strings. The lines used in the matrices fo r the multi­ table-driven multilevel method described above. level method have (names of) characters as labels. If A more simple fo rmal description is sometimes sut� preprocessing were designed to generate an output ficient. For example, if the data to be ordered is that is easier to consume by the multilevel method, filtered and contains only uppercase Latin letters, the labels could be anything that seems suitable. then the POSIX syntax may seem an overkill. In The problem, again, is how to al low fo r the speci­ other cases, the LC_COLLATE formalism lacks fication of preprocessing in a fo rmal yet user­ expressive power, as we have seen. friendly way. Transformations based on regular Is it possible to design a formal specification expressions and finite state machines are a possible method that falls between the descriptive texts in path. These techniques allow an efficient implemen­ country standards and the almost algorithmic tation. P ). Plauger has published material about parameters such as POSIX LOCALEs' using them for ordering with the C language.,(, JSO/IEC 10646-1 :1993 may provide a first step to build fo rmal definitions. It is the most comprehen­ Conclusions sive repertoire of characters to date and a strict The evolution of computer systems is progressing superset of many earlier repertoires and coded toward a better quality interaction with people. An character sets. Moreover, it establishes a unique and aspect of that interaction is the ordering of words authoritative naming fo r characters. This paper and names. Efficient methods exist today fo r obtain­ uses character names such as 'IN CAPITAL LE'JTER ing a quality ordering. Although some software E WlTH ACUTE. ISO has decided that the 10646 uses these methods, many applications perform names will be used in all future character set stan­ computer-friendly ordering rather than human­ dards and standard updates. In a certain sense, friendly ordering. There is no technical limitation ISO/IEC 10646-1 : 1993 is a character reference to improve on that aspect; fo r example, a multilevel

50 Vo l. 5 No. 3 Su111111er 1993 Digital Te clmicaf journal The Ordering of Universal Character Strings

algorithm with user-specified tables can replace a things written in this paper. He has on many occa­ single-step bit-code ordering. sions encouraged me to continue with my explo­ For some cultures and in multicultural environ­ rations of ordering. I also owe thanks to Johan van ments, not all ordering problems are solved. Wingen, independent consu ltant in Leiden, the Research is needed, as well as fo rmal rules to allow Netherlands, who has gathered and made available users to specifyordering preferences. much background information on coded character Some useful ordering techniques are in place. sets and ordering practices. A special word of The table-driven multilevel method is an important thanks goes to Kevin P. Donne.l ly, to Denis Garneau, one. Preprocessing can solve some problems, but a and to P.). Plauger fo r reviewing this paper and fo r convenient fo rmalism is needed to specify it. UCS providing many usefu l comments and suggestions. coding provides many new challenges; but at Of the many colleagues in Digital who have the same time it offers a new fixed point, from helped me, I want to especially mention Masahiro which it may be possible to derive user-friendly fo r­ Morozumi of International Systems Engineering in mal definitions. Japan, with whom I could exchange many mails about ordering in Japanese and about Digital's Appendix: implementation of XPG4. I also want to mention InternationalSt andardization Efforts Tim Greenwood of lnternational Systems Engineer­ Many countries have developed a standard on order­ ing in the U.S., who has done a lot of coordination ing. These standards are not listed in this section. work for this issue of the Digital Te chnicaljo urnal, lSO/lEC JTCl/SC22/WG 15 (Programming Lan­ and fo r my contribution to it in particular. And I guages) is the committee and work group that is know that I'm doing an injustice by not naming the discussing the POSIX work (ISO 9945). many other colleagues who contributed. ISO/IEC JTC1/SC22/WG20 (Internationalization) is working on a Technical Report that will provide a References framework for internationalization. The work I. X/Open CAE Sp ecification, System Interface Def­ group is also preparing documents on the registry initions, Issue 4, X/Open Doc N C204 (London: of cultural elements, specification methods for X/Open Company Limited, 1992). defining string comparison, and a default-tailorable ordering fo r 10646. 2. R. Gavare, "Alphabetical Ordering in a Lexicolog­ CEN (European Standardization Committee) ical Perspective,'' Data Linguistica 18 (Almquist BTS7 (Technical Bureau on IT)/TC304 (Character Set & Wiksell, 1988). Te chnology) has a project on European character 3. A. LaBonte, Regles classement alphabetique string ordering rules. The scope is to establish pro­ en langue fra nfaise et procedure informatisee cedures for the registration of national and regional pour le tri (Ministere des Communications du ordering rules and to prepare multilingual charac­ Quebec, 1988). ter ordering rules fo r European scripts (Latin, Greek, and Cyrillic). 4. J Bettels and F. Bishop, "Un icode: A Universal ISO TC37/SC2,1\'VG 2 is currently working on multi­ Character Code," Digital Te chnical jo urnal, val. lingual ordering fo r terminological and lexico­ 5, no. 3 (Summer 1993, this issue): 21-31. graphical purposes. ISO TC46/SC9 has similar work 5. P. Plauger, "Translating Multibyte Characters," but fo r bibliographical pur poses. The approach is Th e jo urnal of C Language Tra nslation (June application oriented, whereas the other ISO and 1991). CEN efforts mentioned above are computer­ oriented approaches. 6. P. Plauger, The Standard C Library (Englewood To allow for some level of synchronization of Cliffs, NJ : Prentice Hal l, 1992). these efforts and to avoid overlaps, liaisons have been established between all these committees. General References G. Adams, "Introduction to Unicode," Proceedings

Acknowledgments of the Fo urth Unicode !mplementors Wo rkshop (Mountain View, CA: Unicode Consorti um, 1992). Alain LaBonte of the Gouvernement du Quebec, Direction Gcnerale des Te chnologies de B. Comrie, ed., The Wo rld's Major Languages !'Info rmation, has been the inspiration fo r many (Oxford: Oxford University Press, 1990).

Digital Te chnical Jo urual Vo l. 5 No. .) SI/111/JH'r 19'J3 51 Product Internationalization

F Coulmas, The Writing Systems of the Wo rld A. LaBonte, "Mu ltiscript Ordering fo r Unicode," (Oxford: Basil Blackwell, 1989). Proceedings of the Fo urth Unicode Implementors Wo rkshop (Mountain View, CA: Unicode Consor­ ). DeFrancis, The (Honolulu: tium, 1992). niversity of Hawaii Press, 1984). A. Nakanishi, Writing Systems of the Wo rld (Rut­ D. Garneau, Keys to Sort and Search fo r Culturally land, VT, and To kyo: Charles E. Tuttle, Co., 1980). E\pected Results (Ontario: 1!3M National Languagt: Te chnical Center, 1990). G. Sampson. Writing Systems (Hutchinson, 1985).

S. ]ones et al., Developing International User Info r­ STRI TS73, Nordic Cultural Requirements on Info r­ mation (Burlington, MA: Digital Press, 1992). mation Te chnology (Reykjavik: Idntaeknistofnum Islands, 1992). K. Katzner, The Languages of the Wo rld (London: U. Warotamasikkhadit and D. Londe, Computerized Routledge. 1989). Alphabetiza tion of Th ai, Technical Memo TM-- C. Kennelly. Digital Guide to Developing Interna­ 1000/000/01 (Santa Monica, CA: System Develop­ tional Software (Uurl ington, i'vlA: Digital Press. 1991 ) ment Corp., 1969).

E. Kohl, "The Art of Arranging Files," ISO Bulletin Unicode 1.0. 1, Report from the Unicode Consor­ (December 1986). tium (Mountain View, CA: Unicode, Inc., 1992).

52 Vo l. 5 No ..3 Summer 1993 Digital Te chtticaljo urnal Gayn B. Winters I

International Distributed Sy stems-Architectural and Practical Issues

Building distributed systems fo r international usage requires addressing many architectural and practical issues. Key to the efficientconstr uction of such systems, modularity in sys tems and in run-time libraries allows greater reuse of compo­ nents and thus permits incremental improvements to multilingual sys tems. Us ing safe software practices, such as banishing the use of literals and parameteri-::ing user preferences, can help minimize the costs associated with localization, reengi­ neering, maintenance, and design.

The worldwide deployment of computer systems network. The client software, e.g., a mail client and has generated the need to support multiple lan­ the local windowing system, could be completely guages, scripts, and character sets simultaneously. monolingual. Networking, database, and printing A system should fo cus on natural ease of use and services, for instance, should be multilingual in that thus allow end users to read system messages in the they support the various end users by providing language of their choice, to have natural mem1s, services independent of the natural languages, fo rms, prompts, etc., and to enter and display data scripts, or character sets used. in their preferred presentation fo rm. This paper surveys many of the architectural ami Digital envisions a computer system that not practical issues involved in the efficient construc­ only is distributed but is distributed geographically tion of international distributed systems. We begin across the world. A single site may have end users by discussing some economic issues and pitfalls with varying language and cultural preferences. For related to localization and reengineering. Many of l:xampk, a Japanese hank in Tel Aviv may have these topics can be addressed by straightforward employees whose native languages are Arabic, good engineering practices, and we explore several English, Hebrew, Japanese. or Russian, and may important techniques. The structure of application­ conduct business in one or several of these lan­ specific and system-level run-time libraries (RTLs) guages. Figure l could represent a portion of their is a key issue. We therefore devote several sections

ARABIC USER JAPANESE USER MULTILINGUAL USER

WIDE AREA NETWORK

MULTILINGUAL DATA MULTILINGUAL DATA

Figure 1 A Portion of a Multilingual Ne twork

Digita/ 7ecbnical jo urnal H>l. 5 .. 3 Sull/1111!1' I'J93 53 Product Internationalization

of this paper to preferred RTL structures, data rep­ Safe SoftwarePracti ces resentations, and key RTL services. Distributed Many well-known, straightforward programming systems cause some special problems, which we practices, ifadopted. can dramatically reduce rccngi­ briefly discuss, commenting on naming, sccurity, necring efforts. 1-7 Even fo r existing systems, the cost management, and configuration. In particular, a of incrementally rewriting software to incorporate desire fo r client software designed fo r monolingual some of tlwsc practices is often more than recov­ distributed systems to work without change in a ered in lower maintenance and reengineering multilingual distributed system led to a new system costs. This section discusses a few key practices. model. In the model, the host servers and the sys­ Probably the most fundamental and elementary tem management provide the interfaces and con­ safe software practice is to banis!J literals, i.e., versions necessary fo r these clients to interface strings, characters, and numbers, from the code. with the multilingual world. Finally, we observe Applying this practice does not simply redefine YES that all the preceding techniques can be delivered to be "yes" or THREE to be the integer 3. Rather, this incrementally with respect to both increasing func­ practice yields meaningful names, fo r instance, tionality and lowering engineering cost. affirmative_response and maximum_alternatives, to help anyone who is trying to understand how Localization andReengi neering the code functions. Thus, not only does the prac­ When a system component is productized fo r some tice make the code more maintainable, but it local market, the process of making it competitive also makes it easier to parameterize or generalize and totally acceptable to that market is cal led local­ the data representation, the user interface prefer­ ization. During this process, changes in the design ences, and the fu nctionality in ways the original and structure of the product may be required . programmer may have missed. These definitions These changes are called reengineering. For exam­ can be gathered into separate declaration files, mes­ ple, U.S. automobiles whose steering linkages, sage catalogs, resource files, or other databases to engine placement, console, etc., were not designed provide flexibility in supporting clients of different to allow the choice of left- or right-hand steering languages. were not competitive in Japan. Rcengineering The abstraction of I iterals extends to many data these automobiles fo r right-hand steering was pro­ types. In general, it is best to use op aque data ty pes hibitively expensive, so manufacturers had to to encapsulate objects such as typed numbers (e.g., redesign later models. money and weight), strings, elate and time of clay, Computer systems have problems similar to the graphics, image, audio, video, and handwriting. automobile left-hand-right-hand steering problem. Providing methods or subroutines fo r data type A good architecture and design is necessary to manipulation conceals from the application how avoid expensive reengineering during localization. these data types are manipulated. The use of poly­ The fo llowing are examples of areas in which a morphism can serve to overload common method localization effort may encounter problems: user­ and operation names l.ike create, print, and delete. defined characters and ligatures; geometry prefer­ Support fo r multiple presentation fo rms fo r each ences, such as vertical or right-to-left writing data type should allow additional ones to be added direction, screen layout, and page size; and policy easily. These presentation fo rms are typically differences, such as meeting protocols and required strings or images that are fo rmatted according paper trails. Building limiting assumptions into a to end-user preferences. Both input and output software or hardware product can often lead to should be factored first into transformations costly reengineering efforts and regional time­ between the data type and the presentation fo rm, to-market delays. and then into input and output on the presentation On the other hand, an internal survey of reengi­ fo rm. For example, to input a elate involves neering problems associated with Digital's software inputting and parsing a string that represents a pre­ indicates that simple, easy-to-avoid problems are sentation fo rm of the elate, e.g., "17 janvier 1977," strikingly frequent. In fact, it is amazing how many and computing a value whose data type is Date. ways a U.S. engineer could find to make use of the The concepts of character and of how a character (ultimately erroneous) assumption that one charac­ is encoded inside a computer vary dramatically ter fits into one 8-bit (or even more constrictive, worlclwide.u-11 In aclclition, a process that works one 7-bit) byte' with a single character in one language may need to

54 Vu l. 5 Nu. 3 Sttlltllter I'J' )J Digital Te cbnicaljournal International Distributed Sy stems-Architectural and Practical Issues

work with multiple characters in another language. be ve11' difficult to reengineer. Thus, policy descrip­ One simple rule can prevent the problems that this tions should be placed into an external script variation can cause: Banish the Character data or database. The advent of workflow controllers, typ e from applications, ami use an opaque string such as those in Digital's EARS, ECHO, and data type instead. This rule eliminates the tempting Te am Route products, makes it easy to do this. practice of making pervasive use of how a charac­ Applications should not put date fo rmatting, ter is stored and used in the programmer's native sorting, display, or input routines into their main­ system. The Array of Character data type is nearly as line code. Often such operations have been coded insidious, because it is tempting to use the ith ele­ previously, and a new application's code will prob­ ment fo r something that will not make sense in ably not be international and may well contain another natural language. One should only extract other bugs. Therefore, programmers should con­ substrings s[i.j] from a string s. Thus, when in a struct applications to use, or more precise�y reuse, given language the object being extracted is only run-time libraries, thus investing in the quality and one code point s[i:i], the extraction is obviously a the multilingual and multicultural capabilities of special case. The section Text Elements and Text these RTLs. When the underlying system is not rich Operations discusses this concept fu rther. enough and/or competition dictates, the existing Another safe software practice is to parameter­ RTL structures must be augmented. ize preferences, or better yet, to attach them to the data objects. As discussed previously, a Run-time LibrarySt ructure "hardwired" preference such as writing direction A common theme for internationalizing software invariably becomes a reengineering problem. The and for the safe programming practices discussed language represented by the string, the encoding in the previous section is to keep the main applica­ type, the presentation to rm of the object, and the tion code independent of all natural language, input method fo r the object are all preferences. In script, or character set dependencies. In particular, servers and in all kinds of databases, tagging the the code must use only RTLs with universal applica­ data with its encoding type is desirable. In general, tion programming interfaces (APis), i.e., the name the data type of the object should contain the pref of the routine and its fo rmal parameter list must erence attributes. The client that processes the accommodate all such variants. Digital's early local­ object can override the preferences. ization efforts typically made the mistake of replac­ Geometry preferences should be user selectable. ing the us.-only code with code that called RTLs Some geometry preferences affe ct the user's work­ specific to the local market. This practice gener­ ing environment, e.g., the ways in which dialog ated multiple versions of the same product, each of boxes work, windows ancl pop-up menus cascade, which needed to be changed whenever the perti­ and elevator bars work.1 These preferences are nent part of the U.S. version was changed. A better almost always determined by the end user's work­ structure fo r run-time libraries is shown in Figure 2. ing language. Other geometry preferences relate to The application illustrated in Figure 2 calls an the data on which the user is working, e.g., paper RTL routine through the routine's universal AP!s. size, vertical versus horizontal writing (for some This routine may in turn call another language­ Asian languages), how pages are oriented in a book, specific routine or method, or it may be table driven. layouts fo r tables of contents, and labels on graphs. For example, a sort routine may be implemented Computer programs, in particular groupware applications, mi.."X policy with processing. "Policy" refers to the sequence or order of processing activi­ APPLICATION ties. For example, in a meeting scheduler, can any­ one call a meeting or must the manager be notified VARIOUS RUN-TIME LIBRARIES WITH UNIVERSAL APPLICATION PROGRAMMING INTERFACES first? Is an invoice a request fo r payment or is it the administrative equivalent of delivered goods requir­ LANGUAGE 1 LANGUAGE 2 LANGUAGE N ing another document to instigate payment' Often RUN-TIME RUN-TIME ... RUN-TIME such pol icy issues are not logically forced by the LIBRARY LIBRARY LIBRARY computation, but they need to be enfo rced in cer­ tain business cultures. A sequence of processing activities that is "'hardwired" into the program can Figure 2 Modular Run-time Librmy Structure

.) Digit ttl Te cbnicttl jo urnal Vu /. 5 No. S/11111//er 19')3 55 Product Internationalization

using sort keys rather than compare functions fo r ligature such as CE may be repn:sc:nted as three: better performance. With this structure, localiza­ code:points <0> , where a "joint:r'' is tion to a new language involves only the addition of a spt:cial codepoint reserved fo r creating text cle­ the new language-specific RTL or the correspond­ ments. Less complc:x text elemt:nts, i.e., subcom­ ing new table entries. ponents of the encoded text elements, are found Note that the application must pass sufficient by using the code point and the operation name structure to the RTL to guarantee that the APis are to index into some database that contains this universal. For example, to sort a list of strings, a call information. For example, if is a single code point fo r e, then the base character e is fo und by sort_algorithm(list_pointer,sort_name,sorl_order) applying some function or table lookup to the code could be created. The sort_order parameter is of point . The same is true fo r finding a code point the type {ascending,dcscending). The sort_name fo r the acute accent. When a sequence of code parameter is necessary because in many cultu res points represents a text element, the precise term numerous methods of sorting are standard LI2 In "encoded text element" is often abbreviated as some RTL designs, notably those specified by "text element." X/Open Company Ltd., these extra parameters are An encoded character set of particular impor­ passed as global variahlcs.'·r' 7 This technique has the tance is Unicode, which addresses the encoding of advantage of simplifying the APis and making them most of the world's scripts using integers from 0 to almost identical to the APis fo r the l .S. code. Such 2 16-1.11.17 The Unicode universal character set is the RTLs, however, do not tend to be thread-safe and basis of ISO 10646, which will extend the code have other problems in a distributed environ­ point interval to 2°1-1 (without using the high­ ment.0·1l·11 An alternative.: and far more flexible order bit) 9 Unicode has a rich set of joiner code mechanism is more object orientcd-using a sub­ points, and it fo rmalizes the construction of many type of the List of String data type when alternate types of text elements as sequences of code points. sorts are meaningful. This subtype has the addi­ Processing text clements that are represented as tional information (e.g., sort_name and sort_order) sequences of code points usually requires a three­ used by its Sort method.12.1 :. step process: (l) the original text is parsed into The next three sections discuss the organization operation-specific text elements, (2) these text ele­ and extensibility of RTI.s with this structure. ments are assigned values of some type, and (3) the operation is performed on the resulting sequence Data Representation of values. Note that each step depends on the text Data representation in RTLs incorporates text operation. In particular, a run-time library must elements and text operations, user-defined text have a wide variety of parsing capabilities. The elements, and document interchange fo rmats. following discussion of rendering, sorting, and substring searching operations demonstrates this Te xt Elements and Te xt Op erations need. A text element is a component of a written script In rendering, the text must be parsed into text that is a unit of processing fo r some text operation, elements that correspond to glyphs in some fo nt such as sorting, rendering, and substring search. database. The values assigned to these text ele­ Sequences of charactcrs, digraphs, conjunct conso­ ments are indexes into this database. The rendering nants, ligatures, , words, phrases, and sen­ operation itself gets additional data from a fo nt tences are examples of common text elements. 10 '' server as it renders the text onto a logical page. An encoded character set E rcpresents some partic­ The sorting operation is more complicated ular set of text elements as integers (code points). because it involves a list of strings and multiple Typically, the range of E is extended so that code steps. A step in most sorting algorithms involves the points can represent not only £ext elements in mul­ assignment of collation values (typically integers) tiple scripts but also abstractions that may or may to various text elements in each string. The parsing not be part of a script, such as printing control step has to take into account not only that multiple codes ancl asynchronous communication cocles.1G code points may represent one character but also More complex text elements can be represented as that some languages (Spanish, fo r example) treat sequences of code points. For example, 0 may multiple characters as one, for the purposes of sort­ be represented by two code points <'>, and a ing. Thus, a sorting step parses each string into text

56 lk>l. 5 No. j Summer 1993 Digital Te chnicaljourual International Distributed Sy stems-Arcbitectural and Practical Issues

elements appropriate fo r the sort, assigns co llation encoded character set in use, needs to be extended values to these elemenrs, and then sorts the result­ so that a sequence of E's code points defines the ing sequences of values. Note that the parsing step desired user-defined text clement. The issues that takes place in a sorting operation is somewhat related to this problem are ones of registration different from the one that occurs in a rendering to prevenr one user's extensions from conflicting operation, because the sort parse must sometimes with another user's extensions and to allow data group into one text element several characters, interchange. each of which has a separate glyph. The second, more difficult problem concerns the Searching a string s fo r a substring that matches extensions of the text operations required to manip­ a given string s' involves different degrees of ulate the new text element. for each such text oper­ complexity depending on the definition of the ation, the parsing, value mapping, and operational term "matches." The trivial case is when "matches" steps discussed earlier must be extended to operate means that the substring of s equals s' as an encoded on strings that involve the additional code points of suhstring. In this case, the parse only returns code E. When tables or databases define these steps, the points, and the values assigned are the code point extensions are tedious but often straightforward. values. When the definition of "matches" is weaker Careful design of the steps can greatly simplify their than equality, the situation is more complicated. extensions. In some cases, new algorithms are For example, when "matches" is "equal after upper­ required fo r the extension. To the extent that these casing," then the parsing step is the same one as fo r tables, databases, or algorithms are shared, the uppercasing and the values are the code points of extensions must be registered and shared across the the uppercased strings. (Note that uppercasing has system. two subtle points. The code point for a German sharp s, , actually becomes two code points Document Interchange Fo rmats . Thus, sometimes the values assigned to the text elements resulting from the parse consist Compound documents (i.e., documents that con­ of more code points than in the original string. In tain data types other than text) use encoded charac­ addition, this substring match involves regional ter sets to encode simple text. Although many new preferences, fo r example, uppercasing a French e is document interchange formats (DIFs) wil I probably E in france and l in Canada.) The situation is similar use Unicode exclusively (as docs Go Computer when "matches" equals "equal after removing all Corporation's internal fo rmat fo r text), existing for­ accents or similar rendering marks." A more com­ mats should treat Unicode as merely another plex case would be when s' is a word ancl finding a encoded character set with each character set match ins means finding a word in s with the same being tagged .1H This allows links to be made to exist­ root as s'. In this case, the operation must first parse ing documents in a natural way. s into words and then do a table or dictionary Many so-called revisable Dlfs, such as Standard lookup fo r the values, i.e., the roots. Generated Mark-up Language (SGML), Digital Document Interchange Format (DDII'), Office Document Architecture (ODA), Microsoft Rich Text Us er-defi ned Te xt Elements Format (RTF), and Lotus spreadsheet fo rmat (WKS), When the user of a system wishes to represent and page description languages (POLs), such as and manipulate a text element that is not currently PostScript, Sixels, or X.ll, can be extended to pro­ represented or manipulated by the system, a mech­ vide this Unicode support by enhancing the anism is required to enable the user to extend attribute structure and extending the text import the system's capabilities. Examples of the need fo r map Strings(E) -> Dl F fo r each encoded character set such a mechanism abound. Chinese ideograms E. In doing so, however, many of the richer con­ created as new given names and as new chem­ structs in Unicode, e.g., writing direction, and ical compounds, Japanese gaiji (user-defined char­ many printing control codes are often best acters), corporate logos, and new dingbats are replaced with the DIF's constructs used fo r these often not represented or manipulated by standard features instead.19 In this way, both processing oper­ systems. ations are easier to extend and facilitate the layout User-defined text elements cause two separate functions DII' >PDL and the rendering fu nctions problems. The first problem occurs when E, the PDL->lmage.

SuJIIIIH'r Digital Te chnical journal Vo l. 5 Nu . . i /'J')J 57 Product Internationalization

Presentation Services • Strings-> Image The practice of factoring input and output of data • OIF >POL >Image types into a transformation T<- >r _Prcst·ntation_

Form and performing the 1!0 on the presentation • Encoded Image -• Image fo rm allows one to focus on each step separately. • Encoded Audio-> Audio Signal This factorization also clarifies the applicability of various user preferences, e.g., a date fo rm prefer­ • Encoded Video-> Video Signal ence applies to the transformation, anc.J a t( >nt pref­ • Encoded Handwriting-> Image erence applies to how the string is displayed. As mentioned in the section Safe Software Practices, These output services also vary with encoding, preferences such as presentation fo rm are best device, and algorithm. Figure 3 illustrates the attached to the end user's copy of the data. Data sequence DIF ->PDL-•lmage. Optional parameters types such as encoded image, encoded audio, and are permitted at each step. A viable implementation encoded video pose few international problems of Strings---> Image is to fa ctor this fu nction by means except for the exchangeability of the encodings and of the function Strings •DIF, which is discussed in the viability of some algorithms for recognizing the Data Representation section. Alternatively, the speech and handwriting. Algorithms fo r presenta­ data type Strings can be simply viewed as another tion services can be distributed, but we view them as DIF to be supported. typically residing on the client 20 In Figure 1, we pre­ A revisable document begins in some DIF such as sume that the local language PCs have this capability. plain text, Strings(Unicode), SGML, or DD!F. A lay­ out process consumes the document ami some Input logical page parameters and creates an intermedi­ ate fo rm of the document in some PDL such as Existing technology offe rs several basic input ser­ PostScript, Sixels, or even a sequence of X.ll pack­ vices, which are presented in the fo llowing partial ets. To accomplish this, the layout process needs to list of device-data type functions: get fo nt metrics from the fo nt server (to compute

• Keystrokes • Encoded Character relative glyph position, word and line breaks, etc.). In turn, the rendering process consumes the PDL • Image-• Encoded Image and some physical media parameters to create the

• Audio Signal-• Encoded Audio image that the end user actually sees. The rendering process may need to go back to the fo nt server to • Video Signal •Encoded Video get the actual glyphs fo r the image. Rendering, lay­

• Handwriting -> Encoded Handwriting out, and font services are multilingual services. The servers for these services are the multilingual The methods fo r each input service depend on servers envisioned in Figure 1. both the device and the digital encoding and often use multiple algorithms. WJ1ereas fo r some languages the mapping of one or more keystrokes ComputationServices into an encoded character (e.g., [compose] + [e] + To build systems that process multilingual data, [ '] yielding e) may be considered mundane, input such as the one shown in Figure 1, a rich variety of methods fo r characters in many Asian languages are text operations is necessary. This section catego­ complex, fascinating, and the topic of continuing rizes such operations, but a complete specification research. The introduction of user-defined text of their interfaces wo uld consume too much space elements, which is more common among the in this paper. Te xt operations require parsing, value Asian cultures, requires these input methods to mapping, and operational fu nctions, as described be easily extendable to accommodate user-defined earlier. characters.

Te xt Ma nipulation Services Output Text manipulation services, such as those speci­ The basic output services are similar to the input fied in C programming language standard ISO/IEC services listed in the previous section. 9899: 1990, System V Release 4 Multi-National

58 Vol. 5 No. 3 Summer /'J':J.) Digital Te clmical jo urnal Jntemational Distributed Sy stems-Architectural and Practical Issues

PARAMETER PARAMETER � l DOCUMENT PAGE INTERCHANGE - LAYOUT DESCRIPTION 1- RENDER - FORMAT r- LANGUAGE I J I I

FONT SERVER

FONT DATABASE

Figure 3 Layout and Rendering Services

Language Supplement (MNLS), or XPG4 run-time which is only defined on those encoded text ele­ libraries (including character and text element clas­ ments in Unicode that can be expressed as encoded sification fu nctions, string and substring opera­ text elements in E. If s is in Strings(E), then tions, and compression and encryption services) from_uni(to_uni(s)) is equal to s. Other encoding need to be extended to multilingual strings such as conversions Strings(E)->Strings(E') can be defined Strings(Unicode) and other DIFs, and to various text as a to_uni operation followed by a from_uni oper­ object class I ibraries .u.s. 13 ation, for E and E' respectively. Another class of encoding conversions arises when the character set Data Ty pe Tra mformations encoding remains fixed, but the conversion of a document in one DIF to a document in another DIF Data type transformations (e.g., speech to text, is required. A third class originates when Unicode image-to-text optical character recognition [OCR], or ISO 10646 strings sent over asynchronous and handwriting to text) are operations where the communication channels must be converted to a data is transformed from a representation of one Universal Transmission Format (UTF), thus requir­ abstract data type to a representation of another ing Strings(Unicode)<-> 1T encoding conversions. abstract data type. The presentation fo rm transfor­ mations T•--•T_Presentation_Form and the funda­ mental input and output services are data type Colla tion or Sorting Services transformations. Care needs to be taken when Another group of computation services, collation parameterizing these operations with user prefer­ or sorting services, sorts lists of strings according ences to keep the transformation thread-safe. to application-specific requirements. These ser­ Again, this is best accomplished by keeping the pre­ vices were discussed earlier in the paper. sentation fo rm preferences attached to the data. Linguistic Services Encoding Co nversions Linguistic services such as spell checking, grammar Encoding conversions (between encoded character checking, word and line breaking, content-based sets, DIFs, etc.) are operations where only the rep­ retrieval, translation (when existent), and style resentation of a single data type changes. For exam­ checking need standard AP!s. Although the imple­ ple, to support Unicode, a system must have fo r mentation of these linguistic services is natural each other encoded character set a function language-specific, most can be implemented with to_uni:Strings(E)-•Strings(Unicode), which con­ the structure shown in Figure 2. verts the code points in E to code points in Also, large character sets such as Unicode Unicode.11 The conversion fu nction to_uni has a par­ and other multilingual structures require a uni­ tial inverse from_uni:Strings(Unicode) >Strings(E), fo rm exception-handling and fallback mechanism

Digital Te cbllical ]ounwl Vol. 5 No ..i S11111mer 1993 59 Product Internationalization

because of the large number of unassigned code translations or "synonyms" fo r a name. Synonyms, points. For example, a system should he able to fo r whatever purpose, should have attributes such uniformly handle exceptions such as "glyph not as long_namc, short_name, language, etc., so that t

u. fo rm c(u) for a Unicode string c(u) would • Both multilingual and multiple monolingual u expand every combined character in to its base run-time libraries, simultaneously (see Figure 2) characters followed by their assorted marking char­ • Multilingual database servers, fo nt servers, acters in some prescri bed order. The recom­ 11·21 logging and queu ing mechanisms, and directory mended order is the Unicode "priority value:· services The canonical form should have the fo llowing prop­ erty: When c(u) is equal to c(u), the plain text rep­ • Multilingual synonym services resentations of u and I' are the same. Ideally, the • Multilingual diagnostic services converse should hold as we lL Thus, u and 11 represent the same name in the sys­ Since a system cannot provide all the services fo r tem if c(u) is equal to c(u). In any directory listing, every possible situation, registering the end users' an end user of a language sees only one name per needs and the system's capabilities in a global name object, independent of the language of the owner service is essen tial. The name service must be con­ who named the object. Further restrictions on the figured so that a multilingual server can identify the strings used fo r names are desirable, e.g. , the absence language preferences of the cl ients that request ser­ of special characters and trailing blan . In a multi­ vices. This configuration al lows the servers to tag vendor environment, both the canonical fo rm and or convert data from the client without the mono- the name restrictions should be standardized . The 1 ingual client's active participation. Therefore, the X.'500 working groups currently studying this prob­ name service database must be updated with the lem plan to achieve comparable standardization. necessary preference data at client installation Since well-chosen names convey useful informa­ time. tion, and since such names are entered ami dis­ Typically, system managers for different parts of played in the end user's writing system of choice, it the system are mono I ingual end users (see Figure is often desirable for the system to store various 1) who need to do their job from a standard PC.

llfJI. 5 No. 3 Summer 19'):) Digital Te cbnical ]O lii'IICII International Distributed Sy stems-Architectural and PracticalIss ues

Thus, both the normal and the diagnostic manage­ nance costs and help avoid costly redesign ment interfaces to the system must behave as multi­ problems. Providing multilingual services to mono­ lingual servers, sending error codes back to the PC lingual subsystems permits incremental improve­ to be interpreted in the local language. Although ments while at the same time lowers costs through the quality of the translation of an error message is increased reuse. Final ly. the registration of syn­ not an architectural issue, translations at the system onyms, user preferences, locales. and services in a management level are generally poor, and the sys­ global name service makes the system cohesive. tem design should account fo r this. Systems devel­ opers should consider giving both an English and Acknowledgments a local-language error message as well as giving I wish to thank Bob Ay ers (Aclohe). JosL·ph Bosurgi easy-to-use pointers into local-language reference (Univel), Asmus Freytag (\1icrosoft), Jim (;ray manuals. (Digital), and jan te Kidte(D igital) fo r thl'ir helpful Data errors will occur more frequently because comments on earlier drafts. A special thanks to of the mixtures of character sets in the system, and Digital's internationalization team, whose contribu­ attention to the identification of the location tions are always understated . In addition. I wo uld and error type is important. Logging to capture like to acknowledge the Unicode Te chnical offending text and the operations that generated it Committee, whose impact on the industry is pro­ is desirable. fo und and growing; I have learned a great deal from fo llowing the work of this com mittee. Incremental Internationalization Multi! ingual systems and internationalcomponents References can be bu ilt incrementally. Probably the most pow­ erful approach is to provide the services to support 1. D. Carter, Writing Localizable Software fo r tviA : multiple monolingual subsystems. Even new oper­ the Ma cintosb (Reading. Ad dison-We sley, ating systems, such as the Windows NT system, that 1991). use Un icode internally neecl mechanisms fo r such 2. Producing International Products (Maynard. support.25 Multidimensional improvements in a sys­ MA: Digital Equipment Corporation, 1989). tem's ability to support an increasing number of This internal document is unavailable to variations are possible. Some such improvements external readers. are making more servers multilingual, supporting more multilingual data and end-user preferences, 3. Digital Guide to Developing international supporting more sophisticated text elements (the Software (Burlington, MA: Digital Press, first release of the Win dows NT operating system 1991). will not support Unicode's joiners), as well as adding more character set support, locales, and 4. S. Martin, "Internationalization Made Easy," MA: user-defined text elements. The key point is that, OSF White Paper (Cambridge, Open Soft­ like safe programming practices, multilingual ware Foundation, Inc., 199l). support in a distributed system is not an ·'ali-or­ 5. S. Snyder et al., "Internationalization in the nothing" endeavor. OSF IKE-A Framework," May 1991. This doc­ ument was an electronic mail mc.ssage trans­ Summary mitted on the Internet. Customer demand fo r multilingual distributed systems is increasing. Suppliers must provide 6. X/ Open Po rtability Guide, Issue 3 (Reading, U. K. systems without incurring the costs of expen­ X/Open Company Ltd , 1989). sive reengineering. This paper gives an overview of 7. X/ Open Internationalization Guide, Draft the architectural issues and programming practices 4.3 (Reading, U. K.: X/Open Company Ltd., associated with implementing these systems. October 1990). Modularity both in systems and in run-time librarits allows greater reuse of components and 8. UNIX Sy stem V Release 4, Multi-National incremental improvements with regard to interna­ Language Supplernent (MNLS) Product tionalization. Using the suggested safe software Overview (Japan: American Te lephone and practices can lower recnginecring and mainte- Te legraph, 1990).

Digital Te ciJnicalJo unwl Vo l. S No . . I Summer 1')93 61 Product Internationalization

9. Information Te ch nology -Universal Coded International Organization for Standardiza­ Ch aracter Set (UCS) Draft International tion/International Electrotechnical Commis­ Standm·d, ISO/IEC 10646 (Geneva: Interna­ sion, 1983). tional Organization for Standardization/Inter­ national Electrotechnical Commission, 1990). 17. J Bertels and F. Bishop, "Unicode: A Universal Character Code," Digital Te chnical jo urnal, 10. A. Nakanishi, Writing Sy stems of the Wo rld, vol. 5, no. 3 (Summer 1993, this issue): 21-31. third printing (Rutland, Ve rmont. and To kyo, Japan: Charles E. Tuttle Company, 198R). 18. Go Computer Corporation, "Compaction 11. The Unicode Consortium, Th e Unicode Techniques," Second Un icode Implementors' Standard- Wo rldwide Ch aracter Encoding, Conference (1992). Version 1.0, Vo lume l (Reading, MA: Addison­ 19. J Becker, "Re: Updated [Problems with] Wesley, 1991). Unbound (Open) Repertoire Paper" (January 12. R. Haentjens, "The Ordering of Universal 18, 1991). This electronic mail message was Character Strings," Digital Te chnicaljo urnal, sent to the Unicode mailing list. vol. 5, no. 3 (Summer 1993, this issue): 43-52. 20. V Joloboff and W McMahon, X Windozu 13. Programming Lanf!.uages-C, ISO/lEC 9899: System, Ve rsion 11, Input Method Sp ecifica­ 1990(E) (Geneva: International Organization tion, Public Review Draft (Cambridge, MA: fo r Standardization/International Electrotech­ Massachusetts Institute of Technology, 1990). nical Commission, 1990). 21. M. Davis, (Taligent) correspondence to the 14. S. Martin and M. Mori, Internationalization Unicode Te chnical Com mittee, 1992. in OSF/1 Release 1.1 (Cambridge, MA: Open Software Foundation, Inc., 1992). 22. M. Gasser et a!., "Digital Distributed Security MA: 15. J. Becker, "Multilingual Wo rd Processing," Sci­ Architecture" (Maynard, Digital Equip­ entifi c American, vol. 251, no. 1 (July 1984): ment Corporation, 1988). This internal docu­ 96-107 ment is unavailable to external readers.

16. Coded Character Sets fo r Te xt Communica­ 23. H. Custer, Inside Wi ndows NT (Redmonc!,WA : tion, Pa rts 1 and 2, ISO/IEC 6937 (Geneva: Microsoft Press, 1992).

62 Vo l. 5 Nu. 3 Summer 1993 Digital Te cbnical jo nnwl Michael M. T. Ya u I

Supporting the Chinese, Japanese, and Ko rean Languages in th e Op enVMS Op erating Sy stem

The Asian language versions of the Op en VMS operating sy stem allow Asian-speak­ ing users to interact with the Op enVMS sy stem in their native languages and provide a platform fo r developing Asian applications. Since the Op en VMS variants must be able to handle multibyte character sets, the requirements fo r the internal representation, input, and output differ considerably from those fo r the standard English version. A review of the japanese, Ch inese, and Ko rean writing systems and character set standards provides the context fo r a discussion of the fe atures of the Asian Op enVi 'vlS variants. Th e localization approach adopted in developing these Asian variants was shaped by business and engineering constraints; issues related to this approach are presented.

The OpenVMS operating system was designed in native languages. These extensions also provide an era when English was the only language sup­ a platform fo r developing Asian applications. This ported in computer systems. The Digital Com mand paper presents a high-level overview of the major Language (DCL) commands and utilities, system features of Chinese, Japanese, and Korean support help and message texts, run-time libraries and sys­ in the OpenVMS operating system and discusses the tem services, and names of system objects such localization approach and techniques adopted. as file names and user names all assume English text encoded in the 7-bit American Standard Code Asian Language Va riants of the fo r Information Interchange (ASCII) character set. Op enVMS Sy stem As Digital's business began to expand into mar­ The fo llowing five separate Asian language variants kets where common end users are non-English of the OpenVMS operating system are available in speaking, the requirem ent fo r the OpenVMS system the Pacific Rimgeog raphical area: to supporr languages other than English became inevitable. In contrast to the migration to support Language Country OpenVMS Va riant single-byte, 8-bit European characters, OpenVMS localization efforts to support the Asian languages, Japanese Japan OpenVMS/Japanese namely Japanese, Chinese, and Korean, must deal 's OpenVMS/Hanzi with a more complex issue, i. e., the hand ling of Republic multibyte character sets. Requirements fo r the inter­ of China nal representation, input, and output of Asian text Chinese Ta iwan, OpenVMS/Hanyu are radically different from those fo r English text. Republic of China As a result, many traditional ASCII programming Korean Republic OpenVMS/Hangul assumptions embedded in the OpenVMS system are of Korea not valid fo r handling Asian multibyte characters. (South Korea) Since the early 1980s, Digital's engineering Thai Thai land OpenVMS/Thai groups in Asia have been localizing the OpenVMS system to support Asian languages. The resultant Asian language extensions allow As ian-speaking This paper covers the first fo ur variants, omitting users to interact with the Open VMS system in their the Thai variant because of space limitations. Each

Digital Te chrticaljountal llrJ/. 5 Nu. . ) Summer 1993 63 Product Internationalization

Asian language variant of the OpenVMS system Overview of Asian Wr iting Sy stems is designed to be installed and to run as a separate Before looking at specific features of the Asian system. Currently, no provision exists to fo rmally OpenVMS variants, this paper briefly reviews the support multiple Asian languages simultaneously Chinese, Japanese, and Korean writing systems. on a single OpenVMS system. Each variant provides For a more detailed discussion of the diffe r­ a bil ingual system environment of English and ences among these writing systems, refer to Tim one Asian language. Such an envir onment, cal led Greenwood's paper in this issue of thejourna/.1 Asian Open\'.VIS mode in this paper, supports ASCII ami one mu ltibyte Asian character set. The variants The Ch inese Wr iting Sy stem are available on the VA X and the Alpha AXP plat­ The Chinese writing system uses ideographic char­ forms with identical fe atures. Throughout the acters called Hanzi, which originated in ancient paper, the generic name Asian OpenV.\1S variant China more than 3,000 years ago. Each ideogra phic denotes any of the Asian language variants of the character (or ideogram) is a symbol made up of ele­ OpenVMS operating system, regard less of the hard­ mentary root radicals that represent ideas and ware plattorm. things. Some ideograms have very complex glyphs To achieve full downward compatibility fo r exist­ that consist of up to 30 brush strokes. Over 50,000 ing users, applications, and data from the ideograms are known to exist today; how­ OpenVMS system , each Asian OpenVMS variant is ever, a subset of 20,000 or less is typically sufficient a superset of the standard OpenVMS system. In fact, J(>r general use. Two or more ideograms are often a user can operate in the standard OpenVMS strung together to represent more complex mode, i.e., the 1-byte DEC Multinational Character thoughts. Set (DEC MCS), on an Asian OpenVMS variant with­ Ideographic writing systems have characteristics out noticing any diffe rence in the functional behav­ that are quite diffe rent from those of alphabetical ior compared to a standard OpenVMS system. The writing systems. such as the Latin languages. For components of an Asian OpenVMS variant are instance, the concept of uppercase and lowercase installed on a standard OpenVMS system in a man­ docs not apply to ideographic characters, and col la­ ner similar to that of a layered product; fi les (exe­ tion rules are built on different attributes. The cutable images and other data files) arc added and input of ideographic characters on a standard key­ replaced on the standard OpenV.\1S system. In gen­ board requires additional processing. eral, three types of components are supplied in an Two forms of are in use instal lation: today: Traditional Chinese and Simpl ified Chinese.

1. A standard OpenVMS component supplan ted Tr aditional Chinese is the original written fo rm by an Asian localized version that incl udes the and is still used in Ta iwan and Hong Kong. In the standard OpenV\1S mode as a subset. At the 1940s, the gove rnment of the People's Republic of process level, the user can set the component to China (PRC) launched a campaign to simplify the run in either standard OpenVMS mode or Asian writing of some traditional Chinese characters in Open VMS mode. The DCL and the terminal driver an effort to speed up the learning process. The are examples ofthis type of component. resu lting simpler set of Chinese characters is known as Simplified Chinese and is used in the PRC, 2. A standard OpenVMS component supplemented , and Hong Kong. by an Asian localized version that runs only mode. Both versions of the in Asian OpenVMS Th e japanese Wr iting Sy stem component run simultaneously on the system. The Japanese writing system uses three scripts: Examples are the TPU/EVE editor and the MAIL Chinese ideographic characters (called kanji in utility. Japan), k.a na (the native phonetic alphabet), and 3. A new Asian-specific component created to pro­ rrm�aji (the English alphabet used fo r foreign vide fu nctionality fo r Asian processing that does words). The kanji script commonly used in Japanese not exist in the standard Openv:.1s system. An includes about 7, 000 characters. There are two sets example of this type of component is the charac­ of ka na scripts, namely, hiragana and ka takana; ter manager (CMGR), which is discussed later in each comprises 52 characters that represent sylla­ this paper. bles in the Japanese language. Hiragana is used

64 l,'rJI. 5 No. 3 Summer 1993 Digital Te chuicaljourual Supporting the Ch inese, japanese, and Ko rean Languages in the Op en VMS Op erating Sy stem

extensively intermixed with kanji. Katakana is used COLUMN 1 94 ROW COLUMN to represent words borrowed from other languages. �1 · 1 ____1 �1 · 1 ____ ROW Ox21 - Ox7E Ox21 - Ox7E Th e Ko rean Wr iting Sy stem 10 TWO-BYTE CODE STRUCTURE The Korean writing system uses two scripts: 94 A PLANE Hangul (the native phonetic characters) and Hanja (Chinese ideographic characters). The Hangu l • Note !hal the first bit of each row and column can be either 0 or 1 script was invented in 1443 by a group of scholars

in response to a royal directive. Each Hangul char­ Fig ure 1 Code Structure in Asian Character acter is a grouping of two to five Hangul letters Set Standards (phonemes) that forms a square cluster and repre­ sents a syllable in the Korean language. The mod­ includes 6,353 kanji characters divided into two ern Hangul alphabet contains 24 basic letters-14 levels, according to frequency of usage.:i Level 1 has consonants and 10 vowels. Extended letters are 2,965 characters, and level 2 has an additional 3,388 derived by doubling or combining the basic letters. characters. This standard also includes complete sets of characters fo r hiragana and ka takana, Asian Character Sets ASCII, and the Greek and Russian scripts-a total of During the early clays of Asian language computeri­ 453 characters. The 1990 revision, JIS X 0208-1990, zation when de jure standards did not exist fo r added two characters to the standard:' An addi­ Asian character sets, individual vendors in the local tional plane of ka nji characters became standard in countries invented their own local character sets 1990 with the announcement of )IS X 0212-1990." fo r use in their Asian language products. Although Prior to the introduction of the 2-byte standards, most vendors have migrated to conform with the Japanese systems that support katakana used the national standards, a variety of local character sets JIS X 0201-1976 standard fo r a 1-byte, 8-bit character still exists today in legacy systems, thus creating set/• To day, tl1ere is still a demand to support this interoperability issues. This paper reviews only the standard, in addition to the 2-byte standards, due national standard character sets that are supported to its pervasive use primarily in legacy mainframe by the Asian Open VMS variants. systems.

Na tional Sta ndards People 's Republic of Ch ina In 1980, China National standards bodies in each of the Asian announced a 2-byte standard, Chinese Character Pacific geographies have established character set Coded Character Set fo r InformationIn terchange­ standards to fa cilitate info rmation interchange fo r Basic Set (GB 2312-1980).7 Its structure, which fo l­ their local characters. For languages that use Han lows that of the Japanese standard, includes two characters (which are large in number) in their writ­ levels of Hanzi. Level 1 has 3,755 characters, and ing scripts, the character set standards all share level 2 has an additional 3,008 characters. The stan­ a similar structure, which is illustrated in Figure 1. dare! also has 682 characters, including ASCII, Characters are assigned to a 94 -row by 94-column Greek, Russian, and the Japanese kana characters. structure called a plane. Each character is repre­ Subsequently, China has annou nced additional sented by a 2-byte (7-bit) value in the range of Ox2l character set standards. to Ox7E. A plane, therefore, has a total of 8,836 code points available. Such a structure avoids the ASCTI Ta iwan, Republic of China The Ta iwanese control code values, thus preventing conflicts with national standard, Standard Interchange Code fo r existing operating systems and communication Generally Used Chinese Characters (C NS 11643- hardware and software. 1986) was first announced in 1986.s Aga in, the structure is simi lar to the Japanese and PRC stan­ Japan Japan was the first country to announce a dards. It defines two planes of characters with a 2-byte character set stanclarcl, the Code of the total of 13,051 Hanzi, 651 symbols, and 33 control Japanese Graphic Character Set fo r Information characters. The standard was revised in 1992 and Interchange (JIS c 6226-1978).! This standard has renamed Chinese Standard Interchange Code (CNS since been revised twice, in 1983 and 1990, and 11643-1992).9 An additional five planes were renamed )IS X 0208. The JIS X 0208-1983 standard defined in this revision, adding 34,976 characters.

Digital Te cl.m ical ]m tnwl Vo l. 5 Nu. 3 Stllllll/er 199.) 65 Product lnternationali:J:ation

Republic of Ko rea (South Ko rea) The latest ver­ 1. Shift code-based representation. In this scheme, sion of the Korean 2-byte character set standard the 1-byte code is combined with a 2-byte code is the Korean Industrial Standards Association by inserting shift control codes to switch Code fo r Information Interchange (KS C 5601-1987), between the two code sets. A 1-bytc "shift out" announced in 1987 1" This standard includes 2,350 control code changes the mode from 1- to 2-byte, precomposed Hangul characters, 4,888 Hanja while a 1-byte "shift in" control changes the (Chinese characters), and 352 other characters such mode back to 1-hyte characters. This scheme is as ASCII, the Hangul alphabets, Japanese kana, in common use in mainframes. Greek, Russian, and special symbols. 2. ISO 2022-based representation. The ISO 2022 Code Extension Techniques al low a designated Us er-defined Characters character set to consist of two, three, or fo ur Character set standards do not always encode all 7-bit bytes in addition to the 7- bit sets. 11 The only known charactns of the writing scripts for which requ irement is that all bytes of a character have the standards are intended . For instance, when the the same high-order bit setting (a ll 0 or all I). total number of known characters exceeds the A simple scheme of simultaneously supporting available code space, only subsets of the most com­ ASCII and one 2-byte character set can be mon characters are included. In addition, new char­ achieved by statically designating ASCII to GO acters are invented over time to describe new ideas and invoking it to graphics left (.) and designat­ or objects, such as new chemical elements. The ing a local 2-byte set (e.g., one of the Chinese, concept of user-defined characters (UOC:s), some­ Japanese, or Korean sets) to G 1 and invoking it times known as gcnji in Japan, was introduced to to graphics right (GR). The resulting mixed address the user's need fo r characters that are not 1-byte/2-byte representation is shown in Figure 2. coded in a character set standard. Many computer The high-order bit of each 8-bit byte provides vendors, including Digital, provide extended code self-identifying information fo r the local 2-byte areas fo r assigning UDCs and vendor-defined non­ set. This scheme can be further extended to standard characters. Attributes of these characters include two additional character sets by stati­ fo r various operations such as display fo nts, colla­ cally designating them to G2 and G3 and invok­ tion weights, and input key sequence are often ing them by the single shift codes SS2 and SS3. made available, e.g., by registering them in a system The Extended UNIX Code (El "< :) scheme employs database. From an end-user and application per­ this additional extension. spective, ·ocs should be processed transparently and in the same way as standard characters. 3. Shift range-based representation. Th is scheme, a hybrid of the previous two schemes. is used by the "Shift .JIS Code" on PC-based systems in Asian Op enVMS Sy stem Overview Japan. Bytes with codes 0 to 127 arc interpreted From an operating system perspective, three basic as 1-byte ASCII, codes 160 to 191 and 192 to 223 issues need to be addressed to support Asian char­ are interpreted as 1-byte katakana (as specified acter processing, namely, internal representation, by the .JIS X 0201 standard), and codes 128 to 159 (i.e., how As ian characters are represented and and 224 to 255 are combined with the byte that stored inside the computer), basic text input, and output. FIRST BYTE SECOND BYTE

Internal Representation 7-BIT ASCII Asian OpenVMS variants support the respective national standard character sets. To achieve fu II compatibility with existing ASCII data, each Asian 14-BIT JIS/GB/KS OpenVMS variant simultaneously supports one multibyte Asian character set and ASCII. A variety of schemes can be used to combine multiple coded F(� ure 2 Example of an /SO 2022-based character sets. In general, the schemes fa ll into one Representation Th at Combines of the fo llowing three types: Multiple Coded Character Sets

66 �'{)/. 5 No . .i Summer 1993 Digital Te chnical]our11al Supporting the Ch inese,Japanese, and Ko rean Languages in the Op enVMS Op erating System

fo llows to fo rm a 2-byte code that is interpreted The DEC Asian code set internal representation as a kanji character (as specified by the JIS X corresponds to mapping a character plane (94 by 0208 standard). This scheme allows more single­ 94) to one of the (1, 1) and (0, 1) quadrants of the byte characters to be represented at the expense 2-byte code space in Figure 3. The exact mappings of the number of 2-byte characters allowed. of individual DEC Asian character sets supported by Asian OpenVMS vary. Table 1 provides a summary of Asian OpenVMS nriants employ the ISO the common code range assignments. 2022-based representation fo r Digital's Asian code sets (the DEC Asian code sets) and are named DEC Kanji The DEC Kanji (OpenVMS/japanese) respectively DEC Kanji, DEC Hanzi, DEC Hanyu, and code set currently supports ASCII, JIS X 0208-1983, DEC Hangul fo r the japanese, Simplified Chinese, and an area fo r UDCs, as shown in Ta ble I. The UDC Traditional Chinese, and Korean character sets. area is further divided into the two subareas This encoding scheme maintains full downward described in Ta ble 2. compatibility with all existing ASCII software and Recently, Super DEC Kanji, a revision ancl exten­ data. ln particular, a string or record that consists of sion to the DEC Kanji code set, has been proposed only ASCII characters has the fo rm of simple ASCI I. Because there is no need to keep state information about the data, this scheme simplifies processing, SECOND BYTE when compared to the shift code-based scheme. 80 A1 However, without explicit support fo r coded char­ 00 21 FF CONTROLS acter set designation, simultaneous support fo r 21 Chinese, japanese, and Korean is not possible. (0,1 ) To support LD<:s, each DEC Asian code set con­ 80 tains an extended code area fo r their assignment. FIRST a: f- CONTROLS BYTE The high-order bit of the second byte no longer has A1 �Gz 0 u to be set; thus, an additional 94 by 94 plane of code (1,0) (1,1) positions is available. The disadvantages of this FF approach are that synchronizing a character bound­ ary requires scanning fo rward from the beginning of the string and that the second byte can now con­ Figure 3 DEC Asian Code Set Jntemal flict with the ASCII characters. Representation

Ta ble 1 Summary of DEC Asian Code Range Assignments

Code Range DEC Kanji DEC Hanzi DEC Hanyu DEC Hangul

(Oxxxxxxx) ASCII ASCII ASCII ASCII (1 xxxxxxx 1 xxxxxxx) JIS X 0208 GB 2312-1980 CNS 11643-1986(1)* KS C 5601-1987 (1 xxxxxxx Oxxxxxxx) UDC UDC CNS 11643-1986(2)t (OOOxxxx) CO Control CO Control CO Control CO Control (100xxxx) C1 Control C1 Control C1 Control C1 Control

Notes:

• denotes plane 1 of CNS-11643-1986. r denotes plane 2 of CNS-11643-1986.

Ta ble 2 The DEC Kanji UDC Area

Number of Area Usage Quadrant Rows Characters Code Range

User Area (1 ,0) 1-31 2,914 OxA1 21-0xBF7E DEC Reserved (1 ,0) 32-94 5,922 OxC021-0xFE7E

Digital Tecbn ica f jou nwf 1-1>1. 5 Nu . .) Su/1111/er 199.3 67 Product Internationalization

to suprort additional character sets such as JIS X code (OxC2CB) combines with the following 2-byte 0201-197H and JIS X 0212-1990, ·which arc specified as code to fo rm a 4-byte code. Of course, the code fo llows: point OxC2CB is reserved fo r this purpose. This scheme makes avai.lable another two 94 by 94 Additional planes of code positions: Code Range Planes (OxC2CB lx.:xxxxxx I xxxxxx) (SS2 1 xxxxxxx) JIS X 0201 (0xC2CB lxxxxxxx Oxxx.x.Je'<) (SS3 1 xxxxxxx 1 xxxxxxx) JIS X 021 2 Table 5 shows the current definition of the DTSCS. An additional area is available fo r UDCs in the The redefined UDC area includes both a user/ C:NS planes, as defined in Ta ble 6. vendor-defined area (UDA) and a user-definable character set (UDCS), as described in hhlc .). DEC Hc1 11gul The DEC Hangul (OpenVMS/Hangul) code set supports ASCII and KS C 5601-1987 (with

DEC Hm r:i The DEC Hanzi (Open\.\IS/ llanzi for the exception of UDCs). Simrlified Chinese) code set supports ASCII, GB 2312-80, and a UDC area described in Ta ble 4. Asian Te xt Input The most widely used computer input device DEC Ha ny u Tbe DEC Hanyu (OpenVMS/Hanyu remains the keyboard. Because it is impossible fo r Traditional Chinese) code set currently sup­ to assign thousands of ideographic characters to a ports ASCII , CNS 11643-1986 (first ami second standard QW ERTY keyboard, new methods must be planes), and the Digital Ta iwan Supplemental devised to fa cil itate the Asian text input process. In Character Set (DTSCS). The DTSCS supplements the this context, an inpu t method is basically an algo­ characters defined in C:\S ll643-198(l with an addi­ rithm that takes keystroke input representing cer­ tional collection of characters that address cus­ tain attributes (e .g., phonetics) of a character or tomer needs. Currently, the DTSCS defines the 6,319 string and uses a lookup table to find characters characters recommended by the Electronic Data or strings that have thost: attribute values. Typ ical ly, Processing Center (EDPC) of the Executive , a a user inputs sevt:ral keystrokes and selects the 1;1 iwanese government body. The CNS 11613-1992 desired character or string from a candidate list by standard includes the DT SCS. means of an iterative dialog with the input method. To support the additional DTSCS, the mixed This process is sometimes referred to as preediting. 1-byte/2-byte scheme is extended to a 1-byte/ Depending on the physical location of where the 2-byt<.:/4-byte scheme. Each DTSC:S character maps dialog takes place, a preediting user interface can to a 4-bytc code, in wh ich a fixed kading 2-byte be one of three styles: off-the-spot, over-the-spot,

Ta ble 3 The Super DEC Ka nj i UDC Area

Number of Area Usage Quadrant Rows Characters Code Range

JIS X 0208 UDA (1 ,1) 85-94 940 OxF5A1-0xFEFE

JIS X 021 2 UDA SS3 (1 ,1) 78-94 1,598 (SS3 + OxEEA1 )-0xFEFE

UDCS (1 ,0) 1-94 8,836 OxA1 21 -0xFE7E

Ta ble 4 The DEC Hanzi UDC Area

Number of Area Usage Quadrant Rows Characters Code Range

DEC Reserved (1 ,1) 88-94 658 OxA1 A1-0xFEFE User Area (1 ,0) 1-87 8,178 OxA1 21-0xF77E DEC Area (1 ,0) 88-94 658 OxF821-0xFE7E

6H Vo l. 5 ;Vo. .i Stt/1111/er !')'/! Digital Te chnical jo urnal Supporting the Ch inese.japanese, and Ko rean Languages in the Op en VMS Op erating Sy stem

Ta ble 5 The DEC Hanyu DTSCS Area

Number of Area Usage Quadrant Rows Characters Code Range

EDPC Recommended Characters C2CB (1,1) 1-68 6,319 OxC2CBA1 A1-0xC2CBE485 Reserved C2CB (1,1) 68-94 2,51 7 OxC2CBBEB6-0xC2CBFEFE Reserved C2CB (1 ,0) 1-94 8,836 OxC2CBA12 1-0xC2CBFE7E

Ta ble 6 The DEC Hanyu UDC Area

Number of Area Usage Quadrant Rows Characters Code Range

UDC (1,1) 93-94 145 OxFDCC-OxFEFE UDC (1,0) 82-94 1,186 OxF245-0xFE7E

or on-the-spot. Diffe rent input methods may have tions such as text editors, which need to directly different preedit interface requirements. Usually, manage the screen display. The method requires several screen areas are needed fo r the preediting substantial reengineering of an ASCII application dialog to take place. Input methods diffe r from cul­ to support a .Japanese input method. ture to culture and from script to script. 2. Ru n-time library (RTL). Japanese applications The major difference in the implementation of cal l special text line input routines, which han­ input method support among the Asian OpenVMS dle the Japanese input method. This method is variants is in the character cell terminal envi­ suitabl · to r applications that require simple line ronment. Some input methods are suitable fo r input of text. The RTL method hides the details of programming into the terminal firmware. The the input method from the application but lacks Chinese ami Korean input methods supported the flexibility to control the preedit user inter­ on the OpenVMS/H;mzi, OpenVMS/Hanyu, and fa ce. The reengineering needed to handle the OpenVMS/Hangul systems are examples of such Japanese inpu t method is shifted from the appl i­ methods. Other input methods are too complex or cation to the RTL routines. Th is approach require so many resources as to make them too requires less application reengineering, but all costly fo r firmware implementation. This is true of standard line input routine calls in the applica­ the Japanese: input method, which needs to be tion must be replaced by Japanese line input rou­ implemented on the host. Such implementation tine cal ls. causes a number of technical issues with the tracli­ tional ASCII character cell terminal-oriented appli­ 3. Front-end input processor (FIP). The Japanese cation programming model, where an application input method is embedded as a front-end process does not have to be concernedwith input methods inside the terminal queued I/O (QIO) system ser­ and expects to receive character codes directly. The vice. This method of implementation benefits fo llowing three alternatives have been used to cxisting high-level RTL text line input routines implement host-based input methods on the and requires little application or RTL reengineer­ OpenVMS/Japanese system: ing to support the Japanese input method in the single-I ine input of]apanese text.12 I. Application. All Japanese applications directly

program the input method themselves. An appli­ The Asian OpenVMS graphical user interface on cation must call low-level routines (a set of workstations is called Asian DECwindows/Motif. kana-kanji input method routines are available Current input method support is provided through in the JSY Run-time Library) to access the input a Digital extension im plemented as an X cl ient. method dictionary and directly controls the With n.:lcasc 5 of the X l I standard, the implementa­ preedit interface in relation to its own screen tion will migrate to using the standard X input management. This method is used by appl ica- method (XIM) support in the Xlib library routines.

DipJtal Te cbuical jo unwl VrJ I. 5 No. . ) Sui/Ill/PI" I'J'J.) 69 Product lnternationalliation

Most Asian PCs have a front-end processor imple­ • Five stroke mentation of input methods resident on the PC. • Five shape Therefore, PC (f<..-sktop computers can s<.:nd Asian

characters directly when communicating with an • Asian OpenVMS host. The following is an overview of the input meth­ • Telex code ods supported by each Asian OpenVMS variant. • GB 2312 code japanese Input Method Ka na characters can be The OpenVMS/Hanyu system supports the fol­ typed directly on a standard keyboard using a !?ana lowing Chinese input methods, which are imp le­ . For kanji characters, the de facto mented by firmware on the Digital VT382 series standard input method is called the romaji/kana­ Chinese terminals: to-kanji conversion, which is based on phonetic • Ts - conversion. The process of entering a kanji string involves typing the kana (hiragana or ka takana) • Quick Ts ang-Chi or the romaji pronunciation of the string. The input • Phonetic method then looks in a conversion dictionary fo r the list of k. anji strings that have the same pronunci­ • Internal code ation. Since most japanese words have homonyms, • Phrase the user usually needs to go through a selection process to find the desired kanji string. More Ko rean Input Jllle thod Hangul characters are advanced implementations involve performing syn­ composed by directly typing the individual Hangul tactic and semantic analysis of the sentence to letters. The composition sequence always starts increase the efficiency of the input method. On with a consonant, is followed by a vowel, and fin­ the OpenVMS/Japanesc system, the kana-to-kanji ishes with a consonant, if present. The input method input method has a provision fo r separating conver­ validates the composition sequence keyed in by the sion units into word, clause, and sentence. The user at each step. The display device updates the method also has a learning capability that reorders intermediate rendering of the partially fo rmed the candidate list entries by means of a personal Hangu l character as the shape ami position of each dictionary, putting the characters selected at the letter changes during composition. Hanja charac­ top of the list so that more frequently used words ters are entered by typing their Hangul pronuncia­ appear first in the homonym list. tion. The input method displays a list of all possible Hanja characters (homonyms). More sophisticated Chinese Input Method No standard exists fo r the implementations can perform Hangul-to-Hanja Chinese input method. The large number of input conversion in word units similar to that of the methods that have been proposed over the years kana-to-kanji conversion. On the Digital VT382 can be classified into one of two major types: Korean terminal, both the Hangul and the Hanja 1. Pattern decomposition-based method. Each input methods are implemented by firmware. character is decomposed into basic strokes or patterns. Each stroke or pattern, e.g., a root racl i­ Asian Te xt Output cal, is assigned a code (mapped to a key) and Asian character fo nts are usually displayed or each character is retrieved by inputting a printed as bit-map graphics. To meet the require­ sequence of such codes. ments of specific applications such as sealing and plotting, these fo nts can also be defined as outline 2. Phonetic-based method. Each character is tran­ fo nts using vector representation. International scribed into phonemic letters and retrieved by codes of Asian language characters are mapped to this phonemic transcription. The system used in the corresponding fo nt data when needed fo r out­ Ta iwan is based on the National Phonetic Alpha­ put. Predefined character fonts are usually stored in bet (), whereas the PRC uses Roman the read-only memoq' (ROM) of terminals and print­ alphabets based on the Wad e-Giles system. ers fo r better performance. As fo r the English alpha­ The OpenVMS/Hanzi system supports the fo llow­ bet, diffe rent standards, styles, and sizes exist fo r ing Chinese input methods: Asian language character fo nts. The fo llowing list

70 VfJI. .5 No. .) S11mmer 1993 Digital Te ciJnicaljournal Supporting the Ch inese, japanese, and Ko rean Languages in the Op enVMS Op eratingSy stem

contains some of the more popular fo nt styles used • Font memories and loading protocol. The termi­ in the respective markets: nal requires additional ROM to hold the fo nts of standard characters in an Asian character set, typ­ Market Font Style ically 7, 000 to 20,000 characters. Also, fo r char­ acters outside the standard set, i.e., UDCs, the Japan Mincho, Gothic, Round-Gothic terminal requires random-access memory (RAM) Korea Myuncho, Gothic to downline load the fo nts from the host. Digital PRC , Quasi-Song, Hei (boldface), Kai Asian terminals support fo nt-loading protocols Ta iwan Sung, Hei (boldface), Kai that work with the host software to downline load fo nts into RAM either on demand or on a preloading basis. The fo nt cache in Digital's In general, Asian ideographic characters require Asian terminals can usually hold about 400 char­ high-definition fo nts, i.e., at least a 24-by-24 clot acters at once. matrix, to achieve acceptable visual quality. As a result, memory requirement is a major issue when • Input method. Implementing input methods on supporting Asian fo nts. a video terminal requires additional hardware modification. The input method algorithms Ha rdware must be programmed into the firmware together Supporting Asian language processing requires with extra memory fo r the input method lookup modifying the standard video terminals and print­ tables. In addition to the main display area, one ers. In general, software products need to recog­ extra line on the screen is needed as an input nize the different fu nctional characteristics of Asian method work area, e.g., fo r displaying candidate terminals and printers. For example, the character lists fo r user selection. Some keys must be set designation and invocation defaults diffe r from assigned permanently fo r invoking different those of standard terminals. input methods. The printing of legends on the Wo rkstations do not require any modifications tops of the keys is now more complex, because (except fo r exchanging a local language keyboard the keytops must include additional legends fo r fo r the standard one), because input and display are the input method keyboard layout. For example, directly supported by software. on Digital's Hanzi terminals, four ideograms must appear on the tiny area of one kcytop. Asian Video Te rminals The traditional character cell terminal provides certain local display and Asian Printers Digital supports a range of input fu nctions on behalf of a software program. For example, the terminal firmware preprocesses Asian printers. Similar to Asian video terminals, Asian printers must support fo nt-loading protocols scan codes generated by keyboard input and con­ to downline load fo nts fo r UDC:s by either preloacl­ verts them to character code before sending them to an application. Similarly, character fo nts are usu­ ing or on-demand-loading methods. Additional ally stored in the terminal ROM. Digital has devel­ RAM is required to hold these fo nts. Also, Digital's Asian printers generally support multiple fo nt type­ oped a variety of video terminals to support Asian faces and sizes. language processing. Some major hardware considerations fo r Asian video terminals arc Asian Op enVMS Structure The components provided by the Asian OpenVMS • High-resolution video display. Ideographic char­ variants on top of the standard OpenVMS system acters have complex glyphs, which require at can be divided into five main groups: least a 24-by-24 dot matrix cel l to be of accept­ able display quality. Such a cell would occupy 1. System support fo r transparent processing of two ASCII columns. As a result, to maintain UDC:s a 26 -line (40 ideograms per line) display requires 2. An enhanced OpenVMS terminal I/O subsystem a screen resolution of at least 960 by 780 pixels. to support Asian terminal devices Typically, Digital's Asian video terminals usc monitors that run at a 60-hcrtz noninterlaceCI 3. A set of run-time libraries to facilitate Asian mode, a mode substantially higher than that of application development on Asian OpenVMS standard ASCII terminal monitors. systems

Digital Te cbnicaljourna/ Vu /. 5 N" . .i Sltllllll<'rI'.I' J.i 71 Product Internationalization

4. A set of localized utilities and commands fo r code point serves as the key in the CMGR databases users to perform common tasks on OpenV.\15 fo r retrieving other attributes of the character. systems in their nalivl' languages The CMCR utility provides a user interface to cre­ ate and manage the UDC attribute database. The 5. A utility to set the operating modes (standard user interface includes a font editor fo r users to cre­ OpenVMS mode or Asian OpenVMS mode) of the ate the glyph image of a UDC and entries for other localized components attributes. To allow applications ro retrieve the Figure 4 summarizes the Asian OpenV.VIS system UDC attributes, the CMGR has a set of application structure. programming interfaces (AP!s) used to access the individual attribute databases. In particular, the AsianOpenV MS Components on-demand fo nt loading of UDCs supported by the Asian terminal l/0 subsystem employs the CMGR This section reviews the major components of the fo nt databases, and the SORT/MERGE uti! ity uses the Asian OpenVMS variants. collation databases for UDC sorting.

Us er-defined Character Support­ CMGR Fo nt Database To output a UDC to a dis­ Th e CharacterMa nager play or printing device, the UDC's glyph image must Attributes of characters in the standard character first be defined. The CMGR provides a screen fo nt sets supported on an Asian OpenVMS system are editor fo r users to create the glyph images. The known and fixed. Therefore, attribute support can CMGR supports multiple typefaces (e.g., Hei, Sug, be built into the system statically. In contrast, UDCs and Default) and fo nt sizes (e.g., 24 by 24, 32 by 32, usually require thl'ir attributes to be dynamically and 40 by 40) in multiple databases. There are two defined and accessed. A new utility called the char­ ways to load the UDC fo nts to Asian output devices, acter manager (CMGR) enables users to create, man­ namely, preloading and on-demand loading. age (modify and update), and retrieve UDCs and Fonts can be preloaded by sending a file that con­ their attribu tes. LDC support is currently offered tains the appropriate control sequences and font on the OpenVMS/Japanese, OpenVMS/Hanzi, and patterns, which are discussed in more detail later in OpenVMS/Hanyu systems. In the OpenVMS/Hanyu this section. The CMGR provides a command that system, the CMGR also supports Digital-defined generates a preload file from the fo nt database fo r characters, e.g., the DTSCS and DEC Recommended required UDCs. Characters (DRC). On-demand fo nt loading is a more complicated The CMGR manages a set of systemwide data­ mechanism, which involves an on-demand loading bases that store IDC attributes. Two liDC attributes protocol. Font patterns are retrieved from the font are currently supported, glyph images and col laring database through the CMGR cal lable interface by a values. fo nt-handling process. To represent the UDCs in the computer, the CMGR allows a user to assign each UDC a code point in the CMGR Collation Attribute Database To facilitate designated ! 'DC area. Currently, UDC characters are the sorting of data, including UDCs, the collation entered by directly typing their binary code. The weights of the characters must be defined .

LOCALIZED OPENVMS USER COMMANDS AND APPLICATION UTILITIES

USER- DEFINED MODE MULTIBYTE PROCESSING RTL CHARACTER SWITCHING SUPPORT LOCALIZED SCREEN MANAGEMENT RTL LOCALIZED OPENVMS CALLABLE UTILITY ROUTINES I ASIAN TERMINAL 1/0 SUBSYSTEM I

Figure 4 Asian OjJ enVMS System Structure

72 Vo l. 5 No. 3 Summer 1993 Digit�1l Te chnical journal Supporting the Ch inese, jap anese, and Ko rean Languages in the Op enVUS Op erating .\)'stem

Currently, only the OpenVMS/Hanzi and OpenVMS / request and passes it on to a process called the fo nt Hanyu systems offer this fe ature. hand ler. On behalf of the terminal, the fo nt handler retrieves the fo nt bit map of the requ�:stcd charac­ Asian Te rminal 11 0 Subsystem ter from the system fo nt database and sends it back to the terminal or printer, which in turn loads it The Asian terminal 1!0 subsystem is an extension into its RAM ancl resumes the display processing. of the standard OpenYMS terminal I/0 subsystem. Because it involves XON/XOFF flow control, which It consists mainly of the OpenVMS terminal class is done at a very low levd of the system, the process drivers/port drivers, auxil iary class drivers, and requires modifications to device drivers. The server processes, and handles both standard and amount of UDC fo nt is not limited by Wf:\1 capacity, Asian terminals simultaneously. For Asian termi­ because the terminal firmware au tomatically nals, the subsystem provides extended functions updates the memory. to support multibyte character hand l ing in the ter­ minal QIO system service, input method, code set conversion, and fo nt loading. Front-end Input Process (fJP) 12 One of the big­ gest diffe rences between Japanese and other Asian Te rminal QIO Sy stem Service/Multibyte Ch aracter language (e.g., Chinese and Korean) support on the Ha ndling The enhanced terminal QIO system ser­ OpenV\1S system is in the implementation of the vice can handle mixed ASCII and multibyte Asian input method. The nature of the kana-to-kanji characters in line input calls. Line editing (e.g., input method makes it unsuitable fo r implementa­ character echo, cursor movement, character dele­ tion in terminal firmware. The method requires a tion, character insertion, won.l delimiters, and huge input method dictionary (about 1 megabyte in character overstrike), line wrapping, uppercasing, size) and a dynamic memory work area fo r syntac­ and read verify ing will handle Asian characters tic and semantic analysis. Also, updating an input correctly. Because the QIO system service is the method dictionary that is implemented in firmware low�:st-level routine that handles terminal 1/0, all is a very costly operation. other text 1!0 routines such as UB$GET_INPUT, $GET RMS service, and the text 1!0 facility of pro­ Code Set Conversion Prior to the introduction of gramming languages such as C, Fortran, and COBOL the Asian OpenVMS variants, Digital's customers are layered on it. The enhancements automatically used video terminals and printers that support pro­ benefit all of these higher-level routines. prietary local language code sets from third-party vendors. To protect customer investments and to Fo nt Loading Asian terminal devices have ensure a smooth migration path fo r legacy equip­ writable fo nt memory (WFM), and the firmware ment, the Asian terminal l/0 subsystem provides an supports fo nt-loading sequences and logic. A text application-transparent, code set conversion facil­ file is scanned by a utility program prior to ou tput ity. This facility is bas<.:d on the terminal fallback to a terminal or printer. Tht: Asian terminal 1/0 sub­ facility (TFF) introduced in OpenVMS version 5.0, system then creates a preloading file, which con­ which provides a similar function fo r conversion tains the font-loading sequence fo r all nonresident between 7-bit National Replacement Character Sets characters fo und in the file. Next, the subsystem (NRCSs) and the 8-bit DEC MCS. TFF provides a mid­ sends this preloading file to the terminal or printer, driver that converts both incoming and outgoing causing the requ ired fo nts to be loaded in the fo nt data from one code set to another. For the Asian memory. Finally, the text file is output to the termi­ OpenVMS variants, the conversion logic is extended nal or printer. This method is limited by the size of to support 16-bit character entities. Currently, TFF the font memory, typically 300 to 500 characters. supports the conversion between the DOOSAN The fo nt preloading method is used mainly in batch code and the DEC Hangul code on the OpenVMS/ operations, such as line printers, where perfor­ Hangu l system and the MITAC TELEX code and the mance is an important factor. DEC Hanyu code on the OpenVMS/ Hanyu system. When an Asian video terminal or printer receives In addition, code set conversion is necessary an Asian character cocle and determines that it is a between heterogeneous systems because of the UDC, the terminal firmware automatically halts the proliferation of encoding schemes used by diffe r­ current processing and generates a fo nt request to ent vendors. For instance, Chinese I'Cs in Ta iwan the OpenVMS system. The terminal driver traps this use the BIG 5 code. To facilitate the communication

Digital Te cbtlicaljo u,-,a{ V/J/. 5 /1/(J. 3 Sunnner 1993 73 Product Internationalization

between the Open VMS system and PC desktop com­ ka nj i string, and getting the contents of the inter­ puters, the OpenYMS/Hanyu system supports the nal buffe r. The kana-to-kanji input method is pro­ conversion between the <; 5 code and the DEC grammed by cal ling a sequence of these routines. Hanyu code. This implementation gives the application the abil­ ity to directly control the screen management and allows flexibility in the design of the preedit user Asian Application Programming Support interface; however, the application must deal with To help software developers write Asian applica­ every detail of the input method, which is a disad­ tions on Asian OpenVMS variants, Digital provides vantage. In addition, the library JML!Fl helps the a set of common Asian multibyte character process­ application customize the keyboard mapping fo r ing RTL routines to supplement the standard ka na-to-k.anji conversion. 12 OpenVMS RTLs. In particular, our Asian localization The screen management (SMG) IH L on the effort to develop OpenVMS layered products uti­ OpenYiVIS system provides a suite of routines fo r lizes these RTLs. Functions provided by the Asian designing, composing, and keeping track of com­ language RTL (approximatdy 240 routines) are clas­ plex images on a character cell video terminal in a sified into the following categories of routines: device-independent manner. The standard SMG ver­

• Character conversion sion supports only the ASCII and DEC Special Graphics character sets and cannot correctly han­ • String dle multi byte Asian characters. For example, opera­ tions such as screen update optimization, boundary • Read/write processing (clipping on borders), and cursor move­ • Pointer ments operate on part of a multibyte Asian charac­ ter and cause screen corruption because of the • Comparison "one-character-is-equal-to-one-byte" assumption.

• Search The Asian OpenYMS variants provide an extended version of S1YIG (about 20 percent of the • Count original routines have been extended) to support

• Character type multibyte character sets and DEC MCS, in addition to ASCII and DEC Special Graphics. To maintain • Date/time downward compatibi lity, most routine entries

• Code set conversion remain identical, with an optional character set argument added at the end of the argument I ist The majority of the routine interfaces are com­ to indicate desired character set operations. mon to all Asian countries. Currently, one library Alternatively, users can define a logical name image supports the Hanzi, Hanyu, and Hangul lan­ SMG$DEfAU LT _CHARACTER_SET without ex pl icitly guage variants. Language-specific code is hidden passing the character set argument in the routine under this generic multibyte interface and call. Existing ASCII appl ications run unmodified switched at run time by a system logical name with the As ian SMG. New Asian applications that defined during system start-up. use multibyte fe atures relink with the new library. The OpenVMS/Japanese system has a set of rou­ tines fo r handling ka na-to-kanji conversion, both high level and low level. The high-level routines, Asian Commandsand Utilities such as JLB$GET_lNPUT, JLB$GET_COMMAND, and The Open VMS user interface determines the way an JLB$GET_SCREEN (Japanese versions of LIB$GET_ end user interacts with the system. The interface INPUT, LIB$GET_COMMAND, and LIB$GET_SCREl'N), includes such components as the DCL command hide the kana-to-kanji input method details from line interpreter, system help and messages, and all the application. These routines use the off-the-spot the system uti! ities provided by the OpenVMS sys­ preediting that usually takes place at the last line of tem. Selected user interface components of the the screen; however, the flexibility of the preed it OpenYMS system have been localized to support user interface is limited. A set of low-level routines Asian characters on the Asian OpenVMS variants. A performs primitive fu nctions such as opening the description of some of these local ized components conversion dictionary, finding the next candidate fo llows.

74 Vo l. 5 No. 3 Summer 1993 Digital Te cbnicaljournal Supporting the Ch inese, japanese, ana Ko rean Languages in the Op enVMS Op erating Sy stem

DCL Co mmand Line Interpreter The algorithms Sy stem Help and Messages The OpenVMS/Hanzi, in the standard DCL that assume characters to be OpenVJVIS/Hanyu, and OpenVMS/Hangul systems equal to 1 byte and interpret these characters as include a translated Asian language version of the ASCII/DEC MCS are enhanced fo r the fo llowing DCL OpenVMS system help library (accessed by typing primitives in the Asian code set modes: HELP at the $ prompt). The Asian version of the sys­ tem help library is placed in a directory that is sepa­ • Command parsing. Parsing of command input rate from the original Eng! ish one but that has the in single-byte units causes data corruption, same file name. The user can switch the language because parr of some multibyte Asian characters (English or the particular Asian language) of system can be mistaken fo r one of the special DCL ASCII help by using the ASIANGEN utility, which redefines characters such as 1, @, or ". Command parsing is the file specification logical to point to the appro­ now done in character units instead of byte priate file. units, and operations such as terminator, delim­ The OpenVMS/Japanese system provides a trans­ iter checks, and compression are lated Japanese version of the system messages skipped on Asian characters, since the DCL spe­ (SYSMSG.EXE), which is placed in a subdirectory of cial characters are all in ASCII. SYS$MESSAGE. Users can switch the language of the system messages by using the SET LANGUAGE com­ • Character uppercasing and lowercasing. Upper­ casing and lowercasing are applied only to ASCII mand, which reloads the message file into memory. characters, because the concept of uppercase/ In addition, most of the localized original utilities lowercase does not exist in Asian character sets. and Asian-specific utilities provide bilingual help Uppercasing/lowercasing in single-byte units and messages. corrupts Asian character data, because part of an Asian character can be indiscriminately SORT/MERGE Collation rules in the Asian lan­ uppercased/lowercased. guages are very diffe rent from those of the Latin languages. t:\ • Symbols and labels. Certain 8-bit values (those • Asian collation sequences. An Asian character with no character assigned in the DEC MCS) are has different collation sequences based on differ­ currently disallowed for DCL symbol names, ent attributes. The SORT/M ERGE command is symbol values, and labels. This restriction has extended as fo llows to include new subqualifiers been removed in the Asian modes to allow all fo r the Asian col lating sequences: /KEY=(POS:m, Asian characters in DCL symbols and labels. The CSIZE:n, ). The enhanced algorithms maintain separate symbol Asian OpenVMS SORT/M ERGE utility supports the tables for each of the code set modes, because of Asian collating sequences shown in Table 7. the possibility of code collision issues across dif­ fe rent code sets. • Collation weights. Unlike ASCII, the col lation weights of the Asian collating sequences cannot The Asian DCL command line interpreter is be derived by virtue of the code value. Rather, currently supplied with the OrenVIYIS/Hanzi, the string comparison fo r Asian collation OpenV:VIS/Hangul, OpenVMS/Hanyu, and OpenVMS/ sequences are driven by collation weight tables. Thai systems in the same binary image, i.e., a single For the standard characters, these tables are built image supports multiple code sets. The default into binary images that arc linked with the utility code set mode fo r DCL fo r a particular system is for fast access. establ ishecJ during system start-up by means of a defined logical name supplied with the start-up • Multibyte characters. String comparison in the procedureof each Asian OpenVMS variant. Switch­ original SORT/M ERGE operation is cl one in byte ing the code set mode between DEC MCS and the units, because a character is assumed to be equal particular Asian code set of the system is accom­ to I byte. For the Asian SORT/MERGE, a compari­ plished through a utility, e.g., HANZIGEN in the son operation must be aligned by character, i.e., OpenV.VIS/Hanzi system. The Asian DCL is not sup­ multibyte, units rather than by byte units. The pl ied with the OpenVMS/japanese system, because operation must be able to hand le the case in until only recently the japanese input method was which the start position of a sort key (specified not available at the DCL level. by a byte position) in a record is in the middle of

. SIIIIIIIIL'r Digital Te e/m ica/ joul'lwl llf>l. 5 Nu . ) I'J'J.j 75 Product Internal ionalization

Ta ble 7 Asian Collating Sequences Supported by the OpenVMS User Interface

Collation Sequence Ty pe OpenVMS/Japanese OpenVMS/Hanzi OpenVMS/Hanyu

Pronunciation Onyomi* Pinyin Phonetic_Code Kunyomit Kokugo� Kana8bit Radical Bushu Radical Radical Stroke Count Sokaku Stroke Stroke

Internal Code JIScode QuWei QuWei

Notes: ' denotes a Chinese reading.

t denotes a Japanese reading.

• denotes a Kana reading.

a multihyte charactn. Also, to avoid a truncation ported collation sequences using the new com­ problem at the key boundary. the size of the sort mand qualifier DIR/FOLD/COLLATINC;_sEQl JENCE= key (mixed ASUf and multibyte characters are (). allowed) is specified as a number of characters The MAIL utility invokes the Asian text editors instead of a number of bytes. by default instead of :nvoking the standard ones. The OpenVVIS/Japanese system incorporates tbe • Multiple passes. Sorting Asian characters by any Japanese input method to allow users to enter of the individual col lating sequences (excq)t Japanese characters. QuWei) may not produce a unique sort order. In general, multiple succ<.:ssive passes using EDT Tbe Asian OpenVN!S EDT editor was local­ diffe rent col lating sequences are needed to do ized and enhanced fo r Asian text ed iting. Much of so. Thus, the Asian O pen Vi\IS SORT/.\1 En<;E utility the work involved driving the term inal display allows a sort key specified with mull iplc passes correctly fo r Asian characters. In addition, the ed i­ of different collating sequences. In addition, if tor has a large number of new editing features. the /STA BLE qualifier is not specified, QuWei collation is always added last to the sort key to TP U/E VE Localization of TP! J and EVE deals fu rther classify records with identical co llation mainly with managing the screen update fo r mixed values. ASCll and Asian characters, such as cursor move­

• User-defined charact<.:rs. The Asian OpenVMS ment and screen boundary hand ling. Both the ·rru SORT/M ERGE utility supports collation of liDC:s. ed iting engine and the EVE interface were modi­ When a 1 JDC is encountt: red, tht· SORI/M EH<;E fied. Asian-spn:ific TPU built-in proced ures were operation ret rieves the collation weight from a added, and existing ones were enhanced . String system database maintained by the Uvl< dZ utility search is now aligned at the character boundary with tbe value defined by a user when the char­ rather that on byte units. acter was registered. For the .J apanese TPlJ/EVE, one of the most diffi­ cult tasks is to incorporate the Japanese input ;\!fA IL Most of the work involved in localizing the m<.:thod. Th is requires managing overlap windows !VIA lL utility enhanc<.:s the user interface to use in a cha racter cel l terminal between the input Asian characters. String search enhancements method working area and the background editing al low p rocess ing by character units instead of by area. byte units. St ring uppercasing is not appl ied to Asian characters. The subject fic lcl , the personal DE Cwindmus S) ·ste111 With the increasing empha­ name field, and the fo lder names can all contain �is on intcrnational ization fe atures in the Xll and Asian characters. The I isting of mail fo lders can OSF/Motif standards, OpenV MS DEC:wimlows sys­ be displayed in so rted order in any of the su p- tems provide these features and the localization

7

fe atures demanded by the market. For a description - The standard OpenVM S environment does not of the latest internationalization support in the X support the application-transparent process­ Window System standard, refer to the book by ing of liiX:s. I Schei flcr and Gettys. 1 - The writing dire ction of Asian languages can be vertical, i.e., from top to bottom. The stan­ Asian Op enVMS Localization Issues dard OpenVMS environment assumes hori­ The Asian Open VMS effort has been addressing vari­ zontal, left -to-right languages. ous technical and engineering issues.1'uc.. J- This sec­ tion discusses the major ones. Engineering Issues Historically, the Asian localization of the OpenVJVIS Te chnical Issues system has been organized as an engineering effort Localization of the OpenVMS components to sup­ that is separate from mainstream development. As port the Asian languages requ ires reengineering the a resu lt, a number of engineering constraints and program codes and text translation. The need to overhead costs exist. reengineer source code arises fo r two main reasons. • Single language support. The design goal t( )r the l. OpenVMS components make fu ndamental Asian OpenV\1S variants, as driven by the local programming assumptions and practices based market requirements, has been targeted at sup­ on the ASCII and DEC MCS character sets. For porting a single language on one system, i.e., one example, language variant per system. As a result, no spe­ - OpenVMS com ponents assume the character cial design considerations are given to support­ set to be ASCII (plus DEC MCS in some cases), ing multiple languages on one system. and blindly uppercase and lowercase charac­ • Fu ll upward compatibility. The top design ters, validate characters against the DEC MC:S, re irement is to keep fu ll downward compati­ and define printable characters according ro bility with original ASCII/DEC MCS OpenV\1S the ASCII and MC:S encodings. DEC systems. All ASC II/DEC MCS appl ications with - OpenVMS components assume characters to ex isting data must be able to ru n unchanged on be 1 byte and use string manipulation algo­ the Asian OpenVMS variants. In fa ct, an Asian rithms based on 1-byte units. OpenVMS system can, at any time, be reset to - OpenVMS components assume the display operate in the original DEC MCS mode, if desired . width of a character to be of fixed length Therefore, most localized components must be (1 byte) and use screen display management able to sw itch between the standard and Asian algo rithms based on the assumption that code paths. System mechanisms fo r determining 1 byte equals one display column. the current language variant and operating - OpenVMS components assume that the char­ mode are required. acter count, the byte count, and the display • Optimal performance. Anotherdesign goal is to width are the same. and use string manipula­ minimize any performance impact on standard tion algorithms and character cel l terminal Engl ish components. As a resu lt, Asian codes are screen display management based on this designed around standard code paths. For exam­ assu mption. ple, branches fo r Asian code are placed at the 2. Some functionality that is required to support. end of a conditional statement, and Asian code Asian languages is missing in the standard branches out from the main line code using spe­ Open VMS environment, For example, cial hooks. - Keyboard input of Asian characters re quires • Limited or no kernel changes. Since Asian code more complicated input method processing changes are not merged into the mainstream, than is ava ilable in the standard OpenVMS kernel changes in Asian code would be very diffi­ environment. cult to maintain with new OpenVMS releases. In - Collation rules of Asian languages are radi­ addition, any kernel changes in the standard cally diffe rent from English col lation rules, OpenVMS release will likely hn:ak the Asian on which the standard OpenVMS environ­ code. This puts a constraint on supporting Asian ment is based. languages in OpenVJ\IIS kernel components.

Digital Te chnical .Journal H1/. 5 No. 3 Sl//)/ll/!'r I'J'J.i 77 Product Internationalization

• Commonality. Because the Asian languages References share a lot of commonality, techniques such I. T. Greenwood, "International Cu ltural Differ­ as common source are used fo r most Asian ences in Software," Digital Te clmical.Jounwl, localized components to maximize engineering vo l. 5, no. 3 (Summer lS>93, this issue): 8-20. return by sharing common Asian localization code. 2. Code of the .Japanese Graphic Character Set fo r lnfonnation lnterchange, ]IS C 6226-1978 Conclusions (: Japanese..: Standards Association, 1978) Local language processing has become a mandatory functionality fo r computer systems sold in Asian 3. Code of the .Japanese Graphic Character Set markets. From the OpenVMS operating system per­ fo r b�(ormation Interchange, .JIS X 0208-1983 spective, the basic local Asian language processing (Tokyo: Japanese Standards Association, requirements are being addressed by its Asian 1983). language variants in a single-language-for-a-single­ 4. Code of the .Japanese Grapbic Cbaracter Set locale manner. With global trade and the technol­ fo r lnj(mnation Interchange, .TISX 0208-1990 ogy trend of distributed computing systems, the (Tokyo: Japanese Standards Association, challenge fo r the future is to be able to provide 1990). OpenVMS services simultaneously to multiple cl ients operating in different languages and code 5. Code of the Supplementary .fapa!lese sets. Such a requirement leads to the concept of a Graphic Character Setfor lnfonnation In fer­ multilingual operating system, which allows soft­ change, )IS X 0212-1990 (Tokyo: Japanese Stan­ ware applications to run irrespective of the lan­ dards Association, 1990). guage and/or code set they support. With the availability of the ISO 10646 Universal Character Set 6. Code fo r Information Interchange, ]IS X 020 1- 1976 (UCS) standard, the set of tools fo r building such a (Tokyo: Japanese Standards Association, 1976). multilingual operating system has been enhanced. tH From an engineering perspective, the current 7. Code of Ch inese Graphic Character Set for Asian localization approach of OpenVMS has been Information Interchange, GB 2312-1980 (­ adopted historically because of a number of fa ctors jing: Te chnical Standards Publishing, 1981). and constraints, such as the organization of engi­ neering resources and the initial need to bring the 8. Standard Interchange Code fo r General�y­ capability rapidly to the market. The reengineering used Ch inese Ch aracters, CNS 11643-1986 techniques are geared toward the character set (lhipei: National Bureau of Standards, 1986) encoding schemes currently supported. The 9. Chi11ese Standard Interchange Code, CNS arrangement of performing localization remotely 11643-1992 (: National Bureau of Stan­ and independently from the original mainstream dards, 1992) development has meant costly reengineering and maintenance overheads in the long term. With the 10. Code /(J r Inforrnation Interchange ( 1-Imz_r.: ul industrial trend of shipping global software simul­ and Ha nja) , KS C 5601-1987 (: Korean taneously satisfying multiple diffe rent local market Industrial Standards Association, 1989). requirements, an international product engineer­ 11. Information Processing-/SO 7- bit and 8-bit ing approach must be taken to minimize the cost Coded Ch aracter Sets-Code Extension Te ch­ of worldwide system engineering to deliver a niques, 3d eel., ISO 2022 (Geneva: Interna­ global product. In particular, the original product tional Organization fo r Standardization/ must be internationalized from the ground up, so International Electrotechnieal Commission, that no separate reengineering is needed during 1986). localization to support a local market. In addition, to achieve simultaneous worldwide delivery, con­ 12. T. Honma, H. Baba, and K. Ta kizawa, current engineering of loca lization needs to be "Japanese Input Method Independent of performed closely in parallel with the product Applications," Digital Te chnical Jo urnal, development. vol. 5, no. 3 (Summer 1993, this issue): 97- 107.

78 Vo l. 5 No. 3 Summer 19'.!3 Digital Te cfJIIical jo urnal Supporting the Ch inese, japanese, and Ko rean Languages in the Op enVMS Op erating System

13. R. Haentjens, "The Ordering of Universal ment Corporation, Order No. AD-PGOBA-TE, Character Strings," Digital Te chnicaljo urnal, December 1990). vol 5, no. 3 (Summer 1993, this issue): 43-52. 17 Addendum to Technical Guide to Asian Lan­ 14. R. Scheifler and ). Gettys, X Window System, guage Software Local ization (Maynard, MA: X 11, Release 5, 3d ed. (Burlington, MA: Digital Digital Equipment Corporation, Order No. Press, Order No. EY-.J802E-DP-EEB, 1992). AD-PGOCA-TE, December 1990). 15. Introduction to Asian Language Software 18. Information Te chnology - Un iversal Multi­ Localization (Maynard, MA: Digital Equip­ ple-Octet Coded Character Set (UCS) -Part 1: ment Corporation, Order No. AD-PGOAA-TE, Architecture and Basic Multilingual Plane, December 1990). ISO/lEC 10646-1 (Geneva: International Orga­ 16. Te chnical Guide to Asian Language Software nization fo r Standardization/International Localization (Maynard, MA: Digital Equip- Electrotechnical Commission, 1993).

Digital Te ch7lical Jn ur11a.l 11>1. 5 No. 3 Summer 19?3 79 Hirotaka Yo shioka Jim Melton

Ch aracter Internationalization in Databases: A Ca se Study

(haracter internationalimtirm: jJuses dijficultproblems fo r database management systems because they must address user (s tored) data, source code, and metadata. The revised ( 1CJI.J2) standardfo r database language SQL is one of the fi rst standards to address internationaHration in a sign ificant Wt�y. DEC Rdb is one of the fe w Digital products that has a complete internationaliza tion (A sian) implementation that is also iWIA compliant. The product is still evolving fr a noninternational­ ized product to a fu lly internationalized one; this evolution has ta ken fo ur years and provides an ex cellent example of the issues that must be resolved and the approaches to resolving them. Rdb can serve as a case study fo r the software engi­ neering community on how to build internationalized products.

International ization is the process of producing Internationalization in the specifications anc.l products that operate wel l in SQL Standard many languages and cultu res. 1 Internationalization Like most computer languages, SQL came into being has several different aspects such as character set with the minimal set of characters required by the issues, date and time representation, and currency language; vendors were free to support as many, or representation. Most of these affect many areas of as fl:w, additional characters as they perceived their information technology where the solutions are markets demanded. There was I ittle, if any, consid­ reasonably similar: for example, solutions to cur­ eration given to portability beyond the English rency representation are equally applicable to language customer base. In 1989, after work was database systems and to programming languages. completed on ISO 9075 :1989 and A!'\1 51 X3.135 -1989 Database systems, however, are affected in several (SQL-89), significant changes were proposed fo r the unique ways, all of which deal with character sets. next revision of the SQL database language to In this paper, we fo cus on the issues of character set address the requ irement fo r additional character set internationalization in database management sys­ support. (Unfortunately, this put SQL in the van­ tems (DB.\IS) and do not address the other aspects of guard, and I ittle support existed in the rest of the date and time, currency, or locales. standards community fo r this effort.) To better understand the problems and solutions associated with character internationalization of database systems, we present an overview of the Character Set Support solutions fo und in the SQL standard and report SQL must address a more complex set of require­ a case study of implementing those solutions in a ments to support character sets than other pro­ commercial product. We first discuss the character gramming languages due to the inherent nature internationalization fea tures supported in the of database systems. Whereas other program­ recently published revision of the standard for ming languages have to cover the character set used Database Language SQL (ISO/TEC 9075 :1992 and to encode the source program as wel l as the char­ ANSI X:1.135 -l992) 2 We then describe in some detail acter set fo r data processed by the program, the application of those fe atures in DEC Rd b, database systems also have to address the character Digital's re lational database product. The interna­ set of the metadata used to describe the user data. tionalization of DEC Rdb serves as a case study, or a In other words, character set information must model, for the internationalization of Digital's soft­ be known within three places in a database ware products in ge neral. environment.

80 Vo l. 5 No . .l Summe-r 19 9.! Digital Te cbnicaljounwl Ch aracter Internationalization in Databases: A Case Study

1. The user data that is stored in the database or in Japanese kanji characters? Furthermore, what if that is passed to the database system from the the name of somespeci fic department was actually application programs. expressed in Hebrew (because of a business rela­ tionship)? That means that our database would have In SQL, data is stored in tables, which are two­ to be able to handle data in Hebrew characters, dimensional representations of data. Each record metadata in Japanese characters, and source code of data is stored in a row of a table, and each field using Latin characters' in a row corresponds to a column of a table. Al l One might reasonably ask whether this level of the data in a given column of a table has the same fu nctionality is really required by the marketplace. data type and, fo r character data, the same char­ The original impetus for the character internation­ acter set. alization of the SQL standard was provided by pro­ 2. The metadata stored in the database that is used posals arisin g from the European and Japanese to describe the user data and its structure. standards participants. However, considerable (and enthusiastic) encouragement came from the In SQL databases, metadata is also stored in tabu­ X/Open Company, Ltd. and from the Nippon lar fo rm (so that it can be retrieved using the Te lephone and Te legraph/Multivendor Integration same language that retrieves user data). The Architecture (NTI/ MlA) project, where this degree metadata contains information about the struc­ of mixing was a firm requ irement. ' ture of the user data. For example, it specifies The situation is even more complex than this the names of the users' tables and columns. example indicates. In general, application pro­ grams must be able to access databases even though 3. The data management source code. the data is in a different character encoding than Data management statements (for querying and the application code' Consider a database contain­ updating the database) have to be represented as ing ASCII data and an application program written character strings in some character set. There in extended binary coded decimal interchange ·are three aspects of these statements that can be code (EBCDIC) for an IBM system, and then extend independently considered. The key words of the that image to a database containing data encoded language (like SELECT or UPDAT E) can be repre­ using the Japanese extended UNIX code (EUC) sented in one character set-one that contains encoding and an application program written in only the alphabetic characters and a few special ISO 2022 fo rm. The program must still be able to (e.g., punctuation) characters; the character access the data, yet the character representations string literals that are used fo r comparison with (of the same characters) are entirely different. database data or that represent data to be put Although the problem is relatively straightforward into the database; and the identifiers that repre­ to resolve fo r local databases (that is, databases sent the names of database tables, columns, and residing on the same computer as the application), so forth. it is extremely difficult fo r the most general case of heterogeneous distributed database environments. Consider the SQL statement

SELECT EMP_I D FROM EMPLOYEES Addressing Th ree Iss ues WHERE DEPARTMENT = 'Purchasing' To support internationalization aspects, three dis­ tinct issues have to be addressed: data representa­ In that statement, the words SELECT, FROM, and tion, data comparison, and multiple character set WHERE; the equals sign; and the two apostrophes support. are syntax elements of the SQL language itself. EMP _ID, EMPLOYEES, and DEPARTMENT are names of Data Representation How is the data (including database objects. (EMPLOYEES is a table; the other metadata and source code) actually represented? two are columns of that table.) Finally, Purchasing The answer to this question must address the actual is the contents of a character used to repertoire of characters used. (A character reper­ compare against data stored in the DEPARTM ENT toire is a collection of characters used or available column. for some particular purpose.) It must also address That seems straightforward enough, but what if the fo rm-of-usc.: of the charactc.:r strings. that is, the the database had been designed and stored in Japan ways that characters are strung together into char­ so that the names of the table and its columns were acter strings; alternatives include fixed number of

Digital Te ch"llical ]ounml Vol. 5 Nu. .i Summer 1993 81 Product Internationalization

bits per character, like 8-bit characters, or variable What if the character sets of data in the source number of bits per character, like ISO 2022 or ASN.l. program are different from those in the database? Finally, the question must deal with the character Rules must exist to provide the ability fo r programs encoding (for example, ASCII or EBCDIC). The com­ to query and modify databases with different char­ bination of these attributes is called a character set acter sets. in the SQL standard. It is also possible for the data to be represented in Co mponents of Chara cter different ways within the database and in the appli­ In ternationalization cation program. A column definition that specifies SQL recognizes fo ur components of character inter­ a character set would look like this nationalization: character sets, collations, transla­ NAME CHARACTER VARYING (6) tions, and conversions. Character sets are described CHARACTER SET IS KANJI, above; they comprise a character repertoire, a fo rm­ or of-use, and an encoding of the characters. Colla­ tions are also described above; they specify the NAME NATIONAL CHARACTER VARYING (6), rules fo r comparing character strings expressed in (which specifies the character set defined by the a given character repertoire. product to the national character set), while a state­ Translations provide a way to translate character ment that inserts data into that column might be strings from one character repertoire to a different INSERT INTO EMPSCNAME) (or potentially the same) repertoire. For example, vALuEs I t£�.1 ( ..- , _K ANJ I , •••) ; one could define a translation to convert the alpha­ betic letters in a character string to all uppercase If the name of the column were expressed in letters; a different translation might transliterate hiragana, then the user could write japanese hiragana characters to Latin characters. INSERT INTO EMPSC HIRAGANA 2/j:i ;{_) 1'11'-�- By comparison, conversions allow one to convert a vALuEs ( ..- , ::::: J I I I , • ••) ; character string in one fo rm-of-use (say, two octets Data Comparison How is data to be compared? per character) into another (for example, com­ AJ l character data has to be compared using a colla­ pound text, a fo rm-of-use defined in the X Window tion (the rules fo r comparing character strings) . System). Most computer systems use the binary values of SQL provides ways fo r users to specify character each character to compare character data 1 byte at sets, col lations, and translations based on standards a time. This method, which uses common charac­ and on vendor-provided faci lities. The current draft ter sets I ike ASCII or EBCDIC, general ly does not pro­ of the next version of the SQL standard (SQL3) also vide meaningfu l results even in English. It provides allows users to define their own character sets, col­ far less meaningful results for languages like French, lations, and translations using syntax provided in Danish, or Thai. the standard 45 If these facilities come to exist in Instead, rules have to be developed fo r language­ other places, however, they will be removed from specific collations, and these rules have to resolve the SQL standard (see below). SQL does not provide the problems of mbdng character sets and colla­ any way fo r users to specify their own conversions; tions within SQL expressions. only vendor-provided conversions can be used. Applications can choose to fo rce a specific colla­ tion to be used fo r comparisons if the default col­ Interfacing with Application Programs lation is inappropriate: Application programs are typically written in a

WHERE :hostva r = NAME COLLATE JAPANESE third-generation language (3c;L) such as Fortran, COBOL, or C, with SQL statements either embedded Multiple Ch aracte1· Set Support How is the use of in the application code or invoked in SQL-only pro­ multiple character sets hand led? The most power­ cedures by means of CALL-type statements 6 As a ful aspect of SQL is its ability to combine data from resu lt, the interface between the database system multiple tables in a single expression. What if the and 3GL programs presents an especially difficult data in those tables is represented in diffe rent char­ problem in SQL's international ization facilities. acter sets? Rules have to be devised to specify the Figure 1 illustrates the procedme to invoke SQL results fo r combining such tables with the rela· from C; Figme 2 shows SQL as it is invoked from C; tiona! join or union operations. and Figure 3 shows SQL schema.

82 vbl.5 No. :) Sum met· 1993 Digital Te clmical]ournal Character Internationalization in Databases: A Case Study

rnain () { #include #inc lude #inc lude "SQL92 .h" I* Interface to SQL-92 *I

static sq lstate char[6J; static employee_n umber char[?J; static emp loyee_n ame wchar_t [26J; static emp loyee_ contact char[13J;

I* Assume some code here to produce an appropriate employee number va lue *I

LOCATE CONTACT

I* Assume more code here to use the result *I

}

Figure 1 Invoking SQL fr om C

MODULE i18n_demo NAMES ARE Latin1 LANGUAGE C SCHEMA personnel AUTHORIZATION management

PROCEDURE locate contact ( : emp_n um CHARACTER (6) CHARACTER SET Asci i, :emp_name CHARACTER VARY ING (25) CHARACTER SET Uni code, :contact_n ame CHARACTER VARY ING (6) CHARACTER SET Shift_ jis, SQLSTATE ) SELECT name, contact_i n_j apan INTO :emp_name, :contact name FROM personne l.emp loyees

WHERE emp_i d = :emp_n um;

Figure 2 SQL Invokedfr om C

CREATE SCHEMA personnel AUTHORIZATION management DEFAULT CHARACTER SET Uni code

CREATE TABLE emp loyees ( em p_ i d CHARACTER (6) CHARACTER SET Asci i, name CHARACTER VARYING (25), depa rtment CHARACTER (10) CHARACTER SET Latin1, sa lary DECIMAL (8,2), contact_i n_j apan CHARACTER VARYING (6) CHARACTER SET ft_j is,

PRIMARY KEY (emp_i d)

Figure 3 SQL Schema

Digital Te chnical jo urnal Vo l. 5 No. 3 Summer 1993 83 Product Internationalization

In these figures, all the metadata values (that is, Remote Database Access Is sue the identifiers) are expressed in Latin characters; As mentioned, a distributed environment presents this resolves the data representation issue. The significant difficulties fo r database internationaliza­ reader shou ld compare the character sets of the tion. A simple remote database access scenario data items in the EMPLOYEES table and the con·e­ illustrates these problems. If an application pro­ sponding parameters in the SQL procedure. The dif· gram must access some (arbitrary) database via a ficulties arise when trying to achieve a correlation remote (e.g., network) cotmection, then the remote between the parameters of the SQL procedure and database access facility must be able to deal with all the arguments in the C statement that invokes that the character sets that the application and database procedure. use together; it may also have to deal with differ­ The C variable employee_number corresponds ences in available character sets. (See Figure 4.) to the SQL parameter :emp_num; the C data type An ISO standard fo r remote database access char is a good match fo r CHARACTER SET ASCII. The (ISO/IEC 9579-1 and 9579-2) uses the ASN.l notation C variable employee name corresponds to the SQL and encoding fo r transporting SQL commands and

parameter :emp_name; the C data type wchar_t database data across remote connections. 7 ASN .1 is chosen by many vendors to match CHARACTER notation, as presently standardized, provides no SET Unicode. However, CHARACTER SET Shift_jis is way to use various character sets in general. more complicated; there is no way to know exactly Recently work has begun to resolve this problem. how many bytes the character string will occupy The revised standard must allow a character set to because each character can be 1 or 2 bytes in be specified uniquely by means of a name or identi­ length. Therefore, we have allocated a C char that fier that both ends of the connection can unam­ permits up to 13 bytes. Of course, the C run-rime biguously interpret in the same way. The individual library wou ld have to include support fo r ASCII characters in ASN.l character strings must be simi­ data, Unicode data, and ShiftJIS data. larly identifiable in a unique way. Typically, 3GL languages have little or no support This problem has not yet been resolved in the fo r character sets beyond their defaults. Conse­ standards community, partly because several quently, when transferring data from an interna· groups have to coordinate their effo rts and produce tionalized SQL database into a noninternationalized compatible solutions. application program, many of the benefits are lost. Happily, that situation is changing rapidly. Program­ Hop e fo r the Future ming language C is aclcling facilities fo r handl ing In the past, programming languages, database additional character sets, and the ISO standards systems, networks, and other components of group responsible fo r programming languages information management environments had to deal (ISO/IEC JTCl/SC22) is investigating how to adcl with character sets in very awkward ways or use those capabilities to other languages as welL vendor-provided defaults. The resu lt has been an The most difficult issue to resolve concerns the incred ible mess of 7-bit (ASCII, fo r example) and differences in specific character sets (especially 8-bit (Latin-1. fo r example) code sets, PC code fo rm-of-use) supported by SQL implementations pages, and even national variants to all of these. The and 3GL implementations. As with other issues, number of code variants has made it very difficult purely local situations are easy ro resolve because fo r a database user to write an application that can a DBMS and a compiler provided by the same vendor be executed on any database system using recom· are likely ro be compatible. Distributed environ­ pilation only. Col lectively, they make too many ments, especially multivendor ones, are more com­ assumptions about the character set of all character plicated. SQL has provided one solution: it permits data. the user to write SQL code that translates and converts the data into the fo rm required by the application program as long as the appropriate con­ APPLICATION DATABASE versions and translations are available fo r use by PROGRAM SYSTEM SQL. Of course, once the clarahas been transferred REQUIRES UNICODE into the appl ication program, the question SUPPORTS LATINI re mains: What fa cilities does it have to manipulate that data? Figure 4 Remote Database Access

84 Vo l. 5 No. :) Summer 1993 Digital Te cht1icaljour71al Character Internationaliza tion in Databases: A Case Study

The fu ture outlook fo r database internationaliza­ a language fo r these specifications, it is probable tion was improved dramatically by the recent adop­ that this capability will be withdrawn from the SQL3 tion of ISO 10646, Universal Multiple-Octet Coded specification. This decision would completely align Character Set (UCS) and an industry cou nterpart, the SQL character internationalization capabilities Unicode 8 The hope is that Unicode will serve as a with the rest of the international standards efforts. "16-bit ASCII" fo r the future and that all new systems After other standards fo r these tasks are in place, will be built to use it as the default character set. however, the remote data access (RDA) standard Of course, it will be years-if not decades­ will have to be evolved to take advantage of them. before all installed computer hardware and soft­ RDA must be able to negotiate the use of character ware use Unicode. Consequently, provisions have sets for database applications and to transport the to be made to support existing character sets (as information between database clients and servers. SQL-92 has clone) and the eccentricities of existing In order for RDA to be able to do this, the ASN.l stan­ hardware and software (like networks and file sys­ dard will have to support arbitrary named character tems). As a result, several different representations sets and characters from those sets. of Unicode have been developed that permit trans­ As a result, relevant standards bodies will need to mission of its 16-bit characters across networks that provide (1) names fo r all standardized character are intolerant of the high-order bit of bytes (the sets and (2) the ability fo r vendors to register their eighth bit) and that permit Unicode data to be own character sets in a way that allows them to stored in file systems that deal poorly with all the be uniquely referenced where needed. Still other bit patterns it permits (such as octets with the bodies will need to provide language and services value zero). for defining collations and translations. Finally, In the past fe w years, many alternative character registries will need to be established fo r vendor­ representations have been considered, proposed, supplied collations, translations, and conversions. and implemented. For example, ISO 2022 specifies Of course, the greatest task will be to provide how various character sets can be combined in complete support fo r all these facilities throughout character strings with escape sequences and gives the information processing environment: operat­ instructions on �w itching between them 9 Similarly, ing systems, communication links, crus, printers, ASN.l-like structures, which provide fully tagged keyboards, windowing systems, file systems, and so text, have been used by some vendors and in some forth. Healthy starts have been made on some of standards, e.g., Open Document Arch itecture. u• these (such as the X Window System), but much None of these representations has gained total work remains to be done. acceptance. Database implementors perceive diffi­ culties with a stateful model and with the potential DEC Rdb:An Internationalization perfo rmance impact of having a varying number of Case Study bits or octets fo r each character. UCS and Unicode DEC Rdb (Rdb/VMS) is one of the few Digital prod­ appear to be likely to gain wide acceptance in the ucts that has an internationalized implementation database arena and in other areas. that is also compliant with the multivendor inte­ gration architecture (MIA) H·12 Its evolution from a Fu ture Wo rk fo r the SQL Sta ndard noninternationalized product to a fu lly internation­ One should not conclude that the job is done, that alized one has taken four years to achieve. The there is nothing left to work on. Instead, a great deal design and development of Rclb can serve as a case of work remains before the task of providing full study fo r software engineers on how to build inter­ character set internationalization fo r database sys­ nationalized products. In this half of our paper, we tems is completed. present the history of the reengineering process. At present, the working draft for SQL3 contains Then we describe some difficu lties with the reengi­ syntax that wou ld allow users to define their own neering process and our work to overcome them. character sets, collations, and translations using a Finally, we evaluate the result. nonprocedural language 4·S In general, the SQL stan­ dards groups believe that it is inappropriate for a Localiza tion and Reengineering database standard to specify language fo r such The localization process comprises all acttv1t1es widely needed facilities. Consequently, as soon as required to create a product variant of an applica­ the other responsible standards bodies provide tion that is suitable for use by some set of users with

Digital Te chnical jom·,a/ Vo l. - No. 3 Summer 1993 85 Product Internationalization

similar preferences on a particular platform. 5. The collating sequence meets the customer's Reengineering is the process of developing the set needs. of source code changes and new components 6. The messages are in the language the customer required to perform localization. DEC Rdb had to be uses. reengineered to support several capabilities that

are mandatory in Japan and other Asian countries. 7 The character set of the source code is not nec­ Our experience has shown that the n:cngi­ essarily the same as it is at run time. neering process is very expensive and should be avoided. If the original product was not designed 8. The file code is not necessarily the same as the fo r internationalization or localization, however, process code. reengineering is a necessary (and unavoidable) The reengineering process has two significant evil. Typically, reengineering is required; so we drawbacks: (I) the high cost of reengineering and decided to develop a technology that would avoid (2) the time lag between shipping the product to reengineering and to build a truly internationalized the customer in the United States and shipping product. to the customer in .Japan. The time lag can be reduced Most engineering groups fo llow the old assump­ but cannot be eliminated as long as we reengineer tions about product design. These assumptions the original product. If a local product is released include the following: simultaneously with the original, both Digital and

I. The character set is implicitly ASCII. the customers will benefit significantly In the next section, we fo llow the DEC Rdb prod­ 2. Each character is encoded in 7 bits. uct through the reengineering process required to produce the Japanese Rdb version 3.0. 3. The character count equals the byte count and equals the display width in columns.

4. The maximum number of distinct characters Reengineering Process is 128. DEC Rd b version 3.0 was a major release and conse­ quently was very important to the Japanese market. 5. The collating sequence is ASCII binary order. The International System Engineering Group was 6. The messages are in English. asked to release the Japanese version by the end of 1988, which was within six months of the date that 7 The character set of the source code is the same it was first shipped to customers in the United as it is at run time. States. 8. The file code (the code on the disk) is the same as the process code (the code in memory). japanese and Asian Language Diffe rent user environments require different Requirements to VAX Rdb/VMS product capabilities. Japanese kanji characters are Japanese and Asian language requirements apply to encoded using 2 bytes per character. If a product DEC Rdb and other products as wel L The require­ assumes that the character set is 7-bit ASCII, that ments common to Asian languages are 2-byte char­ product must be reengineered before it can be used acter handl ing, local language ed itor support, and in Japan. On the other hand, internationalized prod­ message and help file translation. ucts can operate in different environments because Japanese script uses a 2-byte code; therefore they provide the capabilities to meet global require­ 2-byte character handling is mandatory For exam­ ments. These capabilities include the fo llowing: ple, character searches must be performed on 2-byte boundaries and not on 1-byte boundaries. If a string 1. Multiple character sets ensure that the customer's has the hexadecimal value 'A1A2A3A4', then its sub­ needs are met. strings are 'AlA2' and 'A3A4'. 'A2A3' must not be 2. Each character is encoded using at least 8 bits. matched in the string. Digital's Asian text editors, e.g., the Japanese text 3. The character count does not equal the byte processing utility (JTPU) and Hanzi TPU (for China), count or the display width. must be supported as well as the original TPU, 4. The maximum number of unique characters is the standard EDT editor, and the language-sensitive unknown. editor.

86 Vo l. 5 No. 3 Summer 1993 Digital Te ch11icaljo urnal Character Internationalization in Databases: A Case Study

Messages, help files, and documentation must all BUILD ENVIRONMENT be translated into local languages. SOURCE CODE ACQUISITION The country-specific requirements include sup­ ACQUISITION port for a Japanese input method. Kana- to-kanji MAINTENANCE OF THE input methods must be supported in command REENGINEERED CREATION OF THE SYSTEM DEVELOPMENT lines. In addition, 4-byte character handl ing is ENVIRONMENT FOR NTT/MlA required fo r Ta iwan (Hanyu). Finally, SQL TEST OF THE THE JAPANESE VERSION fe atures must be added fo r Japan. RESULTS Since there are not many requirements, one STUDY OF THE might conclude that the reengineering task is not ORIGINAL CODE difficult. However, reengineering is compl icated, MODIFICATION OF expensive, and time consuming; and thus should be THE SOURCE CODE avoided. Note: Each segment of the chart represents the project time (person-weeks) required to complete each step in the reengineering process. Reengineeringjapanese Rdb Ve rsion 3.x

A database management system like DEC Rdb is very Figure 5 Reengineering Process fo r complex. The source code is more than 810,000 japanese Rdb Ve rsion 3.x lines; the build procedures are complicated; and a mere subset of the test systems consumes more than one gigabyte of disk space. Consequently, the the environment became cheaper after the first reengineering process is complicated. The process time. The other steps such as modifying the source encompasses more than modifying the source code, testing, and maintenance remained at almost code. Instead, a number of distinct steps must be the same cost. accomplished: Reengineering Metric 1. Source code acquisition We modified about 10 percent of the original 2. Build environment acquisition source modules during reengineering. Most of the modification occurred in the fro nt end, e.g., SQL 3. Test system acquisition and RDML (relational database manipulation lan­ 4. Creation of the development environment fo r guage). The engine parts, the relational database the Japanese version management system (RDMS), and KO DA (the kernel of the data access, the lowest layer of the physical 5. Study of the original code data access) were not modified very much. Ta ble l 6. Modification of the source code gives the complete reengineering metrics.

7. Te st of the results, including the new Japanese (modified modules + functionality and a regression test of the original . . . new created modules) Reengmeenng metnc = ------� functionality (original + modified + 8. Maintenance of the reengineered system new created modules)

Figure 5 shows the development cost in person­ Coengineering Process: weeks fo r each of the eight steps. Two engineers No More Reengineering stabilized the development environment-com­ To reduce and eliminate reengineering, we have pile, link/build, and run-for version 3.0 of DEC taken a conservative, evolutionary approach rather Rdb in approximately fo ur months. It is likely that than a revolutionary one. We used only proven the process required fo ur months because it was technologies. The evolution can be divided into our first development work on DEC Rdb. In addi­ three phases: tion, approximately two months were needed to be able to run the test system. It was not an easy task. 1. Jo int Development with Hong Kong. Our devel­ Each step had to be repeated fo r each version of opment goal was to merge Japanese, Chinese the original. (Project time decreased a little.) Every (People's Republic of China and Ta iwan), and version required this reengineering, even if no new Korean versions into one common Asian Rdb functionality was introduced. The cost of building source code.

Digital Te clmical journal Vo l. 5 No. 3 Summer 1993 87 Product Internationalization

Ta ble 1 Reengineering Metrics

Reengineering Modified Total Size in Facility Metric Modules Modules Kilo Lines

SQL 6.3% 8 128 226.0 RDML 11.7% 11 94 188.3 ROMS 3.1 % 4 127 154.0 KODA 0.6% 1 157 109.8 RMU 0.0% 0 41 80.5 Dispatcher 0.0% 0 30 60.9

Notes:

RMU is the Rdb management utility; it is used to monitor, back up, restore, and display DEC Rdb databases.

The reengineering metric for JCOBOL (a Digital COBOL compiler sold in Japan) is 47/258 = 18.2%; the size is 225.0 kilo lines.

2. Coengineering Phase 1. Our goal was to merge reduce reengineering costs in Asia. To achieve our Asian common Rdb into the original master goal, we devised several source code conventions. sources fo r version 4.0. The merger of]-Rdb and The purposes of the conventions were Chinese-Rdb into Rdb would eliminate reengi­ 1. To identify the module fo r each Asian version by neering and create one common executable its file name image. 2. To make it possible to create any one of the Asian 3. Coengineering Phase II. In the final phase, our versions (for Japan, the PRC, Ta iwan, or Korea) or goal was to develop the internationalized product the English version from the common source fo r version 4.2 by adding more internationaliza­ codes, using conditional compilation methods tion fu nctionality, SQL-92 support, MIA support 3. To identify the portions of codes that were mod­ fo r one common executable, and multiple char­ ified fo r the Japanese version acter set support. 4. To facilitate an engineer in Hong Kong who is Coengineering is a development process in developing versions fo r the PRC, Ta iwan, and which local engineers temporarily relocate to the Korea Central Engineering Group in the United States to develop the original product jointly with Central We developed the Japanese Rdb version 3.0 in Engineering. The engineers from a non-English­ Japan. The files were transferred to Hong Kong to speaking country provide the user requirements develop versions fo r the PRC, Ta iwan, and Korea. and the cultural-dependent technology (e.g., 2-byte The modified versions were sent back to Japan to processing and input methods), and Central be merged into one common Asian source file. Engineering provides the detailed knowledge of the Since we had one common Asian source file, product. This process promotes good experiences reengineering in Hong Kong was reduced. Reengi­ fo r both parties. For example, the local engineers neering in Japan, however, was still necessary. We learn the corporate process, and the corporate used compilation flags to create four country ver­ engineers have more dedicated time to understand sions, that is, we had fo ur sets of executable images. the requirements and difficulties of local product As a result, we needed to maintain four sets of needs, what internationalization means, and how to development environments (source codes, tests, build the internationalized product. Coengineering and so forth). We wanted to further simplify the minimizes the risks associated with building inter­ process and therefore entered the coengineering nationalized products. phases.

Asian jo int Development Coengineering Phase I Our goal fo r the Asian joint development process The integration of Asian DEC Rdb into the base DEC was to provide a common Asian source code fo r Rdb product took place in two phases. In the first Japan, People's Republic of China (PRC), Ta iwan, phase, we integrated the Asian code modifications and Korea. One common source code would into the source modules of the base product.

88 Vo l. 5 No. 3 Summer 1993 Digital Te chnical journal Character Internationalization in Databases: A Case Study

Consequently, the specific Asian versions of the ARDB_JAPAN_VA RIANT is set true. This would indi­ product can be attained by definition and then cate that all text would be treated as if it were translation of a logical name (a sort of environment encoded in DEC_KANJI. The code would behave as if variable). No conditional compilation is necessary. it were DECJRdb. This translation must occur at all In all releases of DEC Rdb version 3.x, source levels ofthe code, including the user interface, DEC modules of the base product were conditionally Rdb Executive, and KO DA. compiled fo r each Asian version, which created Since DEC Rdb checks the value of the logical separate object files and images. name at run time, we do not need the compilation The process steps in this phase were flags; that is, we can have one set of executable images. 1. Merge the source code Figure 6 shows the values that are valid fo r the a. Create one executable image ROB$CHARACfER_SET logical. b. Remove Japanese/Asian VMS dependency The DEC JRdb source contains code fragments similar to those shown in Figure 7, which were c. Remove kana-to-kanji input method taken from RDOEDIT.B32 (written in the BLISS pro­ 2. Transfer the]-Rdb/C-Rdb tests gramming language). This code was changed to use a run-time flag set as a result of translation of the Source Code Merge (R db Ve rsion 4. 0) To create logical RDB$CHARACTER_SET, as shown in Figure 8. a single set of images, we removed the compilation flags and introduced a new way of using the Asian­ Remove japanese VMS (]VMS) Dependency The specific source code. We chose to do this by using Japanese version of DEC Rdb version 3.x used a run-time logical name; the behavior of DEC Rdb the ]VMS run-time library (JSY routines). The JSY changes based on the translation of that logical routines are Japanese-specific character-handling name. routines such as "get one kanji character" and We removed the Japanese/Asian VMS dependen­ "read one kanji character." The library is available cies by using Rdb code instead of JSYSHR cal ls. only on ]VMS; native VMS does not have it, so DEC (JSYSHR is the name given to the OpenVMS system Rdb cannot use it. To remove the ]V:-.1S dependency, services in Japanese VMS.) we modified all routines that cal led JSY routines so We removed the kana-to-kanji input method: By that they contain their own code to implement the calling UB$FIND_IMAGE_SYMBOL (an OpenVMS sys­ same fu nctions. tem service to dynamically link library routines) to The JRdb/VMS source contains code fragments invoke an input method, the image need not be similar to the ones shown in Figure 9. The code was linked with ]VMS; even an end user can replace an changed to remove references to the JSY routines as input method. shown in Figure 10. This example does not use JSY routines like ]SY$CH_SIZE or]SY$CH_RCHAR. Run-time Checking We removed the compilation flags, but introduced a new logical name, the Remove Kana-to-k.anji Input Method The depen­ ROB$CHARACfER_SET logical, to switch the behavior dency on ]VMS can be eliminated by making the of the product. For example, ifRDB$CHARACfER_SET 2-byte text handling independent of ]SY routines, translates to DEC_KANJI, then the symbol but the input method still depends on JSYSHR fo r

$ DEFINE RDB$CHARACTER_S ET - { DEC KANJI I DEC HANZI DEC HANGUL DEC HANYU }

DEC_KANJ I Japanese DEC_ HANZI Chinese DEC_H ANGUL Korean DEC HANYU Taiwan

$ SET LANGUAGE JAPANESE ! If you use Japanese VMS

Fig ure 6 RDB$CHARACTER_SET Logical

Dlgittll Te cbnlcal]ourna.l Vo l. 5 No. 3 Summer 1993 89 Product Internationalization

This example switches the default TPU shareable image CTPUSHR). If the Japanese va riant is set, then the default ed itor shou ld be JTPUSHR.

%IF $ARDB- JAPAN VARIANT %THEN TPU_I MAGE_N AME = ( IF (.T NAME EQL Q) THEN $DESCRIPTOR ('T PUSHR ') ELSE $DESCRIPTOR ('JTPUSHR')); %ELSE TPU IMAGE NAME $DESCRIPTOR ('T PUSHR');

Figure 7 Compilation Flag in DEC Rdb Version3

1 This code cou ld be trans lated to the following 1 wh ich might contain redundant code but shou ld work:

IF.ARDB JAPAN VARIANT ! If ARDB_J APAN_VARIANT flag is true,

THEN then Rdb/VMS shou ld use the J-Rdb/VMS behavi or. TPU_IMAGE_NAME = ( IF C.TPU NAME EQL Q) THEN $DESCRIPTOR ('T PUSHR') ELSE $DESCRIPTOR ('JTPUSHR ')) ELSE TPU_IMAGE NAME = $DESCRIPTOR ('T PUSHR ');

Figure 8 Run-time Ch ecking in Ve rsion 4

%IF $ARDB COMMON VARIANT %THEN !+ ARDB: Advance character poi nter.

JSY$CH SIZE counts the size of the character. If it is ASCII, return 1, If it is Kanj i , return 2. CP is a character poi nter CP = CH$PLUS( .CP , JSY$CH SIZE( JSY$CH RCHAR( .CP ) ) ); 1 - %ELSE CP = CH$PLUS( .CP , 1 ); %FI !$ARDB COMMON VARIANT

Figure 9 Us ing ]S Y Routines in DEC Rdb Ve rsion 3 kana-to-kanji conversions. To remove this depen­ We created a shareable image for the input dency, we developed a new method to invoke the method, using the SYS$LANGUAG E logical to switch ka na-to-kanji conversion rou tine. Figure 11 shows to the Japanese input method or to other Asian the new input method. language input methods. Since an input method is Since LIB$FIND_IMAGE_SYMBOL is used to find the a shareable image, a user can switch input methods Japanese input at run time, JSYSHR does not need to by redefining the logical name to identify the appro­ be referenced by the SQL$ executable image. priate image.

90 Vo l. 5 No. 3 Summer 1993 Digital Te chnicaljour11al Chara cter Internationalization in Databases: A Case Study

! ******************r u n time checking

IF $RDMS$ARDB COMMON THEN ' + ARDB: Advance char acter poi nter.

If the code va lue of CP is greater than 128 , then it means the first byte of Kanji, so advance 2, else it is ASCII, advance 1.

P = CH$PLUS( .CP , (IF CH$RCHAR ( .CP) GEQ 128 THEN 2 ELSE 1 ) ) ; ,_ ELSE CP CH$PLUS ( .CP , 1 ); FI !$RDMS$ARDB COMMON

where $RDMS$ARDB COMMON is a macro .

Figtt1·e 10 Removing ]S Y Routines in Ve rsion 4

SQL$ .EXE I + (default) -> SMG$READ COMPOSED LINE

+ (if Japanese Input is selected) LIB$FIND IMAGE SYMBOL I +------> (shareable for Japanese Input ).E XE

Figure 11 Input Methodfo r Ve rsion 4: Kana-to-kanji Conversion (Japanese Input) Shareable Image

Note that the input method is a mechanism to bility to perform 2-byte processing. Japanese and convert alphabetic characters to kanji characters. other Asian language components must be pro­ It is necessary to permit input of ideographic char­ vided fo r local country variants. The localization kit acters, i.e., kanji, through the keyboard . Asian local for Japan contains Japanese documentation such as language groups would be responsible fo r creating messages and help files, an input method, and the a similar shareable image for their specific input J-Rdb license management facility (LMF). As a result, methods. we need not reengineer the original product any more. The installation procedure is also simplified. Transfe r DEC ]-Rdb and DEC C-Rdb Te sts To Users worldwide merely install DEC Rdb and then ensure the functionality of Japanese/Asian DEC install a localization kit if it is needed. Rdb, we transferred the tests into the original devel­ The localization kits contain only the user inter­ opment environment. We integrated not only the faces, so no reengineering is necessary; however, source modules but also all the tests. Consequently, translation of documentation, message files, help the Asian 2-byte processing capabilities have now files, and so on to local languages still remains nec­ been tested in the United States. essary. Nonetheless, the reengineering process is eliminated. Kit Components andj-Rdb Installation Procedure In version 4.0, we achieved the main goal, to inte­ The original DEC Rdb version 4.0 has the basic capa- grate the Asian source code into the base product

Digita/ 1echnica/ jo urnal Vo l. 5 No . .> Su m111er I'J93 91 Product Internationalization

to avoid reengi neering. The Japanese localization defined in SQL-92, to al low multiple character sets kit was released with a delay of about one month in a table. after the US. version (versus a five-month delay in version 3.0). The one-month delay between Database Character Sets The database character releases is among the best in the world fo r such sets are the character sets specified for the attached a complex product. database. Database character set attributes are default, identifier, and national. Coengineering Phase II SQL uses the database default character set fo r In the second phase of integration, we redesigned two elements: (1) database co lumns with a charac­ the work done in Phase I and developed a multi­ ter data type (CHARACTER and CHARACTER VARY­ lingual version of Rclb/YMS. ING) that do not explicitly specify a character set In version 4.0, we introduced the logical name and (2) parameters that are not qualified by a char­ RDB$CHARACTER_SET to integrate Asian function­ acter set. The user can specify the database default ality into DEC Rclb. In Phase II, we created an inter­ character set by using the DEFAULT CHARACTER SET nationalized version of DEC Rcl b. We retained the clause fo r CREATE DATABASE. one set of images and introduced new syntax and SQL uses the identifier character set fo r database semantics. We also provided support for the NIT/ object names such as table names and column MIA requirements. names. The user can specify the identifier character The fo llowing are the highlights of the release. set fo r a database by using the IDENTIFIER CHARAC­ The details are given in the Appendix. TER SET clause fo r CREATE DATABASE. SQL uses the national character set fo r the fo llow­ • NIT/MIA SQL Requirements ing e.lements. - NATIONAL CHARACTER data type • For all columns and domains with the data type - N'national' literal NATIONAL CHARACTER or NATIONAL CHARACTER - Kanji object names VA RYING and fo r the NATIONAL CHARACTER data type in a CAST function • Changes/extensions to the original DEC Rdb - Add a character set attribute • In SQL module language, all parameters with the - Multiple character set support data type NATIONAL CHARACTER or NATIONAL CHARACTER VARYING • Dependencies upon other products • For all character-string literals qualified by the - COD/Plus, COD/Repository: Add a character national character set, that is, the literal is pre­ set attribute ceded by the letter N and a single quote (N') - Programming languages: COBOL, PIC, N The user can specify the national character set Since we are no longer reengineering the original fo r a database by using the NATIONAL CHARACTER product, we now have time to develop the new SET clause fo r CREATE DATABASE. functionality that is required by NIT/.MIA. The new The fo llowing example shows the DEFAULT syntax and semantics of the character-set hand ling CHAHACTER SET, IDENTIFIER CHARACTER SET, and are conformant with the new SQL-92 standard. NATIONAL CHA.RACTER SET clauses fo r CREATE As far as we know, no competitor has this level of DATABASE. fu nctionality. If we had to continue to reengineer the original, CREATE DATABASE FILENAME ENVI RONMENT DE FAULT CHARACTER SET DEC_KANJ I we would not have had enough resources to con­ NATIONAL CHARACTER SET KANJI tinue development of important new fu nctionali­ IDENTIFIER CHARACTER SET DEC_K ANJI; ties. Coengineering not only reduces development CREATE DOMAIN DEC_KANJ I_DOM CHAR (8); cost but also improves competitiveness. CREATE DOMAIN KANJ I_DOM NCHAR(6); We introduced the RDB$CHARACTER_SET logical during Phase I to switch the character set being DEC_KANJI_DOM is a text data type with used. Since the granularity of character set support DEC_K.ANJI character set, and KANJ I_DOM is a text is on a process basis, however, a user cannot mix data type with KANJ I character set. The database different character sets in a given process. In Phase default character set is DEC_KANJI and the national II, we implemented the CHARACTER SET clause, character set is KANJI.

92 Vo l. 5 No . .3 Summer 19<)3 Digital Te chnical jo urnal Character Internationalization in Databases: A Case Study

As previously stated, the user can choose the SET CHARACTER LENGTH 'CH ARACTERS '; default and identifier character sets of a database. Consequently, users can have both text columns Multiple Ch aracter Sets Examples Users can cre­ that have character sets other than 7-bit ASCII and ate a domain using a character set other than the national character object names (i .e., kanji names, database default or national character sets with the Chinese names, and so on). fo llowing sequence: In Rdb version 3.1 and prior versions, the charac­ CREATE DOMA IN DEC KOREA_D OM CHAR(6) ASCII ter set was and could not be changed. In Rdb CHARACTER SET DEC_K OR EAN; version 4.0, users could change character sets by defining the RDB$CHARACIER_SET logical. It is CREATE TABLE TREES (TREE_C ODE TREE_C ODE_D OM, important to note that the logical name is a volati le QUANT ITY INTEGER, attribute; that is, the user must remember the char­ JAPANESE_N AME CHAR (30), acter set being used in the database in his process. FRENCH_NAME CHAR (30) CHARAC TER SET DEC_M CS, On the other hand, the database character sets ENGLISH_N AME CHAR (3Q) introduced in version 4.2 are persistent attributes, CHARACTER SET DEC_M CS, so the user is less likely to become confused about KOREAN_N AME CHAR( 30) CHARACTER SET DEC_KOREAN, the character set in use. KANJ I_NAME NCHAR (3Q));

Session Character Sets The session character sets The table TREES has multiple character sets. This are used during a session or during the execution of example assumes the default character set is procedures in a module. The session character set DEC_KANJI and the national character set is KAN.Jl. has fo ur attributes: literal, defau lt, identifier, and Users can have object names other than ASCII national. names specifying the identifier character set. The SQL uses the literal character set fo r unqualified database engine uses the specific routines to com­ character string literals. Users can specify the pare data, since the engine knows the character set literal character set only fo r a session or a module of the data. With DEC Rdb version 4.2, all three by using the SET LITERAL CHARACTER SET statement issues of data representation, multiple character­ or the LITERAL CHARACTER SET clause of the SQL set support, and data comparison have been module header, DECLARE MODULE statement, or resolved. DECLARE ALlAS statement. Session character sets are bound to modules or Conclusions an interactive SQL session, and database character sets are attributes of a database. For example, a user By replacing reengineering with coengineering, we can change the session character sets for each SQL reduced the time lag between shipping DEC Rclb to session; therefore, the user can attach to a database customers in the United States and in Japan from that has DEC_MCS names and then attach to a new five months fo r version 3.0 in July 1988 to two database that has DEC_HAI\JZI names. weeks fo r version 4.2 in February 1993. Figure 12 shows the decrease in time lag for each version Octet Length and Character Length In DEC Rclb we developed. We also eliminated expensive version 4.1 and prior versions, all string lengths reengineering and maintenance costs. Finally, we were specified in octets. In other words, the increased competitiveness. numeric values specified for the character-column It has taken more than fo ur years to evolve from a length or the start-off set and substring length noninternationalized product to an international­ within a substring expression were considered to ized one. If the product had originally been be octet lengths or offsets. designed to be internationalized, this process would DEC Rc.l b version 4.2 supports character sets of have been unnecessary. When DEC Rdb was origi­ mixed-octet and fixed-octet fo rm-of- use. For this nally created, however, we did not have an interna­ reason and to al low an upgrade path to SQL-92 tionalization model, the architecture, or mature (where lengths and offsets are specified in charac­ techniques. Reengineering is unavoidable under ters rather than octets), users are allowed to specify these circumstances. lengths and offsets in terms of characters. To By sharing our experience, we can help other change the default string-length unit from octet to product engineering groups avoid the reengineer­ characters, users may invoke the fo llowing: ing process.

Digital 1e chll ical ]ounwl Vo l. 5 No . .) Summer /')').) Product Internationalization

25 ticul ar, Don Blair, Ya suhiro Matsuda, Scott ,'vl atsu­ moto, Jim Murray, Kaz Ooiso, Lisa Maatta Smith,

20 and Ian Smith were particularly helpful during the DEC Rdb work. During the internationalization of SQL, Laurent Barnier, David Birdsall, Phil Shaw, (f) 15 :.::: Kohji Shibano, and Mani Subramanyam all made w w significant contributions. :s: 10 References

5 1. G. Winters, "International Distributed Sys­ tems-Architectural and Practical Issues," 0 Digital Te chnicaljournal, vol. 5, no. 3 (Sum­ V3.0 V3.0B V3. 1A V3. 1B V4.0 V4.0A V4.2 mer 1993, this issue): 53 -62.

2. American Na tional Standard fo r Informa­ Figure 12 me Lag between US. andja panese tion Systems-Database Language SQL, ANSI Shipment of DEC Rdb X3.135-1992 (American National Standards Institute, 1992). Also published as Informa­ Fu ture Wo rk fo r DEC Rdb tion Te chnology -Database Languages­ Coengineering has proved that an evolutionaq' SQL, ISO/IEC 9075:1992 (Geneva: International approach is not only possible, but that it is the most Organization fo r Standardization, 1992). reasonable approach. Additional work, however, 3 remains to be done for DEC Rdb. . W Rannenberg and .J. Bertels, "The X/Open DEC Rdb must support more character sets like Internationalization Model," Digital Te e/m i­ 5, 1993, ISO 10646-1 . We think that the support of new char­ ca/ jo urnal, vol. no. 3 (Summer this 32-42. acter sets would be straightforward in the DEC Rdb issue): implementation. DEC Rdb has the infrastructure fo r 4. Database Language SQL (S QL3), Wo rking supporting it. SQL-92 has the syntax fo r it, that is, Draft, ANSI X3H2-93-091 (American National the character set clause. Furthermore, the DEC Rdb Standards Institute, February 1993). implementation has the attribute fo r a character set 5. in the system tables. Database Language SQL (S QL3), Working ISO/IEC JTC 1/SC21 N6931 Collations on Han characters should be Draft, (Geneva: International Organization fo r Standardiza­ extended. The current implementation of a col la­ tion, July 1992). tion on Han characters is based on its character value, that is, its code value. We believe the user 6. ]. Melton and A. Simon, Understanding the would also like to have collations based on dictio­ Ne w SQL: A Complete Guide (San Mateo, CA: naries, radicals, and pronunciations. u Morgan Kaufmann Publishers, 1992).

7. Information Te chnology-Remote Database Summary Access-Part 1: Generic Mo del, Service, There are significant difficulties in the specification and Protocol, ISO/IEC 9579-1 :1993, and Info r­ of character internationalization fo r database sys­ mation Te chnology-Remote Database tems, but the SQL-92 standard provides a sound Access -Part 2: SQL Sp ecialization, ISO/IEC foundation fo r the internationalization of products. 9579-2:1993 (Geneva: International Orga niza­ SQL-92 DEC The application of facilities to Rdb is tion fo r Standardization, 1993). quite successful and can serve as a case study for the internationalization fo r other software products. 8. ]. Bertels and F. Bishop, "Unicode: A Universal Character Code," Digital Te chnical jo urnal, vol. 5, no. 3 (Summer 1993, this issue): 21-31. Acknawledgments The authors gratefully acknowledge the help and 9. Information Processing-/SO 7- bit and 8-bit contributions made by many people during the Coded Character Sets-Code Exte nsion Te ch­ development of DEC Rdb's internationalization niques, ISO 2022: 1986 (Geneva: International facilities and those of the SQL standard. In par- Organization fo r Standardization, 1986).

94 Vo l. 5 No. 3 Summer 1993 Digital Te chnical jo urnal Character Internationaliza tion in Databases: A Cas e Study

10. Information Processing, Op en Document 12. Multivendor Integration Architecture, Ve r­ Architecture, ISO/IEC 8613:1989 (Geneva: sion 1.2 (Tokyo: Nippon Te legraph and Tele­ International Organization fo r Standardiza­ phone Corporation, Order No. TR550001, tion, 1989). September 1992).

11. DEC Rdb, SQL Reference Ma nual (Maynard, 13. R. Haentjens, "The Ordering of Universal MA: Digital Equipment Corporation, Order Character Strings," Digital Te chnical jo urnal, No. AA-PWQPA-TE, January 1993) vol. 5, no. 3 (Summer 1993, this issue): 43-52.

Appendix: Sy ntax of Rdb Ve rsion 4.2

Fo rmat of CHARA CTER SET Clause

: := [ CHARACTER SET J I

: := CHARACTER [ VARYING J [ ( ) ] I CHAR [ VARYING J [ ( ) J I VARCHAR ( )

: := NATIONAL CHARACTER [ VARYING J [ ( ) J I NATIONAL CHAR [ VARYING J [ ( ) ] I NCHAR [VARYING J ( )

Character Set Na mes DEC_M CS KANJI HANZI KOREAN HANYU DEC KANJI DEC HANZI DEC KOREAN DEC_SICGCC DEC_H ANYU KATAKANA ISOLATINARABIC ISOLAT INHEBREW ISOLATINCYRILLIC ISOLATINGREEK DEVANAGARI

Continued on next page.

Digital Te chllicaljour11al Vo l. 5 No. 3 Summer 1993 95 Product Internationalization

Exa mple of CHA RACFER SE T

CREATE DATABASE FILENAME ENVIRONMENT DEFAULT CHARACTER SET DEC_K ANJ I NATIONAL CHARACTER SET KA NJI IDENTI FIER CHARACTER SET DEC_K ANJ I;

CREATE DOMA IN NAMES GENERAL CHAR(20>;

CREATE DOMAIN NAMES_P RC CHAR (20) CHARACTER SET IS HANZI;

CREATE DOMA IN NAMES_M CS CHAR(20) CHARACTER SET IS MCS;

CREATE DOMAIN NAMES_KOREAN CHAR(20) CHARACTER SET IS HANGUL;

CREATE DOMAIN NAMES JAPAN NCHAR (20>;

Fo rmat of Literals

- I

::= [ J [ ... ]

I

::= ! ! See the Syntax Ru les .

::=

N [ ... ]

Exa mp le of Na tional Object Na me CREATE TABLE tf�§J , tf�JiH\;i', NAT I 0 N A L CHARACTER ( 1 0) , �-'} DECIMA L<10), �-!{ 'l'f;;- D E C I M A L( 5 ) , f1.PJT NCHAR<30), PRIMARY KEY tf�JHff-jj·) ;

S E L E C T :(it �I! 'I'% FROM tf�j'l WHERE fE-fg!-IJj:{} = 100 AND �!:j- > 300000 AND te�.f.l�iJ LIKE N' .Z:fJ %';

96 Vo l. 5 No. 3 Summer 1993 Digital Te chnical jo urnal Ta kah ide Honma Hiroyosh i Baba Ku niaki Ta kizawa

Japanese Input Method Independent of Applications

The japanese input method is a complex procedure involving preediting opera­ tions. An application that accepts japanese fr om an input device must have three sys tems fo r the input method: a keybinding system, a manipulator fo r preediting, and a kana-to-kanji conversion sys tem. Va rious keybinding systems and manipula­ tors accelerate input op erations. Our implementation separates an application fr om theja panese input method in three layer·s. An application can use a fr ont-end inputprocessor to perfo rm all op erations including 1/0. An application can use the henkan (conversion) module and implement l/ 0 op eration itself An applicalion can execute all op erations except keybinding, which is handled b-y an input method library

In this paper, we first present an overview of the neering a product for. Japanese users and a summary tech.nical environment of the Japanese input method of the industry's complex techniques developed fo r implementation. Based on this overview, we briefly japanese input methods. describe the critical engineering issues fo r conver­ sion of Digital's products fo r the japanese user. Our japanese Input most critical engineering issue was the reduction of The japanese writing system uses hundreds of similar (but slightly different) work to localize ideograms called k. anji. In addition, Japanese uses a products. Another issue was to satisfy customers' phonetic system of k.ana characters (hiragana and requests fo r the ability to use the many input styles katakana) and has accepted romaji, which is the familiar to them. We describe our approach to the use of Latin letters to spell Japanese words. Figure 1 development of a japanese input method that over­ summarizes the japanese character systems. comes these issues by separating the input method japanese input requires users to operate in a from an application in three lay<:rs. "preediting" mode to convert kana or romaji into a kana-kanji string. 1·2 Overviewofthe Japanese The computer keyboard used fo r japanese input Input Method has multiple character assignments. Almost all keys In this section, we describe jap:mese input and on the japanese keyboard are assigned both a Latin string manipulation from the perspective of both alphabet character and a japanese kana character. the user and the application. Based on these The .Japanese user must first choose between kana descriptions, we present a brief overview of reengi- key input or alphabet input. A user in an engineering

JAPANESE CHARACTERS PHONOGRAM

KANA (ORIGINAL JAPANESE CHARACTERS)

L HIRAGANA t L KATAKANA ROMAJI (USING LATIN ALPHABET TO EXPRESS A PHONEME)

IDEOGRAM

L KANJI

Fig ure 1 }ajJaneseCh aracter Systems

Digital Te chuical jo urual Vo l. 5 1\"o. .) Summer 1993 97 Product Internationalization

area generally uses romaji (alphabet) key input. In japanese Application Capabilities the office environment, however, a user prefers The Japanese application has two special capabili­ kana key input because it requires half as many ties fo r japanese processing. First, the application keystrokes as romaji input. must be capable of hand ling multibyte characters. This subject itself is interesting as it involves Preediting Op era tion wchar_t and Unicode character sets; however, this The user inputs the phonetic expression in either paper fo cuses on the second capability, the imple­ kana or romaji that represents the statement the mentation of the input method. An application that user wants to input. Then the user presses the con­ accepts Japanese from an input device must have, version key to convert the phonetic expression to at least, three additional systems fo r the input a kana-kanji mL-xed string. At this time, the user method. These are the so-called keybinding system, checks the accuracy of the conversion result. a manipulator for preediting, and the ka na-to-kanji Sometimes the user needs to select the correct conversion system. word from a system-generated list of several homonyms. Moreover, a user may also need to Keybinding System This system analyzes the key determine the separation positions in the phonetic input from a user and determines which of the expression to ensure a meaningful grammatical key's fu nctions the user wants to do. It defines construction. the user interface ancl the way a user operates with Japanese has no word separator equivalent to the keystrokes. It also defines the preediting conver­ space character in English. To obtain the correct or sion key. We imagine there are as many keybinding expected separation of grammatical elements, the systems as there are word processors. user must sometimes move the separation position. After the user constructs a corrected statement, he PreeditMcmipulator Sy stem This system not only or she finishes preediting and fixes the statement. echoes the input characters on the screen but also The user repeats these complex steps to construct controls the video attribute that expresses the Japanese documents. Figure 2 shows the preediting preediting area. This capability allows the user to steps fo r the Japanese user. distinguish preediting strings from background Va rious techniques have been developed to accel­ fixed strings. The user must be able to recognize erate Japanese input operations. They include the preediting string fo r more processing (for UNDO, COPY, zip code conversion, and categorized example, to convert the input to another expres­ expert dictionary. sion such as kana to kanji ). In addition, the user can set this system to convert input to another expression dynamically (for example, automatic conversion of romaji to kana). START SET UP INPUT METHOD L,-.,-•INPUT PHONETIC EXPRESSION FOR A STATEMENT Kana-to-kanji Conversion System This system (CHANGE PHONOGRAM SYSTEM) analyzes the input string, gets the word from a dic­ CONVERT KANA TO KANJI tional)', and constructs the correct statement gram­ t:=MOVE THE SEPARATION POSITION matically. Many personal computer (PC) vendors SELECT A WORD FROM MANY HOMONYMS have invested in systems that use this input method. FIX A STATEMENT In Japan, some vendors have introduced artificial intelligence technology, but this system essentially END OF DOCUMENT has only statistical rules.1··1 Figure 3 summarizes .Japanese processing as han­ Figure 2 Preeditingfapanese Input cl led by applications.

JAPANESE PROCESSING ----,-- VARIABLE MULTI BYTE CHARACTER HANDLING L_ JAPANESE INPUT METHOD KEYBINDING SYSTEM � PREEDIT MANIPULATOR SYSTEM KANA-TO-KANJ I CONVERSION SYSTEM

Figure 3 japanese Processing byAp plications

98 Vo l. 5 No. 3 Su.111mer !')').) Digilal Te chnical jo urnal japanese Input Method Independent of Applications

Me tbod of japanese Co nversion interfaces of several Japanese engines and thus As mentioned above, to convert a product fo r use capture the free software capabilities. in Japanese, we must implement both a Japanese­ string manipulator and an input method. To retain App lication-independentAp proach the "look-and-feel" of the original Digital product, As described in the previous section, the Japanese the interface is designed so the Japanese user does input method includes complex techniques. Many not need to explicitly enter the preediting session PC software vendors (but not manufacturers) with the special-purpose key (Enter Preedit) but is decided against developing their own methods and automatically en tered. With most applications on incorporated a popu lar input method fo r their other systems, a user must explicitly enter the applications. This decision, of course, reduces preed iting session by using the special keystroke. their development cost. Our approach also seeks to This implementation has the advantage that it com­ reduce development cost. We separated the input pletely separates the input method from the appli­ method from the application to the greatest extent cation, but it requ ires the user to remember to possible, as long as the separation did not adversely perform an extra step. affect the application. To eliminate the conflict between the original The PC system is designed as a single-task system, product's key fu nction and the additional Japanese but Digital's operating systems (OpenVMS, ULTRIX, input fu nction, each product has to have a slightly and DEC OSF/1 AXP) are designed as multitasking diffe rent keybincling system fo r Japanese. As a systems. Therefore we cou ld not adopt many of the resu lt, a user must learn more than one Japanese PC techniques that were implemented in the driver input operation when using multiple products. level. For example, access to dictionary and gram­ matical analysis of the input string are too expen­ Us er Environments sive in the driver level of a multitasking system because they use system resources that are com­ res are widely used in many offices and are popular mon to all tasks on the system. devices fo r Japanese input. Naturally a user wants Our approach divides the input method into to operate with a fa miliar PC keystroke fo r Japanese three layers. Each layer is dependent on any lower input even in integrated systems (in some servers). layer. Consequently, any application using the high­ When PCs, which use front-end processors, are est layer also uses the functionalities of the other integrated into environments with VMS and UNIX two layers. systems, a user often prefers the PC's interface. The more integrated a user's environment is, the more requirements a user has. Stra tegy of Tb ree Layers In addition, a distribution kit for the X Window The criteria of our layering strategy were (1) to System in a UNIX environment has some sample reduce the cost of reengineering products for the implementations of the Japanese input method. Japanese user, (2) to unify the input method user This kit gives a user more choices fo r input at no interface, and (3) to protect the user's operational cost. investment in a keybinding system. The market fo r the Japanese input method sepa­ These criteria lecl us to set the keybinding system rates vendors into two main groups. One is the PC into the lowest layer. We named our system the front-end processor manufacturer who imple­ input method library (IMLIB) and released it on ments more advanced techniques but at a high VMS/Japanese version 5.5 and ULTRJX/Japanese ver­ price. The other is the UNIX system vendor who sion 4.3. We also ported it to the Alpha AXP system, supplies input implementations free (without guar­ and this facility is available on any Japanese plat­ antee) and thus reduces the maintenance cost. fo rm. Any application using our method needs to In the next section, we present our approach to use IMLIB. the development of an application-independent In essence, this keybinding system allows a user Japanese input method. The goals of our design to change the input method of operation to any were (1) to include the PC keybinding system in style by changing the IMLIB definition files. If an integrated environments so users could select their application supports IMLIB, a user can change the preferred input method, (2) to supply a tool that appl ication's input operation by changing IMLIB would easily convert products for the Japanese once. As a resu lt, an application's key customiza­ user, and (3) to provide a way to access the tion function can move into IMLIB.

Digital Te ch11i£'a/ journal V!;/. 5 No. 3 Stmnner 1993 99 Product Internationalization

At this point, we considered the simplest method PCs or some word processors, and some text edi­ of separating the input method from applications. tors on various operating systems. If a user needs to The intermediate process, also called the front-end usc a different editor, he or she needs to learn method, processes all the input and then passes it another operation. Our method unifies the input to an application. Many front-end implementations operation. We studied several types of input styles use the pseudo-terminal driver (pty in UNIX or FT in and recognized that we could build the general OpenVMS ). The intermediate process gets all l/0 to model fo r this input operation. The IMLIB manual and from an application, processes it, and finally describes this model in detaiJ.".C' In this paper, we passes it to an application or a device. This imple­ discuss it briefly. mentation cannot recognize the application input requests, but works only by a user's operation. To KEYBIND Database change this operation, we set the inside the In the Japanese input operation, entering the key terminal driver to get all application-request infor­ input causes several conversion actions and state mation. Our front-end process recognizes applica­ transitions. Figure 4 shows the multiple transitions tion requests and works ·without conflict. incurred during input. We needed to define the One advantage of this front-end implementation conversion actions and some state transitions as a is a complete independence of applications. This single key input action. We implemented this func­ can also be a disadvantage since an application can­ tion through the KEYBIND database and language. not control the input method closely. for example, Figure 5 is an example of the KEYBIND database. A this implementation can alter the user interface of user builds an input style by changing the KEYBIND an editor system. database with the KEYBIND language. We continued to study another layer fo r separa­ IMLIB allows the user to change the keybinding tion. The preecliting operation, that is, all the input and to choose a different input sequence with a string manipulations except 1/0 to devices, was a diffe rent state transition vector. For the user's con­ candidate. Al l applications pass the input from venience, IMLIB provides some KEYBIND databases input devices into the .Japanese input manipulator of the major Japanese input styles in default. and then pass the output from this manipulator When an application calls the ImSetKeybind onto output devices. By using this system, we fu nction, it loads a KEY13IND binary file into mem­ can unify the input operation except fo r device­ ory. Each time the application gets the key input, dependent operations and reduce the cost to imple­ it queries the key's fu nction from IMLIB. IMLIB ment this kind of functionality. searches the KEYBIND file fo r the key's definition Our development process started at the lowest and returns that information, called an action, to layer (IMLIB). proceeded to the highest layer (front­ the application. Each action is a set of orders that end), and finished at the middle layer (preediting has a diffe rent procedure fo r Japanese conversion. manipulator). In the fo llowing sections, we describe For example, the action CONVERT means to convert the functionalities in ead1 layer from the lowest to an input string to a kanji string. At that time, IMLIB the highest layer. also maintains Japanese input states and, if neces­ sary, changes the state.

Implementation of IMLIB PROFILE Database IMLIB is a utility that supplies the keybinding defini­ The Japanese input operation bas many parameters tion hmction and other info rmation fo r customiz­ to determine its look-and-feel, such as the video ing the Japanese input operation. This capability enables us to supply user-friendly keybinding sys­ tems. A user can change the input sequences and 1 N ITI AL STATE �------... KANJI CONVERTED L-- � 1 I the look-and-feel of the user interface by modifying ----_..,..._____ databases. We introduced two databases, KEYBIND fo r keybinding and PROf'ILEfor look-and-feel and an application's usage. We also supplied the KEYBIND compiler fo r improved performance and the elimi­ INPUTTING STATE I· ,. KANA CONVERTED nation of the grammatical error at run time. I As mentioned in the introduction, there are many implementations of Japanese input styles on Figure 4 State Tra nsition

100 vbl. 5 No. 3 Summer 19'13 Digital Te cbuicaljow·nal japanese Input Method Independent of Applications

JVMS ��of-J£ )i 7 7 1 !v ('/7. T b. T // v- r) v1 .0 (JVMS conversion key definition fi le (system tem plate) ver 1 .0)

gold CTRL_G ; Go ld key; used as a PREFIX key kakutei CTRL_N ; Finish wi thout any conversion kanj i_henkan NULL, gold + CTRL_K ; Convert to Kanji I next candidate hi ragana_henkan CTRL_L; (Convert clause) to Hi ragana katakana_henkan CTRL_K ; (Convert clause) to Ka takana zenka ku_henkan CTRL_F ; Convert to full width characters hankaku_henkan gold + CTRL_F ; Convert to ha lf width characters kigou_henkan GS; Symbo lic code conversion oomo j i VOl D; Convert to upper characters komo j i VOID; Convert to Lower characters j i_bunsetsu CTRL_P; Move to next clause zen_b unsetsu gold + CTRL_P; Move to previous clause tansyuku US; Shrink the clause sintyou gold + US; Extend the clause zen_kouho gold + (NULL, CTRL_L);! Previous clause candidate

kai jo = CTRL_N ; Cancel the conversion and go into input state sakujo DEL; Delete previ ous character hidari LEFT; Move the cursor Left migi RIGHT; Move the cursor right space_fi rst "� "; Finish by space (space at initial state) space_ input "� "; Finish by space (space at other states)

STATE "initial" = space_first NONE; kanj i_henkan START_SELECTED, CONVERT, GOTO "kk_c onverting"; hi ragana_henkan START_SELECTED, HIRAGANA, GOTO "converting"; katakana_henkan START_SELECTED, KATAKANA, GOTO "converting"; zenka ku_henkan START_SELECTED, ZENKAKU, GOTO "converting"; hanka ku_henkan START_SELECTED, HANKAKU, GOTO "converting"; kigou_henkan START_SELECTED, SYMBOL, GOTO "converting"; oomoji START_SELECTED, UPPER, GOTO "converting"; komo j i START_SELECTED, LOWER, GOTO "converting"; TYPING_KEYS START, ECHO, GOTO "inputting"; END;

Figure 5 Po rtion of KEYBIND Database

attribute fo r the preediting string, preediting KEYBIND Compiler exception handling, and application-specific pro­ The KEYBIND compiler analyzes the KEYBIND text cessing. The PROFILE database stores these addi­ file and creates the KEYBIND binary file. IMLIB ser­ tional parameters the same way as the resource file vices reads the PROFILE database and the KEYBIND does in the X Window System. binary file and maintains them in memory. As a The PROFILE database is a text file. It contains sev­ response to an application's query, IMLIB services eral records that represent each environment. This sends it the actions in KEYBIND and the data in record fo rmat has the style of INDEX : value. The PROFILE and at this time maintains the KEYBIND application predefines the INDEX fo r its purpose; states. Figure 7 shows the relationship among the however, IMLIB defines some INDEXes related to IMLIBcomponents. Japanese input operation because it requires some IMLIB is available on the OpenVMS VA X, OpenVMS common environment definitions. The range or AXP, ULTRIX, and DEC OSF/1 AXP operating systems. value corresponding to the INDEX is placed in the The major applications arc DECwindows/Motif, right-hand side of the record. Figure 6 shows a DECwrite, the front-end input process. and screen record from a PROFILE database. management (S\IG).

Digital Te chnical jo urnal Vo l. 5 No. 3 Summer I'J').) 101 Product Internationalization

DEC-J APANESE.KEY . keybi n d : im_key_j vms_leve l2 DEC-J APANESE .KEY . keybi nd_1 : im_k ey_j vms DEC-JAPANESE.DISP.preEdi tRow : current DEC-JAPANESE.DISP.preEdi tColumn : current DEC-JAPANE SE.DISP.inputRendition : bold DEC-JAPANESE.DISP. kanaRendition : bold DEC-JAPANESE.DISP. currentClauseRendi tion : reverse DEC-JAPANESE.DISP. leadi ngC lauseRendi tion : none DEC-JAPANESE.DISP. trai lingC lauseRendi tion : none DEC-JAPANESE .ECHO.asci i : hanka ku DEC-J APANESE . ECHO. kana : hiragana DEC-J APANESE.ECHO. autoRomanKana : off DEC-JAPANESE.OUTRANGE .clauseSi ze : none DEC-JAPANESE. OUTRANGE .cla useNumber : rotate DEC-JAPANESE . OUTRANGE .cursorPo sition : done

Figure 6 PROFILE Database Record

PREEDITING APPLICATION STRING

ACTIONS (KEY BIND) QUERY APPLICATION KEYWORDS (PROFILE) r--..L.----1.---, KEYBIND IMLIB COMPILER SERVICES Figure 8 He nkan Module Function

In addition, HM can access the resources ofiMLJB. This fe ature helps the unification of the Japanese input user interface and reduces the number of sim·

KEYBIND KEYBIND PROFILE ilar product conversions. HM has another significant TEXT BINARY DATABASE capability. We defined the common (minimum) FILE FILE FILE application programming interface to potentially accept all Japanese conversion engines and imple­ Figure 7 Relationship among !MLIB mented "PL GGS'' in HM. Therefore HM can use one Co mponents or more engines fo r kana-to-kanji conversion.

Implementation ofthe HenkanMo dule HM Mechanism Overview The second layer is part of the Japanese input HM is a tool that any appl ication can use. An manipulator and is called the henkan module or application passes key input to HM by a normal pro­ HM. (Henkan means conversion in Japanese.) It cedure call. After HM processes it, HM calls applica· does not handle 1/0 operation but accepts key tion routines with the processed result. Because input from the application and converts it to a HM handles large string buffe rs, it dynamically allo­ string in preediting. cates/deal locates memory. To ensure that memory Figure 8 summarizes the function of HM. An is retained, we used a callback technique. (These application passes the key input to HM stroke by techniques are described later in the Callback stroke. HM performs all Japanese preediting opera­ Routines section.) tions; the application has no direct manipulation of HM operates by key input as fo llows: the input. Then the application gets the preediting 1. HM gets a keycode from an appl ication with pro­ string from HM. Because HM does no 1/0, it is inde­ cedure arguments. pendent of any specific device. As a result, all appli­ cations, including windowing systems, can use HM. 2. HM gets the actions assigned to the key from IMUB.

102 Vo l. 5 No. . ) Summer 1993 Digital Te chnical journal japanese Input Method Independent of Applications

3. If the key is not assigned to the Japanese input • HMinitialize. This routine creates a context fo r operation, HM tells the application to process it HM. It accepts three callback entries, a user­ separately. defined data pointer that would be passed to the callbacks, and an item list fo r initial information 4. If the key is assigned to the Japanese input oper­ as its arguments. ation, HM processes it according to the actions. • HMConvert. This routine sends a key to HM. The 5. HM modifies the information to be displayed key is represented as a 32-bit data (longword) according to the action and calls a registered call­ that is generated by a function HMEncodeKey back routine to update the screen. from an that the keyboard HM passes the information that should be dis­ sends or by a fu nction HMKeysymToKeycode played on the screen in an argument of the callback from a keysym of the X Window System. IMLIB routines. The callback routines are prepared by the interprets the keycode, and HMConvert per­ application and registered into HM context at the fo rms a conversion in accordance with the infor­ initialization of HM. This callback method makes mation. (A summary of what is executed was the application interface and data flow more easily. given in the Mechanism Overview section.)

• HMEndConversion. This routine aborts the con­ Co mponents version and resets an internal status. It is used Figure 9 shows the composition of HM. The applica­ when the application has to stop the input fo r a tion interfaces include both the C and the VMS bind­ particular reason, fo r example, if an application ing interfaces for the OpenVMS operating system. issues the cancel request. The Japanese input manipulator performs all Japanese input operation by using lMUB, the Callback Routines romaji-to-kana converter, and the kana-to-kanji HM requires three call back routines: start_conver­ converter. After it processes the input key, it cal ls sion, fo rmat_output, and end_ conversion. They are back the application routines. There are several used as fo llows. types of romaji-to-kana converters. We imple­ mented a submodule rornaj i-to-ka na converter • start_conversion. This routine is called when driven by a conversion table; a user can change this the conversion string input is started. The table to another. application memorizes where the cursor is The kana-to-kanji converter module is a general­ positioned. ized Japanese conversion library. Many Japanese • conversion engines exist, and each one is used dif­ fo rmat_output. This routine is called whenever fe rently. The kana-to-kanji converter loads the the information to be displayed bas been interface routine that absorbs these differences changed. The application updates the screen. dynamically at the initialization of the HM context. • end_conversion. This routine is called when the It then processes the conversion request with any input string is determined. As a result, the appli­ engine. cation takes the string passed in the argument of the last call of fo rmat_output into its input Services buffer. HM provides 17 library entries. In this section, we describe three basic routines: HMinitialize, The user-defined data pointer, one of the argu­ HMConvert, and HMEndConversion. ments fo r HMinitialize, is always passed to these callbacks. Since HM is not concerned with its contents, the user can put any kind of information into it. SEVERAL APPLICATION INTERFACES HM is available on the OpenVMS VA X, OpenVMS JAPANESE INPUT MANIPULATOR AXP, ULTRIX, and DEC OSF/1 AXP operating systems. This portability is due to the module's indepen­ ROMAJI-TO·KANA KANA-TO-KANJI IMLIB CONVERTER CONVERTER dence from physical 1/0. The major client applica­ I I tions working on these operating systems are DECwindows/Motif, Japanese SMG, and the front­ Figure 9 HM Co mponent Structure end input process.

Digital Te clmicaljournal Vu l. 5 No. 3 Summer I'J'J3 103 Product Internationalization

Implementationof the Front-end PIP Mechanism Overview Input Processor FIP processes all Japanese input operations using The front-end input process (FIP) fo r a dumb termi­ HM. We supplied the Digital Command Language nal supports fu ll operations fo r the Japanese string (DCL) command, INPUT START/STOP fo r activating/ manipulation. FIP is implemented on the fo llowing deactivating FIP. Once a user activates FIP from operating systems: OpenVMS/Japanese/VA X version DCL, it is available until the user logs out or the 5.5-2 or later versions and OpenVMS/Japanese/AXP system is deactivated. version 1.0 or later versions. Figure 10 shows FIP and its environment fo r the manipulation of Japanese input. An applica­ Full Op eration Support tion issues 1/0 requests to the Tidriver to get The original product can use FIP if the product's user inputs, but FIP fetches the requests from mechanism, particularly its 110 operation and the Tidriver through the Fldriver. Then FIP starts preediting fu nction, does not conflict with the FIP to communicate with the drivers and the Japanese implementation. Some applications conflict with string conversion engine to pass the resultant the design of FIP due to the limitations of FIP and its string as well as preedits to a screen. environment. For example, FIP does not detect the The sequence of the front-end input process read request that includes the NOECHO item code, fo llows. so the application that issues such a read request 1. An application creates a front-end input to the terminal driver (Tid river) cannot use FIP as process. a Japanese front-end input process. Also FIP does not step into a process fo r the termination of a read 2. A front-end input process exchanges packets request simply because a read buffer that is defined with an application through its mailbox. by an application has overflowed . FIP continues to 3. An application issues a queued 110 ($QIO) communicate with the Tidriver and a conversion read request to the Tid river. engine to get the Japanese string unless the termi­ 4. The Fldriver intercepts the request and passes nate key is explicitly input. To overcome these con­ the information to FIP as a packet. flicts, we implemented a pseudo-driver named F!driver to intercept 110 requests from the applica­ 5. FIP issues a $QIO read request to the Tidriver tion before they are processed by the Tidriver. to get input strings fo r conversion.

THE THE INPUT RESULT STRING STRING

SEND START REQUEST MAILBOX

FIP

MAILBOX SEND START CONFIRMATION

$QIO AST $QIO READ/WRITE REQUEST

WRITE BACK THE RESULT

FIDRIVER INTERCEPT OF READ $QIO

Figure 10 PIP Environmentfo r Manipulation of japanese Input

104 Vo l. 5 No. 3 Summe-r1993 Digital Te chnical jo urnal japanese Input Method Independent of Applications

6. A user inputs a key from a terminal. FIP FIP EVENT MANAGEMENT receives the input and decides whether or not to call a routine of the conversion engine. If KEY EVENT an input key is recognized as one of the con­ ACTION CONTROL FIDRIVER MAILBOX version keys, FIP calls the routine and passes EVENT EVENT the input strings. If not, FlP issues a $QIO TERMINAL HENKAN ACTION MODULE write request to the ITdri ver to echo an input I character.

7. A conversion engine receives a string and Figure lJ FIP Fu nctional Structure converts it to the Japanese string.

8. A conversion engine returns the result to FIP. a read request to its own mailbox. The mailbox 9. FIP issues a $QIO write request to the Tid river event notifies FIP of the arrival of a message from to display the resultant string from the engine an application. When an application sends a and arranges the current editing line. start request to the FIP mailbox, the mailbox event is set so FIP starts to initialize its environ­ 10. Steps 5 to 9 are repeated. ment. Also FIP terminates itself at the time a stop 11. Once a user inputs the Te rminate key of an request message is delivered to its mailbox. application's request, FIP recognizes it as a ter­ • Fldriver Event. The F!driver event provides com­ minator and returns the entire resultant string munication with the Fldriver. The F.ldriver inter­ to the Fldriver as a write packet. cepts a request from an application to the 12. The Fldriver sends the result string and 1/0 Tidriver and creates a packet fo r FIP. FIP issues a status to an application. read request to the Fidriver, and this event is set when a packet is delivered. A request is catego­ 13. An application accepts the converted string. rized in three types: read request, cancel Afterexec uting its internal process, it issues request, and disconnect request. another $QIO read request to the Tidriver. (Return to step 3.) • Key Event. The key event provides commu­ nication with the Tid river. FIP issues a $QJO Fidriver The Fldriver is a pseudo-driver that inter­ read request to the Tidriver byte by byte. All cepts $QJO read requests from an application to the input from a keyboard is recognized as a the Tidriver. Functioning as a bridge between ter­ key event in FIP. Once a key event is set in minal read requests and FIP, the FTdriver gets a read FIP, FIP examines the key sequence in a read request, passes its information to FIP, and maintains buffer. it. When FIP returns the completion message with If the input is in the range of a terminator mask, its processed Japanese string, the Fldriver validates FIP terminates a read operation from the Tfclriver it and completes a user's read request as if the and writes back the resultant string and 110 status Tidriver had returned it. Thus the user/application block to the Fldriver as a write packet. (A termina­ can get the Japanese string without modification tor mask is defined in the $QIO read request from fo r Japanese input method. an application.) The F!driver has other notification functions fo r If the input key is a conversion key, FIP calls a exception handling such as logout, cancel, or abort. conversion engine and gets the resultant converted string. Then FIP issues a write request to the Front-end Input Process Op erations All the oper­ Tidriver to display the updated string. ations in the front-end input process are driven by If the input key is a printable character, FIP the mailbox event, the Fldriver event, and the key updates the contents of its internal buffe rs defined event. Figure 11 shows the functional structure in the context and issues a write request to the of FIP. Tidriver to echo the character. The fo llowing operations in the front-end input If the input key is for line editing, fo r example, to process correspond to these three events. delete a line or a word or to refresh a line, FIP emu­ • Mailbox Event. The mailbox event provides lates the line-editing functionof the Tid river so its communication with an application. FIP issues editing fu nction is executed.

Digital Te cbtlicaljournal Vo l. 5 No. 3 Summer 199:3 105 Product Internationalization

FIP stores all user input and read-request info r­ and operations for the Japanese conversion. The mation from an application in its internal buffers Clserver client library is located between an appli­ and database. The buffe rs contain the codes of user cation and the Clserver body. input and corresponding video attributes to display. The database contains item codes in a read request, Input Method Control Program (IMCP) IMCP is channel numbers to connect other devices, and a command line interface to customize the Clserver so on. environment. A user sets proxy to a Japanese sys­ FIP creates a new database when the updated tem dictionary at a remote node on the network, read request from an application is delivered, in and IMCP administrates a proxy database. A user other words, when the Fldriver event is set. Also, can confirm the status of the server at a command FIP adds the ASCII code and an attribute of the line and can shut down the server from the I.MCP updated user input into buffe rs when a user inputs, interface. that is, when the key event is set.

Other Servers HM has a conversion engine dis­ Cli ent/Server Conversion patcher that can dynamically select from several The use of a client/server conversion has two Japanese conversion engines. HM now serves the advantages: (I) It reduces the required resources CIS (Ciserver, Digital Japan), the Wnn (Omron fo r language conversion by distributing some com­ Company), the Canna (NEC), and the JSY (Digital ponents to other systems, and (2) It presents an Japan) engines. Therefore, an application that uses environment that shares a common dictionary. HM as the Japanese conversion interface can select All procedures fo r the Japanese conversion its preferred engine. require large system resources such as CPU power. A user can place the conversion information server Extension in the Future (Ciserver) and a dictionary on a remote node and In this section, we describe the possibilities fo r call some functions of the Clserver client library to internationalization of FIP, HM, IMLIB, the Clserver, get the resultant string. In this way, a local system and the Fldriver. Although our approach does not saves its resources while the remote server pro­ provide a multilingual input method, it does pro­ cesses the conversions. vide an architecture that can be used fo r any In addition, many users can access a common language. dictionary on the specific remote node. It is possi­ FIP has a mu!tibyte VO operation that can be ble fo r any local user to access a dictionary on a applied to other 2-byte languages. In addition, all remote node if the Ciserver on the node is active. the read/write communications among FIP, the Fldriver, and the Tidriver proved able to handle Clserver one-byte languages such as English. Also, IMLIB can expand its keybinding system fo r conversion of The object name is "IM$CISERVER". The Clserver other languages, and HM can add the interfaces fo r initializes itself by finding the name of a transport conversion engines of other languages if such protocol in a logical table. It then creates corre­ engines are prepared. sponding shareable images, maps its required rou­ tines, and waits fo r a connect request from a client. The Clserver communicates with its client via a Summary mailbox at the transport level. The server sets the The Japanese input method is a complex procedure asynchronous system trap to the mailbox and reads involving preediting operations. Va rious keybind­ a message in it such as a connect request, a discon­ ing systems and manipulators accelerate input nect request, a connect abortion, or a client's image operations. Our approach fo r the Japanese input termination. The Clserver can identify the connec­ method allows an application three choices: (1) An tion to a client and specify a conversion stream in application can use a front-end input processor to the connection. perform all operations including 1/0 . (2) An appli­ cation can use the henkan module and implement Clserver Client Library The client library pre­ 1/0 operation itself. (3) An application can execute sents programming interfaces. These are callable all operations except keybinding, which is handled routines that execute various string manipulations by an input method library.

106 Vo l. 5 No. 3 Summe-r 1993 Digital Te chnical journal japanese Input Method Independent of Applications

Acknowledgments 2. Standard X, Ve rsion 11, Release 5 (Cam­ We want to express our appreciation to Katsushi bridge, MA: MIT X Consortium, 1988). Takeuchi of the XTPU development team fo r his ini­ 3. K. Yo shimura, T. Hitaka, and S. Yo shida, "Mor­ tial designing and prototyping of IMLIB and some phological Analysis of Non-marked-off implementation of f!P, and Junji Morimitsu on the Japanese Sentences by the Least BUNSETSU's same team fo r his initial implementation of IMLIB Number Method," Transactions of Informa­ and its compiler. AJso, we wish to thank Makoto tion Processing Society of japan, vo l. 24 lnada on the DECwindows team fo r his implementa­ (1983). tion of HM; J-litosbi lzumida, Tsutomu Saito, and Jun Yo shida from the JVMS driver team fo r their 4. K. Shirai, Y. Hayashi, Y. Hirata, ancl ]. Kubota, contribution toward creating the Fldriver; and "Database Formulation and Learning Proce­ Naoki Okuclera for his implementation to the dure fo r Kakari-Uke Dependency Analysis," entire Clserver environment. As a final remark, we Tra nsactions of lnfomzation Processing acknowledge Eiichi Aoki, an engineering manager Society of japan, vol. 26 (1985). of ISEJapan, and Hirotaka Yo shioka in the !SA group 5. IMLIB/OpenVMS Library Reference Manual for their encouragement in writing this paper. (in Japanese) (Tokyo : Digital Equipment Cor­ poration Japan, Order No. AA-PU8TA-TE, 1993). References 6. Us er's Manual fo r Defining Us er Keys in 1. Guides to the X Window Sy stem Program­ JMLIB (in Japanese) (Tokyo: Digital Equipment mer's Supplementfo r Release 5 (Sebastopol, Corporation Japan, Order No. AA-PUSUA-TE, CA: O'Reilly & Associates, Inc., 1991). 1993).

Digital Te chnical jo urnal l'IJ/. 5 No. 3 Summer I'J93 107 I Further Rea dings

The Digital Te cbnical]ournal VA X 9000 Series publishes papers that explore Vo l. 2, No. 4, Fa ll 1990, EY-E762E-DP the teclmological fo undations of Digital �· nu�jorpmducts. J::a ch DECwindows Program Journaljucuses on at least one Vo l. 2, No. 3, Summer 1990, EY-E756E-DP product area and presents a compilatioll of refereed papers VA X 6000 Model 400 System written by the engineers u•ho Vo l. 2, No. 2, Sp ring 1990, EY-C l97E-DP developed tbe products. The con­ tentj(Jr the Journal is selected Compound Document Architecture bythe journal Advisory Board Vo l. 2, No. I, Wi nter I990, EY-C196E-DP Digital engineers who would

like to contribute a paper to the Distributed Systems Journalshoul d contact the editor Vo l. I, No . 9, }u ne 1989, EY-C179E-DP at RDVAX:.RI.A I\.E.

Storage Technology Vo l. 1, No. 8, Fe bruary 1989, EY-Cl66E-DP

1bpics covered in previous issues of rbe Digital Te ch nical jo urnalare as fo llows: CVAX-based Systems Vo l. I, No. 7, August 1988, EY-6742£-DP Multimedia/Appl ication Control Vo l. 5, No. 2, -�Pring JCJ93, EY-P963E-DP Software Productivity To ols Vo l. I, No. 6, Fe bruary 1988, EY-8259£-DP DECnet Open Networking Vo l. 5, No. 1, Wi nter 1993, EY-M770E-DP VA Xcluster Systems Vo l. I, No. 5, September 1987, EY-8258£-DP Alpha AXP Architecture and Systems Vo l. 4, No. 4, Sp ecial Issue 1992, EY-)886£-DP VAX 8800 Vo l. 1, No. 4, Fe bruary 1987, EY-6711E-DP NVAX-microprocessor VA X Systems Vo l. 4, No. 3, Summer 1992, EY:J884E-DP Networking Products Vo l. 1, No. 3, September I986, EY-6715E-DP Semiconductor Te chnologies

Vo l. 4, No. 2, Sp ring 1992, EY-l521E-DP MicroVAX ll System Vo l. 1, No. 2, March 1986, EY-3474E-DP PATHWO RKS: PC Integration Software Vo l. 4, No. 1, Wi nter 1992, EY-J825E-DP VA X 8600 Processor Vo l. l, No. I, August 1985, EY-3435£-DP Image Processing, Video Terminals, and Printer Te chnologies Vo l. 3, No. 4, Fall 1991, EY-H8R9E-DP Subscrip tions and BackIss ues Subscriptions to the Digital Te chnicaljo urnalare Availability in VA Xcluster Systems/Network available on a prepaid basis. The subscription rate Performance and Adapters is $40.00 (non-u.s. $60.00) fo r four issues and $75.00 Vo l. 3, No. 3. Summer 1991 , EY-H890E-DP (non-u.s. $ 115.00) for eight issues. Orders should be sent to Cathy Phillips, Digital Equipment Corpora­ Fiber Distributed Data Interface tion, 30 Porter Road LJ 02/D 10, Little£On, Massa­ Vo l. 3, No. 2, Sp ring 1991, EY+I876E-DP chusetts 01460, U.S.A., Te lephone: (508) 486 -2538, FA X: (508) 486-2444. Inquiries can be sent electron­ Transaction Processing, Databases, and ically £O [email protected]. Subscriptions mus£ be Fault-tolerant Systems paid in U.S. dollars, andchecks should be made Vo l. 3, No. 1, Wi nter I99I, EY-F'588E-DP payable to Digital Equipment Corporation.

lOR Vo l. 5 No . .3 Summer 19'J.) Digital Te chnical journal Single copies and past issues of the Digital Te chnical Te chnical Pap ers byDi gital Authors jo urnalare available for $16.00 each by calling R. Abugov, "From Trenclcharts to Control Charts: DECdirect at 1-800-DIGITAL (1-800-344-4825). Setup Te sts fo r Making the Leap," IHFUSF:"Ml Recent back issues of the.J ournalare available on International Semiconductor Manujltcturing the Internet at gatekeeper.dec.com in the directory Science Sy mposium (June 1992). /pub/DEC/DECinfo/DTJ . R. Al-Jaar, "Performance Evaluation of Real-Time Recommended Reading on Decision-Making Architectures fo r Computer­ Internationalization Top ics Integrated Manufacturing," Ro botics and Computer­ B. Comrie, editor, Th e Wo r"ld's Major Languages integrated Manufacturing (January 1992). (New Yo rk: Oxford University Press, 1987). P Anick and S. Artemieff, "A High-Level Morpholog­ ical Description Language Exploiting Morphological F. Coulmas, Th e Wr iting Systems of the Wo rld Paradigms," Proceedings of the 15th International (Oxford: Basil Blackwell, 1989). Conference on Computational Linguistics (August 1992). J. DeFrancis, The Ch inese Language Fa ct and Fa ntasy, Second Paperback Edition (Honolulu: P. Anick and R. Flynn, ·· versioning a Full-text University of Hawaii Press, 1989). Information Retrieval System," Fifteenth Annual International ACM 5/GIR Conference on Research ]. De Francis, Vi sible Sp eech: Th e Diverse Oneness and Development in Information Retrieval of Wr iting Sy stems (Honolulu: University of Hawaii (June 1992). Press, 1989).

B. Archambeault, "A New Standard Radiator fo r Digital Guide to Developing International Shielding Effectiveness !\1easurcments," JF/:1:: Soft -ware (Burlington, MA: Digital Press, Order International Sy mposium on Electromagnetic No. EY-F577E-DP, 1991). CompatihUity (August 1992).

S. Martin, Internationalization Explored A. Berti and V Bolkhovsky, "A Manufacturable (UniForum, 1992). Process fo r the Formation of Self AI igned Cobalt Silicide in a Sub Micrometer CMOS Technology," MultiLingual Computing is a publication Proceedings of the Ninth International VLS! of Wo rldwide Publ ishing Group, Clark Fork, Multilevel Interconnection Conference (VMIC) Idaho, U.S.A. and is available on the Internet : (June 1992). [email protected]

G. Bock and D. Marca, "GROUPWARE: Software for A. Nakanishi, Writing Systems of the Wo rld, third Computer-Supported Cooperative Wo rk," IEEE printing (Rutland, VT. and To kyo: Charles E. Tuttle Co mputer Society Press Tu torial (January 1992). Company, 1988).

C. Brench, "A Method to Improve E:\1 1 Shielding D. Ta ylor, Global Softwa re: Developing Applica­ Predictions," IEEE Internal ional ::,:ymposium on tions fo r the International Ma rket (New York, Electromagnetic Compatibility (August 1992). Berlin, Heidelburg, London, Paris, To kyo, Hong Kong, Barcelona, Budapest: Springer-Verlag, 1992). D. Byrne, "Accurate Simulation of Multifn:qucncy Semiconductor Laser Dynamics { lnder Gigabits­ Th e Un icode Standard, Ve rsion 1. 0, Vo lume 1 Per-Second Modulation," IF.H]ournalof (Reading, MA: Addison-Wesley Publishing Lightwave Te chnology (August 1992). Company, 1991). R. Collica, "The Effect of the Numher of Defect The Unicode Standard, Ve rsion l.O, Vo lume 2 Mechanisms on Fault Clustering and its Detection (Reading, MA: Addison-We sley Publishing Using Yield Model Parameters;· IEEF Transactions Company, 1992). on Semiconductor Manujctcturing (August 1992).

Digital Te chnical jo unwl V

D. Davies and). Pazaris. ''Requirements fo r Optical K. Ramakrishnan, "Effe ctiveness of Congestion Interconnects in Future Computer Systems,'' SPIE Avoidance: A Measurement Study,'' IEEE Infocom International Sy mposium on Op tical Applied '92 (May 1992). Science and Engineering (July 1<)92). K. Ramakrisbnan, P Biswas, and R. Karedla. ·'Anal­ D. Dossa, "Above-Barrier Quasi-Bound Electronic ysis of File I/O Traces in Commercial Computing States in Asymmetric Quantum We lls,'' Physics Environments," AC!H Sigmetrics (June 1992) Review (March 1991). Y Raz, "The Principle of Commitment Ordering, D. Dossa, "Observation of Above-Barrier Quasi­ or Guaranteeing Serializability in a Heterogeneous Bound States in Asymmetric Single Quantum Wells Environment of Multiple Autonomous Resource by Piezomodulated Reflectivity," Applied Physics Managers Using Atomic Commitment," Proceed­ Letters (November 1991). ings of the 18th International Conference on Ve zv Large Databases (August 1992). B. Doyle, C. Conran, and B. Fishbein, "Thermal Instability in P-channel Transistors with Reoxi­ A. Rewari, "AI fo r Customer Service and Support,'' dized Nitrided Oxide Gate Dielectrics," IEEE IEEE Expert (June 1992). Fiftieth Device Research Conference (June 1992). ). Rose, "Fatal Electromigration Vo ids in Narrow B. Doyle and K. Mistry, "Hot Carrier Stress Damage Aluminum-Copper Interconnect," Applied Phsyics in the Gate 'Off' State in n-Channel Transistors," Letters (November 1992). IEEE Transactions on Electron Devices (July 1992). \V Samaras, "Futurebus+ Electrical Behavior fo r R. Dunlop, "Design fo r Electronic Assembly,'' High Performance," Bl!SCON '92 We st Conference Design fo r Manufacturability (vol. 6 of the SME Proceedings (Februaqr 1992). To ol and Manufacturing Engineers Ha ndbook series) (January 1992). M. Sayani, "DC-DC Converter Using AllSu rface­ Mount Components and Insulated-Metal M. Good, "Participatory Design of a Portable Substrate," IEEE Seventh Annual Applied Po wer To rque-Feedback Device;· CHI '92 Conference Electronics Conference (February 1992). Proceedinp,s(AC:\1 Conference on Human Factors Computing Systems) (May 1992). H. Smith and W Harris, "SIMS Quantification of AsCs+ at CoSi2/Si Interfaces,'' Proceedings of the D. Krakauer and K. Mistry, "ESD Protection in a Eighth International Conference on Secondary Jon 3.3V Sub-Micron Silicided CMOS Technology,'' IEEE Mass Sp ectrometry (SIMS Vffl) (September 1991). Electrical Over Stress/Electrostatic Discharge Sy mp osium Proceedings (July 1992). M. Stick, "Matrices and Ve ctors,'' Six Sigma Research Institute (April 1992). P Martino, "Analysis of Complex Geometric To lerances by Linear Programming," ASk!£ R. Ulichney, "The Construction and Evaluation of Computer in Engineering (August 1992). Halftone Patterns with Manipulated Power Spectra," Raster Imaging and Digital Ty pography (RIDT) ). Oparowski and P Te rranova, "Material (October 1991). and Design Considerations of Flexible Signal Connectors fo r the VA X 9000 MCU,'' ASM G. Wa llace, "The .JPEG Still Picture Compression International 7th Electronic Materials Standard," Co mmunications of the ACJJ (April and Processing Congress (August 1992). 1991).

A. Philipossian and D. Jackson, '· Kineticsof Oxide G. Wa llace, "Overview of the JPEG (ISO/CCITT) Growth during Reoxidation of Lightly 7'l"itricled Still Image Compression Standard," SPJE Image Oxides," jo urnal of the Electrochemical Society Processing Algorithms and Te chniques (September 1992). (February 1990).

110 Vol. 5 No. 3 Summer 1993 Digital Te chnical jourrwl I Recent Digital US. Pa tents

The fo llowing patents were recently issued to Digital Equipment Corporation. Titles and names supplied to us by the US. Pa tent and Trademark Office are reproduced exactly as they appear on the original published patent.

D337, 761 M. Hetfield and S. K. Morgan Electronic Device Module D338,001 M.]. Falkner, M. R. Wiesenhahn, Positioning Device and M. D. Good D338,653 M. Hetfield and S. K. Morgan Power Supply Module 5,185,877 W Bruckert Protocol for Transfer of DMAData 5,220,661 A. H. Mason, W- M. , System and Method fo r Reducing Timing Channels C. Kahn, ancl j. C. R. Wray Digital Data Processing Systems 5,224,263 W Hamburgen Gentle Package Extraction To ol and Method 5,225,790 P. Esling,J. M. Rinaldis, and Tunable Wideband Active Filter R. W Noguchi 5,226,092 K. Chen Method and Apparatus fo r Learning in a Neutral Network 5,226, 170 P. Ru binfelcl Interface between Processor and Special Instruction Processor in Digital Data Processing System 5,227, 041 B. Brogden, L. Brown, and Dry Contact Electroplating Apparatus S. Husain 5,227, 582 ]. Copeland and D. Robinson Video Amplifier Assembly Mount 5,227,604 G. M. Freedman Atmospheric Pressure Gaseous-Flux-Assisted Laser Reflow Soldering 5,227,778 G. Vacon Service Name to Network Address Translation in Communications Network 5,228,066 C. ]. Devane System and Method for Measuring Computer System Time Intervals 5,229, 575 L. Colella, R. Pacheco, and Thermode Structure Having an Elongated, Thermally D. Wa ller Stable Blade 5,229,901 M. L. Mailal)' Side-by-side Read/Write Heads with Rotal)' Positioner 5,229,914 D. A. Bailey Cooling Device that Creates Longitudinal Vo rtices 5,229,926 D. Donaldson and D. Wissel! Power Supply Interlock fo r Distributed Power Systems 5,230,044 N. Quaynor and X. Arbitration Apparatus for Shared Bus 5,231 ,246 D. Alessandrini,]. M. Benson, Apparatus fo r Securing Shielding or the Like and W Rett 5,232,570 C. Byun, B. Haines, E. Johns, Nitrogen-Containing Materials fo r Wear Protection and Q. Ng, G. C. Rauch, R. M. Friction Reduction Raymond, and D. Ravipati 5,233,616 M. Callander Write-back Cache with ECC Protection 5,233,684 R. Ulichney Method and Apparatus for Mapping a Digital Color Image from a First Color Space to a Second Color Space 5,235,211 W Hamburgen Semiconductor Package Having Wraparound Metallization 5,235,642 M. Abacl i, A. Birrell, Access Control Subsystem and Method for B. W Lampson, and Distributed Computer System using Locally Cached E. Wo bber Authentication Credentials 5,235,644 B. W Lampson, C. Kaufman, Probabilistic Cryptographic Processing Method W Hawe, M. F Kempf, ]. Tarclo, and A. Gupta 5,235,693 M. Gagliardo, JJ Lynch, Method and Apparatus fo r Red ucing Buffe r Storage in and P. M. Goodwin a Read-Modify-Write Operation

Digital Te chnical jounwl 1�1/. 5 No. 3 Summer !')'). ) 111 Recent Digital US. Pa tents

5,23';,697 S. C. Steely and). H. Zurawski Set Prediction Cache Memory System using Bits of the Main Memory Address 5,237, 662 T. L. Carruthers, K. Green, System and Method with a Procedure Oriented and S. Jenness Input/Output Mechanism 5,239,260 D. llingleband D. C. Widder Semiconductor Probe and Alignment System (SPAS) 5,239,493 S. K. Sherman Method and Apparatus fo r Interpreting and Organizing Timing Specification Information 5,239,630 R. F Lary and X. Cao Shared Bus Arbitration Apparatus Havinga Deaf Node 5,239,634 B. Buch and C. MacGregor Memory Controller for Engineering/Dequeuing Process 5,239,637 D. W Thiel, W Goleman, Digital Data Management System fo r Maintaining Consistency and S. H. Davis of Data in a Shadow Set 5,240,549 ). E. Fitch and W Hamburgen Fixture and Method for Attaching Components 5,240,740 K. A. Frey and M. L. Mallary Method of Making a Thin Film Head with Minimized Secondary Pulses 5,241 ,564 ). L. Ya ng Low Noise, High Performance Data Bus System and Method 5,241,621 R. Smart Management Issue Recognition and Resolution Knowledge Processor 5,241,652 W Barabash and W Ye razunis System fo r Performing Rule Partitioning in a RETE Network

112 Vo l. 5 No. 3 Summer (993 Digital Te chuical jom.,r al e p .... . n 0 0.... (I) ; ��CU'U� b��'J�1W Vl� � !') 0.... (I) Pl" ::::- n

ISSN 0898-901X

Printed in U.S.A. EY-P986E-DP/93 10 02 17.0 Copyright © Digital Equipment Corporation. All Rights Reserved.