WHITE PAPERA P R I L 2 0 0 9 Multilingualization | April 2009

Multilingualization An answer to companies marketing their products globally. WHITE PAPERA p r i l 2 0 0 9 Multilingualization | April 2009 TABLE OF CONTENTS What is Multilingualization? 3 Introduction 3 Key considerations while multilingualizing a product 3 Spoken language categorization 4 Multilingualization framework 4 Multilingualization methodology 5 Design Considerations 6 Tools and utilities 9 Deciding on the kind of internationalization support 9 © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. Multilingualization | April 2009 What is Multilingualization? Multilingualization is the process of making digital products and systems accessible to all people in various geographic regions in their own native language. A truly global product is based on multilingual character tags in which all the displayed language characters appear fully in the local language. Approximately 80% of the world’s population is composed of non-English speakers. If consumer product companies with global manufacturing network do not multilingualize their products, the digital divide will thwart the global reach of their products. Introduction Unicode is an inherently multilingual place and Unicode is the only plausible way to deal with texts that combine characters from lots of different languages. Multilingualization function comprises of two aspects such as Internationalization (I18N) and Localization (L10N) that can be achieved independently. Multilingualization Internationalization Localization Internationalization process makes any software, portable between languages and regions. Internationalization results in reduced code maintenance by eliminating language or platform-specific code modifications and product versions. This reduces version control, bug fixing, bug porting, hot site maintenance, and customer support costs. Key considerations while multilingualizing a product The first step towards Multilingualizing would be to develop a single international Unicode code base or code stream including double byte, Unicode and bi- directional support for Unicode / UTF – 8. While developing a design for Internationalization, the following functional aspects form the key role in decision making for the Multilingualization. Following questions need to be answered. Does the UI only need to be I18Nized and displayed in localized mode? Does the file I/O need to be I18Nized? Does the system have heavy string processing or byte level processing? Answers to these questions guide strategy of Multilingualization. © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. Multilingualization | April 2009 Spoken language categorization From Multilingualization perspective, all the spoken languages can be divided into 3 categories: l English : This is the default working language and is normally supported on all platforms and by all software. Source code needs to support only single byte l Latin 1: This includes many Western European languages, which originated from Latin & Greek and can be understood as “English-like”. The character set of these languages can also be defined by single byte character. Minimal code changes are required to support these languages l Rest all: These languages require more than 1 byte to represent a character and thus introduce the necessity of multibyte character support. The changes may be substantial depending upon the existing design and complexity of the system Multilingualization framework Digital products are designed to perform some sort of function, usually data processing. Users request the function using an interface that consists of messages, windows, buttons, form fields, drop down lists and other screen elements. A proven approach to customize the product or solution for usage in different Internalization(I18N) For internationalization, software is broken down into the following building blocks: l Functionality - What the product does. This is the highest priority for internationalization l Messages - Text based information from the product for the user. This is the second highest priority for internationalization l Interface elements - Text containers and graphics, such as windows, buttons, and icons. This is the lowest priority for internationalization Internationalized software can be localized to different languages and cultures with minimal effort © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. Multilingualization | April 2009 Localization(L10N) Localization of a product or a software solution can be achieved in two ways: l The source code will first undergo internationalization and then localized to specific regions l The source code will be enhanced to support local languages of different regions Many companies still believe in traditional bricks-and-mortar R&D and the idea that their innovation must principally reside in-house. These companies labor R&D departments with acquisitions, alliances, licensing, and selective component outsourcing Product Multilingualization Source Code (English) Framework (Internationalized) Product (JAPANESE) Product (ARABIC) Product (CHINESE) Multilingualization methodology Codeanalysis: Internationalization strives for a single generic code base that can support characters in multiple regional languages. Source code currently supporting single byte characters, has to support double byte (Unicode, UTF – 8). There are more considerations than those listed in the previous section, when some languages need custom modules for support. When designing the road map for the product internationalization, markets should be analyzed. Productinternationalization: Remove / Extract all hard coded strings into a resource file. This will include user interfaces and all the textual and non-textual elements, such as dialog boxes, error messages, and command line help. All the strings are placed into a resource file such as “Eng.rc -> Resouce file” Genericcodebase: The strings are replaced with a unique ID in the source code. This achieves mapping of the strings to some number representing the © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. Multilingualization | April 2009 original string. Custom messages such as status messages, error messages, exception messages, and help messages are also given a unique ID. Normally output like log files or error message are produced in system default encoding so that editors or display mechanism provided by operating system is able to display them correctly. Translation: The character strings are translated to local languages. The translated strings are stored in various resource files such as Chinese.rc, Japanese.rc, etc. AddressingtheCulturalissues: There may be product items that have different formats in different cultures. This differs from localization in that the formatting occurs based on user locale and not localization. For example, in the U.S., dates are displayed in month/day/year format. In the U.K., dates are displayed in day/ month/year. The English product may be used in both regions, but the date format needs to change. Product functionality may also require I18N, depending on what the product does. Plug-in: The resource files are used as plug-in executable files. Design Considerations Following design aspects form the key role in decision making for the I18N & L10N porting: l Resourcetranslation : In any I18N effort localization of resources (message strings, picture images, menubar etc) is an important and most visible task. Generic steps for doing it are: – Identify the resources that are to be localized – Replace the occurrence of these resources in code by an identifier. – Arrange the resources in an external data source which is catalogued as per language – Build up a mechanism to pick correct resource as per locale setting l Scalability: Localization specific code should always be segregated from the common code and should preferably be kept in a separate library. Thus the design can be kept scalable and pluggable for future support to other languages l 3rdpartyI18Nlibrary: An example is ICU, a freeware provided by IBM compiles on various flavors of Windows, Unix, Mainframes etc. There are other alternatives also available and should be chosen as per the requirement © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. Multilingualization | April 2009 l CollationSequence : Sorting is one of the basic operations that are done on data. Also sorting order of alphabets of a language is determined by regional conventions. For example consider letter Ü in Swedish it comes near the end of alphabet (after Z) in German it is sorted near U. From programming perspective sorting is dependent on function like strcoll() which takes two input strings and gives the decision that in a given locale which string is bigger or which one comes first. Associated with a locale is a character set and its collation sequence. Comparison function acts as per the collation sequence defined in that locale. As there are variation in definition of locale in different platforms there can be variation in collation sequence even though character set is same. For example for Japanese collation sequence can be base on number of strokes in an ideograph or can be based on phonetics.

WHITE PAPERA P R I L 2 0 0 9 Multilingualization | April 2009

PROC SORT (Then And) NOW Derek Morgan, PAREXEL International

Computer Science II

Unicode Collators

Braille Decoding Device Employing Microcontroller

Program Details

MSDB Foundation Provides Digital Braille Access Family Learning Weekends Have Become a Successful Tradition

Mysql Globalization Abstract

Localization & Internationalization Testing

Tailoring Collation to Users and Languages Markus Scherer (Google)

ICU and Writing Systems Ken Zook June 20, 2014 Contents 1 ICU Introduction

A Barrier to Indic-Language Implementation of Unicode Is the Perception That Encoding Order in Unicode Is Equivalent to Lingui

Section 18.1, Han