Multilingualization An answer to companies marketing their products globally.

WHITE PAPERA p r i l 2 0 0 9 Multilingualization | April 2009

TABLE OF CONTENTS

What is Multilingualization? 3 Introduction 3 Key considerations while multilingualizing a product 3 Spoken language categorization 4 Multilingualization framework 4 Multilingualization methodology 5 Design Considerations 6 Tools and utilities 9 Deciding on the kind of internationalization support 9

 © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. Multilingualization | April 2009

What is Multilingualization?

Multilingualization is the process of making digital products and systems accessible to all people in various geographic regions in their own native language. A truly global product is based on multilingual character tags in which all the displayed language characters appear fully in the local language. Approximately 80% of the world’s population is composed of non-English speakers. If consumer product companies with global manufacturing network do not multilingualize their products, the digital divide will thwart the global reach of their products.

Introduction

Unicode is an inherently multilingual place and is the only plausible way to deal with texts that combine characters from lots of different languages. Multilingualization function comprises of two aspects such as Internationalization (I18N) and Localization (L10N) that can be achieved independently.

Multilingualization

Internationalization Localization

Internationalization process makes any software, portable between languages and regions. Internationalization results in reduced code maintenance by eliminating language or platform-specific code modifications and product versions. This reduces version control, bug fixing, bug porting, hot site maintenance, and customer support costs.

Key considerations while multilingualizing a product

The first step towards Multilingualizing would be to develop a single international Unicode code base or code stream including double byte, Unicode and bi- directional support for Unicode / UTF – 8. While developing a design for Internationalization, the following functional aspects form the key role in decision making for the Multilingualization. Following questions need to be answered. Does the UI only need to be I18Nized and displayed in localized mode? Does the file I/O need to be I18Nized? Does the system have heavy string processing or byte level processing? Answers to these questions guide strategy of Multilingualization.

 © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. Multilingualization | April 2009

Spoken language categorization

From Multilingualization perspective, all the spoken languages can be divided into 3 categories:

l English : This is the default working language and is normally supported on all platforms and by all software. Source code needs to support only single byte

l Latin 1: This includes many Western European languages, which originated from Latin & Greek and can be understood as “English-like”. The character set of these languages can also be defined by single byte character. Minimal code changes are required to support these languages

l Rest all: These languages require more than 1 byte to represent a character and thus introduce the necessity of multibyte character support. The changes may be substantial depending upon the existing design and complexity of the system

Multilingualization framework

Digital products are designed to perform some sort of function, usually data processing. Users request the function using an interface that consists of messages, windows, buttons, form fields, drop down lists and other screen elements. A proven approach to customize the product or solution for usage in different

Internalization (I18N)

For internationalization, software is broken down into the following building blocks:

l Functionality - What the product does. This is the highest priority for internationalization

l Messages - Text based information from the product for the user. This is the second highest priority for internationalization

l Interface elements - Text containers and graphics, such as windows, buttons, and icons. This is the lowest priority for internationalization

Internationalized software can be localized to different languages and cultures with minimal effort

 © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. Multilingualization | April 2009

Localization (L10N)

Localization of a product or a software solution can be achieved in two ways:

l The source code will first undergo internationalization and then localized to specific regions

l The source code will be enhanced to support local languages of different regions Many companies still believe in traditional bricks-and-mortar R&D and the idea that their innovation must principally reside in-house. These companies labor R&D departments with acquisitions, alliances, licensing, and selective component outsourcing

Product Multilingualization Source Code (English) Framework (Internationalized)

Product (JAPANESE)

Product (ARABIC)

Product (CHINESE)

Multilingualization methodology

Code analysis: Internationalization strives for a single generic code base that can support characters in multiple regional languages. Source code currently supporting single byte characters, has to support double byte (Unicode, UTF – 8). There are more considerations than those listed in the previous section, when some languages need custom modules for support. When designing the road map for the product internationalization, markets should be analyzed.

Product internationalization: Remove / Extract all hard coded strings into a resource file. This will include user interfaces and all the textual and non-textual elements, such as dialog boxes, error messages, and command line help. All the strings are placed into a resource file such as “Eng.rc -> Resouce file”

Generic code base: The strings are replaced with a unique ID in the source code. This achieves mapping of the strings to some representing the

 © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. Multilingualization | April 2009

original string. Custom messages such as status messages, error messages, exception messages, and help messages are also given a unique ID. Normally output like log files or error message are produced in system default encoding so that editors or display mechanism provided by operating system is able to display them correctly.

Translation: The character strings are translated to local languages. The translated strings are stored in various resource files such as Chinese.rc, Japanese.rc, etc.

Addressing the Cultural issues: There may be product items that have different formats in different cultures. This differs from localization in that the formatting occurs based on user locale and not localization. For example, in the .S., dates are displayed in month/day/year format. In the U.K., dates are displayed in day/ month/year. The English product may be used in both regions, but the date format needs to change. Product functionality may also require I18N, depending on what the product does.

Plug-in: The resource files are used as plug-in executable files.

Design Considerations

Following design aspects form the key role in decision making for the I18N & L10N porting:

l Resourcetranslation : In any I18N effort localization of resources (message strings, picture images, menubar etc) is an important and most visible task. Generic steps for doing it are:

– Identify the resources that are to be localized

– Replace the occurrence of these resources in code by an identifier.

– Arrange the resources in an external data source which is catalogued as per language

– Build up a mechanism to pick correct resource as per locale setting

l Scalability: Localization specific code should always be segregated from the common code and should preferably be kept in a separate library. Thus the design can be kept scalable and pluggable for future support to other languages

l 3rd party I18N library: An example is ICU, a freeware provided by IBM compiles on various flavors of Windows, Unix, Mainframes etc. There are other alternatives also available and should be chosen as per the requirement

 © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. Multilingualization | April 2009

l CollationSequence : is one of the basic operations that are done on data. Also sorting order of of a language is determined by regional conventions. For example consider Ü in Swedish it comes near the end of (after Z) in German it is sorted near U. From programming perspective sorting is dependent on function like strcoll() which takes two input strings and gives the decision that in a given locale which string is bigger or which one comes first. Associated with a locale is a character set and its collation sequence. Comparison function acts as per the collation sequence defined in that locale. As there are variation in definition of locale in different platforms there can be variation in collation sequence even though character set is same. For example for Japanese collation sequence can be base on number of strokes in an ideograph or can be based on phonetics. If sorting plays a major role in the business logic a decision on which sorting order will be followed is a design level decision. Unicode standard prescribes a collation sequence which is kind of standard. If a consistent result is desired in different platform it is recommended that collation service of I18N library (like ICU) should be used

l Locale based processing: While internationalizing a software the changes that are requires can be listed as:

– Extracting the hard coded resources

– Replacing the functions whose behavior is locale sensitive

Any book on internationalization gives a fair idea of what are the functions whose behavior is locale sensitive. They fall in one of the following categories

– Character classification (toupper , isdigit , isalphanum etc)

– Sorting / Collation (strcoll , strcmp , memcoll)

– Language Parsing

– Date / time

– Numeric or monetary formatting

l Conversion from variable-width encoding to other encoding: If you use 2 different encodings, then you may have to tackle issues arising from the conversion of data stream in variable-width encoding (.g. UTF-8) to any other encoding (e.g. UTF-16). The issue arises because of the fact that it is not trivial to find the number of bytes consumed by n number of characters invariable-width encoding. The solution can be devised based on the bit structure of the encoding scheme used

 © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. Multilingualization | April 2009

l Abstraction using wrappers: It is very important to design a layer of abstraction between the existing code and the 3rd party I18N library. The reasons are obvious and the benefits lie in the very concept of abstraction

l Normalization : Unlike English, many spoken languages have diacritical or accentuated characters. Sometimes, they need to be treated like a combination of a base and an additional character, depending upon the requirement. For this purpose, these characters should be normalized. ICU provides normalization feature also

l Digits identification: are the international standard for number representation; however there are many local flavors also which are popular in many countries. IsDigit () kind of functions (and their equivalent functions) recognizes these which can be done by knowing the code points of those numerals

l Surrogaterange : Characters in the surrogate range are very rarely used in any language and can usually be ignored. What this implies is that, any Unicode character of any language can be represented by 2 bytes of UTF-16 encoding which makes it a fixed-width character. This greatly simplifies the design. However, the decision should be taken based on the requirement

l Performance : Encoding conversion of data should be avoided to as it is always a costly operation. Keeping data in a fixed-width encoding will always save precious CPU cycles which otherwise would be needed to calculate character boundaries

l Memory consumption: This decision is governed by the languages required to be supported. For English and latin1 languages extended ASCII or UTF-8 encoding needs minimum memory, whereas UTF-16 will always need 2 bytes. On the other side, Far East language characters may be defined in 2 bytes in UTF-16 encoding whereas it may take 3 or more bytes in UTF-8 encoding. Likewise for other encodings. The trade-off is always between performance and memory consumption and the choice is purely governed by the project requirements

l Backwardcompatibility : Almost all of multibyte encoding provide (e.g. UTF- 8) provide backward compatibility with ASCII i.e. the initial 127 characters have the same code points. If the design demands that the old data (which is in English) be supported by the new I18Nized binary, then such encoding scheme may be used for the I18N data

 © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. Multilingualization | April 2009

Tools and utilities

Following are some of the helpful tools that help in analyze the code that needs to be I18Nized:

l Understand for C: It is an excellent tool to understand the structure of C & C++ code, its classes, files, functions, macros. It provides search, view and edit facilities. Also it generates statistical data for the whole code in various formats

l Source Insight: It is another good tool to gain insight into the C / C++ source code

Deciding on the kind of internationalization support

One major point of analysis is what kind of internationalization support is to be added. The choices are Locale –based internationalization and Unicode based internationalization. In Locale based internationalization software behaves as per the system (or user) locale. In Unicode based internationalization the character set of working is Unicode and data can be processed depending on which language data is. That means independent of what is locale to the system, it can process data as per type of data.

 © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. Multilingualization | April 2009

Today, global R&D networks have to thrive on a complex innovation ecosystem that include captives, vendors and partners spread across geographies. Does this affect the performance of your R&D? While offshore outsourcing forces a distributed effort around the globe, do you have a partner that seamlessly integrates into your R&D network. HCL, India’s largest independent R&D services partner delivers just that. Powered by experience, expertise and scale, HCL delivers the most complex turnkey product engineering projects, while ensuring integration into your existing R&D network.

HCL is helping more than 200 global organizations achieve their business goals.

Talk to us today to know more about the new business aligned R&D network.

10 © 2009, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved. CUSTOM APPLICATION SERVICES

ENGINEERING AND R&D SERVICES

ENTERPRISE APPLICATION SERVICES

ENTERPRISE TRANSFORMATION SERVICES

IT INFRASTRUCTURE MANAGEMENT

BUSINESS PROCESS OUTSOURCING About HCL

HCL Technologies

HCL Technologies is a leading global IT services company, working with clients in the areas that impact and redefine the core of their businesses. Since its inception into the global landscape after its IPO in 1999, HCL focuses on ‘transformational outsourcing’, underlined by innovation and value creation, and offers integrated portfolio of services including software-led IT solutions, remote infrastructure management, engineering and R&D services and BPO. HCL leverages its extensive global offshore infrastructure and network of offices in 20 countries to provide holistic, multi-service delivery in key industry verticals including Financial Services, Manufacturing, Aerospace & Defense, Telecom, Retail & CPG, Life Sciences & Healthcare, Media & Entertainment, Travel, Transportation & Logistics, Automotive, Government, Energy & Utilities. HCL takes pride in its philosophy of ‘Employee First’ which empowers our 54,026 transformers to create a real value for the customers. HCL Technologies, along with its subsidiaries, had consolidated revenues of US$ 2.0 billion (Rs. 9,842 crores), as on 31st March 2009. For more information, please visit www.hcltech.com

HCL Enterprise

HCL is a $5 billion leading Global Technology and IT Enterprise that comprises two companies listed in India – HCL Technologies & HCL Infosystems. The 3-decade-old Enterprise, founded in 1976, is one of India’s original IT garage start-ups. Its range of offerings spans Product Engineering, Custom & Package Applications, BPO, IT Infrastructure Services, IT Hardware, Systems Integration, and distribution of ICT products. The HCL team comprises over 59,000 professionals of diverse nationalities, who operate from 20 countries including 500 points of presence in India. HCL has global partnerships with several leading Fortune 1000 firms, including leading IT and Technology firms. For more information, please visit www.hcl.in