
Understanding Internationalization May 2006 [Revision number: V2.1-1] Copyright 2006 Sun Microsystems, Inc. This tutorial provides background information to help you develop an internationalized web application with the Sun Java Studio Creator application development tool, using JavaServer Faces components. It provides an example of localizing an application for a specific locale. The Sun Java Studio Creator IDE enables you to develop localized web applications specific to a single locale or supporting many locales. The easiest way to develop an application for a single locale is to have the IDE inherit its default settings from the locale settings of your operating system. The IDE also lets you develop internationalized web applications that can be automatically localized to many different locales. The recommended way to do so is to use a properties bundle that segregates your human-readable text into one set of files for easy maintenance. Contents - What Is Internationalization - Locale, Character Sets, and Encoding - Unicode and UTF-8 - Locale Settings and Character Encodings in the IDE - Localized Project Development - Internationalization Features in the IDE - Source Encoding Settings - Supported Locales - Encoding Component - Default Locale and Response Encoding - UrlLang Property - Lang Property - Language Property - The Charset Property - The Hreflang Property - The Dir Property - Bundled Properties - Load Bundle Component Internationalization Properties in JavaServer Faces - Files - The faces-config.xml File - Example Example used in this tutorial » internationalizingapps_ex.zip (zip) What Is Internationalization Internationalization is the process of making program code generic and flexible so that it can easily accommodate specifications from markets around the world. These specifications include—but are not limited to—language, character sets, date and time formats, and currency symbols. Readers familiar with issues involving internationalization, localization, and character encoding may wish to skip directly to the topic Localized Project Development. Internationalization enables localization, the ability to develop a user interface for a specific locale (a geographic, political, or cultural region). It is usually identified by the two-letter ISO-639-2 language code and the two-letter ISO-3166 country code for the locale. For example, en_US represents the locale United States English. In some cases, only a language identifier is provided. 1 The Sun Java Studio Creator application development tool (the IDE) enables you to create localized versions of your program, where each localized version supports a target locale. It generates internationalized code, so that your program can behave as a localized application in different locales. Locale, Character Sets, and Encoding Character encoding issues are not exclusive to internationalization; they affect any developer whose application will run in a specified locale even if the application doesn't display messages specific to that locale. However, it is difficult to understand the process of internationalization without understanding character encoding. As already mentioned, a locale is usually identified by a two-letter language code and a two-letter country code. Each language (and in some cases, each locale) may require a different character set. The character set is the set of glyphs (the alphabet, numbers, and punctuation) used in the language. Character sets are comparatively small for European languages, numbering in the hundreds of glyphs. For other locales, especially those in Asia, character sets can number in the tens of thousands of glyphs. Character encoding is the mapping of code points, represented as integers, to the glyphs in a character set. The mapping between a code point and a symbol is typically set by a standards committee such as the International Standards Organization (ISO). Encodings can also be defined by manufacturers who want to define special glyphs of their own. Locales with large character sets require code points of at least two bytes (16 bits) to represent all of the glyphs, but variable byte-length encoding schemes have been designed to use as many as 6 bytes for some glyphs. Any single language can have more than one encoding. And, although two different character encodings may provide the same set of characters, their code points refer to different glyphs. To add to the confusion, there can be several names for the same encoding. To read more on encoding, you can go to several sources, starting with the ISO web site. Unicode and UTF-8 Unicode is the international standard whose goal is to specify a code that maps every character needed by every written human language to a single unique integer code point. Despite difficulties, Unicode has emerged as the dominant encoding scheme in multilingual environments. The Java Studio Creator IDE defaults to the UTF-8 encoding (8-bit Unicode Transformation Format) as the most flexible implementation of the Unicode standard. It is a lossless, variable-length character encoding that uses groups of bytes to represent the Unicode standard for the alphabets of many of the world's languages. It is the default encoding for the XML format. UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. For example, only one UTF-8 byte is needed to encode the 128 US-ASCII characters. An ASCII file encoded in UTF-8 is exactly the same as the original ASCII file. All non-ASCII characters in the encoding are guaranteed to have the most significant bit set, whereas ASCII characters do not. This means that existing tools used with ASCII text (searching, text editing, and so on) work as expected, and that legacy systems such as emailers can transmit UTF-8. UTF-8 guarantees that no byte sequence of one character is contained within a longer byte sequence of another character. This ensures that byte-wise sub-string matching can be applied to search for words or phrases within a text. Most other variable-length 8- bit encodings do not have this property, making string matching difficult. Locale Settings and Character Encodings in the IDE The Java Studio Creator IDE gives you control of locale and character encoding in several ways. As the preceding discussion points out, locale and character encoding are separate settings. However, you will usually find an association between the two settings because of font availability. NOTE: A typeface is a design of a set of letters, numbers, and special symbols that share a particular appearance, such as the Helvetica or Times Roman typefaces. A font is an implementation of a typefaces character set. At one time, fonts were expressed in hardware. However, modern fonts—often dynamically generated—are available as software associated with an operating system. As a developer, you must consciously target the platforms and character encodings that will support your application. For web 2 applications, the target platform for your development effort is not a computer or even an operating system, but a web browser. Popular browsers, in turn, are implemented for several operating systems. Because fonts are distributed with operating systems, popular browsers must accommodate each system's available character encodings. A modern browser can handle many different font encodings. In the Java Studio Creator IDE, you will typically select a specific character encoding, such as Shift_JIS encoding whenever you specify the locale ja_JP, or Big5 encoding whenever you specify a traditional Chinese locale such as zh_TW. Browsers provide preference settings to inform internationalized sites of the user's preferred language. The web server receiving this information attempts to serve pages in the preferred language of the client browser. The ability to specify character encodings as well as locales is important because not all encodings for a specific language are the same. For example, consider Figure 1, which shows a sample of Chinese text displayed in three different encodings on the same browser. The encoding in each case is the response encoding, the encoding specified by the web server when it composes a page and sends it to the client browser. When the browser receives a page specified in, for example, UTF-8 encoding, it attempts to render the page using its UTF-8 fonts. Figure 1: Rendering Differences in Chinese Character Encodings The circled characters in Figure 1 show that UTF-8, Big5, and GB2312 can have significant differences when displayed on the same browser. GB2312 encoding is typically used in mainland China; Big5 in Taiwan. One character code at the bottom the Big5 encoding sample is not rendered as a Chinese glyph at all, indicating that the passage was probably intended for GB2312 (simplified Chinese) encoding. Fonts used for UTF-8 encoding are similar to those used for Big5 in this sample, although different punctuation spacings produce different line breaks. To further complicate the situation, different browsers may behave differently. For this reason, some web applications, especially those adapted for Chinese, give the user a choice of encodings. 3 Localized Project Development The Java Studio Creator IDE enables you to develop web-based applications in many languages. To successfully deploy and test while developing a web-based application in a specific language, you must start up the IDE on a development computer with the same character encoding as the target computer where you plan to run the server. The browser must also support the target language. As you develop your application, all JavaServer Pages (JSP) files and Java source files may be maintained in the native encoding of your operating system, as is true of many other popular IDEs. This feature allows the IDE to work interchangeably with other tools. Alternatively, you may choose to store all JSP and Java source files in UTF- 8 encoding. The use of UTF-8 circumvents encoding problems if you need to share your files in an internationally distributed development team. In such cases, you can maintain a central source code repository because the operating system encodings used to write files will not differ among locales.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages19 Page
-
File Size-