Download the Web Internationalization Tutorial

Web Internationalization – Standards and Practice

Objectives

• Describe the standards that define the Web Internationalization architecture & principles for I18N on the web Standards and Practice • Scope limited to markup languages • Provide practical advice for working with international data on the web, including the design and implementation of Internationalization and Unicode Conference 34 multilingual web sites and localization

Web Internationalization – Standards and Practice Slide 2

Legend For This Presentation Web Internationalization Agenda

Icons used to indicate current product support: • Emphasis on Character Processing • Updates for HTML5 Internet Firefox 3.6 Opera 10 Explorer 8 This presentation and part 2 and example code are Supported: available at:

Partially supported: www.xencraft.com/training/webstandards.html

Not supported: Richard Ishida and W3C test I18n features for Caution numerous browsers and versions (X)HTML: Highlights a note for users or developers to be careful. www.w3.org/International/tests/

Web Internationalization Slide 3 Web Internationalization – Standards and Practice Slide 4

Internationalization and Unicode Conference 1 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Web I18n Part 1- Character Processing A Simple HTML Example Page

Character Encodings Character Encoding Negotiation Reference Processing Model Character Escaping Unicode in Markup Normalization Identifiers

Web Internationalization – Standards and Practice Slide 6 Web Internationalization Slide 7

A Simple HTML Example Page A Simple HTML Example Page

Here is how the same HTML looks in Japan Here is how the same HTML looks in Japan The browser has no information about the encoding of the web page. It uses a default value which in this case, is very wrong and even confuses the markup (see beauté).

Web Internationalization Slide 8 Web Internationalization Slide 9

Internationalization and Unicode Conference 2 Tex Texin, XenCraft Web Internationalization – Standards and Practice

A Simple HTML Example Page Character Encodings

Here is how the same HTML looks in Japan Encoding disagreement is one problem for text. We also consider the following problems and solutions. Some of the problems may (See: Character Model for the World Wide Web www.w3.org/TR/charmod/ ) not be obvious to the reader. Problem Solution Changing the euro symbol Encoding disagreement Encoding negotiation to a bullet, might cause a Encoding diversity Reference processing model Encoding limitations Character escaping significant financial error. Unicode vs. markup Markup preferred on the web

String matching Early uniform normalization String indexing Character counting guidelines

Web Internationalization Slide 10 Web Internationalization Slide 11

Character Encodings Start by identifying needed symbols

Use Example Symbols First though: What are character encodings? Letters ABCDEF…abcdef…ÂÃÄ Æ ß Ñ ACR ≠ A + ˚ Punctuation : . , ; ? ! ( ) - … Numeric, 0 1 2 3 4 5 6 7 8 9 + − ± * × / † < = Arithmetic > % ‰ # ¼ ½ ¾ ⅓ ⅔ ⅣⅲⅫ CCS U+233B4 U+2260 U+0041 U+030A $ ¢ £ ¥ € ₣ ₤ ₧ ₫ ₪₩ © ® ™ ℅ ° Business ℃ ℉ ∞ CEF D84C DFB4 2260 0041 030A Mathematics ∀ ∂ ∃ ∇ ∈ ∉ ∋∏ ∑ ≦ ≥ ≡ ≠ ≈ ⊆ ⊇ Other applications: ¶ § ♀ ♂ ♠ ♣ ♥ ♦ ← ↑ ↔ ↵ CES D8 4C DF B4 22 60 00 41 03 0A Proofreading, ♧ ♨ ♬ ♭♯ Å ☞ ☜ games, music… The Character Encoding Model – Unicode Tech. Report 17 Web Internationalization Slide 12 Web Internationalization Slide 13

Internationalization and Unicode Conference 3 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Character Encodings Character Encodings

ACR = Abstract Character Repertoire CCS = Coded Character Set

ACR ≠ A + ˚ ACR ≠ A + ˚

CCS U+233B4 U+2260 U+0041 U+030A Maps each character to a non-negative unique number. • The set of characters you need to represent – Note this example uses hexadecimal numbers. – (aka Character Set). – The “U+” indicates use of Unicode‟s numbering. • Characters may be composable. E.g. Å = A + ˚ – The grapheme Å consists of two characters A + ˚ – Unicode calls these “Unicode Scalar Values”

Web Internationalization Slide 14 Web Internationalization Slide 15

Character Encodings Character Encodings

CEF = Character Encoding Form CEF = Character Encoding Form

ACR ≠ A + ˚ ACR ≠ A + ˚

CCS U+233B4 U+2260 U+0041 U+030A CCS U+233B4 U+2260 U+0041 U+030A

CEF D84C DFB4 2260 0041 030A CEF D84C DFB4 2260 0041 030A 16-bit 16-bit

CEF F0 A3 8E B4 E2 89 A0 41 CC 8A Note the relationship between the CEFs is not so simple. 8-bit Map CCS to fixed width units (e.g. 32, 16, or 8 –bit) Map CCS to fixed width units (e.g. 32, 16, or 8 –bit) Note the relationship between the CEFs is not so simple. Web Internationalization Slide 16 Web Internationalization Slide 17

Internationalization and Unicode Conference 4 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Character Encodings Character Encodings

CES = Character Encoding Scheme • Many character sets exist and in popular use • Many encoding schemes, even for 1 character ACR ≠ A + ˚ set ISO 8859-1 ≈ IBM 850 ISO-2022-JP, Shift_JIS, EUC-JP (JIS X-0208-1997) CCS U+233B4 U+2260 U+0041 U+030A UTF-8 = UTF-16 = UTF-32

CEF 16-bit D84C DFB4 2260 0041 030A • Given just bytes, the character set and the encoding scheme can be indeterminate. CES D8 4C DF B4 22 60 00 41 03 0A UTF-16BE How can a browser know how to decode a web page? CES: Mapping the CEF(s) to serialization of bytes Web Internationalization Slide 18 Web Internationalization Slide 19

Encoding Identification Character Encoding Names

• Given just bytes, encoding is IANA (Internet Assigned Numbers Authority) indeterminate. – Maintains registry of official names for character • How can an encoding be identified? sets (actually encodings) used on the internet and in MIME (mail) • Registry Names • There are 2 requirements: – ASCII, printable characters – Agreement on names for encodings – Case-insensitive – Mechanisms for labeling text with encoding – Maximum length 40 characters – Aliases (alternative names) are also registered – The preferred name is indicated www.iana.org/assignments/character-sets

Web Internationalization Slide 20 Web Internationalization Slide 21

Internationalization and Unicode Conference 5 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Unregistered Encoding Names Character Encoding Names

• Conventions for Unregistered Character • IANA Name and Alias Examples Encoding Names – ISO_8859-1:1987 (ISO_8859-1, ISO-8859-1, latin1, – Name begins with “x-” L1, IBM819, CP819, csISOLatin1) – Example: “x-Tex-Yves-encoding” – Windows-1252, GB2312, BIG5, BIG5-HKSCS – Useful for private encodings or very new – SHIFT_JIS, HP-Legal encodings – Extended_UNIX_Code_Packed_Format_for_Japanese – Adobe-standard-encoding – Not useful on the web, except for private – UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32 exchange • Note- Registry contains many useless names • Note- Preferred names indicated. Use them.

Web Internationalization Slide 22 Web Internationalization Slide 23

Markup and Encoding Names HTTP and Encoding Names

• HTTP Mechanism for labeling HTTP with encoding • HTML • XML HTTP Response • CSS

• Links 200 OK HTTP/1.1 – HTML Content-Type: text/html; charset=UTF-8 --- Blank Line – HTML <… HREF> document – XML <… HREF> ...

Web Internationalization Slide 24 Web Internationalization – Standards and Practice Slide 25

Internationalization and Unicode Conference 6 Tex Texin, XenCraft Web Internationalization – Standards and Practice

HTML, XML & Encoding Names New in HTML 5:

HTML – Use “preferred MIME name” in IANA registry – HTML does not specify a default. – Use BOM instead for UTF-16. – Supported by most browsers XML – Simpler and less error-prone then CONTENT="text/html; charset=UTF-8"> – Alternative declaration: Begin with Byte Order Mark (U+FEFF), for UTF-16 or UTF-8 – Byte Order Mark now recognized – Note UTF-16 MUST begin with a BOM – UTF-32, EBCDIC, others not recommended – The default encoding is UTF-8. – UTF-7, SCSU, et al must not be supported

Web Internationalization – Standards and Practice Slide 26 Web Internationalization – Standards and Practice Slide 27

CSS2 and Encoding Name LINKs and Encoding Name

CSS2 Declaring the charset of a LINKed document – Only used in the first line of external style • HTML sheets

– CSS 2.1 added Unicode Byte Order Mark (BOM, U+FEFF) as an encoding indicator. Unicode – Encoding is unspecified if BOM and @charset • XML conflict.

Web Internationalization – Standards and Practice Slide 28 Web Internationalization Slide 29

Internationalization and Unicode Conference 7 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Notes: Declaring Encoding Names HTML5: LINKs and Encoding Name

• Charset on links can be incorrect if the Declaring the charset of a LINKed document document‟s encoding on the server changes • Deprecated in HTML5 • The encoding for the – Place it is as early as possible in the document. – Else, prior statements may be decoded incorrectly. Unicode • Note: – Transcoders do not generally correct charset ID

Web Internationalization Slide 30 Web Internationalization X Slide 31

HTML4 Encoding Priorities HTML5 Encoding Priorities

Prioritization is used to resolve conflicts. • From high to low priority, HTML5 uses the • From high to low priority, HTML uses the encoding encoding of: of: 1. User override for charset 1. HTTP “Content-Type” charset 2. HTTP “Content-Type” charset 2. 3. Byte Order Mark 3. LINK or other syntax for external documents 4. Either (Must only be one) 4. Charset-detecting heuristics • • • Many user agents (browsers) support a user override 5. Character set-detecting heuristics for charset (highest priority)

Web Internationalization Slide 32 Web Internationalization Slide 33

Internationalization and Unicode Conference 8 Tex Texin, XenCraft Web Internationalization – Standards and Practice

CSS2 Encoding Priorities XML Encoding Priorities

Prioritization is used to resolve conflicts. • Encoding name processing is more carefully • From high to low priority, CSS 2.1 external specified for XML. style sheets use the encoding of: • As with HTML, protocol or external information can supercede declaration, BOM or default of 1. HTTP “Content-Type” charset UTF-8. 2. BOM/@charset rule in the style sheet • XML Appendix E (non-normative): Prioritization 3. LINK or other syntax in referencing should be specified by protocols. document – Recommends use of BOM or encoding declaration for files (rather than an external source). 4. Charset of the referencing document – Refers to RFC 3023 • RFC 3023 specifies several encoding scenarios based on 5. Assume UTF-8 MIME media type: text/xml, application/xml, etc.

Web Internationalization Slide 34 Web Internationalization – Standards and Practice Slide 35

Web I18n Part 1- Character Processing Character Encoding Negotiation

Character Encodings Unix user Windows user Character Encoding Negotiation Reference Processing Model Character Escaping GB2312 1252 Unicode in Markup Normalization html html Identifiers

Web Internationalization Slide 36 Web Internationalization Slide 37

Internationalization and Unicode Conference 9 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Typical Browser-Server HTTP Sequence Character Encoding Negotiation

Accept Charset Get URL HTML Browser Server 1. Browser issues GET URL x,y,z Pages 2. Server sends RESPONSE 3. Browser displays document in RESPONSE GET / HTTP/1.1 Accept-Language: en-us,en,hr;q=0.5 4. Browser POSTs Form with user data (text) Accept-Charset: iso-8859-1,utf-8;q=0.75,*;q=0.5 5. Web Server receives data, database The browser‟s HTTP GET request can list the languages and application stores text. the encodings it can make use of, to guide the server. •“q” is a relative measure of the usefulness (quality) of an entry.

Which encoding is sent by the server? The above example indicates: •US English preferred, other English, Croatian are also ok. Which encoding is returned by the browser? •ISO 8859-1 preferred, then UTF-8, then anything else.

Web Internationalization Slide 38 Web Internationalization Slide 39

Character Encoding Negotiation Character Encoding Negotiation

Accept Charset Get URL HTML Browser Server • Most browsers let you set your language x,y,z Pages preferences and priorities Response Browser Server CHARSET=x • Encoding capabilities are not settable (since they are software dependent). 200 OK HTTP/1.1 Content-Type: text/html; charset=iso-8859-1 – Microsoft IE doesn‟t send ACCEPT-CHARSET. --- Blank Line – (U.S.) NS 7: ISO-8859-1, UTF-8;q=0.66, *;q=0.66 HTML document ...

– Opera 6.0 sends: The server returns a document. Windows-1252;q=1.0, UTF-8;q=1.0, UTF-16; q=1.0, The encoding is declared in the RESPONSE header. iso-8859-1;q=0.6, *;q=0.1 (Web administrators or content authors need to inform the server about document encodings.)

Web Internationalization Slide 40 Web Internationalization Slide 41

Internationalization and Unicode Conference 10 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Character Encoding Negotiation Character Encoding Negotiation

Accept Charset Get URL HTML Accept Charset Get URL HTML Browser Server Browser Server x,y,z Pages x,y,z Pages Response Browser Server CHARSET=x Response Browser Server CHARSET=x Form Data Set Browser Server O/S Charset =z The browser adapts the document for operating System display. The browser also accepts user data in HTML

• POST + HTTP URI Form Data Set sent in body, encoded as either

Form Data Set = Control Name/Current Value Pairs 1) “application/x-www-form-urlencoded“ or Name/Tex 2) “multipart/form-data” (MIME, RFC 2045) sex/m

Web Internationalization Slide 44 Web Internationalization Slide 45

Internationalization and Unicode Conference 11 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Form Data Set- GET Method Submission Form Data Set Encoding

– Non-alphanumeric and non-ASCII characters and ‘+’, ‘&’, ‘=’, are replaced by %HH – Browsers map current encoding byte values to %HH – If the server doesn‟t know browser‟s character This simple form will submit a an HTTP GET with: encoding, it may decode form data incorrectly. http://www.xencraft.com/cgitest?Name=Tex&sex=m

Web Internationalization Slide 46 Web Internationalization Slide 47

Form Data Set Encoding Character Encoding Negotiation

Accept Charset Get URL HTML Application/x-www-form-urlencoded Browser Server x,y,z Pages Response Browser Server CHARSET=x

Submit Form Example comparing two character encodings: (GET or POST) Browser Server O/S Charset =z Charset=ISO-8859-1 CHARSET=x Name=Fran%E7ois+Ren%E9+Strau%DF encoding=x-www-form-urlencoded

Charset=UTF-8 Modern browsers send x-www-form-urlencoded data to Name=Fran%C3%A7ois+Ren%C3%A9+Strau%C3%9F the server in the CHARSET that was determined to be that of the *form*, however that determination was made (HTTP, , default, user override).

Web Internationalization Slide 48 Web Internationalization Slide 49

Internationalization and Unicode Conference 12 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Character Encoding Negotiation Character Encoding Negotiation

Accept Charset Get URL HTML Returning data in the encoding received Browser Server x,y,z Pages • Generally works in principle Response Browser Server CHARSET=x • Document „charset” must be correctly Submit Form identified (and has often been wrong) (POST) Browser Server • Fails with multiple encodings handled by a O/S Charset =z CHARSET=x single CGI multipart/form-data (MIME) • Fails with transcoding proxies (not allowed to Each control name/current value pair is a separate change URIs). part. Each part can be a different charset or content-type encoding. • Recommend using UTF-8 in both directions Supports file uploading (RFC1867).

Web Internationalization Slide 50 Web Internationalization Slide 51

Character Encoding Negotiation Character Encoding Negotiation

Multipart/form-data Other solutions to identifying encodings: • More efficient than x-www-form-urlencode for • XFORMS fixes the failure cases: non-ASCII data, binary data, and files http://www.w3.org/MarkUp/Forms/ • Does not have the length limit that browsers http://www.w3.org/TR/xforms/ impose on URLs (can be as low as 250 for • TIP: Use with older browsers: some devices) • Hidden fields containing encoding name or • Is now well supported carefully chosen text (tracks transcodings). • Recommended for POST of all form data CGI script performs analysis. – e.g. Microsoft‟s _CHARSET_

Web Internationalization Slide 52 Web Internationalization Slide 53

Internationalization and Unicode Conference 13 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Web I18n Part 1- Character Processing Reference Processing Model

Character Encodings • Different encoding schemes require different Character Encoding Negotiation decoding/parsing/processing methods Reference Processing Model – Single, and Multi-byte character sets (e.g. EUC) – Character encoding-switching schemes (ISO 2022) Character Escaping – Forward combining (accent-base letter) Unicode in Markup – Backward combining (base letter-accent) Normalization – Logical ordering/Visual ordering Identifiers • Variety bothers implementers and spec writers • Adopting a single universal encoding obsoletes most of the existing data • Instead, use a character abstraction Web Internationalization Slide 54 Web Internationalization Slide 55

Reference Processing Model Reference Processing Model

• Logically, characters are Unicode characters – Specifications are in terms of Unicode characters – Implementations do NOT have to use Unicode, In/out Any encoding only behave as if they did on the wire HTML • Benefits Abstraction – Removes ambiguity, simplifies specifications C Layer using internal S – Allows flexibility for common local encodings Unicode S – Backward compatible for older HTML browsers – Supports internationalization (large character set) Any encoding XML for internal – Removes dependencies/orientation on byte values implementation

Web Internationalization Slide 56 Web Internationalization Slide 57

Internationalization and Unicode Conference 14 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Reference Processing Model Web I18n Part 1- Character Processing

• Examples using Reference Processing Model Character Encodings – HTML 4.0 declares Unicode as its SGML Document Character Encoding Negotiation Character Set Reference Processing Model – CSS “sequence of characters from UCS” – XML “A character is an atomic unit of text as specified by Character Escaping ISO/IEC 10646” Unicode in Markup • Any encoding can be used internally, but Unicode often makes the most sense. Normalization • XML requires parsers to accept UTF-8 and UTF-16, Identifiers making Unicode best internal choice • Some Recommendations require Unicode – e.g. DOM requires UTF-16 Web Internationalization Slide 58 Web Internationalization Slide 59

Character Escaping Character Escaping

Mechanisms to represent characters • Useful for: • Numeric Character References (NCRs) – syntax-significant characters – HTML and XML – e.g. < (<), > (>), & (&), " (") Hexadecimal: &#xhhhhhh; – characters outside current encoding Decimal &#dddd; – eliminating visual or other ambiguity – CSS2 (soft-hyphen), “\hh ” (note terminating space), \hhhhhh - (hyphen-minus) • Character Entity References (HTML only) å Å (note case-sensitivity) (space) (no-break space)

Web Internationalization Slide 60 Web Internationalization Slide 61

Internationalization and Unicode Conference 15 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Character Escaping Character Escaping

• Relies on Reference Processing Model • Don‟t use Windows 1252 code points instead – Always references Unicode scalar value of Unicode, for values 128-159 (0x80-0x9F) • Same value regardless of encoding • e.g. Euro is € or € not € • Same value for UTF-8, UTF-16, UTF-32 • www.i18nguy.com/markup/ncrs.html • One value for supplementary characters, not two • Don‟t simulate characters with special fonts E.g. 𒍅 not �� (e.g. Symbol), or you can get erroneous: – Simplifies transcoding (no parsing or conversion) • Display, depending on font availability – Allows any Unicode character in any document (if • Font fallbacks it is legal in the language of the document) • Searches by Search engines • Behavior from Style sheets • Database contents

Web Internationalization Slide 62 Web Internationalization Slide 63

Windows 1252 vs Unicode & ISO 8859-1 Selecting A Character Encoding

1252 is identical to Unicode and ISO 8859-1 except in 80-9F. • Choose an encoding that minimizes the need Unicode and ISO 8859-1 assigns control codes. Windows-1252 assigns Euro, Smart quotes, TM, and others to escape characters. – Unicode is always a candidate. – Unicode is supported by all but the oldest browsers. – Is the largest character set, and can be expanded. – Therefore it is often the best choice both for minimizing escapes and anticipating future character requirements. • e.g. New currency symbols

Web Internationalization Slide 65

Internationalization and Unicode Conference 16 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Web I18n Part 1- Character Processing Unicode Vs. Markup

Character Encodings • +100,000 characters as of Unicode 5.0 Character Encoding Negotiation – Should we use them all? Reference Processing Model – Are there any we shouldn‟t use? Character Escaping – Does Unicode‟s capabilities, needed for plain text, Unicode in Markup interfere with markup? • Markup can do some things better than Normalization character codes. Not all Unicode characters are Identifiers needed.

Web Internationalization Slide 66 Web Internationalization Slide 67

Unicode Vs. Markup Unicode Vs. Markup

Potential problem areas Solution types • Redundancies impact searching • Restrict characters so they cannot be used – “Å” A-ring “A+˚ ” A+ring “Å” Angstrom • Replace redundancies (normalization) • Formatting characters vs. Markup • Replace with Markup – E.g. Bidi controls, interlinear annotation characters – Extensible – presentation can be separate from content • Characters with style vs. Markup – E.g. Superscript, subscript Joint W3C and Unicode recommendations in: • Object Replacement Character vs. Markup “Unicode in XML and other Markup Languages” http://www.w3.org/TR/unicode-xml/ – Better to use markup to include an image http://www.unicode.org/unicode/reports/tr20/

Web Internationalization Slide 68 Web Internationalization Slide 69

Internationalization and Unicode Conference 17 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Web I18n Part 1- Character Processing String Indexing

Character Encodings • How long is this string? ≠ Å Character Encoding Negotiation Reference Processing Model Character Escaping Unicode in Markup String Indexing Normalization Identifiers

Web Internationalization Slide 70 Web Internationalization Slide 71

String Indexing String Indexing

• Which units should be used for counting? Character Model recommendations Graphemes • Character counting is recommended for most ≠ A + ˚ 3 programming interfaces (e.g. XML Path)

Characters • Code unit counting may be used for internal U+233B4 U+2260 U+0041 U+030A 4 efficiency (e.g. DOM) • Graphemes may be useful for user interaction, Code units D84C DFB4 2260 0041 030A once a suitable definition exists 5 • Avoid creating API with single unit arguments Bytes D8 4C DF B4 22 60 00 41 03 0A e.g. “SS” = Uppercase(“ß”) 10

Web Internationalization Slide 72 Web Internationalization Slide 73

Internationalization and Unicode Conference 18 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Normalization Early Uniform Normalization

• Representing data in Unicode characters can have more than 1 more than 1 way leads representation to errors • Canonical equivalence • E.g. The Mars Climate – Indistinguishable, fundamental equivalence Orbiter mission was – E.g. combining sequences, singletons disastrous. Information – “Å” U+00C5 (A-ring pre-composed) expected to be metric, – “A+˚ ” U+0041 + U+030A (A + combining ring above) was sent in English units – “Å” U+212B (Angstrom) • Solution- Adopt a • Compatibility equivalence standard representation- – E.g. Formatting differences, ligatures Normalize – “ｶ” U+FF76 “カ” U+30AB (KA half and full width) – “ﬁ” U+FB01 (ligature fi)

Web Internationalization Slide 74 Web Internationalization Slide 75

Early Uniform Normalization Early Uniform Normalization

• Unicode Consortium has defined canonical and Text on the web SHOULD be Fully Normalized. compatibility decomposition formats and 4 Fully Normalized text is either: different sets of rules for normalization: 1. Unicode text in Normalization Form NFC, and “ Unicode Normalization Forms” 2. Does not contain character escapes or includes http://www.unicode.org/unicode/reports/tr15/ that upon expansion would undo point 1, and • The W3C Character Model recommends 3. Does not begin with a composing character. Normalization Form C (NFC) or: – Brings canonical equivalences to composed form – Leaves compatibility forms as distinct 1. Legacy encoded text, which transcoded to Unicode satisfies the above. – Most legacy text is composed, and is unchanged

Web Internationalization Slide 76 Web Internationalization Slide 80

Internationalization and Unicode Conference 19 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Early Uniform Normalization New “International” Features in HTML5

• Examples of Fully Normalized Text • Some attributes that now apply to all elements: “suçon”, “suçon”, – dir (Direction) “sub¸on”, “sub̧on” – lang (Language) Note- Unicode does not have a composed b-cedilla. • lang attribute now supports empty string • Examples that are not Fully Normalized indicating the primary language is unknown “suc¸on”, “suçon” • xml:lang supported, provided it has same value Reason: should use composed character “ç” as lang “¸on”, “̧on” • hreflang attribute added to Reason: should not begin with combining character – For consistency with and

Web Internationalization Slide 81 Web Internationalization Slide 88

New “International” Features in HTML5 New “International” Features in HTML5

• Ruby annotation support: , , • Native support for IRI and IDNA – For IRI, document encoding must be UTF-8 or UTF-16 or the query component must %hh escape 2010 ( yyyy ) 10 ( mm ) any non-ASCII characters 16 ( dd ) • Encodings – New encoding declaration – Support for BOM – Charset attribute removed on and

Web Internationalization Slide 89 Web Internationalization Slide 90

Internationalization and Unicode Conference 20 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Questions Coffee Break

Web Internationalization Slide 91 Web Internationalization Slide 92

Web Internationalization Agenda Language Identification

• Part 1 – Character Processing Same mechanism for all: • Coffee Break • Tags (identifiers) defined by RFC 3066. • Part 2 2-letter and 3-letter language codes (ISO- – Layout and Typography 639) with optional 2-letter country codes – Designing International Web Sites (ISO-3166) separated by a character „-‟ (not „_‟). • Tags are case insensitive (even in XML). • In mark-up: the language attribute is inherited by the children of the element where the attribute is defined.

Web Internationalization Slide 93 Web Internationalization Slide 94

Internationalization and Unicode Conference 21 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Language Identification Language Identification

RFC 3066 Rules: • RFC 3066 does not cover all needs. • 3-letter codes should be used only for the – e.g. Latin-Amer. Spanish, Script distinctions languages that have no 2-letter code. – Addressed case-by case through registrations • Always use the Terminological form of the • No clear distinction of the identifiers of a 3-letter codes, not the Bibliographical “language” and a “locale”. form. – (See past IUC locales talks for more information.) • Avoid user-defined codes (x-myCode) • Standards groups considering these issues: IETF, ISO TC37, SIL, W3C, et al

Web Internationalization Slide 95 Web Internationalization Slide 96

Language Identification: BCP47 Language Identification

• BCP47 evolution • HTTP: Content-Language header – RFC 4646, 4647 • HTML: LANG attribute (e.g. in ) • RFC 4646 replaces RFC 3066. • XML: xml:lang attribute • language-country becomes language-script-country • Registry expanded to include all valid entries • XHTML 1.0: Both lang and xml:lang • New matching rules in RFC 4647 – RFC 5646 just released

Verba.

• 7,000 three-letter ISO 639-3, ISO 639-5 language codes, 7 region codes • 220 'extended language' subtags, for backwards • XHTML 1.1: xml:lang attribute compatibility.

Web Internationalization Slide 97 Web Internationalization Slide 98

Internationalization and Unicode Conference 22 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Language Identification Language Identification – Input file

The lang() function in XPath:

• True if the selected node has xml:lang set to the given language code. Message 100 in English. • Match is done as a sub-string from the start of Message 200 [insertion in French] in American the value: English. Message 200 en Québecquois. 'en' matches 'en', and 'en-us'. • Match is case insensitive: Message 300 en français. 'en' matches 'EN', 'En-us', etc. Message 400 in British English.  Example: Input, Languages.xsl, Output.

Web Internationalization Slide 99 Web Internationalization Slide 100

Language Identification – Style-sheet Language Identification – IE Output

Message 100 in English. (en) Message 200 [insertion in French] in American English. (en-us) en Message 400 in British English. (EN-GB)

()

Web Internationalization Slide 101 Web Internationalization Slide 102

Internationalization and Unicode Conference 23 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Language Identification – FF Output Language Identification- Opera Output

Message 100 in English. (en)Message 200 [insertion in French] Message 100 in English. Message 200 [insertion in French] in in American English. (en-us)Message 400 in British English. American English. Message 200 en Québecquois. Message 300 (EN-GB) en français. Message 400 in British English.

Web Internationalization Slide 103 Web Internationalization Slide 104

Language Identification – CSS Language Identification – CSS

There are two methods to refer to the language attribute in CSS: Test Language and CSS *:lang(zh) { font-family:SimSun }

Text in English.

• The attribute selector.

Text in American English.

Text in generic English.

*[lang|=fr] { font-weight:bold }

Texte en québecquois.

Texte en français.

Text in British English.

• Both use the same matching mechanism as the lang() function in XPath.  Example: LanguagesCSS.htm

Web Internationalization Slide 105 Web Internationalization Slide 106

Internationalization and Unicode Conference 24 Tex Texin, XenCraft Web Internationalization – Standards and Practice

Language Identification – FF Output Quotes – HTML

Text in English. • The element for in-line quotations (auto- quotation marks expected). Text in American English. • The

element for paragraph- Text in generic English. type quotations (indented, and no auto- Texte en québecquois. quotation marks expected).
Texte en français.  Example: Input, Output: Quotes.htm. Text in British English.
Web Internationalization Slide 107 Web Internationalization Slide 108
Quotes – Using CSS Quotes – HTML Input
...
English text with English quoted • CSS allows control of the type of quote to use text.

Text en Français avec English quoted according to the language. text.

Text en Français avec English *[lang|=fr] { quote:'\ab\a0' '\a0\bb' } quoted text containing a quote itself.
qo:before { content:open-quote }
Quotes in Finnish.

Quotes in Polish.
qo:after { content:close-quote }
Quotes in Japanese.

Quotes in German.

Quotes in Dutch.

A paragraph using • Examples blockquote.
 HTML: Input, CSS, Output: QuotesWithCSS.htm.  XML: Input, CSS File, Output: Quotes.xml.
Web Internationalization Slide 109 Web Internationalization Slide 110
Internationalization and Unicode Conference 25 Tex Texin, XenCraft Web Internationalization – Standards and Practice
Some Unicode Characters Quotes – CSS Style-sheet
q:before { content: open-quote; } U+2018 „ Left Single Quotation Mark q:after { content: close-quote; } blockquote:before { content: open-quote; } U+2019 ‟ Right Single Quotation Mark blockquote:after { content: close-quote; } U+201C “ Left Double Quotation Mark [lang|='en'] > * { /* English */ quotes: "\201C" "\201D" } U+201D ” Right Double Quotation Mark U+201E „ Double Low 9 Quotation Mark [lang|='fr'] > * { /*guillemets*/ quotes: "\AB\A0" "\A0\BB" } U+201F ‟ Double High Reversed 9 Q. M. [lang|='fi'] > * {/*same direction*/ quotes: "\201D" "\201D" } U+300C 「 Left Corner Bracket [lang|='de'] > * { /* German */ quotes: "\201E" "\201C" } U+300D 」 Right Corner Bracket [lang|='ja'] > * { /* Japanese */ quotes: "\300C" "\300D" }
U+00AB « Left Pointing Double Angle Q. M. [lang|='nl'] > * { /* Dutch */ quotes: "\2018" "\2019" }
U+00BB » Right Pointing Double Angle Q. M. [lang|='pl'] > * { /* Polish */ quotes: "\201E" "\201D" } U+00A0 No Break Space
Web Internationalization Slide 111 Web Internationalization Slide 112
Quotes – Firefox 3 Output Quotes – Opera Output
English text with “English quoted text”. English text with “English quoted text”. Text en Français avec « English quoted text ». Text en Français avec « English quoted text ». Text en Français avec « English quoted text containing a Text en Français avec « English quoted text containing a “ quote” itself ». ”quote" itself ». ”Quotes” in Finnish. ”Quotes” in Finnish. „Quotes” in Polish. „Quotes” in Polish. 「Quotes」 in Japanese. 「Quotes」 in Japanese. „Quotes“ in German. „Quotes“ in German. „Quotes‟ in Dutch. „Quotes‟ in Dutch. “A paragraph using blockquote.” “A paragraph using blockquote.”
Web Internationalization Slide 113 Web Internationalization Slide 114
Internationalization and Unicode Conference 26 Tex Texin, XenCraft Web Internationalization – Standards and Practice
Casing Casing TextTransform.htm
• CSS2 provides the property Style: text-transform with 5 values: uppercase, lowercase, capitalize, conversion (making it useless from an i18n viewpoint). CSS3 (working draft) forces Unicode casing conformance. This property is deprecated in XSL 1.0.  Example: Source, Output: TextTransform.htm.
Web Internationalization Slide 115 Web Internationalization Slide 116
Casing TextTransform.htm Casing – IE and Opera Output
Original = This text should be all uppercased.
Original = This text should be all uppercased. Transformed = This text should be all Transformed = THIS TEXT SHOULD BE ALL UPPERCASED. uppercased.

Original = THIS TEXT SHOULD BE ALL LOWERCASED.
Original = THIS TEXT SHOULD BE ALL LOWERCASED. Transformed = THIS TEXT SHOULD BE ALL Transformed = this text should be all lowercased. LOWERCASED.

Original 1 = tHIS tEXT sHOULD bE cAPITALIZED.
Transformed = tHIS tEXT sHOULD bE Original 1 = tHIS tEXT sHOULD bE cAPITALIZED. cAPITALIZED.
Transformed = THIS TEXT SHOULD BE CAPITALIZED. Original 2 = this text should be capitalized.
Original 2 = this text should be capitalized. Transformed = this text should be Transformed = This Text Should Be Capitalized. capitalized.

[de] Original = ß (sharp-s), ö (o-diaeresis)
[de] Original = ß (sharp-s), ö (o-diaeresis) Transformed = ß (sharp-s), ö (o- Transformed = ß (SHARP-S), Ö (O-DIAERESIS) diaeresis)

[tr] Original = i (i-with-dot)
Transformed = i (i-with-dot)
[tr] Original = i (i-with-dot) Transformed = I (I-WITH-DOT)
Web Internationalization Slide 117 Web Internationalization Slide 118
Internationalization and Unicode Conference 27 Tex Texin, XenCraft Web Internationalization – Standards and Practice
Casing – Firefox Output Casing – Clipboard Copy (unchanged)
Original = This text should be all uppercased. Original = This text should be all uppercased. Transformed = THIS TEXT SHOULD BE ALL UPPERCASED. Transformed = This text should be all uppercased.
Original = THIS TEXT SHOULD BE ALL LOWERCASED. Original = THIS TEXT SHOULD BE ALL LOWERCASED. Transformed = this text should be all lowercased. Transformed = THIS TEXT SHOULD BE ALL LOWERCASED.
Original 1 = tHIS tEXT sHOULD bE cAPITALIZED. Original 1 = tHIS tEXT sHOULD bE cAPITALIZED. Transformed = THIS TEXT SHOULD BE CAPITALIZED. Transformed = tHIS tEXT sHOULD bE cAPITALIZED. Original 2 = this text should be capitalized. Original 2 = this text should be capitalized. Transformed = This Text Should Be Capitalized. Transformed = this text should be capitalized.
[de] Original = ß (sharp-s), ö (o-diaeresis) [de] Original = ß (sharp-s), ö (o-diaeresis) Transformed = SS (SHARP-S), Ö (O-DIAERESIS) Transformed = ß (sharp-s), ö (o-diaeresis)
[tr] Original = i (i-with-dot) [tr] Original = i (i-with-dot) Transformed = I (I-WITH-DOT) Transformed = i (i-with-dot)
Web Internationalization Slide 119 Web Internationalization Slide 120
Numbered Lists Numbered Lists NumberedLists.htm
With CSS2 • CSS2 offers the list-style-type ...  Example NumberedLists.htm ...
Web Internationalization Slide 121 Web Internationalization Slide 122
Internationalization and Unicode Conference 28 Tex Texin, XenCraft Web Internationalization – Standards and Practice
Numbered Lists NumberedLists.htm Numbered Lists – Firefox Output
...
List numbered in Hebrew:

Item 1
...
Item 6

List numbered in Georgian:

Item 1
...
Item 6

List numbered in Armenian:

Item 1
...
Item 6

List numbered in Han character (cjk-ideographic):

Item 1
...
Item 6

Web Internationalization Slide 123 Web Internationalization Slide 124
Numbered Lists Number Formatting
With XSL The function format-number() • XSL provides more flexibility as the format in XSL allows the formatting of numbers based and the type of the numbers can be changed on a given pattern. using . • Uses same patterns as Java 1.1 java.text.DecimalFormat patterns. • Use to  Example: Input, ListNumbers.xsl, Output. overwrite the default symbols (i.e. decimal separator, grouping separator, etc.).
 Example: Input, XSL File, Output.
Web Internationalization Slide 125 Web Internationalization Slide 126
Internationalization and Unicode Conference 29 Tex Texin, XenCraft Web Internationalization – Standards and Practice
Text Flow Text Flow
Bi-directional Text in HTML Bi-directional Text for XML (CSS2) • The dir attribute: • Use the direction and unicode-bidi – dir="ltr" (default), dir="rtl" properties. The unicode-bidi property – Affects the default value of align. specifies the behavior for inline levels elements – Inherited (use it in to set the base for the (15 maximum levels of embedding). whole document). • Based on Unicode bidi algorithm (UAX#9) • The element: para.bidi { direction:rtl; – Overrides implicit directional properties of content. unicode-bidi:embed } – Requires the dir attribute.  Example: BidiText.htm
Web Internationalization Slide 127 Web Internationalization Slide 128
Text Flow – Bidi Example Source (1/2) Text Flow – Bidi Example Source (2/2)

-Pepper Creek LLC, Using dir="ltr חברת span":
.שנוסדה זה-עתה, מונה יותר מ550- עובדים ,Pepper Creek LLC חברת
.שנוסדה זה-עתה, מונה יותר מ550- עובדים <"p dir="rtl> Using dir="rtl":
Using dir="ltr" (wrong):
.שנוסדה זה-עתה, מונה יותר מ550- עובדים
.שנוסדה זה-עתה, מונה יותר מ550- עובדים
Web Internationalization Slide 129 Web Internationalization Slide 130
Internationalization and Unicode Conference 30 Tex Texin, XenCraft Web Internationalization – Standards and Practice
Text Flow – Bidi Output Text Flow
Vertical Text
• Use the writing-mode property (CSS3). • For example, to display top-to-bottom, and right-to-left text use:
div.vertical { writing-mode:tb-rl }
 Example in HTML, and in SVG.
Web Internationalization Slide 131 Web Internationalization Slide 132
Text Flow – Vertical, HTML Text Flow – Vertical, HTML Output
Example of horizontal text (rl-tb).

Example of vertical text (tb-rl).

Example of vertical text with horizontal insert.
Web Internationalization Slide 133 Web Internationalization Slide 134
Internationalization and Unicode Conference 31 Tex Texin, XenCraft Web Internationalization – Standards and Practice
Text Flow – Vertical, SVG Text Flow – Vertical, SVG in HTML