2
Discuss various aspects of content Objectives Internationalization Objectives authoring that support universal access Internationalization to the Web You will understand how to declare and use characters & character encodings in X/HTML & CSS Content authoring that supports worldwide use of the Web how to declare the language of the document or parts of the document how to use Chinese or other scripts in URIs according to the latest standards Richard Ishida You will see some of the new CSS style capabilities that may soon be available for Asian languages
4 5 Outline Outline Outline Outline
Character sets & encoding Character sets & encoding Identifying language Character set vs. character encoding IDN & IRIs The Document Character Set East Asian typography Choosing an encoding Serving HTML & XHTML Declaring the document encoding Entities & Numeric Character References Care & feeding of characters Identifying language IDN & IRIs East Asian typography
1 6 An important initial distinction 7
Character set vs. character encoding Character set The set of atomic text elements you will use for a character encoding character character encoding character particular purpose. vs vs Character Encoding
The way these abstract characters are mapped to numbers for manipulation in a computer. Character set Character set
A single character set, such as Unicode, may 9 10 have more than one character encoding.
The Document Character Set 好 ũ א A
Code point U+0041 U+05D0 U+597D U+233B4 character encoding character character encoding character vs vs UTF-8 41 D7 90 E5 A5 BD F0 A3 8E B4
UTF-16 00 41 05 D0 59 7D D8 4C DF B4
UTF-32 00 00 00 41 00 00 05 D0 00 00 59 7D 00 02 33 B4 Character set Character set
2 What is the Document Character Set? 11 What is the Document Character Set? 12
the logical model that describes how XML and HTML are means that documents can only contain characters defined processed by Unicode
for XML and HTML (from version 4.0): any encoding can be used for your document as long as it is the Universal Character Set (UCS) defined by both ISO/IEC properly declared and a subset of the Unicode repertoire 10646 and Unicode standards values of numeric character references (such as ǵ it does not mean that all HTML and XML documents have to and ǵ for ǵ) are interpreted as Unicode characters - The Document Character Set Character Document The Set Character Document The The Document Character Set Character Document The be encoded as Unicode ! Set Character Document The no matter what encoding you use for your document
See GEO FAQ: Document character set http://www.w3.org/International/questions/qa-doc-charset.html
13 Consider using Unicode 14
Choosing an encoding supports many languages, enabling the use of a single encoding across all pages and forms, regardless of language
eliminates the need for server-side logic to determine the Choosing an encoding Choosing an encoding character encoding for each page served or each incoming form submission
allows many more languages to be mixed on a single page than almost any other choice
Although there are other multi-script approaches (such as ISO- 2022 and GB18030), Unicode generally provides the best combination of user agent and script support.
3 If you don't use Unicode 15 16
select an encoding that maximizes the opportunity to Serving HTML & XHTML directly represent characters / minimizes the need to represent characters by character escapes
Choosing an encoding
Choosing an encoding select commonly supported encodings & check that user agents adequately support the encoding selected
consider a solution that minimizes complexity when dealing with multiple languages and scripts
note that support for a given encoding does not necessarily imply support for all writing systems that that encoding supports
XHTML & MIME types 17 XHTML & MIME types 18
HTML text/html HTML text/html
Use the compatibility Use the compatibility XHTML text/html guidelines in XHTML XHTML text/html guidelines in XHTML Spec, Appendix C ! Spec, Appendix C ! Serving HTML & XHTML HTML Serving & XHTML HTML Serving Serving HTML & XHTML HTML Serving application/xhtml+xml & XHTML HTML Serving application/xhtml+xml application/xml application/xml text/xml text/xml
4 XHTML & MIME types 19 'Standards' vs 'Quirks' modes 20
HTML HTML text/html
XHTML text/html + compatibility guidelines Serving HTML & XHTML HTML Serving & XHTML HTML Serving Serving HTML & XHTML HTML Serving application/xhtml+xml & XHTML HTML Serving application/xml text/xml XML
'Standards' vs 'Quirks' modes 21 'Standards' vs 'Quirks' modes 24
Test file for Standards Mode
Serving HTML & XHTML HTML Serving & XHTML HTML Serving Serving HTML & XHTML HTML Serving & XHTML HTML ServingHere is some text in a p in a div.
Here is some text... | ...in a p tag |
Here is some ... | ... that's not. |
5 'Standards' vs 'Quirks' modes 25 Summary of assumptions & recommendations 26
Test file for Standards Mode
Serving HTML & XHTML HTML Serving & XHTML HTML Serving Serving HTML & XHTML HTML ServingHere is some text in a p in a div.
XHTML served as xml is still not widely supportedHere is some text... | ...in a p tag | XHTML served as XML should be served as
Here is some ... | ... that's not. |
27 Basic scenarios 28
Declaring the document encoding HTTP
HTML
XHTML (text/html) Declaring the document encoding document the Declaring Declaring the document encoding document the Declaring XHTML (XML)
6 Where appropriate, declare the page's 29 Where appropriate, declare the page's 30 character encoding by setting the charset character encoding by setting the charset parameter in the HTTP Content-Type header. parameter in the HTTP Content-Type header.
method depends on the server + user agents can easily find the information server may have default settings highest priority in case of conflict, so should be used where transcoding done by the server users may be able to override default settings more difficult for content authors to change the setting - - especially when dealing with an ISP Eg .htaccess files in Apache
• AddType 'text/html; charset=UTF-8' html server settings may get out of synch with the document •
For XHTML served as text/html, where 31 For XHTML served as text/html, where 32 practical use an XML declaration with an practical use an XML declaration with an encoding attribute. encoding attribute.
useful when editing or processing the file as XML, eg. using + XSLT
helps developers, testers, or translation production Character set name managers who want to perform a visual check of a document not required for UTF-8 or UTF-16, but useful anyway allows the document to be read correctly when not on the find a name at server http://www.iana.org/assignments/character-sets Declaring the document encoding document the Declaring encoding document the Declaring Declaring the document encoding document the Declaring encoding document the Declaring the XHTML spec says you should use the preferred name when aliases are available - knocks Internet Explorer documents into Quirks mode avoid using unregistered names (ie. x-… ) not actually needed for HTML documents
7 For XHTML served as application/xhtml+xml, 33 For HTML documents and XHTML documents 34 always use an XML declaration with an served as text/html, always use the encoding attribute. element to explicitly declare the document's character encoding. Character set name useful when processing the file as XML, eg. using XSLT developers, testers, or translation production managers may want to perform a visual check of a document required for all encodings, including UTF-8 or UTF-16 the XHTML spec says you should put as near as possible to the top of the file (ie. before
avoid using unregistered names (ie. x-… )
For HTML documents and XHTML documents 35 Basic scenarios 36 served as text/html, always use the element to explicitly declare the document's character encoding.
enables the document to be edited or read from CD or hard HTTP
helps developers, testers, or translation production managers HTML (✓) ✗✓ who want to perform a visual check of a document ✓ ✓ ✓ the XHTML spec says you should XHTML (text/html) ( )()
there is likely to be no other in-document alternative Declaring the document encoding document the Declaring encoding document the Declaring Declaring the document encoding document the Declaring encoding document the Declaring XHTML (XML) (✓) ✓✗
8 Declare encoding for your CSS style sheets 37 Precedence rules for XHTML/HTML encoding 38 too
@charset "utf-8"; 1. HTTP Content-Type
must be the first thing in the file 2. XML declaration
beware of the UTF-8 signature/BOM 3. meta statement 4. link charset attribute only necessary for external, linked style sheets
likely to become increasingly important in the future HTTP as a priority supports transcoding particularly important if: Declaring the document encoding document the Declaring encoding document the Declaring Declaring the document encoding document the Declaring encoding document the Declaring your style sheet contains non-ASCII values for the content Use the link charset attribute with care property, or
refers to non-ASCII element or attribute names or values
Precedence rules for CSS 39 40
Entities and Numeric Character 1. HTTP Content-Type References (NCRs) 2. @charset rule 3.
HTTP as a priority supports transcoding Declaring the document encoding document the Declaring Declaring the document encoding document the Declaring Use the link charset attribute with care
9 Only use escapes for characters in 41 Only use escapes for characters in 42 exceptional circumstances exceptional circumstances NCRs NCRs NCRs NCRs Create pages using an encoding that supports all the characters you need
ě ř Entities and and Entities and Entities
Entities and and Entities and Entities Jako efektivn jší se nám jeví po ádání tzv. Road Show prostřednictvím našich autorizovaných dealerů á hex NCR v Čechách a na Moravě, které proběhnou v průběhu á decimal NCR září a října.
á character entity
á \E1 CSS escape
NCR = Numeric Character Reference
Only use escapes for characters in 43 Only use escapes for characters in 44 exceptional circumstances exceptional circumstances NCRs NCRs NCRs Create pages using an encoding that supports NCRs all the characters you need
syntax-related characters
< (<) > (>) & (&) " (") Entities and and Entities and Entities
Entities and and Entities Jako efektivnĕjší se nám and Entities jeví pořádání tzv. characters not supported by the document encoding it may be better to change the encoding! Road Show prostřednictvím
našich autorizovanǽch dealerů characters not supported by the input tools v Čechách a na Moravě, které proběhnou v characters that are invisible or ambiguous průbůhu zá ří a eg. / října. eg. /
10 Also bear in mind… 45 46 NCRs NCRs
NCRs always reference the Unicode code point, no Care and feeding of characters matter what encoding you used !
Entities and and Entities suggest you use Hex values rather than Decimal – Entities and and Entities easier to look up
character entities can cause problems for processing as XML
use a single value for supplementary characters, not two
ie. 𒍅 not
Some Unicode characters are not suitable for 47 Other Unicode characters are OK 48 use with markup
Names/ Description Short Comment Names/ Description Short Comment Line and paragraph separator use
Unicode in XML & Other Markup Languages Unicode in XML & Other Markup Languages http://www.w3.org/TR/unicode-xml/ http://www.w3.org/TR/unicode-xml/
11 'Compatibility characters' vary in 49 Normalization 50 appropriateness
Names/ Description Examples Verdict Circled letters and digits used ①②③ⒶⒷⒸ㊂㊃㊄ OK for list item markers ㊓㊔㊕㋝㋞㋟ NFC Ízelítőül Parenthesized or dotted ⑴⑵⑶ use list item marker number used as list item style marker normalize NFD I◌zeli ◌to◌ű◌̈l ﻌﻋﻊﻉ Arabic Presentation forms Half-width and full-width ヤユヨラabcd OK characters Superscripted and subscripted ¹²³₁₂₃ use markup Ha a világ beszélni akarna, Unicode-ul szólalna meg. characters Regisztráljon már most a Tizedik Nemzetközi Unicode Konferenciára, melyet 1997. március 10-12-én rendeznek Etc… Meinz-ban, Németországban. Ezen a konferencián az iparág Care and feeding of characters of feeding and Care characters of feeding and Care Care and feeding of characters of feeding and Care characters of feeding and Care több neves szakértője is részt vesz. Ízelítőül a témákból: a világháló és a Unicode nemzetköziesítése és lokalizálása, a moral – use markup where possible Unicode alkalmazása működő rendszerekben és alkalmazásokban, szövegelrendezésnél, és többnyelvű számítógépeken. Unicode in XML & Other Markup Languages http://www.w3.org/TR/unicode-xml/
When creating Unicode text that has the 51 52 potential for normalization, use NFC Outline Outline
Normalization Form C Character sets & encoding The Document Character Set closest to actual practice Choosing an encoding compact Serving HTML & XHTML Declaring the document encoding the Character Model for the World Wide Web Entities & Numeric Character References recommends that text on the Web be in NFC form facilitates clarity and checking Care & feeding of characters Identifying language Care and feeding of characters of feeding and Care Care and feeding of characters of feeding and Care IDN & IRIs East Asian typography
12 53 54 Outline Outline
Character sets & encoding Declaring the language of text Identifying language Declaring the language of text Language attribute values Negotiating language with the server Styling using the language attribute IDN & IRIs East Asian typography
Always declare the language of the document 55 Always declare changes in the language of 56 as a whole in the tag the text
HTML: use the lang attribute in the tag HTML: use the lang attribute
The French for Cat is chat.
XHTML as text/html: use both the lang attribute and the xml:lang attribute in the tag XHTML as text/html: use both the lang attribute and the xml:lang attribute (xml:lang takes precedence)The title in Chinese is 中国科学院文献情报中心.
XHTML as XML: use the xml:lang attribute in the Declaring the language of text of language the Declaring text of language the Declaring Declaring the language of text of language the Declaring text of language the Declaring tag use a span element if there is nothing to hang the information on13 Why declare the language? 57 58
extremely important for screen readers and Language attribute values accessibility
allows for stylistic variations
some browsers use language to determine appropriate fonts for Simplified vs Traditional Chinese vs Japanese
allows for extraction of language-specific elements, eg. Declaring the language of text of language the Declaring Declaring the language of text of language the Declaring using XSLT's lang() function in XPath
etc.
Use RFC 3066 rules 59 Use RFC 3066 rules 60
case insensitive, A-Z, a-z, 0-9, up to eight letters HTML 4.01 specification still says RFC 1766 language usually written lowercase, by RFC 3066 replaced RFC 1766, so use this ! convention
all 2-letter and 3-letter codes must be ISO 639 codes Language attribute values attribute Language values attribute Language Language attribute values attribute Language values attribute Language don't use 3-letter ISO 639 codes if language 2-letter code exists optional subcode(s)
country, dialect, etc.
14 Use RFC 3066 rules 61 Use RFC 3066 rules 62
i-… reserved for IANA-defined optional registrations case insensitive, A-Z, a-z, 0-9, two to language x-… user-defined codes, eg. optional eight letters x-ishidic subtag(s) usually written uppercase, by convention dialect, country, etc. 2-letter codes must be ISO 3166 codes Language attribute values attribute Language values attribute Language Language attribute values attribute Language values attribute Language 3- to 8-letter codes can be registered with IANA
more than 2 subtags possible, eg. en-UK-geordie – no special rules apply
Use IANA assigned language tags 63 Other points about language tags 64
IANA defines codes for specific combinations, eg. can be applied to other media (eg. audio objects) not just text zh-guoyu, zh-hakka, zh-min, zh-min-nan, zh-wuu, …
used at the beginning of a document, it identifies the includes codes for Simplified vs Traditional Chinese intended audience, rather than all the languages in the document
当世界需要沟通时,请用Unicode!
'en-GB' should also match 'en' – although this is not Language attribute values attribute Language values attribute Language Language attribute values attribute Language values attribute Language always the case for all implementations (eg. Apache language negotiation) 當世界需要溝通時,請用統一碼(Unicode) xml now provides a means to prevent inheritance of language using the null string, ie. register your needed codes with IANA ! xml:lang=""15 Issues with language tags 65 66
doesn't cover all needs, eg. generic Latin-American Spanish Negotiating language with the server some lack of clarity between use for language vs locale designation
many more language codes needed for the approximately 6,000 world languages Language attribute values attribute Language Language attribute values attribute Language
people working on improvements include ISO TC37, SIL, W3C, etc.
register tags you need with IANA
The server may be set up to serve alternative 67 Exercise caution in acting upon language 68 versions of a document in a particular information sent by the user agent language based on HTTP information sent by the user agent GET /Press/1998/CSS2-REC HTTP/1.1 the user may have never checked the Accept- Host: www.w3.org User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en- Language setting US; rv:1.5a) Gecko/20030728 Mozilla Firebird/0.6.1 Accept: REPLY text/xml,application/xml,application/xhtml+xml,text/html;q=0 user agent may send a request that specifies only a .9,text/plain;q=0.8,video/x-mng,image/png,image/jpeg, image/gif;q=0.2,*/*;q=0.1 language and not a region HTTP/1.1 200 OK (a particular issue if trying to establish locale for Accept-Language: en-GB,en;q=0.7,ch-Hans;q=0.3 Date: Wed, 05 Nov 2003 10:46:04 GMT Server: Apache/1.3.28 (Unix) PHP/4.2.3 information like date formats) Accept-Encoding: gzip,deflate Content-Location: CSS2-REC.en.html Accept-Charset: UTF-8,* Vary: negotiate,accept-language,accept-charset Keep-Alive: 300 TCN: choice people borrow machines from friends, they use them in Connection: keep-alive P3P: Referer: http://www.w3.org/International/questions/qa-lang- policyref=http://www.w3.org/2001/05/P3P/p3p.xml internet cafes - in these cases the inferred locale may priorities.html Cache-Control: max-age=21600 Cookie: absence_stuff=mult&0&group_id&2&chunksize&6 Expires: Wed, 05 Nov 2003 16:46:04 GMT be inappropriate If-Modified-Since: Tue, 12 May 1998 22:18:49 GMT Last-Modified: Tue, 12 May 1998 22:18:49 GMT If-None-Match: "3558cac9;36f99e2b" ETag: "3558cac9;36f99e2b" Cache-Control: max-age=0 Accept-Ranges: bytes always allow the user to select the appropriate Content-Length: 10734 Negotiating language with the server the with language Negotiating server the with language Negotiating Negotiating language with the server the with language Negotiating Connection: close server the with language Negotiating language (and locale) from whatever page they are REQUEST Content-Type: text/html; charset=iso-8859-1 looking at ! Content-Language: en
16 How to set up language preferences in your 69 70 browser
In Internet Explorer: Styling using the language Tools > Internet Options > General > Languages attribute
Note: if using a language like fr-CH, add fr too !
Could add zh-hans and zh-hant Negotiating language with the server the with language Negotiating Negotiating language with the server the with language Negotiating For other user agents, see http://www.w3.org/International/questions/qa-lang-priorities.html
Four possible approaches, in theoretically 71 Using :lang() 72 preferred order
body {font-family: "Times New Roman", serif;} :lang(ar) {font-family: "Traditional Arabic", serif; font-size: 1.2em;} :lang(zh-Hant) {font-family: PMingLiU,MingLiU, serif;} the :lang() pseudo-class selector :lang(zh-Hans) {font-family: SimSum-18030, SimHei, serif;} :lang(din) {font-family: "Doulos SIL", serif;} a [lang |= "..."] selector that matches the beginning of the value of a language attribute
It is polite to welcome people in their own language:
- 欢迎 a [lang = "..."] selector that exactly matches
- 歡迎 the value of a language attribute
- Καλοσωρίσατε اه و <"li xml:lang="ar" lang="ar>
- Добро пожаловать a generic class or id selector
- Kudual
en-GB in lang attribute matches en in :lang() ideal approach, but not yet supported by IE6
17 Using [lang |= "..."] 73 Using [lang = "..."] 74
body {font-family: "Times New Roman", serif;} body {font-family: "Times New Roman", serif;} *[lang|="ar"] {font-family: "Traditional Arabic", serif; font-size: 1.2em;} *[lang="ar"] {font-family: "Traditional Arabic", serif; font-size: 1.2em;} *[lang|="zh-Hant"] {font-family: PMingLiU,MingLiU, serif;} *[lang="zh-Hant"] {font-family: PMingLiU,MingLiU, serif;} *[lang|="zh-Hans"] {font-family: SimSum-18030, SimHei, serif;} *[lang="zh-Hans"] {font-family: SimSum-18030, SimHei, serif;} *[lang|="din"] {font-family: "Doulos SIL", serif;} *[lang="din"] {font-family: "Doulos SIL", serif;}
It is polite to welcome people in their own language:
It is polite to welcome people in their own language:
- 欢迎
- 欢迎
- 歡迎
- 歡迎
- Καλοσωρίσατε
- Καλοσωρίσατε اه و <"li>