A Sample of the Complexities Associated with Special Processing of Unicode Codepoints Used in Domain Names

A Sample of the Complexities Associated with Special Processing of Unicode Codepoints Used in Domain Names INTRODUCTION In early 2013, Verisign and other generic Top-Level Domain (gTLD) registries began participating in a procedure developed by the Internet Corporation for Assigned Names and Numbers (ICANN) to test gTLD operational stability and security prior to the delegation and launch of a new gTLD. This procedure is known as Pre-Delegation Testing (PDT)1. This document describes some of the complexities that Verisign has encountered with Unicode codepoint processing during the pre-delegation testing process due to ambiguities associated with local language communities having different practices for normalization of combining characters and composed characters. This need for context introduces risks of ambiguity and adds complexity to pre-delegation testing and makes it more difficult to develop consistent Internationalized Domain Name (IDN) implementations. This ambiguity and complexity may ultimately have profound effects on universal resolvability and deterministic navigation on the Internet, particularly as internationalization is more widely deployed and other opportunities arise for a multitude of interpretations and confusability across the various systems that are used to derive URIs. Certain Unicode codepoints have rules or handling that require special processing. Occasionally, Verisign and Internetstiftelsen i Sverige (IIS) (the Swedish company selected by ICANN to run the tests) had different interpretations of the rules found in the Internationalized Domain Names for Applications (IDNA, RFCs 58902 and 58913) standard or other applicable standards. The PDT system designed by IIS provides the “ability of the software and system to allow for interaction with the applicants”4, and Verisign took advantage of this ability to discuss ambiguities with IIS. Below is a sample of the codepoints that required discussion with IIS. CODEPOINT SAMPLES 1. Hyphen-Minus (U+002D) 1 https://newgtlds.icann.org/en/applicants/pdt 2 https://tools.ietf.org/html/rfc5890 3 https://tools.ietf.org/html/rfc5891 4 https://newgtlds.icann.org/en/applicants/pdt/vendor-selection-summary-02sep16-en.pdf Verisign Public 1 This codepoint is allowed in all scripts. It is a COMMON5 script character and is PROTOCOL VALID (PVALID, RFC 58926) for IDN registration. It has a Bidirectional (Bidi) property7 of ES (European Number Separator). This property is allowed in both Left-To-Right (LTR) and Right-To-Left (RTL) labels. This code point commonly occurs across all scripts and is accepted for registration. Why this matters: The PDT system flagged common script characters as potential issues, but the Hyphen-Minus character is PVALID and widely used in domain names. 2. Digits (U+0030 – U+0039) Digit characters (U+0030 - U+0039) are allowed in all scripts. These are COMMON script characters and are PVALID for IDN registration. They have a Bidi property of EN (European Number). This property is allowed in both LTR and RTL labels. These code points commonly occur across all scripts and are accepted for registration. Why this matters: The PDT system flagged common script characters as potential issues, but the Digit characters are PVALID and widely used in domain names. 3. The Middle Dot (U+00B7) This PVALID codepoint is PROHIBITED by Verisign in all registrations. Originally it was included in Verisign’s Latin table, but the PDT tester wanted to relegate this codepoint to the Catalan language only. Unicode does not recognize Catalan as a script separate from Latin, and so Verisign provides no special support for this language. Why this matters: Verisign prohibited use of this character in accordance with guidance received from the PDT tester. A contextual rule is required to make this codepoint an allowable exception per RFC 5892. 4. Arabic-Indic Digits (U+0660 – U+06F0) These code points are limited to the Arabic and Thaana script tables by a Unicode Script Extension. Further, a Contextual Rule prevents these code points from mixing with the EXTENDED ARABIC- INDIC DIGITS. Also, the ARABIC-INDIC DIGITS have a Bidi property of AN (Arabic Number). The EXTENDED ARABIC-INDIC DIGITS have a Bidi property of EN. The DIGITS (U+0030 – U+0039) also have a Bidi property of EN. A Bidi Rule specified in RFC 58938, (Section 2, Rule #4) prevents EN and AN characters from appearing in the same label, so the ARABIC-INDIC DIGITS cannot mix with EXTENDED ARABIC-INDIC DIGITS or with the DIGIT characters. The Verisign Registry implements these rules. 5 http://unicode.org/reports/tr24/#Special_Explicit 6 https://tools.ietf.org/html/rfc5892 7 http://www.unicode.org/reports/tr9/#Bidirectional_Character_Types 8 https://tools.ietf.org/html/rfc5893 Verisign Public 2 Why this matters: Verisign implemented use of these characters in accordance with guidance received from the PDT tester. Contextual rules apply and simple inclusion or exclusion is not protocol-compliant. 5. Hebrew Punctuation Geresh and Gershayim (U+05F3, U+05F4) These are Hebrew code points. A Contextual Rule asserts that these code points must be preceded by another Hebrew code point, so these code points cannot immediately follow a Digit character. These Hebrew characters are Right-To-Left points, and a Bidi Rule (RFC 5893, Section 2, Rule #1) prevents an RTL label from beginning with an EN character like a Digit. A label containing U+05F3 or U+05F4 cannot begin with a Digit, and it cannot place a Digit immediately before U+05F3 or U+05F4, but there is no public standard that would generally prevent these code points from appearing with DIGITS. Verisign’s implementation allows these characters to be followed by a digit. Why this matters: Verisign implemented use of these characters in accordance with guidance received from the PDT tester. Contextual rules apply and simple inclusion or exclusion is not protocol-compliant. 6. Katakana Middle Dot (U+30FB) This is a COMMON codepoint, with a Script Extension rule. This codepoint can only appear with characters from the Bopomofo, Hangul, Han, Hiragana, Katakana or Yii scripts. Verisign also has a Japanese Language table that includes this codepoint as well as other scripts such as Latin. When using the Japanese Language table, Verisign’s implementation prevents the Katakana Middle Dot from appearing with Latin characters. Why this matters: Verisign implemented use of this character in accordance with guidance received from the PDT tester. The PDT tester recommends that this character not be combined with Latin characters in a Japanese language context) and simple inclusion or exclusion is not protocol- compliant. 7. Katakana-Hiragana Prolonged Sound Mark (U+30FC) This is a COMMON code point, with a Script Extension rule. This codepoint can only appear with characters from the Hiragana or Katakana scripts. Verisign has a Japanese Language table that includes this codepoint as well as other scripts such as Latin. When using the Japanese Language table, Verisign’s implementation prevents the Prolonged Sound Mark from appearing with Latin characters. Why this matters: Verisign implemented use of this character in accordance with guidance received from the PDT tester. The PDT tester recommended that this character not be combined with Latin characters in a Japanese language context) and simple inclusion or exclusion is not protocol- compliant. Verisign Public 3 VERISIGN POSITION Verisign reviewed our PDT test results with the PDT testing agent to resolve issues of interpretation as noted in this paper. As a result of these reviews, multiple changes were made to Verisign’s IDNA implementation. We also discussed Verisign’s concerns with embedding language rules in Label Generation Rulesets with the PDT testing agent, and as a result rules about consonant/vowel patterns in the Thai language were not implemented. Verisign believes that additional work is required in this area to explore the role of context in resolution and to reduce implementation complexity. Verisign Public 4 .

A Sample of the Complexities Associated with Special Processing of Unicode Codepoints Used in Domain Names

Background I. Names

Localizing Into Chinese: the Two Most Common Questions White Paper Answered

Neural Substrates of Hanja (Logogram) and Hangul (Phonogram) Character Readings by Functional Magnetic Resonance Imaging

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress

Recognition of Online Handwritten Gurmukhi Strokes Using Support Vector Machine a Thesis

Sinitic Language and Script in East Asia: Past and Present

+ Natali A, Professor of Cartqraphy, the Hebreu Uhiversity of -Msalem, Israel DICTIONARY of Toponymfc TERLMINO~OGY Wtaibynafiail~

Proposal for a Korean Script Root Zone LGR 1 General Information

The Japanese Writing Systems, Script Reforms and the Eradication of the Kanji Writing System: Native Speakers’ Views Lovisa Österman

Basis Technology Unicode対応ライブラリスペックシート文字コードその他の名称 Adobe-Standard-Encoding A

Implement Bopomofo by Opentype Font Feature.Key

Scripts, Languages, and Authority Control Joan M