SortingSorting UnicodeUnicode TibetanTibetan usingusing aa MultiMulti--WeightWeight CollationCollation AlgorithmAlgorithm

RobertRobert R.R. ChiltonChilton TechnicalTechnical Director,Director, TheThe AsianAsian ClassicsClassics InputInput ProjectProject (ACIP)(ACIP) [email protected] www.asianclassics.org WhyWhy dodo wewe needneed aa sortingsorting algorithmalgorithm forfor UnicodeUnicode Tibetan?Tibetan? TibetanTibetan scriptscript encodedencoded inin UnicodeUnicode andand ISO/IECISO/IEC 1064610646 FullFull supportsupport ofof TibetanTibetan withinwithin aa computercomputer environmentenvironment alsoalso requires:requires: –– Keyboard(sKeyboard(s)) oror otherother inputinput methodsmethods –– Rendering:Rendering: readable,readable, printableprintable displaydisplay ofof thethe encodedencoded TibetanTibetan--scriptscript datadata (fonts,(fonts, etc.)etc.) –– CollationCollation rulesrules forfor generatinggenerating culturallyculturally acceptableacceptable sortingsorting PreviousPrevious effortsefforts toto sortsort TibetanTibetan datadata

UtilizedUtilized aa singlesingle--weightweight sortingsorting modelmodel GenerallyGenerally adequateadequate forfor sortingsorting nativenative TibetanTibetan orthographiesorthographies withinwithin aa specificspecific application/environmentapplication/environment NotNot ableable toto robustlyrobustly handlehandle foreignforeign transcriptionstranscriptions andand otherother nonnon--standardstandard orthographiesorthographies DesignedDesigned forfor RomanizedRomanized oror fontfont--encodedencoded Tibetan,Tibetan, notnot UnicodeUnicode TibetanTibetan TreatTreat TibetanTibetan--scriptscript sortingsorting inin anan exclusive,exclusive, specialspecial casecase fashion;fashion; suchsuch proprietaryproprietary sortingsorting methodsmethods areare notnot likelylikely toto bebe widelywidely implementedimplemented FeaturesFeatures ofof multimulti--weightweight sortingsorting methodologymethodology WellWell--understoodunderstood andand widelywidely implemented*implemented* UsesUses aa collationcollation elementelement tabletable toto achieveachieve culturallyculturally acceptableacceptable sortingsorting EnablesEnables searchingsearching atat differentdifferent degreesdegrees ofof precisionprecision (.g.,(e.g., casecase--sensitivesensitive searches)searches)

*References:

• ISO/IEC 14651 (2001-04) Ed. 1.0 Information technology -- International string ordering and comparison -- Method for comparing character strings and description of the common template tailorable ordering.

Technical Standard #10: Unicode Collation (UCA). AdvantagesAdvantages forfor implementersimplementers andand usersusers ofof UnicodeUnicode TibetanTibetan

CollationCollation elementelement tabletable forfor TibetanTibetan cancan ““plugplug intointo”” existingexisting sortsort logiclogic atat thethe operatingoperating systemsystem levellevel RobustRobust searchingsearching andand sortingsorting ofof TibetanTibetan datadata thusthus becomesbecomes automaticallyautomatically availableavailable toto allall compliantcompliant applicationsapplications runningrunning withinwithin thatthat operatingoperating systemsystem environmentenvironment TheThe samesame collationcollation elementelement tabletable cancan bebe usedused acrossacross multiplemultiple platformsplatforms –– resultingresulting inin consistentconsistent sortingsorting ofof TibetanTibetan datadata withinwithin differentdifferent operatingoperating systemsystem environmentsenvironments MovingMoving fromfrom methodologymethodology toto algorithm:algorithm: anan overviewoverview

AA looklook atat dictionarydictionary sortingsorting UnderstandingUnderstanding thethe multimulti--weightweight sortingsorting ((““internationalinternational stringstring orderingordering””)) modelmodel andand extendingextending thisthis modelmodel toto TibetanTibetan DeterminingDetermining thethe collationcollation elementselements neededneeded forfor sortingsorting UnicodeUnicode TibetanTibetan DictionaryDictionary sortingsorting ofof TibetanTibetan

GeneralGeneral agreementagreement (since(since 1900)1900) onon dictionarydictionary orderorder ofof nativenative orthographiesorthographies – Main exception: treatment of wazur LackLack ofof consensusconsensus onon detailsdetails ofof sortingsorting foreignforeign-- originorigin orthographiesorthographies UniversalUniversal agreementagreement thatthat allall entriesentries mustmust appearappear underunder oneone ofof thethe 3030 lettersletters ((ཀ་ཀ་ toto ཨ་ཨ་)) – Example of ††ññ-- (Sanskrit: “skandha”): sorts under the collation slot for †† HowHow foreignforeign wordswords areare sortedsorted inin dictionariesdictionaries generallygenerally WordsWords fromfrom foreignforeign languageslanguages areare sortedsorted accordingaccording toto thethe sortsort rulesrules ofof thethe dictionarydictionary’’ss languagelanguage (and(and notnot thethe sortsort rulesrules ofof thethe originorigin language)language) –– aa DanishDanish wordword beginningbeginning withwith åå appearsappears afterafter letterletter ZZ inin aa DanishDanish dictionarydictionary –– thethe samesame wordword isis sortedsorted underunder letterletter AA inin anan EnglishEnglish dictionarydictionary ExtendingExtending thisthis conventionconvention toto Tibetan,Tibetan, allall wordswords inin aa TibetanTibetan dictionarydictionary –– includingincluding foreignforeign wordswords –– areare sortedsorted underunder 3030 lettersletters ExtendingExtending thisthis conventionconvention stillstill further,further, allall vowelvowel signssigns areare treatedtreated inin termsterms ofof thethe 55 standardstandard TibetanTibetan vowelsvowels –– implicitimplicit vowelvowel ཨ་ཨ་ –– 44 explicitexplicit vowelvowel signssigns TheThe multimulti--weightweight sortingsorting modelmodel forfor internationalinternational stringstring orderingordering WeightsWeights areare generallygenerally assignedassigned atat threethree (or(or more)more) levelslevels InIn LatinLatin scriptsscripts thesethese levelslevels correspondcorrespond to:to: 1.1. alphabeticalphabetic orderingordering == primaryprimary levellevel 2.2. diacriticdiacritic orderingordering == secondarysecondary levellevel 3.3. casecase orderingordering == tertiarytertiary levellevel (Additional(Additional levelslevels maymay bebe usedused forfor tietie--breakingbreaking betweenbetween stringsstrings notnot distinguisheddistinguished atat thethe firstfirst threethree levels)levels) ExamplesExamples inin LatinLatin scriptscript rolerole andand rulerule differdiffer atat thethe primaryprimary (alphabetic)(alphabetic) levellevel rolerole andand rôlerôle differdiffer atat thethe secondarysecondary ()(diacritic) levellevel rolerole andand ROLEROLE differdiffer atat thethe tertiarytertiary (case)(case) levellevel rolerole andand RÔLERÔLE differdiffer atat bothboth thethe secondarysecondary levellevel andand thethe tertiarytertiary levellevel ExtendingExtending thethe multimulti--weightweight modelmodel toto TibetanTibetan

KK-- andand MM-- differdiffer atat thethe primaryprimary levellevel

K-K- andand Kı-Kı- differdiffer atat thethe secondarysecondary levellevel

K-K- andand ff-- differdiffer atat thethe tertiarytertiary levellevel

K-K- andand gÛgÛ-- differdiffer atat bothboth thethe secondarysecondary levellevel andand thethe tertiarytertiary levellevel DeterminingDetermining collationcollation elementselements forfor UnicodeUnicode Tibetan:Tibetan: anan overviewoverview

PrescriptsPrescripts inin TibetanTibetan orthographiesorthographies TheThe UnicodeUnicode modelmodel forfor encodingencoding TibetanTibetan scriptscript WhatWhat isis aa collationcollation element?element? 167167 primaryprimary--weightedweighted collationcollation elementselements 99 secondarysecondary--weightedweighted collationcollation elementselements PrescriptsPrescripts inin TibetanTibetan orthographiesorthographies

InIn English,English, allall 2626 lettersletters alwaysalways havehave primaryprimary weight;weight; thusthus ““atat”” sortssorts farfar awayaway fromfrom ““vatvat”” InIn Tibetan,Tibetan, lettersletters writtenwritten beforebefore thethe radicalradical letterletter havehave lessless thanthan primaryprimary weight;weight; thusthus ཀནཀན,, སསྐནྐན,, བཀནབཀན,, andand བསབསྐནྐན sortsort relativelyrelatively nearnear toto eacheach other,other, underunder letterletter ཀ་ཀ་ 1111 possiblepossible prescriptsprescripts (or(or ““prepre--radicalsradicals””)) mightmight occuroccur beforebefore thethe radicalradical :letter: –– 55 prefixprefix letters:letters: གག དད བབ མམ འའ –– 33 headhead letters:letters: རར ལལ སས –– 33 twotwo--letterletter sequencessequences ofof བབ prefixprefix followedfollowed byby oneone ofof thethe headhead letters,letters, i.e.:i.e.: བརབར བལབལ བསབས GrammarGrammar rulesrules definedefine whichwhich radicalradical lettersletters cancan taketake whichwhich prescriptsprescripts –– ForFor letterletter ཀཀ་་ therethere areare 77 possiblepossible prescripts:prescripts: N@N@ T@T@ áá ññ †† TáTá T†T†

NoNo radicalradical letterletter cancan taketake allall 1111 prescriptprescript formsforms (and(and somesome taketake nonenone atat all)all) TheThe UnicodeUnicode encodingencoding modelmodel forfor TibetanTibetan scriptscript 193193 distinctdistinct characterscharacters defineddefined inin UnicodeUnicode TheThe 3030 lettersletters (along(along withwith conjunctconjunct andand reversedreversed forms)forms) areare encodedencoded twice:twice: inin nominalnominal positionposition andand inin orthographicorthographic--subjoinedsubjoined positionposition – reflects the fact that the Tibetan script is written from top to bottom as well as from left to right ExampleExample ofof Tå‰P-Tå‰P- encodedencoded asas 66 characters:characters:

བབ + རར + ྟ ྟ + ེ ེ + ནན + ་་ 0F56 0F62 0F9F 0F7A 0F53 0F0B WhatWhat isis aa collationcollation element?element?

AA collationcollation elementelement enablesenables clusteringclustering ofof multiplemultiple UnicodeUnicode characterscharacters suchsuch thatthat theythey cancan bebe treatedtreated togethertogether asas aa singlesingle itemitem forfor determiningdetermining sortsort weightsweights SingleSingle characterscharacters alsoalso functionfunction asas collationcollation elementselements TheThe weightsweights assignedassigned toto thethe collationcollation elementselements determinedetermine theirtheir sortsort (or(or collation)collation) orderorder relativerelative toto oneone anotheranother DefiningDefining TibetanTibetan prepre--scribedscribed radicalradical sequencessequences asas collationcollation elementselements ForFor letterletter ཀཀ་་ wewe cancan definedefine eacheach ofof thethe 77 prescriptprescript++radicalradical clustersclusters ((N@N@ T@T@ áá ññ †† TáTá T†T†)) asas aa collationcollation elementelement (also(also calledcalled aa ““collationcollation graphemegrapheme””)) WeWe cancan thenthen assignassign sortsort weightsweights toto thesethese collationcollation elementselements suchsuch thatthat theythey sortsort inin aa culturallyculturally acceptableacceptable relativerelative orderorder PrimaryPrimary--weightedweighted collationcollation elementselements 3030 nominalnominal letters:letters: ཀ་ཀ་ toto ཨ་ཨ་ –– whichwhich maymay bebe eithereither radicalradical lettersletters ((མམིང་གཞིང་གཞི་ི་)) oror suffixsuffix lettersletters 103103 multimulti--letterletter prepre--scribedscribed radicalradical formsforms – In many of these 103 forms the prescript is written at the head line (encoded as 1 or 2 nominal characters) and the radical letter is encoded as a subjoined character – In the example of Tå‰P-Tå‰P-, the radical letter KK is encoded in subjoined position DefiningDefining thethe 44 explicitexplicit vowelsvowels asas collationcollation elementselements AsAs collationcollation elements,elements, suffixsuffix lettersletters cannotcannot bebe distinguisheddistinguished fromfrom barebare radicals*radicals* BecauseBecause aa nominalnominal letterletter servingserving asas aa radicalradical letterletter carriescarries thethe implicitimplicit vowelvowel ཨ་ཨ་,, thethe 44 explicitexplicit vowelsvowels mustmust bebe givengiven primaryprimary weights;weights; andand mustmust bebe weightedweighted heavierheavier thanthan thethe nominalnominal lettersletters ---- sincesince aa radicalradical letterletter markedmarked withwith anan explicitexplicit vowelvowel willwill sortsort afterafter thethe samesame letterletter notnot markedmarked byby anan explicitexplicit vowelvowel

* in a stateless implementation DefiningDefining thethe 3030 postpost--radicalradical lettersletters asas collationcollation elementselements PostPost--radicalsradicals == thethe 3030 lettersletters inin subjoinedsubjoined positionposition (when(when notnot functioningfunctioning asas thethe radicalradical letterletter inin aa prepre--scribedscribed radicalradical form)form) – Requires maximum-length substring matching OnlyOnly 44 postpost--radicalradical (subscribed)(subscribed) lettersletters occuroccur inin nativenative TibetanTibetan orthographiesorthographies:: ྭ ྭ ྱ ྱ ྲ ྲ ླ ླ RemainingRemaining 2626 areare requiredrequired toto treattreat nonnon--nativenative orthographiesorthographies inin aa consistentconsistent mannermanner MustMust bebe givengiven primaryprimary weights;weights; andand heavierheavier thanthan thethe 44 explicitexplicit vowelsvowels RelativeRelative orderorder ofof thethe 167167 primaryprimary-- weightedweighted collationcollation elementselements

First:First: 3030 nominalnominal lettersletters andand 103103 multimulti--letterletter prepre-- scribedscribed radicalradical formsforms (=(= 133133 collationcollation elements)elements) – given sort weights such that the 103 pre-scribed radical forms are interleaved as appropriate with the 30 nominal letters Next:Next: 44 explicitexplicit vowelsvowels Next:Next: 3030 postpost--radicalradical lettersletters (i.e.,(i.e., inin orthographicorthographic subscribedsubscribed position)position) Thus,Thus, totaltotal collationcollation slotsslots atat thethe primaryprimary--weightweight level:level: 133133 ++ 44 ++ 3030 == 167167 SecondarySecondary--weightedweighted collationcollation elementselements

TheseThese 99 havehave nono primaryprimary weightweight

–– 44 combiningcombining marks:marks: ◌◌྄྄ ◌◌ཱཱ ◌◌༹༹ ◌ཿ◌ཿ

–– 55 signs:signs: ྅྅ ྈྈ ྉྉ ྊྊ ྋྋ TheThe remainingremaining 120120 UnicodeUnicode TibetanTibetan characterscharacters

3030 ++ 44 ++ 3030 ++ 99 == 7373 ofof thethe 193193 UnicodeUnicode TibetanTibetan characterscharacters havehave beenbeen treatedtreated above,above, leavingleaving 120120

5959 ofof thesethese 120120 havehave aa primaryprimary weightweight (in(in additionaddition toto aa secondarysecondary and/orand/or tertiarytertiary weight):weight): – 19 can be decomposed into simple elements and thus need not be treated in the collation element table – 9 are variants (primary and tertiary weighted) of certain of the 30 nominal letters – 3 are variants (primary and tertiary weighted) of certain of the 4 explicit vowels – 8 are variants (primary and tertiary weighted) of certain of the 30 subscribed letters – 20 are the digits and half-digits

TheThe remainingremaining 6161 characterscharacters areare punctuationpunctuation marksmarks andand otherother symbolssymbols whichwhich generallygenerally havehave nono impactimpact onon dictionarydictionary sortsort orderorder andand thusthus havehave nono primary,primary, secondarysecondary oror tertiarytertiary weightweight AppendicesAppendices

•• TheThe UnicodeUnicode (and(and ISO/IECISO/IEC 10646)10646) charactercharacter--encodingencoding chartchart forfor TibetanTibetan –– highlightinghighlighting characterscharacters inin example:example: Tå‰P-Tå‰P- •• An ordered list of collation elements for Unicode Tibetan 0F00 Tibetan 0F7F

0F0 0F1 0F2 0F3 0F4 0F5 0F6 0F7

0 c s ƒ “ £ ³ Ã

0F00 0F10 0F20 0F30 0F40 0F50 0F60

1 d t „ ” ¤ ´ Ä Î

0F01 0F11 0F21 0F31 0F41 0F51 0F61 0F71

2 e • ¥ µ Å Ï

0F02 0F12 0F22 0F32 0F42 0F52 0F62 0F72

3 f v † – ¦ ¶ Æ Ð

0F03 0F13 0F23 0F33 0F43 0F53 0F63 0F73

4 g w ‡ — § · Ç Ñ

0F04 0F14 0F24 0F34 0F44 0F54 0F64 0F74 ˜ 5 h x ˆ ¨ ¸ È Ò

0F05 0F15 0F25 0F35 0F45 0F55 0F65 0F75 i 6 y ‰ ™ © ¹ É Ó

0F06 0F16 0F26 0F36 0F46 0F56 0F66 0F76 š Ô 7 j z Š ª º Ê

0F07 0F17 0F27 0F37 0F47 0F57 0F67 0F77

8 k { ‹ › » Ë Õ

0F08 0F18 0F28 0F38 0F58 0F68 0F78 | 9 l Œ œ ¬ ¼ Ì

0F09 0F19 0F29 0F39 0F49 0F59 0F69 0F79

A m }   ­ ½ Í ×

0F0A 0F1A 0F2A 0F3A 0F4A 0F5A 0F6A 0F7A

B n ~ Ž ž ® ¾ Ø

0F0B 0F1B 0F2B 0F3B 0F4B 0F5B 0F7B

C o   Ÿ ¯ ¿ Ù

0F0C 0F1C 0F2C 0F3C 0F4C 0F5C 0F7C

D p €  ° À Ú

0F0D 0F1D 0F2D 0F3D 0F4D 0F5D 0F7D

E q  ‘ ¡ ± Á Û

0F0E 0F1E 0F2E 0F3E 0F4E 0F5E 0F7E

F r ‚ ’ ¢ ²  Ü

0F0F 0F1F 0F2F 0F3F 0F4F 0F5F 0F7F

The Unicode Standard 3.0, Copyright © 1991-2000, Unicode, Inc. All rights reserved 435 0F80 Tibetan 0FFF

0F8 0F9 0FA 0FB 0FC 0FD 0FE 0FF

0 b o   ž

0F80 0F90 0FA0 0FB0 0FC0

1 c p €  Ÿ

0F81 0F91 0FA1 0FB1 0FC1  2 d q ‘

0F82 0F92 0FA2 0FB2 0FC2 r 3 e ‚ ’ ¡

0F83 0F93 0FA3 0FB3 0FC3 f 4 s ƒ “ ¢

0F84 0F94 0FA4 0FB4 0FC4

5 g t „ ” £

0F85 0F95 0FA5 0FB5 0FC5

6 h u • ¤

0F86 0F96 0FA6 0FB6 0FC6 † 7 i v – ¥

0F87 0F97 0FA7 0FB7 0FC7

8 j ‡ — ¦

0F88 0FA8 0FB8 0FC8 ˜ 9 k x ˆ §

0F89 0F99 0FA9 0FB9 0FC9

A l y ‰ ™ ¨

0F8A 0F9A 0FAA 0FBA 0FCA

B m z Š š ©

0F8B 0F9B 0FAB 0FBB 0FCB ‹ C { › ª

0F9C 0FAC 0FBC 0FCC | D Œ

0F9D 0FAD

E }  œ

0F9E 0FAE 0FBE

F ~ Ž  «

0F9F 0FAF 0FBF 0FCF

436 The Unicode Standard 3.0, Copyright © 1991-2000, Unicode, Inc. All rights reserved An Ordered List of Collation Elements for Unicode Tibetan

[*Note: a comprehensive Collation Element Table for Tibetan script will include additional collation elements, such as , , ྋྙ, ྉ,ྤ ྉ,ྥ beyond those listed here.]

A. Primary Weighted Collation Elements

A.1. The 133 radical-initial sequences (also covers the suffix letters):

ཀ དཀ བཀ རྐ ལྐ སྐ བརྐ བསྐ ཁ མཁ འཁ ག དག བག མག འག རྒ ལྒ སྒ བརྒ བསྒ ང དང མང རྔ ལྔ སྔ བརྔ བསྔ ཅ གཅ བཅ ལྕ བལྕ ཆ མཆ འཆ ཇ མཇ འཇ རྗ ལྗ བརྗ ཉ གཉ མཉ རྙ སྙ བརྙ བསྙ ཏ གཏ བཏ རྟ ལྟ སྟ བརྟ བལྟ བསྟ ཐ མཐ འཐ ད གད བད མད འད རྡ ལྡ སྡ བརྡ བལྡ བསྡ ན གན མན རྣ སྣ བརྣ བསྣ པ དཔ ལྤ སྤ ཕ འཕ བ དབ འབ རྦ ལྦ སྦ མ དམ རྨ སྨ ཙ གཙ བཙ རྩ སྩ བརྩ བསྩ ཚ མཚ འཚ ཛ མཛ འཛ རྫ བརྫ ཝ ཞ གཞ བཞ ཟ གཟ བཟ འ ཡ གཡ ར བར(seen in བརླ) ལ ཤ གཤ བཤ ས གས བས ཧ ལྷ ཨ

Key: Black for the 30 nominal letters. Note that whereas any of these 30 can serve as a bare radical, 10 of these can also appear in suffix position in native orthographies. Blue for (relatively) unambiguous cases of prefixed and/or superscribed radical letters. Note that certain unavoidable ambiguities arise between native orthographies and transcriptions from foreign languages. Red for ambiguous cases where a 3rd codepoint is required to distinguish the sequence as being a prefixed radical letter (as opposed to a root letter followed by a suffix). Note that certain cases (in Dzongkha) require a 4th codepoint in order to distinguish a case of a prefixed radical letter from a case of a suffix letter followed by a secondary syllable that involves a vowel (i.e., ནི or མ ོ ). Magenta for an ambiguous case (in Dzongkha) where a 3rd (or possibly 4th) codepoint is required to distinguish the sequence as being a prefixed radical letter (as opposed to a suffix letter ད followed by a secondary syllable པ or པ ོ ).

A.2. The 4 explicit vowels:

◌ི ◌ུ ◌ེ ◌ོ

A.3. The 30 post-radicals: ◌ྐ ◌ྑ ◌ྒ ◌ྔ ◌ ྕ ◌ྖ ◌ྗ ◌ྙ ◌ྟ ◌ ྠ ◌ྡ ◌ྣ ◌ྤ ◌ྥ ◌ ྦ ◌ྨ ◌ྩ ◌ ྪ ◌ྫ ◌ ྭ ◌ྮ ◌ྯ ◌ཱ ◌ྱ ◌ ྲ ◌ླ ◌ྴ ◌ྶ ◌ྷ ◌ ྸ

B. Secondary Weighted Collation Elements (have no primary weight)

B.1. The 4 combining marks:

◌྄ ◌ཱ ◌༹ ◌ཿ

B.2. The 5 signs (used in transliteration):

྅ ྈ ྉ ྊ ྋ

C. The 120 Remaining Unicode Tibetan Characters

The characters listed above (in items A and B) account for 73 of the 193 Tibetan characters defined in Unicode. This leaves 120 characters, of which 19 can be decomposed into simple elements and thus need not be treated in the collation element table. There is also no need to assign primary secondary or tertiary weights to the 61 characters that function as punctuation marks and other symbols since these generally have no impact on dictionary sort order. [Note that "Syllable OM" at U+0F00 is here treated as an ornamental symbol rather than as having any lexical value due to the fact that there is no canonical or compatibility decompostion specified for this character.]

The digits and half digits account for 20 further characters. The remaining 20 characters are variations (i.e., having both primary and tertiary weights) of certain of the 30 nominal letters, 4 vowels, and 30 subjoined post-initial letters listed previously.

9 Nominal letter variants:

ཊ ཋ ཌ ཎ ◌ཾ ◌ྂ ◌ྃ ར(fixed form) ཥ

3 Vowel variants:

◌ྀ ◌ཻ ◌ཽ

8 Subjoined letter variants:

◌ྚ ◌ྛ ◌ྜ ◌ྞ ◌ྺ ◌ྻ ◌ྼ ◌ྵ