The Unicode Standard Version 3.0
ADDISON–WESLEY
An Imprint of Addison Wesley Longman, Inc. Reading, Massachusetts · Harlow, England · Menlo Park, California Berkeley, California · Don Mills, Ontario · Sydney Bonn · Amsterdam · Tokyo · Mexico City Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If these files have been purchased on computer-readable media, the sole remedy for any claim will be exchange of defective media within ninety days of receipt. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. Printed in the United States of America. Published simultaneously in Canada. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The camera-ready master of the text and code charts was printed on a Hewlett-Packard LaserJet 4000T. The Han radical-stroke index was printed by Apple Computer, Inc. The following companies and organizations supplied fonts:
Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. Cover motifs: Rosetta Stone (front), KangXi dictionary (back). CD-ROM: Phaistos Disk. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: www.awl.com/cseng/ 1 2 3 4 5 6 7 8 9—CRW—0403020100 First printing, January 2000. This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:
Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Preface
This book, The Unicode Standard, Version 3.0, is the authoritative source of information on the Unicode character encoding standard, the international character code for information processing that includes all major scripts of the world and is the foundation for develop- ment of software for worldwide use. As well as encoding characters used for written com- munication in a simple and consistent manner, the Unicode Standard defines character properties and algorithms for use in implementations. Version 3.0 expands on material from Versions 2.0 and 2.1 and supersedes all other previ- ous versions. The previous versions of the Unicode Standard are: • The Unicode Standard, Version 1.0, Volume 1 (1991) • The Unicode Standard, Version 1.0, Volume 2 (1992) • The Unicode Standard, Version 1.1, Unicode Technical Report #4 (1993) • The Unicode Standard, Version 2.0 (1996) • The Unicode Standard, Version 2.1, Unicode Technical Report #8 (1998) Major additions to Version 3.0 include: • conformance rules for transformation formats • new scripts including Ethiopic, Khmer, Mongolian, Myanmar, and Sinhala • restructured and enhanced character block descriptions • clarified bidirectional algorithm • updated implementation guidelines •a Shift-JIS index The Unicode Standard maintains consistency with the international standard ISO/IEC 10646. Version 3.0 of the Unicode Standard corresponds to ISO/IEC 10646-1:2000.
0.1 About the Unicode Standard This book defines Version 3.0 of the Unicode Standard. The general principles and archi- tecture of the Unicode Standard, requirements for conformance, and guidelines for imple- menters precede the actual coding information. Useful ancillary information is given in the appendices. The accompanying CD-ROM contains tables of use to implementers and all technical reports published to date.
Concepts, Architecture, Conformance, and Guidelines The first five chapters of Version 3.0 introduce the Unicode Standard and provide the information an engineer needs to produce a conforming implementation. Basic text pro- cessing, working with combining marks, encoding forms, and doing bidirectional text
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. xxv 0.1 About the Unicode Standard Preface layout are all described. A special chapter on implementation guidelines answers many common questions that arise when implementing Unicode. Chapter 1 introduces the standard’s basic concepts, design basis, and coverage, and discusses basic text handling requirements. Chapter 2 sets forth the fundamental principles underlying the Unicode Standard and covers specific topics such as text processes, overall char- acter properties, and the use of combining marks. Chapter 3 constitutes the formal statement of conformance. This chap- ter also presents the normative algorithms for three processes: the canonical ordering of combining marks, the encoding of Korean Hangul syllables by conjoining jamo, and the formatting of bidirectional text. Chapter 4 describes character properties in detail, both normative (required) and informative. Tables giving additional character property information appear on the CD-ROM. Chapter 5 discusses implementation issues, including compression, strategies for dealing with unknown and unsupported characters, and transcoding to other standards.
Character Block Descriptions Chapters 6 through 13 contain the character block descriptions that give basic information about each script or collection and may discuss specific characters or pertinent layout information. Chapter 6 describes the general punctuation characters. Chapter 7 presents the European Alphabetic scripts, including Latin, Greek, Cyrillic, Armenian, Georgian, Runic, Ogham, and associated combining marks. Chapter 8 presents the Middle Eastern, right-to-left scripts: Hebrew, Arabic, Syriac, and Thaana. Chapter 9 covers the South and Southeast Asian scripts, including Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Tibetan, Thai, Lao, Khmer, and Myan- mar. Chapter 10 presents the East Asian scripts, including Han, Hiragana, Katakana, Hangul, Bopomofo, and Yi. Chapter 11 presents other scripts, including Ethiopic, Cherokee, Cana- dian Aboriginal Syllabics, and Mongolian. Chapter 12 presents symbols, including currency, letterlike and technical symbols, and mathematical operators. Chapter 13 describes special characters such as the Private Use Area, sur- rogates, and specials.
Charts and Index The next two chapters document the Unicode Standard’s character code assignments, their names and important descriptive information, and Han indices that aid in locating specific ideographs encoded in Unicode. xxvi Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Preface 0.1 About the Unicode Standard
Chapter 14 gives the code charts and the Character Names List. The code charts contain the normative character encoding assignments, and the names list contains normative information as well as useful cross refer- ences and informational notes. Chapter 15 provides a radical-stroke index to East Asian ideographs, as well as a Shift-JIS index.
Appendices and Tables The appendices contain detailed background information on important topics: character encoding systems, submission of proposals, and the history of Unicode and its relationship to ISO/IEC 10646. Appendix A describes the history of Han Unification in the Unicode Standard. Appendix B gives instructions on how to submit characters for consider- ation as additions to the Unicode Standard. Appendix C details the relationship between the Unicode Standard and ISO/IEC 10646. Appendix D lists the changes to the Unicode Standard since Version 2.0. The appendices are followed by a glossary of terms, a bibliography, and two indices: an index to Unicode characters and an index to the text of Chapters 1 through 15.
The Unicode Character Database and Technical Reports The Unicode Character Database is the name for a collection of files that contain character code values, character names, and character property data. It is described more fully in the file UnicodeCharacterDatabase.html. Version 3.0.0 of the database is provided on the accompanying CD-ROM. Updates and revisions will be made available online. For infor- mation on the latest available version see: http://www.unicode.org/unicode/standard/versions/ The following Unicode Technical Reports are formally part of this standard: • UTR #11: East Asian Width, Version 5.0 • UTR #13: Unicode Newline Guidelines, Version 5.0 • UTR #14: Line Breaking Properties, Version 6.0 • UTR #15: Unicode Normalization Forms, Version 18.0 The latest available version of these reports is provided on the CD-ROM. Updates and revi- sions will be made available online. For information on the latest available version, see http://www.unicode.org/unicode/standard/versions/.
On the CD-ROM The CD-ROM contains the Unicode Character Database, which gives character codes, char- acter names, character properties, and decompositions for decomposable or compatibility characters. In addition to the Unicode Character Database and Unicode Technical Reports that are part of this standard, the CD-ROM also contains additional technical reports (cov- ering topics such as compression, collation, and transformation formats), as well as property-based mapping tables (for example, tables for case) and transcoding tables for
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. xxvii 0.2 Notational Conventions Preface international, national, and industry character sets (including the Han cross-reference table). For the complete contents of the CD-ROM, see its READ ME file. Please consult the Unicode Consortium’s online resources (see Section 0.3, Resources) to obtain the most up- to-date versions of the materials on the CD-ROM.
0.2 Notational Conventions Throughout this book, certain typographic conventions are used. In running text, an indi- vidual Unicode value is expressed as U+nnnn, where nnnn is a four-digit number in hexa- decimal notation, using the digits 0–9 and the letters A–F (for 10 through 15, respectively). • U+0416 is the Unicode value for the character named - . In tables, the U+ may be omitted for brevity. A range of Unicode values is expressed as U+xxxx→U+yyyy, or U+xxxx—U+yyyy, or xxxx..yyyy, where xxxx and yyyy are the first and last Unicode values in the range, and the arrow, long dash, or two dots indicate a contiguous range inclusive of the endpoints. • The range U+0900→U+097F contains 128 character values. All Unicode characters have unique names, which are identical to those of the English- language edition of International Standard ISO/IEC 10646. Unicode character names contain only uppercase Latin letters A through Z, digits, space, and hyphen-minus; this convention makes it easy to generate computer-language identifiers automatically from the names. Unified East Asian ideographs are named -X, where X is replaced with the hexadecimal Unicode value—for example, - 4E00. The names of Hangul syllables are generated algorithmically; for details, see Hangul Syllable Names in Section 3.11, Conjoining Jamo Behavior. In running text, a formal Unicode name is shown in small capitals (for example, ), and alternative names (aliases) appear in italics (for example, umlaut). Italics are also used to refer to a text element that is not explicitly encoded (for example, pasekh alef) or to set off a foreign word (for example, the Welsh word ynghyd). Phonemic transcriptions are shown between slashes, as in Khmer /khnyom/. The symbols used in the character names list are described at the beginning of Chapter 14, Code Charts. In the text of this book, the word “Unicode” when used alone as a noun refers to the Uni- code Standard. In this book, unambiguous dates of the current common era, such as 1999, are unlabeled. In cases of ambiguity, CE is used. Dates before the common era are labeled with BCE.
Extended BNF The Unicode Standard and technical reports use an extended BNF format for describing syntax. As different conventions are used for BNF, Table 0-1, Extended BNF, lists the nota- tion used here. A sequence of characters is sometimes listed in text with angle brackets, such as or . Character Classes. A character class is constructed from one or two base sets. It is either a single base set, the negation of a base set, or the (set) difference between two base sets. The
xxviii Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Preface 0.2 Notational Conventions
Table 0-1. Extended BNF
Symbols Meaning x := ... production rule x y the sequence consisting of x then y x* zero or more occurrences of x x? zero or one occurrence of x x+ one or more occurrences of x x | y either x or y ( x ) for grouping x || y equivalent to (x | y | (x y)) { x } equivalent to (x)? "abc" string literals ( “_” is sometimes used to denote space for clarity) ’abc’ string literals (alternative form) \u1234 Unicode characters within string literals or character classes \v00101234 Unicode scalar values within string literals or character classes U+HHHH Unicode character literal: equivalent to ‘\uHHHH’ U-HHHHHHHH Unicode character literal: equivalent to ‘\vHHHHHHHH’ charClass character class (syntax below) base sets themselves are bounded by brackets, and contain lists of characters, ranges of characters, general categories, or negations of general categories. The syntax follows: charClass := baseSet | '¬' baseSet | baseSet '-' baseSet baseSet := '[' item (','? item)* ']' item := char | char '-' char | '{' '¬'? category '}' General categories are defined in Chapter 4, Character Properties, such as [{Uppercase Let- ter}] for uppercase letter. Main categories such as [{Mark}] are the equivalent of a list of multiple subcategories: [{Non-Spacing Mark}{Spacing Combining Mark}{Enclosing Mark}]. Examples are found in Table 0-2, Character Class Examples. Table 0-2. Character Class Examples
Syntax Matches [a-z] English lowercase letters [a-z]-[c] English lowercase letters except for c ¬[c] all characters but c [0-9] European decimal digits [\u0030-\u0039] (same as above, using Unicode escapes) [0-9,A-F,a-f] hexadecimal digits [{Letter},{Non-Spacing Mark}] all letters and non-spacing marks [{L},{Mn}] (same as above, using abbreviated notation) [{¬Cn}] all assigned Unicode characters [\u0600-\u06FF]-[{Cn}] all assigned Arabic characters
Operators Operators used in this standard are listed in Table 0-3, Operators.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. xxix 0.3 Resources Preface
Table 0-3. Operators ÷ allow break here (see Section 5.15, Locating Text Element Boundaries) × do not allow a break here ò is transformed to, or behaves like / integer division (rounded down) % modulo operation; equivalent to the integer remainder for positive numbers.
0.3 Resources The Unicode Consortium provides a number of online resources for obtaining informa- tion and data about the Unicode Standard, as well as updates and corrigenda. They are listed below.
Unicode Web Site • http://www.unicode.org
Unicode Anonymous FTP Site • ftp://ftp.unicode.org
Unicode Public Mailing List • [email protected] Subscription instructions for the public mailing list are posted on the Unicode Web site.
How to Contact the Unicode Consortium Contact the Consortium for membership information and to order publications (includ- ing additional copies of this book). • Electronic mail address: [email protected] • Postal address: P.O. Box 391476 Mountain View, CA 94039-1476 USA Please check the Web site for up-to-date contact information, including telephone, fax, and courier delivery address.
xxx Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:
Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Acknowledgments
The production of The Unicode Standard, Version 3.0, is due to the dedication of many individuals for over a decade. We would like to acknowledge those who were involved in Version 3.0, and particularly the following individuals, whose major contributions were central to the design, authorship, and review of this book. Ken Whistler has been the driving force behind Version 3.0. As managing editor, he had responsibility for all aspects of production; in particular, he designed the new chapter structure. Ken meticulously updated the Unicode Character Database, including adding all the new characters and their properties. He also maintained the Character Names List and supplied many of the annotations. Ken co-authored three Unicode Technical Reports, and contributed to the descriptions for many of the new scripts, such as Myanmar and Khmer. He was responsible for the vast majority of the original character block descriptions and made major contributions to the encoding of Indic scripts, IPA, and symbols. Julie D. Allen coordinated the editorial review meetings. As editor, she kept track of edits and text from multiple sources and in various formats, and formatted the text in FrameMaker. She worked with the authors to improve the wording and clarity of the text, and also helped resolve font problems resulting from platform conversion. Joan Aliprand chaired the Unicode Technical Committee while Version 3.0 was under development. She is responsible for the front matter, Glossary, References, and General Index. She also contributed to Middle Eastern Scripts. Joe Becker is acknowledged for his initial work and promotion of a 16-bit full encoding and his overall contributions to the creation of the Unicode Standard and editing of this book. Mark Davis has been essential to the development of Version 3.0. Mark has led the way in many aspects of overall design and implementation of the Unicode Standard. He has been the chief architect for the definition of conformance and the bidirectional algorithm. He contributed significant revisions and enhancements to Implementation Guidelines, Bidirec- tional Behavior, Syriac, and the Unicode Character Database. Five of the Unicode Technical Reports are Mark’s work, and he co-wrote two others. His previous contributions include algorithms for canonical equivalence, conjoining Hangul, shaping for specific scripts, and the development of the Unicode Character Database. Michael Everson collected or created many of the fonts used in this standard, and provided typographical review and assistance. He also reviewed the code charts and Character Names List, and provided many annotations. Block descriptions for several of the new scripts are based on drafts prepared by Michael. Asmus Freytag was responsible for the chapters Punctuation, Symbols, and Special Areas and Format Characters. He made significant contributions to Syriac, Bidirectional Behavior, and to the chapters General Structure, European Alphabetic Scripts, and Implementation Guidelines. Asmus created custom formatting software, negotiated with the font houses for donated fonts, and oversaw production of the code charts. He also created the Shift-JIS Index, and he wrote two Unicode Technical Reports and co-authored another. Previously, Asmus contributed substantial text on implementation aspects of the Unicode Standard.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. iii Acknowledgments
John H. Jenkins was responsible for East Asian Scripts and contributed substantial new material for that chapter. He maintained and extended the Unihan database, and created the Han Radical-Stroke Index to the ideographic content of the standard. John converted all of the Version 2.0 illustrations to Adobe Illustrator format. He created the fonts for the KangXi and CJK radicals. In addition, he collected fonts from many sources and fine-tuned them for use. John was responsible for the original Han cross-reference tables. Mike Ksar led the effort to synchronize Version 3.0 and the second edition of ISO/IEC 10646-1. He contributed to Middle Eastern Scripts, as well as to Appendix C and Appendix D. Mike also thoroughly reviewed the code charts and the text. Rick McGowan coordinated work on new scripts, and was involved with many of the sub- stantial additions published in Version 3.0. He organized a group of Tibetan experts who developed the extensions for Tibetan script. Rick contributed to the sections on Ogham, Thaana, Tibetan, and Canadian Aboriginal Syllabics. He was responsible for mastering and producing the CD-ROM. Rick contributed to development of the original algorithm for canonical equivalence. Lisa Moore rewrote many parts of the text, including the Preface and Introduction, the chapter General Structure, and Appendix D. She wrote the section on Cherokee, and contrib- uted to the explanations on encoding forms and transformation formats. As current Chair of the Unicode Technical Committee, Lisa oversaw the finalization of the content of Ver- sion 3.0. She previously contributed substantially to character properties. Michel Suignard painstakingly recoded and converted fonts for use in this book. He also contributed to several Unicode Technical Reports. Michel previously contributed to the development of the Unicode Character Database and mapping tables. Fonts were essential for the production of this book. In addition to the individuals men- tioned above and the companies and organizations named in the colophon, fonts were contributed by Michael Stone (Armenian), Al Webster (Cherokee), Oliver Corff (Yi), Yan- nis Haralambous (Syriac), and George Kiraz and Paul Nelson (Syriac). John M. Fiscella designed the fonts for symbols and many of the alphabetic scripts. Thomas Milo designed the Arabic font. The font for CJK Extension A was designed by Technical Supervisor Zheng Long and Hua Weicang. Cora Chang created the font for Braille patterns. Critical com- ments and work on fonts are due to Heidi Jenkins, and Dr. Virach Sornlertlamvanich (for Thai). Ralph Buchalter transferred the Version 2.0 master files to the Windows environment, and implemented the new chapter structure. Phil Huston provided FrameMaker expertise. Steve Mehallo designed the interchapter separators, incorporating examples of writing from numerous sources. Kamal Mansour (Monotype) initiated and coordinated this graphic design process. Both the text and the code charts were reviewed critically by experts. Edwin Hart and Bruce Paterson were principal reviewers of this book. The Editorial Committee appreciates the expert feedback on the code charts provided by Keiko Nagano and Tetsuji Orita of IBM Japan, Hugh McGregor Ross, and, for specific scripts, Yannis Haralambous (Greek and Thai), Constantine Stathopoulos (Greek), and Dr. Virach Sornlertlamvanich (Thai). Bar- bara Beeton of the American Mathematical Society reviewed the math symbols, and Donald E. Knuth gave additional input. Cora Chang thoroughly checked the data for CJK ideographs. Mark Davis and Laura Werner did extensive work to verify the consistency of character properties in the Unicode Character Database. Shen Fucheng of RLG helped with the bibliography. The development of this book would not have been possible without the support of the office staff of Unicode, Inc., and the hard work of Mike Kernaghan, as operational manager of the office. We thank Magda Danish, Dee Domingo, Steve Greenfield, Cheryl Morton, iv Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Acknowledgments and Julia Oesterle, all of whom assisted with Version 3.0 in numerous ways. Sarasvati minded the mailing lists. The technical content of the Unicode Standard is determined by the Unicode Technical Committee. Contributors to the work of the UTC include representatives of Full and Asso- ciate Members, Specialist and Individual Members, and Unicode Officers, as well as invited experts and liaisons. Version 3.0 would not have been possible without their creative work and critical thinking over the past three years, augmented by the support of all member representatives: Dean Abramson (Gamma), Glenn Adams (Spyglass), Joan Aliprand (RLG), Joshua Arai (Justsystem), Joe Becker (Xerox), Barbara Beeton (American Mathe- matical Society), Chris Boyle (Novell), Stefan Buchta (Oracle), Richard C. Buhler (Novell), Don Carroll (Hewlett-Packard), Steve Cohen (Basis Technology), Lee Collins (Apple), Ken Cortese (Booz, Allen), David O. Craig (NCR), Frank da Cruz (Columbia University), Mark Davis (Unicode President), Martin Dürst (W3C), Peter Edberg (Apple), John M. Fiscella (Production First), Asmus Freytag (Unicode Vice-President), Deborah Goldsmith (Apple), Edwin F. Hart (SHARE), Hideki Hiura (Sun), Paul Hoffman (Internet Mail Consortium), Lloyd Honomichl (Novell), John Jenkins (Apple), Nailong Jiang (Quark), Robert Keeble (Quark), Susan Kline (Hewlett-Packard), Tatsuo Kobayashi (Justsystem), Mike Ksar (Hewlett-Packard), Michael Kung (Oracle, then Microsoft), Wai Man Long (Compaq), Kamal Mansour (Monotype), Benson Margulies (Basis Technology), Gianni Mariani (SGI), Rick McGowan (NeXT, then Apple), Mike McKenna (Sybase), Thomas Milo (DecoType), Matthias Mittelstein (SAP), Lisa Moore (IBM), Nobuyoshi Mori (SAP), Rudolf Munz (SAP), Brendan Murray (Lotus), Nelson Ng (Oracle), Thomas Nielsen (MGI), Cleo O’Brien (SGI), Sandra Martin O’Donnell (Compaq), Stephen P. Oksala (Uni- sys), Shripad Patki (Sun), Nathan Potechin (MGI), Mirko Raner (MATHEMA), Wendy Rannenberg (Compaq), Gary Roberts (NCR), Murray Sargent (Microsoft), Bernhard Schilling (SAP), Karen Smith-Yoshimura (RLG), Neil Soiffer (Wolfram Research), Mark Son-Bell (Sun), Michel Suignard (Microsoft), Tex Texin (Progress Software), V. S. Umama- heswaran (IBM), Marc von Holzen (SGI), Ken Whistler (Sybase), Michael Wiedeking (MATHEMA), Richard A. Willis (Reuters), Arnold F. Winkler (Unisys), Bill Wolf (Booz, Allen), Misha Wolf (Reuters), Eugene Wong (Hewlett-Packard), Jianping Yang (Oracle), Michael Yau (Oracle), Douglas-Val Ziegler (Xerox). A significant part of the UTC’s work was devoted to refining the Unicode Bidirectional Algorithm. The task was carried out by the Bidirectional Subcommittee, consisting of Asmus Freytag (ASMUS, Inc.), Mike Ksar (Hewlett-Packard), Mati Allouche, Mark Davis, Doug Felt, Israel Gidali, Lisa Moore (all from IBM), F. Avery Bishop, David Brown, John McConnell, Paul Nelson (all from Microsoft), and Jony Rosenne (QSM Programming). The reference implementations on the CD-ROM were developed by Doug Felt and Asmus Freytag. The Arabic Ad Hoc group reviewed the existing Arabic script repertoire and decomposi- tions. Members of the Ad Hoc group were as follows: Joe Becker (Xerox), Asmus Freytag (ASMUS, Inc.), Mike Ksar (Hewlett-Packard), Kamal Mansour (Monotype), Thomas Milo (DecoType), Khaled Sherif (IBM), Ken Whistler (Sybase). The Unicode Technical Committee has worked closely with Technical Committee L2 of the National Committee for Information Technology Standards (NCITS). We appreciate the cooperation of Chair Arnold Winkler and Vice-Chairs Joan Aliprand and Lisa Moore. Arnold Winkler established and efficiently maintained the Web-based archive of L2 docu- ments that was crucial to the development of Version 3.0. The growth of the synchronized character repertoires of the Unicode Standard and Inter- national Standard ISO/IEC 10646 reflects a worldwide effort conducted over many years. With Version 3.0, the Unicode Standard encodes all of the major modern scripts of the
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. v Acknowledgments world. We express deep appreciation to the following experts who shared their specialized knowledge to bring about this achievement: • For Runic: the ISORUNES project of the Swedish Central Board of National Antiquities (Helmer Gustavson, leader, Olle Järnefors, James Knirk, Svante Lagman, Ray Page, Marie Stoklund, and Klaus Düwel). • For Ogham: Michael Everson and Marion Gunn. • For Syriac: George Kiraz and Paul Nelson, who provided the draft block intro- duction for Syriac with the assistance of their colleague Sargon Hasso. Also those who advised them: Dr. Sebastian Brock (Oxford University), Dr. J. Coak- ley (Harvard University), Dr. Jan Wilson (Brigham Young University), and Dr. Edward Odisho (North Eastern Illinois University). • For Thaana: the Maldivian Students’ Association and Husine Zahid. • For Sinhala: Lloyd Anderson, Lee Collins, Camillus Jayewardena (University of Ruhuna, Matara, Sri Lanka), Vasantha Karunaratne, Hugh McGregor Ross, and Mettavihari (IBRIC—International Buddhist Research & Information Centre, Sri Lanka). • For Tibetan extensions: Lloyd Anderson, Robert Chilton, Michael Everson, Christopher Fynn, Peter Lofting, Sam Sirlin, Valeriy E. Ushakov, and especially Tony Duff, who drafted the block introduction for Tibetan. • For Myanmar: the Myanmar Information Technology Standardization Com- mittee (represented by Dr. Aung Maw, Khin Maung Lwin, Dr. Kyaw Thien, Thaung Tin, and Thien Htut), and also F. Avery Bishop, Lee Collins, Andy Daniels, Zaw Htut, and Hugh McGregor Ross. • For Khmer: Maurice Bauhahn, F. Avery Bishop, Andy Daniels, Anshuman Pan- dey, and Cathy Wissink. • For Ethiopic: the Committee for the Standardization of Ethiopian Script (Abass Alamnehe, Fesseha Atlaw, Yitna Firdyiwek, Yonas Fisseha, Samuel Kinde, Ter- refe Ras-Work, and especially Daniel Yacob). • For Cherokee: Charles Gourd of the Cherokee Nation of Oklahoma. • For Canadian Aboriginal Syllabics: the Canadian Aboriginal Syllabic Encoding Committee, as well as Alain LaBonté, Henry Leperlier, and especially Dirk Ver- meulen. • For Mongolian: Chen Zhuang, Professor Choijingzhab, Oliver Corff, Myatav Erdenchimeg, Ji Mu Yan, Yumbayariin Namsarai, and, in particular, Richard Moore, who drafted the block introduction for Mongolian. • For the ideographs of Vertical Extension A, the Kangxi Radicals, and the CJK Radicals Supplement: all of the members of the IRG. Major contributors included Zhang Zhoucai (rapporteur), Wang Xiaoming, Chen Zhuang, Fu Yonghe, Bao Huiyun, Zheng Long, Liu Jiayu; Takayuki Sato, Satoshi Yamamoto, Tateo Koike, Eiji Matsuoka, Tatsuo Kobayashi; Lee Joonsuk, Lee Jaehoon, Suh Kyung Ho; Ngo Trung Viet, Hong Nguyen Quang; Lu Qin, Lee Kin Hong; C. C. Hsu, Emily Yu-Chi Hsu, T. C. Kao; Michael Kung, Hideki Hiura, and John Jen- kins. • For advice on Hebrew: Dr. Bella Hass Weinberg. • For advice on Indic scripts: James Agenbroad, Jeroen Hellingman, and Antoine Lecault. vi Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Acknowledgments
• For assistance with character properties of Thai: Ranat Thopunya and Suwit Srivilairith (IBM Thailand). • For contributions to the encoding of North American scripts: Eric Brunner. The Unicode Standard, Version 3.0, would not have been possible without those who made important contributions to earlier versions: Glenn Adams for his work on the character/ glyph model, Devanagari shaping behavior, and the definition of conformance, and for the initial glossary; Lee Collins, for his painstaking work in the integration and extension of mappings from many different sources to create the original Unicode Han unification; Edwin Hart for his work on the character/glyph model; Isai Scheinberg, for his pivotal work in guiding the Unicode Standard through the intricacies of the ISO standards devel- opment process and his contributions to the bidirectional algorithm; and Layne Cannon and Stuart Smith for the Byte Order Mark. Past contributors to the development of the character databases and mapping tables include K. D. Chang, Tim Greenwood, Lori Hoerth, and Liao Huan-Mei. Other significant contributors to earlier versions were John Bennett, Joe Bosurgi, Andy Daniels, Burwell Davis, Bill English, Masami Hasegawa, Greg Hullander, Mike Kernaghan, Eric Mader, Dave Opstad, Bill Tuthill, J. G. Van Stee, and Glenn Wright. The Unicode Consortium continues to maintain mutually beneficial relationships with international standards organizations. We appreciate the efforts and support of the mem- bers of ISO/IEC JTC1/SC2/WG2 and the members of the Ideographic Rapporteur Group toward the common goal of keeping both standards synchronized. We would particularly like to thank the Convener of WG2, Mike Ksar, the Rapporteur of the IRG, Zhang Zhoucai, and WG2’s Editors and Contributing Editors, especially Bruce Paterson, Michel Suignard, and Michael Everson. We also thank Asmus Freytag for his effective representation of the Unicode Consortium at WG2 meetings. We would like to thank the members of ISO/IEC JTC1/SC22/WG20, especially Alain LaBonté and the Convener, Arnold F. Winkler, for their work with the Consortium on a common collation algorithm and properties related to identifiers. The support of member companies has been crucial to The Unicode Standard, Version 3.0. Particular thanks for facilities, equipment, and resources are owed to Adobe Systems, Inc., Apple Computer, Inc., Hewlett-Packard Company, IBM Corporation, Microsoft Corpora- tion, and Monotype Typography, Inc. While we gratefully acknowledge the contributions of all persons named in this section, any errors or omissions in this work are the responsibility of the Unicode Consortium.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. vii Unicode Consortium Members and Directors
Full Members While Version 3.0 of the Unicode Standard was under development, the following compa- nies were Full Members of the Unicode Consortium: Apple Computer, Inc. NeXT Software, Inc. Basis Technology Corporation Novell, Inc. Booz, Allen & Hamilton, Inc. Oracle Corporation Compaq Computer Corporation The Research Libraries Group, Inc. Gamma Productions, Inc. Reuters, Ltd. Hewlett-Packard Company SAP AG IBM Corporation Silicon Graphics, Inc. Justsystem Corporation Spyglass, Inc. MATHEMA Software GmbH Sun Microsystems, Inc. MGI Software Corporation Sybase, Inc. Microsoft Corporation Unisys Corporation NCR Corporation Xerox Corporation
Current Associate Members Adobe Systems, Inc. Libronix Corporation Arphic Technology Co. Ltd. Logos Research Systems, Inc. ASMUS, Inc. Lycos, Inc. Automated Wagering International, Inc. Monotype Typography, Inc. Beijing Zhong Yi Electronics Co. Netscape Communications Corporation Bitstream, Inc. OCLC, Inc. BMC Software, Inc. OneRealm, Inc. The Church of Jesus Christ of Latter-day Production First Software Saints Progress Software Corporation Columbia University QUALCOMM Incorporated Data Research Associates Quark, Inc. DecoType, Inc. Rogue Wave Software Endeavor Information Systems, Inc. The Royal Library, Sweden Ericsson Mobile Communications RWS Group, LLC eTranslate, Inc. SHARE, Inc. GE Information Services, Inc. Siebel Systems, Inc. The Getty Information Institute Software AG Global System for Sustainable Development Software Engineering Australia, Ltd. Government of Tamil Nadu, India StarTV—Satellite Television Asia Region Hong Kong Telecom CSL Ltd. Innovative Interfaces, Inc. Symbian, Ltd. Internet Mail Consortium Uniscape, Inc. Language Analysis Systems, Inc. VTLS, Inc.
viii Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Current Liaison Members Association for Font Information Interchange Korea, Kongju National Library, Chung-nam (AFII) Viet Nam, Technical Committee on Information China, Center of Computer and Information Technology (TCVN/TC1), Hanoi Development (CCID), Beijing World Wide Web Consortium (W3C) I18N The Internet Engineering Task Force (IETF) Working Group
Current Specialist Members James Agenbroad, Rolf Bourges, G. David Butler II, James Caldwell, Peter Constable, Tony Graham, Stephen Holmes, Kent Karlsson
Current Individual Members Lloyd Anderson, Stephanos Antoniades, Dr. Yves Arrouye, Stella Ayzinger, Hasan Bateson, Charles Bishop, Philip Blair, Brad Bodkin, François-Paul Briand, Brian M. Carr, Shang-Che Cheng, Marco Cimarosti, John Dlugosz, George Foot, Richard A. Gard, Dr. Harry Gaylord, Dan Glisson, Jennifer Goodman, Tim Greenwood, Phillip H. Griffin, Bob Hallisy, James Harlow, Nyr Indictor, Jens K. Jensen, Bohdan Kantor, James Kass, Tim Kimball, W. J. Kurmey, Rob Latousek, Kevin Leake, Jonathan Lewis, Giovanni Lussu, Pankaj Madia, Dr. Karnati Mallesham, Greg Melahn, Dr. Tag Young Moon, Robert J. Morris, Victor Nguyen, Paul Pavlik, Phan Phan, Gabriel Plumlee, Christopher Pullen, Patrick P. Reilly, Daniel Rybowski, Tetsuo Seto, CathyAnn Swindlehurst, Lang-Fen Hu Tang, Mituhiko Toho, Tom Twiddy, L. Segond von Banchet, Joan Winters
Current Members of the Board of Directors Wayne R. Boyle (NCR Corporation) Brian E. Carpenter (IBM Corporation & Internet Architecture Board) Kevin Cavanaugh (Lotus Development Corporation) Tatsuo L. Kobayashi (Scholex Company, Ltd.) Mike Ksar (Hewlett-Packard Company) Elizabeth G. Nichols (IBM Corporation) Wendy Rannenberg (Compaq Computer Corporation) Franz G. Rau (Microsoft Corporation) David R. Richards (The Research Libraries Group, Inc.)
Former Members of the Board of Directors Jerry Barber (Aldus Corporation) Paul Maritz (Microsoft Corporation) Janet Buschert (Hewlett-Packard Company) Stephen P. Oksala (Unisys Corporation) Robert M. Carr (Go Corporation) Mike Potel (Taligent, Inc.) John Gage (Sun Microsystems, Inc.) Rick Spitz (Apple Computer, Inc.) Paul Hegarty (NeXT Software, Inc.) Lawrence Tesler (Apple Computer, Inc.) Gary C. Hendrix (Symantec Corporation) Guy “Bud” Tribble (NeXT Software, Inc.) Richard J. Holleman (IBM Corporation) Kazuya Watanabe (Novell KK) Charles Irby (Metaphor, Inc.) Gayn B. Winters (Digital Equipment Corpora- Jay E. Israel (Novell, Inc.) tion) Ilene H. Lang (Digital Equipment Corporation)
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. ix x Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:
Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 1 Introduction
The Unicode Standard is the universal character encoding scheme for written characters and text. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software. As the default encoding of HTML and XML, the Unicode Standard provides a sound underpinning for the World Wide Web and new methods of business in a networked world. Required in new Internet protocols and implemented in all modern operating systems and computer languages such as Java, Unicode is the basis of software that must function all around the world.
With Unicode, the information technology industry gains data stability instead of proliferating character sets; greater global interoperability and data interchange; and simplified software and reduced development costs.
While modeled on the ASCII character set, the Unicode Standard goes far beyond ASCII's limited ability to encode only the upper- and lowercase letters A through Z. It provides the capacity to encode all characters used for the written languages of the world--more than 1 million characters can be encoded. No escape sequence or control code is required to specify any character in any language. The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently, which means they can be used in any mixture and with equal facility (see Figure 1-1, Wide ASCII ).
Figure 1-1. Wide ASCII
The Unicode Standard specifies a numeric value and a name for each of its characters. In this respect, it is similar to other character encoding standards from ASCII onward. In addition to character codes and names, other information is crucial to ensure legible text: a character's case, directionality, and alphabetic properties must be well defined. The Unicode Standard defines this and other semantic information, and includes application data such as case mapping tables and mappings to the repertoires of international, national, and industry character sets. The Unicode Consortium provides this additional information to ensure consistency in the implementation and interchange of Unicode data.
Unicode provides for two encoding forms: a default 16-bit form and a byte-oriented form called UTF-8 that has been designed for ease of use with existing ASCII-based systems. The Unicode Standard, Version 3.0, is code-for-code identical with International Standard ISO/IEC 10646. Any implementation that is conformant to Unicode is therefore conformant to ISO/IEC 10646.
Using a 16-bit encoding means that code values are available for more than 65,000 characters. While this number is sufficient for coding the characters used in the major languages of the world, the Unicode Standard and ISO/IEC 10646 provide the UTF-16 extension mechanism (called surrogates in the Unicode Standard), which allows for the encoding of as many as 1 million additional characters without any use of escape codes. This capacity is sufficient for all known character encoding requirements, including full coverage of all historic scripts of the world.
1.1 Coverage
The Unicode Standard, Version 3.0, contains 49,194 characters from the world's scripts. These characters are more than sufficient not only for modern communication, but also for the classical forms of many languages. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and scripts of Asia. The unified Han subset contains 27,484 ideographic characters defined by national and industry standards of China, Japan, Korea, Taiwan, Vietnam, and Singapore. In addition, the Unicode Standard includes punctuation marks, mathematical symbols, technical symbols, geometric shapes, and dingbats.
Many new scripts and characters have been added in Version 3.0, including Ethiopic, Canadian Aboriginal Syllabics, Cherokee, Sinhala, Syriac, Myanmar, Khmer, Mongolian, Braille, and additional ideographs. Overall character allocation and code ranges are detailed in Chapter 2, General Structure .
Note, however, that the Unicode Standard does not encode idiosyncratic, personal, novel, rarely exchanged, or private-use characters, nor does it encode logos or graphics. Graphologies unrelated to text, such as dance notations, are likewise outside the scope of the Unicode Standard. Font variants are explicitly not encoded. The Unicode Standard reserves 6,400 code values in the basic 16-bit encoding for the Private Use Area, which may be used to assign codes to characters not included in the repertoire of the Unicode Standard.
There are 7,827 unused code values, which will permit future expansion in the basic encoding space. Provision has also been made for another 917,476 code points through the use of surrogate pairs. Surrogate pairs make another 131,068 private-use code points available should 6,400 prove insufficient for particular applications.
Standards Coverage
The Unicode Standard is a superset of all characters in widespread use today. It contains the characters from major international and national standards as well as prominent industry character sets. For example, Unicode incorporates the ISO/IEC 6937 and ISO/IEC 8859 families of standards, the SGML standard ISO/IEC 8879, and bibliographic standards such as ISO 5426. Important national standards contained within Unicode include ANSI Z39.64, KS C 5601, JIS X 0209, JIS X 0212, GB 2312, and CNS 11643. Industry code pages and character sets from Adobe, Apple, Fujitsu, Hewlett-Packard, IBM, Lotus, Microsoft, NEC, and Xerox are fully represented as well.
For a complete list of ISO and national standards used as sources, see the References section of The Unicode Standard, Version 3.0.
New Characters
The Unicode Standard continues to respond to new and changing industry demands by encoding important new characters. As an example, when the need to support the euro sign arose, The Unicode Standard, Version 2.1, with euro support was issued to ensure a conforming version.
As the universal character encoding scheme, the Unicode Standard must also respond to scholarly needs. To preserve world cultural heritage, important archaic scripts are encoded as proposals are developed.
For instructions on how to submit proposals for new characters to the Unicode Consortium, see Submitting New Characters.
1.2 Design Basis
The primary goal of the development effort for the Unicode Standard was to remedy two serious problems common to most multilingual computer programs. The first problem was the overloading of the font mechanism when encoding characters. Fonts have often been indiscriminately mapped to the same set of bytes. For example, the bytes 0x00 to 0xFF are often used for both characters and dingbats. The second major problem was the use of multiple, inconsistent character codes because of conflicting national and industry character standards. In Western European software environments, for example, one often finds confusion between the Windows Latin 1 code page 1252 and ISO/IEC 8859-1. In software for East Asian ideographs, the same set of bytes used for ASCII may also be used as the second byte of a double-byte character. In these situations, software must be able to distinguish between ASCII and double-byte characters.
The ASCII 7-bit code space and its 8-bit extensions, although used in most computing systems, are limited to 128 and 256 code positions, respectively. These 7- and 8-bit code spaces are inefficient and completely inadequate in the global computing environment.
When the Unicode project began in 1988, the groups most affected by the lack of a consistent international character standard included publishers of scientific and mathematical software, newspaper and book publishers, bibliographic information services, and academic researchers. More recently, the computer industry has adopted an increasingly global outlook, building international software that can be easily adapted to meet the needs of particular locations and cultures. The explosive growth of the Internet has merely added to the demand for a character set standard that can be used all over the world.
The designers of the Unicode Standard envisioned a uniform method of character identification that would be more efficient and flexible than previous encoding systems. The new system would satisfy the needs of technical and multilingual computing and would encode a broad range of characters for professional-quality typesetting and desktop publishing worldwide.
The Unicode Standard was designed to be:
• Universal. The repertoire must be large enough to encompass all characters that are likely to be used in general text interchange, including those in major international, national, and industry character sets. • Efficient. Plain text is simple to parse: software does not have to maintain state or look for special escape sequences, and character synchronization from any point in a character stream is quick and unambiguous. • Uniform. A fixed character code allows for efficient sorting, searching, display, and editing of text. • Unambiguous. Any given 16-bit value always represents the same character.
Figure 1-2, Universal, Efficient, and Unambiguous demonstrates some of these features, contrasting the Unicode encoding with mixtures of single-byte character sets with escape sequences to shift the meanings of bytes.
Figure 1-2. Universal, Efficient, and Unambiguous
1.3 Text Handling
Computer text handling involves processing and encoding. When a word processor user types in text via a keyboard, the computer's system software receives a message that the user pressed a key combination for "T", which it encodes as U+0054. The word processor stores the number in memory and also passes it on to the display software responsible for putting the character on the screen. This display software, which may be a windows manager or part of the word processor itself, then uses the number as an index to find an image of a "T", which it draws on the monitor screen. The process continues as the user types in more characters.
The Unicode Standard directly addresses only the encoding and semantics of text and not any other actions performed on the text. In the preceding scenario, the word processor might check the typist's input after it has been encoded to look for misspelled words, and then highlight any errors it finds. Alternatively, the word processor might insert line breaks when it counts a certain number of characters entered since the last line break. An important principle of the Unicode Standard is that the standard does not specify how to carry out these processes as long as the character encoding and decoding is performed properly and the character semantics are maintained.
Interpreting Characters
The difference between identifying a code value and rendering it on screen or paper is crucial to understanding the Unicode Standard's role in text processing. The character identified by a Unicode code value is an abstract entity, such as "LATIN CAPITAL LETTER A" or "BENGALI DIGIT 5". The mark made on screen or paper, called a glyph, is a visual representation of the character.
The Unicode Standard does not define glyph images. That is, the standard defines how characters are interpreted, not how glyphs are rendered. Ultimately, the software or hardware rendering engine of a computer is responsible for the appearance of the characters on the screen. The Unicode Standard does not specify the size, shape, or orientation of on-screen characters.
Text Elements
The successful encoding, processing, and interpretation of text requires appropriate definition of useful elements of text and the basic rules for interpreting text. The definition of text elements often changes depending on the process handling the text. For example, when searching for a particular word or character written with the Latin script, one often wishes to ignore differences of case. However, correct spelling within a document requires case sensitivity.
The Unicode Standard does not define what is and is not a text element in different processes; instead, it defines elements of encoding, called code values. A code value, commonly called a character, is fundamental and useful for computer text processing. For the most part, code values correspond to the most commonly used text elements.
1.4 The Unicode Standard and ISO/IEC 10646
The Unicode Standard is fully compatible with the international standard ISO/IEC 10646-1:2000, Information Technology--Universal Multiple-Octet Coded Character Set (UCS)--Part 1: Architecture and Basic Multilingual Plane, which is also known as the Universal Character Set (UCS). During 1991, the Unicode Consortium and the International Organization for Standardization (ISO) recognized that a single, universal character code was highly desirable. A formal convergence of the two standards was negotiated, and their repertoires were merged into a single character encoding in January 1992. Since then, close cooperation and formal liaison between the committees have ensured that all additions to either standard are coordinated and kept synchronized, so that the two standards maintain exactly the same character repertoire and encoding.
Version 3.0 of the Unicode Standard is code-for-code identical to ISO/IEC 10646-1:2000. This code-for-code identity holds true for all encoded characters in the two standards, including the East Asian (Han) ideographic characters. ISO/IEC 10646 provides character names and code values; the Unicode Standard provides the same names and code values plus important implementation algorithms, properties, and other useful semantic information.
For details about the Unicode Standard and ISO/IEC 10646, see Appendix C, Relationship to ISO/IEC 10646 , and Appendix D, Changes from Unicode Version 2.0 in The Unicode Standard, 3.0.
1.5 The Unicode Consortium
The Unicode Consortium was incorporated in January 1991, under the name Unicode, Inc., to promote the Unicode Standard as an international encoding system for information interchange, to aid in its implementation, and to maintain quality control over future revisions.
To further these goals, the Unicode Consortium cooperates with the International Organization for Standardization (ISO/IEC/JTC1). It holds a Class C liaison membership with ISO/IEC JTC1/SC2; it participates in the work of both JTC1/SC2/WG2 (the technical working group for the subcommittee within JTC1 responsible for character set encoding) and the Ideographic Rapporteur Group (IRG) of WG2. The Consortium is a member company of the National Committee for Information Technology Standards, Technical Committee L2 (NCITS/L2), an accredited U.S. standards organization. In addition, Full Member companies of the Unicode Consortium have representatives in many countries who also work with other national standards bodies.
A number of organizations are Liaison Members of the Unicode Consortium: the Center for Computer & Information Development (CCID, China), the Internet Engineering Task Force (IETF), the Kongju National Library (Chung-nam, Korea), the Technical Committee on Information Technology (TCVN/TC1, Viet Nam), and the World Wide Web Consortium (W3C) I18N Working Group.
Membership in the Unicode Consortium is open to organizations and individuals anywhere in the world who support the Unicode Standard and who would like to assist in its extension and widespread implementation. Full and Associate Members represent a broad spectrum of corporations and organizations in the computer and information processing industry. The Consortium is supported financially solely through membership dues.
The Unicode Technical Committee The Unicode Technical Committee (UTC) is the working group within the Consortium responsible for the creation, maintenance, and quality of the Unicode Standard. The UTC controls all technical input to the standard and makes associated content decisions. Full Members of the Consortium vote on UTC decisions. Associate and Specialist Members and Officers of the Unicode Consortium are nonvoting UTC participants. Other attendees may participate in UTC discussions at the discretion of the Chair, as the intent of the UTC is to act as an open forum for the free exchange of technical ideas.
Chapter 2 General Structure 2
This chapter discusses the fundamental principles governing the design of the Unicode Standard and presents an overview of its main features. It includes discussion of text pro- cesses, unification principles, allocation of codespace, character properties, writing direc- tion, and a description of combining marks and how they are employed in Unicode character encoding. This chapter also gives general requirements for creating a text- processing system that conforms to the Unicode Standard. Formal requirements for conformance appear in Chapter 3, Conformance. Character properties, both normative and informative, are given in Chapter 4, Character Properties. A set of guidelines for implement- ers is provided in Chapter 5, Implementation Guidelines.
2.1 Architectural Context A character code standard such as the Unicode Standard enables the implementation of useful processes operating on textual data. The interesting end products are not the charac- ter codes but the text processes, because these directly serve the needs of a system’s users. Character codes are like nuts and bolts—minor, but essential and ubiquitous components used in many different ways in the construction of computer software systems. No single design of a character set can be optimal for all uses, so the architecture of the Unicode Stan- dard strikes a balance among several competing requirements.
Basic Text Processes Most computer systems provide low-level functionality for a small number of basic text processes from which more sophisticated text-processing capabilities are built. The follow- ing text processes are supported by most computer systems to some degree: • Rendering characters visible (including ligatures, contextual forms, and so on) • Breaking lines while rendering (including hyphenation) • Modifying appearance, such as point size, kerning, underlining, slant, and weight (light, demi, bold, and so on) • Determining units such as “word” and “sentence” • Interacting with users in processes such as selecting and highlighting text • Modifying keyboard input and editing stored text through insertion and dele- tion • Comparing text in operations such as determining the sort order of two strings, or filtering or matching strings • Analyzing text content in operations such as spell-checking, hyphenation, and parsing morphology (that is, determining word roots, stems, and affixes)
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 9 2.1 Architectural Context General Structure
• Treating text as bulk data for operations such as compressing and decompress- ing, truncating, transmitting, and receiving
Text Elements, Code Values, and Text Processes One of the more profound challenges in designing a worldwide character encoding stems from the fact that, for each text process, written languages differ in what is considered a fundamental unit of text, or a text element. For example, in traditional German orthography, the letter combination “ck” is a text ele- ment for the process of hyphenation (where it appears as “k-k”), but not for the process of sorting; in Spanish, the combination “ll” may be a text element for the traditional process of sorting (where it is sorted between “l” and “m”), but not for the process of rendering; and in English, the objects “A” and “a” are usually distinct text elements for the process of rendering, but generally not distinct for the process of searching text. The text elements in a given language depend upon the specific text process; a text element for spell-checking may have different boundaries from a text element for sorting purposes. A character encoding standard provides the fundamental units of encoding (that is, the abstract characters), which must exist in a unique relationship to the assigned numerical code values. These code values are the smallest addressable units of stored text. An important class of text elements is called a grapheme, which typically corresponds to what a user thinks of as a “character.” Figure 2-1 illustrates the relationship between abstract characters and graphemes. Figure 2-1. Text Elements and Characters Text Elements Characters
C ü
Grapheme: ch c h (Spanish)
çS · ÷ á ç
Word: cat c a t
The design of the character encoding must provide precisely the set of code values that allows programmers to design applications capable of implementing a variety of text pro- cesses in the desired languages. These code values may not map directly to any particular set of text elements that is used by one of these processes.
10 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.1 Architectural Context
Text Processes and Encoding In the case of English text using an encoding scheme such as ASCII, the relationships between the encoding and the basic text processes built on it are seemingly straightforward: characters are generally rendered visible one by one in distinct rectangles from left to right in linear order. Thus one character code inside the computer corresponds to one logical character in a process such as simple English rendering. When designing an international and multilingual text encoding such as the Unicode Stan- dard, the relationship between the encoding and implementation of basic text processes must be considered explicitly, for several reasons: • Many assumptions about character rendering that hold true for English fail for other writing systems. Unlike in English, characters in other writing systems are not necessarily rendered visible one by one in rectangles from left to right. In many cases, character positioning is quite complex and does not proceed in a linear fashion. See Section 8.2, Arabic, in Chapter 8, Middle Eastern Scripts, and Section 9.1, Devanagari, in Chapter 9, South and Southeast Asian Scripts, for detailed examples of this situation. • It is not always obvious that one set of text characters is an optimal encoding for a given language. For example, two approaches exist for the encoding of accented characters commonly used in French or Swedish: ISO/IEC 8859 defines letters such as “ä” and “ö” as individual characters, whereas ISO 5426 represents them by composition instead. In the Swedish language, both are considered distinct letters of the alphabet, following the letter “z”. In French, the diaeresis on a vowel merely marks it as being pronounced in isolation. In practice, both approaches can be used to implement either language. • No encoding can support all basic text processes equally well. As a result, some trade-offs are necessary. For example, ASCII defines separate codes for upper- case and lowercase letters. This choice causes some text processes, such as ren- dering, to be carried out more easily, but other processes, such as comparison, to become more difficult. A different encoding design for English, such as case- shift control codes, would have the opposite effect. In designing a new encoding scheme for complex scripts, such trade-offs must be evaluated and decisions made explicitly, rather than unconsciously. For these reasons, design of the Unicode Standard is not specific to the design of particular basic text-processing algorithms. Instead, it provides an encoding that can be used with a wide variety of algorithms. In particular, sorting and string comparison algorithms cannot assume that the assignment of Unicode character code numbers provides an alphabetical ordering for lexicographic string comparison. Culturally expected sorting orders require arbitrarily complex sorting algorithms. The expected sort sequence for the same characters differs across languages; thus, in general, no single acceptable lexicographic ordering exists. (See Section 5.17, Sorting and Searching, for implementation guidelines.) Text processes supporting many languages are often more complex than they are for English. The character encoding design of the Unicode Standard strives to minimize this additional complexity, enabling modern computer systems to interchange, render, and manipulate text in a user’s own script and language—and possibly in other languages as well.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 11 2.2 Unicode Design Principles General Structure
2.2 Unicode Design Principles The design of the Unicode Standard reflects the 10 fundamental principles stated in Table 2-1. Not all of these principles can be satisfied simultaneously. The design strikes a balance between maintaining consistency for the sake of simplicity and efficiency and maintaining compatibility for interchange with existing standards. Table 2-1. The 10 Unicode Design Principles
Principle Statement Sixteen-bit character codes Unicode character codes have a width of 16 bits. Efficiency Unicode text is simple to parse and process. Characters, not glyphs The Unicode Standard encodes characters, not glyphs. Semantics Characters have well-defined semantics. Plain text The Unicode Standard encodes plain text. Logical order The default for memory representation is logical order. Unification The Unicode Standard unifies duplicate characters within scripts across languages. Dynamic composition Accented forms can be dynamically composed. Equivalent sequence Static precomposed forms have an equivalent dynamically composed sequence of characters. Convertibility Accurate convertibility is guaranteed between the Unicode Standard and other widely accepted standards.
Sixteen-Bit Character Codes Plain Unicode text consists of sequences of 16-bit Unicode character codes. Sequences of 16-bit codes are easy to parse and allow for efficient sorting, searching, display, and editing of text. From the full range of 65,536 code values, 63,486 are available to represent charac- ters with single 16-bit code values, and 2,048 code values are available to represent an addi- tional 1,048,544 characters through paired 16-bit code values. These paired code values, or surrogates, will allow implementations access to additional characters in the future. None of these surrogate pairs has been assigned in this version of the standard. The unrestricted range of 256 byte values is used for either of the two bytes that compose a Unicode 16-bit code value. An 8-bit-oriented process that expects certain ranges of byte values to be reserved may fail if unexpectedly presented with Unicode’s 16-bit character codes. For compatibility with existing environments, a lossless transformation for convert- ing 16-bit Unicode values into a form appropriate for 8-bit environments has been defined; it is called UTF-8. UTF-8 (Unicode Transformation Format-8) is the standard method for transforming Uni- code values into a sequence of 8-bit codes. It is intended to be used where 8-bit codes are needed and/or when the ASCII range of values need to be preserved—for example, when transmitting data through byte-oriented protocols or when using byte-oriented APIs. For further information on transformation formats of Unicode, see Section 2.3, Encoding Forms, and Appendix C.3, UCS Transformation Formats, in this book. Also see ISO/IEC 10646 Transformation Formats. For a format to be used on EBCDIC-based systems, see Unicode Technical Report #16, “UTF-EBCDIC,” on the CD-ROM or the up-to-date ver- sion on the Unicode Web site.
12 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.2 Unicode Design Principles
Efficiency The Unicode Standard is designed to make efficient implementation possible. There are no escape characters or shift states in the Unicode character encoding scheme. Each character code has the same status as any other character code; all codes are equally accessible. All encoding forms are self-synchronizing. When randomly accessing a string, a program can find the boundary of a character with limited backup. In UTF-16, if a pointer points to a low-surrogate, a single backup is required. In UTF-8, if a pointer points to a byte starting with 10xxxxxx (in binary), one to three backups are required to find the beginning of the character. By convention, characters of a script are grouped together as far as is practical. Not only is this practice convenient for looking up characters in the code charts, but it makes imple- mentations more compact. The common punctuation characters are shared. Formatting characters are given specific and unambiguous function in the Unicode Stan- dard. This design simplifies the support of subsets. To keep implementations simple and efficient, stateful controls and formatting characters are avoided wherever possible.
Characters, Not Glyphs The Unicode Standard draws a distinction between characters, which are the smallest com- ponents of written language that have semantic value, and glyphs, which represent the shapes that characters can have when they are rendered or displayed. Various relationships may exist between character and glyph: a single glyph may correspond to a single character, or to a number of characters, or multiple glyphs may result from a single character. The distinction between characters and glyphs is illustrated in Figure 2-2. Unicode characters represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation. Characters are rep- resented by code values that reside only in a memory representation, as strings in memory, or on disk. The Unicode Standard deals only with character codes. In contrast to characters, glyphs appear on the screen or paper as particular representations of one or more characters. A repertoire of glyphs makes up a font. Glyph shape and meth- ods of identifying and selecting glyphs are the responsibility of individual font vendors and of appropriate standards and are not part of the Unicode Standard. Figure 2-2. Characters Versus Glyphs
Glyphs Unicode Characters A ei_`gA c U+0041 LATIN CAPITAL LETTER A a fjabh a d U+0061 LATIN SMALL LETTER A fi fi U+0066 LATIN SMALL LETTER F + U+0069 LATIN SMALL LETTER I klnm U+0647 ARABIC LETTER HEH
For certain scripts, such as Arabic and the various Indic scripts, the number of glyphs needed to display a given script may be significantly larger than the number of characters encoding the basic units of that script. The number of glyphs may also depend on the orthographic style supported by the font. For example, an Arabic font intended to support the Nastaliq style of Arabic script may possess many thousands of glyphs. However, the
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 13 2.2 Unicode Design Principles General Structure character encoding employs the same few dozen letters regardless of the font style used to depict the character data in context. A font and its associated rendering process define an arbitrary mapping from Unicode val- ues to glyphs. Some of the glyphs in a font may be independent forms for individual char- acters; others may be rendering forms that do not directly correspond to any single character. The process of mapping from characters in the memory representation to glyphs is one aspect of text rendering. The final appearance of rendered text may also depend on context (neighboring characters in the memory representation), variations in typographic design of the fonts used, and formatting information (point size, superscript, subscript, and so on). The results on screen or paper can differ considerably from the prototypical shape of a letter or character, as shown in Figure 2-3. Figure 2-3. Unicode Character Code to Rendered Glyphs Text Character Sequence ① Ò 0000 1001 0010 1010 ② @ê 0000 1001 0100 0010 ③ Ú 0000 1001 0011 0000 Font ④ @÷ 0000 1001 0100 1101 (Glyph Source) ⑤ Ì 0000 1001 0010 0100 ç@ 0000 1001 0011 0100
Text Rendering Process
③④ Ò① êçÌü ⑤ ②
For all scripts, an archetypical relation exists between character code sequences and result- ing glyphic appearance. For the Latin script, this relationship is simple and well known; for several other scripts, it is documented in this standard. However, in all cases, fine typogra- phy requires a more elaborate set of rules than given here. The Unicode Standard docu- ments the default relationship between character sequences and glyphic appearance solely for the purpose of ensuring that the same text content is always stored with the same, and therefore interchangeable, sequence of character codes.
14 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.2 Unicode Design Principles
Semantics Characters have well-defined semantics. Character property tables are provided for use in parsing, sorting, and other algorithms requiring semantic knowledge about the code points. See Section 5.15, Locating Text Element Boundaries, Section 5.16, Identifiers, and Section 5.17, Sorting and Searching, for suggested implementations. The properties identi- fied by the Unicode Standard include numeric, spacing, combination, and directionality properties (see Chapter 4, Character Properties). Additional properties may be defined as needed from time to time. By itself, neither the character name nor its location in the code table designates its properties. For an exception, see Section 4.1, Case—Normative.
Plain Text Plain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a sequence of Unicode character codes. In contrast, fancy text, also known as rich text, is any text representation consisting of plain text plus added information such as a language iden- tifier, font size, color, hypertext links, and so on. For example, the text of this book, a mul- tifont text as formatted by a desktop publishing system, is fancy text. Many kinds of data structures can be built into fancy text. For example, in fancy text con- taining ideographs an application may store the phonetic readings of ideographs some- where in the fancy text structure. The simplicity of plain text gives it a natural role as a major structural element of fancy text. SGML, HTML, XML, or TEX are examples of fancy text fully represented as plain text streams, interspersing plain text data with sequences of characters that represent the addi- tional data structures. Many popular word processing packages rely on a buffer of plain text to represent the content and to implement links to a parallel store of formatting data. The relative functional roles of both plain and fancy text are well established: • Plain text is the underlying content stream to which formatting can be applied. • Plain text is public, standardized, and universally readable. • Fancy text representation may be implementation-specific or proprietary. Although some fancy text formats have been standardized or made public, the majority of fancy text designs are vehicles for particular implementations and are not necessarily read- able by other implementations. Given that fancy text equals plain text plus added informa- tion, the extra information in fancy text can always be stripped away to reveal the “pure” text underneath. This operation is often employed, for example, in word processing sys- tems that use both their own private fancy format and plain text file format as a universal, if limited, means of exchange. Thus, by default, plain text represents the basic, interchange- able content of text. Standards for markup language, such as XML and HTML, use plain text for the entire file. They use special conventions embedded within the plain text file such as “
” to distin- guish the markup from the “real” content. XML, in particular, uses Unicode as a base-level encoding: “All XML processors must be able to read entities in either UTF-8 or UTF-16.”— W3C Recommendation: Extensible Markup Language (XML) 1.0, §4.3.3. Files not using these character sets in XML must have an encoding declaration to indicate any other char- acter set. Because plain text represents character content, it has no inherent appearance. It requires a rendering process to make it visible. If the same plain text sequence is given to disparate rendering processes, there is no expectation that rendered text in each instance should have the same appearance. Instead, the disparate rendering processes are simply required to
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 15 2.2 Unicode Design Principles General Structure make the text legible according to the intended reading. Therefore, the relationship between appearance and content of plain text may be stated as follows: Plain text must contain enough information to permit the text to be rendered legibly, and nothing more. The Unicode Standard encodes plain text. The distinction between data encoded in the Unicode Standard and other forms of data in the same data stream is the function of a higher-level protocol and is not specified by the Unicode Standard itself. The 64 control code positions of ISO/IEC 6429 (commonly used with ISO/IEC 646 and ISO/IEC 8859) are retained for compatibility and may be used to implement such protocols. (See Section 2.8, Controls and Control Sequences.)
Logical Order For all scripts, Unicode text is stored in logical order in the memory representation, roughly corresponding to the order in which text is typed in via the keyboard. In some circum- stances, the order of characters differs from this logical order when the text is displayed or printed. Where needed to ensure consistent legibility, the Unicode Standard defines the conversion of Unicode text from the memory representation to readable (displayed) text. The distinction between logical order and display order for reading is shown in Figure 2-4. Figure 2-4. Bidirectional Ordering
When the text in Figure 2-4 is ordered for display, the glyph that represents the first charac- ter of the English text appears at the left. The logical start character of the Hebrew text, however, is represented by the Hebrew glyph closest to the right margin. The succeeding Hebrew glyphs are laid out to the left. Logical order applies even when characters of different dominant direction are mixed: left- to-right (Greek, Cyrillic, Latin) with right-to-left (Arabic, Hebrew), or with vertical script. Properties of directionality inherent in characters generally determine the correct display order of text. This inherent directionality is occasionally insufficient to render plain text legibly, however. This situation can arise when scripts of different directionality are mixed. For this reason, the Unicode Standard includes characters to specify changes in direction. Chapter 3, Conformance, provides rules for the correct presentation of text containing left- to-right and right-to-left scripts. For the most part, logical order corresponds to phonetic order. The only current exceptions are the Thai and Lao scripts, which employ visual ordering; in these two scripts, users tra- ditionally type in visual order rather than phonetic order. Characters such as the short i in Devanagari are displayed before the characters that they logically follow in the memory representation. (See Section 9.1, Devanagari, for further explanation.) Combining marks (accent marks in the Greek, Cyrillic, and Latin scripts, vowel marks in Arabic and Devanagari, and so on) do not appear linearly in the final rendered text. In a Unicode character code string, all such characters follow the base character that they
16 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.2 Unicode Design Principles modify (for example, Roman “ã” is stored as “a” followed by combining “þ”ì when not stored in a precomposed form).
Unification The Unicode Standard avoids duplicate encoding of characters by unifying them within scripts across languages; characters that are equivalent in form are given a single code. Common letters, punctuation marks, symbols, and diacritics are given one code each, regardless of language, as are common Chinese/Japanese/Korean (CJK) ideographs. (See Section 10.1, Han.) Care has been taken not to make artificial distinctions among characters. Users may become confused when they see an Å on the screen but their search dialog does not find it. The reason that this situation occurs is that what they see on the screen is not an Å (A- ring)—it is an Å (Ångström). This phenomenon is called visual ambiguity. It is quite normal for many characters to have different usages, such as comma “,” for either thousands-separator (English) or decimal-separator (French). The Unicode Standard avoids duplication of characters due to specific usage in different languages; rather, it duplicates characters only to support compatibility with base standards. The Unicode Standard does not attempt to encode features such as language, font, size, positioning, glyphs, and so forth. For example, it does not preserve language as a part of character encoding: just as French i grecque, German ypsilon, and English wye are all repre- sented by the same character code, U+0057 “Y”, so too are Chinese zi, Japanese ji, and Korean ja all represented as the same character code, U+5B57 . In determining whether to unify variant ideograph forms across standards, the Unicode Standard follows the principles described in Section 10.1, Han. Where these principles determine that two forms constitute a trivial (wazukana) difference, the Unicode Standard assigns a single code. Otherwise, separate codes are assigned. Compatibility Characters. Compatibility characters are those that would not have been encoded (except for compatibility) because they are in some sense variants of characters that have already been coded. The prime examples are the glyph variants in the Compati- bility Area: halfwidth characters, Arabic contextual form glyphs, Arabic ligatures, and so on. The Compatibility Area contains a large number of compatibility characters, but the Uni- code Standard also contains many compatibility characters that do not appear in the Com- patibility Area. Examples include Roman numerals, such as the IV “character.” It is important to be able to identify which characters are compatibility characters so that Uni- code-based systems can treat them in a uniform way. Identifying one character as a compatibility variant of another character implies that gen- erally the first can be remapped to the other without the loss of any information other than formatting. Such remapping cannot always take place because many of the compatibility characters are in place just to allow systems to maintain one-to-one mappings to existing code sets. In such cases, a remapping would lose information that is felt to be important in the original set. Compatibility mappings are called out in Section 14.1, Character Names List. Because replacing a character by its compatibly equivalent character or character sequence may change the information in the text, implementation must proceed with cau- tion. A good use of these mappings may not be in transcoding, but rather in providing the correct equivalence for searching and sorting.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 17 2.2 Unicode Design Principles General Structure
Dynamic Composition The Unicode Standard allows for the dynamic composition of accented forms and Hangul syllables. Combining characters used to create composite forms are productive. Because the process of character composition is open-ended, new forms with modifying marks may be created from a combination of base characters followed by combining characters. For example, the diaeresis, “¨”, may be combined with all vowels and a number of consonants in languages using the Latin script and several other scripts.
Equivalent Sequence Some text elements can be encoded either as static precomposed forms or by dynamic composition. Common precomposed forms such as U+00DC “Ü” are included for compatibility with current standards. For static pre- composed forms, the standard provides a mapping to the canonically equivalent dynami- cally composed sequence of characters. (See also Section 3.6, Decomposition.) In many cases, different sequences of Unicode characters are considered equivalent. For example, a precomposed character may be represented as a composed character sequence (see Figure 2-5 and Figure 2-10). Figure 2-5. Equivalent Sequences B + Ä B + A + @¨ LJ + A L + J + A
In such cases, the Unicode Standard does not prescribe one particular sequence; all of the sequences in the examples are equivalent. Systems may choose to normalize Unicode text to one particular sequence, such as by normalizing composed character sequences into pre- composed characters, or vice versa. Therefore, any distinctions made between nonidentical equivalent sequences by applications or users are not guaranteed to be interchangeable. (For implementation guidelines, see Section 5.7, Normalization.)
Convertibility Character identity is preserved for interchange with a number of different base standards, including national, international, and vendor standards. Where variant forms (or even the same form) are given separate codes within one base standard, they are also kept separate within the Unicode Standard. This choice guarantees the existence of a mapping between the Unicode Standard and base standards. Accurate convertibility is guaranteed between the Unicode Standard and other standards in wide usage as of May 1993. In general, a single code value in another standard will corre- spond to a single code value in the Unicode Standard. Sometimes, however, a single code value in another standard corresponds to a sequence of code values in the Unicode Stan- dard, or vice versa. Conversion between Unicode text and text in other character codes must in general be done by explicit table-mapping processes. (See also Section 5.1, Transcoding to Other Standards.)
18 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.3 Encoding Forms
2.3 Encoding Forms The Unicode Standard character encoding model includes a repertoire of characters, their mapping to a set of integers, and their encoding forms. Character encoding forms specify the representation of characters as actual data in a computer. The Unicode Standard uses two encoding forms: 16-bit and 8-bit. For a formal definition of these encoding forms, see Section 3.8, Transformations. Figure 2-6 illustrates the relationship between abstract characters, encoded characters, and character encoding schemes. In Figure 2-6: • The solid arrows connect encoded characters with the abstract characters that they represent and encode. • The dotted arrow illustrates a hypothetical private-use variant of the A-ring character (the value F000016 is the Unicode scalar value for a surrogate pair in the Private Use Area). • The hollow arrow shows where an encoded character sequence represents an abstract character, but does not directly encode it. • The serialization column shows two of the possible serializations of the encoded characters. Figure 2-6. Character Encoding Schemes Abstract Encoded Serialized
UTF-16BE UTF- 8
C5 00 C5 C3 85
212B 21 2B E2 84 AB F0000 DB 80 DC 00 F3 B0 80 80
A 61 30A 00 61 03 0A 61 CC 8A
¡
UTF-16 The default encoding form of the Unicode Standard is 16-bit: characters are assigned a unique 16-bit value, except that characters encoded by surrogates consist of a pair of 16-bit values. The Unicode 16-bit encoding form is identical to the ISO/IEC 10646 transforma- tion format UTF-16. In UTF-16, characters mapped up to 65535 are encoded as single 16- bit values; characters mapped above 65535 are encoded as pairs of 16-bit values (surro- gates). Surrogates are defined in Section 3.7, Surrogates.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 19 2.3 Encoding Forms General Structure
By using surrogates, UTF-16 provides a coded representation for more than 1 million graphic characters in a form that is compatible with 16-bit characters. This scheme permits the coexistence of a very large number of characters in systems that define characters as 16- bit entities. Surrogates were designed to be a simple and efficient extension mechanism that works well with older implementations and avoids many of the problems of multibyte encodings. See Section 5.4, Handling Surrogate Pairs, for more information about surrogate characters.
UTF-8 To meet the requirements of byte-oriented, ASCII-based systems, a second encoding form is specified by the Unicode Standard: UTF-8. It is a variable-length encoding that preserves ASCII transparency. UTF-8 usage is fully conformant with the Unicode Standard and ISO/ IEC 10646. Existing software and practices in information technology frequently depend on character data being represented as a sequence of bytes. To use Unicode character data in such sys- tems, Unicode character values and pairs of surrogates must be transformed into a sequence of one or more bytes that represent the same information, but which are restricted in their numerical range. UTF-8 is a variable-length encoding of byte sequences, where the high bits indicate the part of the sequence to which a byte belongs. Table 3-1 on page 47 shows how the bits in a Uni- code value (or a surrogate pair) are distributed among the bytes in the UTF-8 encoding. Any byte sequence that does not follow this pattern is an ill-formed byte sequence. See Section 3.8, Transformations, for a definition of ill-formed byte sequences. The UTF-8 encoding form maintains transparency for all of the ASCII code values (0..127). Further- more, the values 0..127 do not appear in any byte of a transformed result except as the direct representation of these ASCII values. The ASCII range (U+0000..U+007F) is repre- sented by single bytes; many of the non-ideographic scripts are represented by two bytes; the remaining Unicode scalar values are represented as three bytes; and surrogate pairs require four bytes. Other important characteristics of UTF-8 are as follows: • It is efficient to convert to and from 16-bit Unicode text. • The first byte indicates the number of bytes to follow in a multibyte sequence, allowing for efficient forward parsing. • It is efficient to find the start of a character when beginning from an arbitrary location in a byte stream. Programs need to search at most four bytes back- ward, and it is a simple task to recognize an initial byte. • UTF-8 is reasonably compact in terms of the number of bytes used for encod- ing. • A binary sort of UTF-8 strings gives the same ordering as a binary sort of Uni- code scalar values. • If surrogates are not used, a binary sort of UTF-8 strings gives the same order- ing as the binary sort of their UTF-16 counterparts. Sample code for transforming Unicode data (UTF-16) into UTF-8 is provided on the CD- ROM.
20 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.4 Unicode Allocation
Character Encoding Schemes A character encoding scheme consists of an encoding form plus byte serialization. Thus there are four character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16LE, and UTF-16BE. Often, encoding forms and character encoding schemes have the same name. The names of the Unicode encoding forms are the same as those used in other contexts. Notably, the IANA maintains a registry of charset names used on the Internet, which includes the same names as the encoding forms used here. Although the encoding forms here and the descriptions of the charsets registered with IANA are very similar, some important differences may arise in terms of the requirements for each. For more informa- tion, please see Unicode Technical Report #17, “Character Encoding Model,” on the CD- ROM or the up-to-date version on the Unicode Web site.
2.4 Unicode Allocation All code values in the Unicode Standard are equally accessible electronically; the exact assignment of character codes is of minor consequence for information processing. Never- theless, for the convenience of people who will use them, the codes are grouped by linguis- tic and functional categories. Such grouping offers the additional benefit of supporting various space-saving techniques in actual implementations.
Allocation Areas Figure 2-7 provides an overview of the Unicode codespace allocation. For convenience, the Unicode Standard codespace is divided into several areas, which are then subdivided into character blocks: • The General Scripts Area, consisting of alphabetic and syllabic scripts that have relatively small character sets, such as Latin, Cyrillic, Greek, Hebrew, Arabic, Devanagari, and Thai • The Symbols Area, including a large variety of symbols and dingbats, for punc- tuation, mathematics, chemistry, technical, and other specialized usage • The CJK Phonetics and Symbols Area, including punctuation, symbols, radicals, and phonetics for Chinese, Japanese, and Korean • The CJK Ideographs Area, consisting of 27,484 unified CJK ideographs • The Yi Syllables Area, consisting of 1,165 syllables and 50 Yi radicals • The Hangul Syllables Area, consisting of 11,172 precomposed Korean Hangul syllables • The Surrogates Area, consisting of 1,024 low-half surrogates and 1,024 high-half surrogates that are used in the surrogate extension method to access more than 1 million codes for future expansion • The Private Use Area, containing 6,400 code positions used for defining user- or vendor-specific characters • The Compatibility and Specials Area, containing many of the characters from widely used corporate and national standards that have other representations in Unicode encoding, as well as several special-use characters The allocation of characters into areas reflects the evolution of the Unicode Standard and is not intended to define the usage of characters in implementations. For example, many
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 21 2.4 Unicode Allocation General Structure characters are included in the standard solely for reasons of compatibility with other stan- dards but are not coded in the Compatibility Area; there are many general-purpose sym- bols and punctuation in the CJK Phonetics and Symbols Area, while the Hangul conjoining jamo are in the General Scripts Area. A Private Use Area gives the Unicode Standard the necessary flexibility and matches wide- spread practice in existing standards. Successful interchange requires agreement between sender and receiver regarding interpretation of private-use codes. Figure 2-7. Unicode Allocation
U+0000 U+0000 U+0100 Latin U+1000 General U+0200 Scripts U+0300 U+2000 U+0400 Greek Cyrillic Symbols U+0500 Armenian/Hebrew U+3000 U+0600 CJK Misc. Arabic U+0700 Syriac/Thaana U+4000 U+0800 U+0900 Devanagari/Bengali U+5000 U+0A00 Gurmukhi/Gujarati U+0B00 Oriya/Tamil U+6000 CJKV U+0C00 Telugu/Kannada Ideographs U+0D00 Malayalam/Sinhala U+7000 U+0E00 Thai/Lao U+0F00 Tibetan U+8000 U+1000 Myanmar/Georgian U+1100 Hangul Jamo U+9000 U+1200 U+1300 Ethiopic U+A000 U+1400 Cherokee Yi U+1500 Canadian Aboriginal U+B000 U+1600 Syllabics U+1700 Ogham/Runic U+C000 U+1800 Khmer Mongolian Hangul U+1900 U+D000 U+1A00 U+1B00 U+E000 Surrogates U+1C00 U+1D00 U+F000 Private Use U+1E00 Extended Latin U+1F00 Extended Greek Compatibility U+2000 Primary Private Use Compatibility Reserved
22 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.4 Unicode Allocation
Codespace Assignment for Graphic Characters The predominant characteristics of codespace assignment in the Unicode Standard are as follows: • Where there is a single accepted standard for a script, the Unicode Standard generally follows it for the relative order of characters within that script. • The first 256 codes follow precisely the arrangement of ISO/IEC 8859-1 (Latin 1), of which 7-bit ASCII (ISO/IEC 646 IRV) accounts for the first 128 code positions. • Characters with common characteristics are located together contiguously. For example, the primary Arabic character block was modeled after ISO/IEC 8859- 6. The Arabic script characters used in Persian, Urdu, and other languages, but not included in ISO/IEC 8859-6, are allocated after the primary Arabic charac- ter block. Right-to-left scripts are grouped together. • To the extent possible, scripts do not cross 128-byte boundaries or 1,024-byte boundaries. • Codes that represent letters, punctuation, symbols, and diacritics that are gen- erally shared by multiple languages or scripts are grouped together in several locations. • The Unicode Standard makes no pretense to correlate character code allocation with language-dependent collation or case. • Unified CJK ideographs are laid out in two sections, each of which is arranged according to the Han ideograph arrangement defined in Section 10.1, Han. This ordering is roughly based on a radical-stroke count order.
Nongraphic Characters, Reserved and Unassigned Codes All code points except those mentioned below are reserved for graphic characters. Code points unassigned in this version of the Unicode Standard are available for assignment in later versions of the Unicode Standard to characters of any script. • Sixty-five codes (U+0000..U+001F and U+007F..U+009F) are reserved specifi- cally as control codes. Of the control codes, null (U+0000) can be used as a string terminator as in the C language, tab (U+0009) retains its customary meaning, and the others may be interpreted according to ISO/IEC 6429. (See Section 2.8, Controls and Control Sequences, and Section 13.1, Control Codes.) • Two codes are not used to encode characters: U+FFFF is reserved for internal use (as a sentinel) and should not be transmitted or stored as part of plain text. U+FFFE is also reserved; its presence indicates byte-swapped Unicode data. (See Section 13.6, Specials.) • A contiguous area of codes has been set aside for private use. Characters in this area will never be defined by the Unicode Standard. These codes can be freely used for characters of any purpose, but successful interchange requires an agreement between sender and receiver on their interpretation. • In addition, 2,048 codes have been allocated for surrogates, which are used in the extension mechanism.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 23 2.5 Writing Direction General Structure
2.5 Writing Direction Individual writing systems have different conventions for arranging characters into lines on a page or screen. Such conventions are referred to as a script’s directionality. For example, in the Latin script, characters run horizontally from left to right to form lines, and lines run from top to bottom. In Semitic scripts such as Hebrew and Arabic, characters are arranged from right to left into lines, although digits run the other way, making the scripts inherently bidirectional. Left-to-right and right-to-left scripts are frequently used together. In such a case, arranging characters into lines becomes more complex. The Unicode Standard defines an algorithm to determine the layout of a line. See Section 3.12, Bidirectional Behavior, for more informa- tion. East Asian scripts are frequently written in vertical lines that run from top to bottom. Lines are arranged from right to left, except for Mongolian. Most characters have the same shape and orientation when displayed horizontally or vertically, but many punctuation characters change their shape when displayed vertically. In a vertical context, letters and words from other scripts are generally rotated through 90-degree angles so that they, too, read from top to bottom. That is, letters from left-to-right scripts will be rotated clockwise and letters from right-to-left scripts will be rotated counterclockwise. In contrast to the bidirectional case, the choice to lay out text either vertically or horizon- tally is treated as a formatting style. Therefore, the Unicode Standard does not provide directionality controls to specify that choice. Other script directionalities are found in historical writing systems. For example, some ancient Numidian texts are written bottom to top, and Egyptian hieroglyphics can be writ- ten with varying directions for individual lines. Early Greek used a system called boustrophedon (literally, “ox-turning”). In boustrophedon writing, characters are arranged into horizontal lines, but the individual lines alternate between running right to left and running left to right, the way an ox goes back and forth when plowing a field. The letter images are mirrored in accordance with the direction of each individual line. Boustrophedon writing is of interest almost exclusively to scholars intent on reproducing the exact visual content of ancient texts. The Unicode Standard does not provide direct support for boustrophedon. Fixed texts can, however, be written in boustrophedon by using hard line breaks and directionality overrides.
2.6 Combining Characters Combining Characters. Characters intended to be positioned relative to an associated base character are depicted in the character code charts above, below, or through a dotted circle. They are also annotated in the names list or in the character property lists as “combining” or “nonspacing” characters. When rendered, the glyphs that depict these characters are intended to be positioned relative to the glyph depicting the preceding base character in some combination. The Unicode Standard distinguishes two types of combining charac- ters: spacing and nonspacing. Nonspacing combining characters do not occupy a spacing position by themselves. In rendering, the combination of a base character and a nonspacing character may have a different advance width than the base character by itself. For example, an “Ó” may be slightly wider than a plain “i”. The spacing or nonspacing properties of a combining character are defined in the Unicode Character Database.
24 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.6 Combining Characters
Diacritics. Diacritics are the principal class of nonspacing combining characters used with European alphabets. In the Unicode Standard, the term “diacritic” is defined very broadly to include accents as well as other nonspacing marks. All diacritics can be applied to any base character and are available for use with any script. A separate block is provided for symbol diacritics, generally intended to be used with sym- bol base characters. The blocks contain additional combining characters for particular scripts with which they are primarily used. As with other characters, the allocation of a combining character to one block or another identifies only its primary usage; it is not intended to define or limit the range of characters to which it may be applied. In the Uni- code Standard, all sequences of character codes are permitted. Other Combining Characters. Some scripts, such as Hebrew, Arabic, and the scripts of India and Southeast Asia, have spacing or nonspacing combining characters. Many of these combining marks encode vowel letters; as such, they are not generally referred to as “dia- critics.”
Sequence of Base Characters and Diacritics In the Unicode Standard, all combining characters are to be used in sequence following the base characters to which they apply. The sequence of Unicode characters U+0061 “a” + U+0308 “þ”í + U+0075 “u” unambiguously encodes “äu” not “aü”. The ordering convention used by the Unicode Standard is consistent with the logical order of combining characters in Semitic and Indic scripts, the great majority of which (logically or phonetically) follow the base characters with respect to which they are positioned. This convention conforms to the way modern font technology handles the rendering of non- spacing graphical forms (glyphs) so that mapping from character memory representation order to font rendering order is simplified. It is different from the convention used in the bibliographic standard ISO 5426. A sequence of a base character plus one or more combining characters generally has the same properties as the base character. For example, “A” followed by “ ”ó has the same prop- erties as “”. In some contexts, enclosing diacritics confer a symbol property to the charac- ters they enclose. This idea is discussed more fully in Section 3.10, Canonical Ordering Behavior, but see also Section 3.12, Bidirectional Behavior. In the charts for Indic scripts, some vowels are depicted to the left of dotted circles (see Figure 2-8). This special case must be carefully distinguished from that of general combin- ing diacritical mark characters. Such vowel signs are rendered to the left of a consonant let- ter or consonant cluster, even though their logical order in the Unicode encoding follows the consonant letter. The coding of these vowels in pronunciation order and not in visual order is consistent with the ISCII standard. Figure 2-8. Indic Vowel Signs Ó +ç@ çÓ
Multiple Combining Characters In some instances, more than one diacritical mark is applied to a single base character (see Figure 2-9). The Unicode Standard does not restrict the number of combining characters
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 25 2.6 Combining Characters General Structure that may follow a base character. The following discussion summarizes the treatment of multiple combining characters. (For the formal algorithm, see Chapter 3, Conformance.) Figure 2-9. Stacking Sequences
2 Characters Glyphs a @¬÷@ @. @ ä÷. ö ö If the combining characters can interact typographically—for example, a U+0304 - and a U+0308 —then the order of graphic display is determined by the order of coded characters (see Figure 2-10). The diacritics or other com- bining characters are positioned from the base character’s glyph outward. Combining char- acters placed above a base character will be stacked vertically, starting with the first encountered in the logical store and continuing for as many marks above as are required by the character codes following the base character. For combining characters placed below a base character, the situation is reversed, with the combining characters starting from the base character and stacking downward. Figure 2-10. Interacting Combining Characters
LATIN SMALL LETTER A WITH TILDE LATIN SMALL LETTER A + COMBINING TILDE á a LATIN SMALL LETTER A + COMBINING DOT ABOVE
LATIN SMALL LETTER A WITH TILDE + COMBINING DOT BELOW . LATIN SMALL LETTER A + COMBINING TILDE + COMBINING DOT BELOW LATIN SMALL LETTER A + COMBINING DOT BELOW + COMBINING TILDE á a LATIN SMALL LETTER A + COMBINING DOT BELOW + COMBINING DOT ABOVE . LATIN SMALL LETTER A + COMBINING DOT ABOVE + COMBINING DOT BELOW
LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE « LATIN SMALL LETTER A WITH CIRCUMFLEX + COMBINING ACUTE LATIN SMALL LETTER A + COMBINING CIRCUMFLEX + COMBINING ACUTE
ö LATIN SMALL LETTER A ACUTE + COMBINING CIRCUMFLEX LATIN SMALL LETTER A + COMBINING ACUTE + COMBINING CIRCUMFLEX
An example of multiple combining characters above the base character is found in Thai, where a consonant letter can have above it one of the vowels U+0E34 through U+0E37 and, above that, one of four tone marks U+0E48 through U+0E4B. The order of character codes that produces this graphic display is base consonant character + vowel character + tone mark character. Some specific combining characters override the default stacking behavior by being posi- tioned horizontally rather than stacking or by ligature with an adjacent nonspacing mark (see Figure 2-11). When positioned horizontally, the order of codes is reflected by
26 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.6 Combining Characters positioning in the predominant direction of the script with which the codes are used. For example, in a left-to-right script, horizontal accents would be coded left to right. In Figure 2-11, the top example is correct and the bottom example is incorrect. Figure 2-11. Nondefault Stacking
GREEK SMALL LETTER ALPHA This is αÕ« + COMBINING COMMA ABOVE (psili) + COMBINING ACUTE ACCENT (oxia) correct GREEK SMALL LETTER ALPHA This is Õ + COMBINING ACUTE ACCENT (oxia) α« + COMBINING COMMA ABOVE (psili) incorrect
Prominent characters that show such override behavior are associated with specific scripts or alphabets. For example, when used with the Greek script, the “breathing marks” U+0313 (psili) and U+0314 (dasia) require that, when used together with a following acute or grave accent, they be rendered side-by-side above their base letter rather than the accent marks being stacked above the breathing marks. The order of codes here is base character code + breathing mark code + accent mark code. This example demonstrates the script-dependent nature of ren- dering combining diacritical marks.
Multiple Base Characters When the glyphs representing two base characters merge to form a ligature, then the com- bining characters must be rendered correctly in relation to the ligated glyph (see Figure 2-12). Internally, the software must distinguish between the nonspacing marks that apply to positions relative to the first part of the ligature glyph and those that apply to the second. (For a discussion of general methods of positioning nonspacing marks, see Section 5.13, Strategies for Handling Nonspacing Marks.) Figure 2-12. Multiple Base Characters @ f @÷ i . fi˜.
Multiple base characters do not commonly occur in most scripts. However, in some scripts, such as Arabic, this situation occurs quite often when vowel marks are used. It arises because of the large number of ligatures in Arabic, where each element of a ligature is a consonant, which in turn can have a vowel mark attached to it. Ligatures can even occur with three or more characters merging; vowel marks may be attached to each part.
Spacing Clones of European Diacritical Marks By convention, diacritical marks used by the Unicode Standard may be exhibited in (appar- ent) isolation by applying them to U+0020 or to U+00A0 . This tac- tic might be employed, for example, when talking about the diacritical mark itself as a mark, rather than using it in its normal way in text. The Unicode Standard separately encodes clones of many common European diacritical marks that are spacing characters, largely to provide compatibility with existing character set standards. These related charac- ters are cross-referenced in the names list in Chapter 14, Code Charts.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 27 2.7 Special Character and Noncharacter Values General Structure
2.7 Special Character and Noncharacter Values The Unicode Standard includes a small number of important characters with special behavior; some of them are introduced in this section. It is important that implementa- tions treat these characters properly. For a list of these and similar characters, see Section 3.9, Special Character Properties; for more information about such characters, see Section 13.1, Control Codes, Section 13.2, Layout Controls, and Section 13.6, Specials.
Byte Order Mark (BOM) The canonical encoding form of Unicode plain text as a sequence of 16-bit codes is sensitive to the byte ordering that is used when serializing text into a sequence of bytes, such as when writing to a file or transferring across a network. Some processors place the least significant byte in the initial position; others place the most significant byte in the initial position. Ide- ally, all implementations of the Unicode Standard would follow only one set of byte order rules, but this scheme would force one class of processors to swap the byte order on reading and writing plain text files, even when the file never leaves the system on which it was created. To have an efficient way to indicate which byte order is used in a text, the Unicode Standard contains two code values, U+FEFF - (byte order mark) and U+FFFE (not a character code), which are the byte-ordered mirror images of one another. The byte order mark is not a control character that selects the byte order of the text; rather, its function is to notify recipients as to which byte ordering is used in a file. Unicode Signature. An initial BOM may also serve as an implicit marker to identify a file as containing Unicode text. The sequence FE16 FF16 (or its byte-reversed counterpart, FF16 FE16) is exceedingly rare at the outset of text files that use other character encodings. It is therefore not likely to be confused with real text data. The same is true for both single-byte and multibyte encodings. Data streams (or files) that begin with U+FEFF byte order mark are likely to contain Uni- code values. It is recommended that applications sending or receiving untyped data streams of coded characters use this signature. If other signaling methods are used, signa- tures should not be employed. Conformance to the Unicode Standard does not requires the use of the BOM as such a sig- nature. See Section 13.6, Specials, for more information on byte order mark and its use as an encoding signature.
Special Noncharacter Values U+FFFF and U+FFFE. These code values are not used to represent Unicode characters. U+FFFF is reserved for private program use as a sentinel or other signal. (Notice that U+FFFF is a 16-bit representation of –1 in two’s-complement notation.) Programs receiv- ing this code are not required to interpret it in any way. It is good practice, however, to rec- ognize this code as a noncharacter value and to take appropriate action, such as by indicating possible corruption of the text. U+FFFE is similar in all respects to U+FFFF, except that it is also the mirror image of U+FEFF - (byte order mark). The presence of U+FFFE constitutes a strong hint that the text in question is byte- reversed. See Section 13.6, Specials.
28 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.8 Controls and Control Sequences
Separators Line and Paragraph Separators. For historical reasons, carriage return and line feed are not used consistently across different systems. The Unicode Standard provides (and encourages the use of) the line and paragraph separator characters to provide clear informa- tion about where line and paragraph boundaries occur. Paragraph separation is also required for use with the bidirectional algorithm (see Chapter 3, Conformance). As these entities are separator codes, it is not necessary to start the first line or paragraph or to end the last line or paragraph with them. Also see Section 13.2, Layout Controls. Interaction with CR/LF. The Unicode Standard does not prescribe specific semantics for U+000D (CR) and U+000A (LF). These codes are provided to represent any CR or LF characters employed by a higher-level protocol or retained in text translated from other standards. It is left to each application to interpret these codes, to decide whether to require their use, and to determine whether CR/LF pairs or single codes are needed. See also Section 5.9, Line Handling, and Unicode Technical Report #13, “Uni- code Newline Guidelines,” on the CD-ROM or the up-to-date version on the Unicode Web site.
Layout and Format Control Characters The Unicode Standard defines several characters that are used to control joining behavior, bidirectional ordering control, and alternative formats for display. These characters are explicitly defined as not affecting line-breaking behavior. Unlike space characters or other delimiters, they do not serve to indicate word, line, or other unit boundaries. Their specific use in layout and formatting is described in Section 13.2, Layout Controls.
The Replacement Character U+FFFD is the general substitute character in the Unicode Standard. It can be substituted for any “unknown” character in another encoding that can- not be mapped in terms of known Unicode values (see Section 5.3, Unknown and Missing Characters, and Section 13.6, Specials).
2.8 Controls and Control Sequences
Control Characters The Unicode Standard provides 65 code values for the representation of control characters. These ranges are U+0000..U+001F and U+007F..U+009F, which correspond to the 8-bit controls 0016 to 1F16 (C0 controls) and 7F16 to 9F16 (delete and C1 controls). For example, the 8-bit version of horizontal tab (HT) is at 0916; the Unicode Standard encodes tab at U+0009. When converting control codes from existing 8-bit text, they are merely zero- extended to the full 16 bits of Unicode characters. Programs that conform to the Unicode Standard may treat these 16-bit control codes in exactly the same way as they treat their 7- and 8-bit equivalents in other protocols, such as ISO/IEC 2022 and ISO/IEC 6429. Such usage constitutes a higher-level protocol and is beyond the scope of the Unicode Standard. Similarly, the use of ISO/IEC 6429:1992 control sequences (extended to 16 bits) for controlling bidirectional formatting is a legitimate higher-level protocol layered on top of the plain text of the Unicode Standard. As with all higher-level protocols, both the sender and the receiver must agree upon a common proto- col beforehand.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 29 2.9 Conforming to the Unicode Standard General Structure
Escape Sequences. In converting text containing escape sequences to the Unicode character encoding, text must be converted to the equivalent Unicode characters. Converting escape sequences into Unicode characters on a character-by-character basis (for instance, ESC–A turns into U+001B , U+0041 ) allows the reverse conver- sion to be performed without forcing the conversion program to recognize the escape sequence as such. Control Code Sequences Encoding Additional Information about Text. If a system uses sequences beginning with control codes to embed additional information about text (such as formatting attributes or structure), then such sequences form a higher-level protocol outside the scope of the Unicode Standard. Such higher-level protocols are not specified by the Unicode Standard; their existence cannot be assumed without a separate agreement between the parties interchanging such data.
Representing Control Sequences Control sequences can be represented in the Unicode encoding design but must then be represented in terms of 16-bit characters. For example, suppose that an application allows embedded font information to be transmitted by means of an 8-bit sequence. In the follow- ing, the notation ^A refers to the C0 control code 0116, ^B refers to the C0 control code 0216, and so on: ^ATimes^B = 01,54,69,6D,65,73,02 Then the corresponding sequence of Unicode character codes would be ^ATimes^B = 0001,0054,0069,006D,0065,0073,0002 That is, each Unicode character code is a 16-bit zero-extended code value of the corre- sponding 8-bit code value. Where the embedded data are not interpreted as a sequence of characters by the protocol, the information could be encoded as follows: ^ATimes^B = 0001,5469,6D65,7300,0002 The data could never be encoded as ^ATimes^B = 0154,696D,6573,0200 because in the Unicode character encoding this sequence represents four characters— (U+0154), two Han characters (U+696D and U+6573, respectively), and (U+0200). None of these characters is a control character. If a control sequence contains embedded binary data, then the data bytes do not necessarily need to be zero-extended because the control sequence constitutes a higher protocol. However, doing so allows code conversion algorithms to suc- ceed even in the absence of explicit knowledge of employed control sequences.
2.9 Conforming to the Unicode Standard Chapter 3, Conformance, specifies the set of unambiguous criteria to which a Unicode- conformant implementation must adhere so that it can interoperate with other confor- mant implementations. The following section gives examples of conformance and non- conformance to complement the formal statement of conformance. An implementation that conforms to the Unicode Standard has the following characteris- tics:
30 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.9 Conforming to the Unicode Standard
• It treats characters as 16-bit units. U+2020 (that is, 202016) is the single Unicode character ‘†’, not two ASCII spaces. • It interprets characters according to the identities, properties, and rules defined for them in this standard. U+2423 is ‘’ , not ‘¨’ hiragana small i (which is the meaning of the bytes 242316 in JIS). U+00D4 ‘ô’ is equivalent to U+004F ‘o’ followed by U+0302 ‘þ’ó, but not equivalent to U+0302 followed by U+004F. U+05D0 ‘µ’ followed by U+05D1 ‘¶’ looks like ‘¶µ’, not ‘µ¶’ when dis- played. When an implementation supports Arabic or Hebrew characters and displays those characters, they must be ordered according to the bidirec- tional algorithm described in Section 3.12, Bidirectional Behavior. When an implementation supports Arabic, Devanagari, Tamil, or other shaping characters and displays those characters, at a minimum the characters are shaped according to the appropriate character block descriptions given in Section 8.2, Arabic, Section 9.1, Devanagari, or Section 9.6, Tamil. (More sophisticated shaping can be used if available.) • It does not use unassigned codes. U+2073 is unassigned and not usable for ‘3’ (superscript 3) or any other character. • It does not corrupt unknown characters. U+2029 is and should not be dropped by appli- cations that do not yet support it. U+03A1 “P” should not be changed to U+00A1 (first byte dropped), U+0050 (mapped to Latin letter P), U+A103 (bytes reversed), or anything other than U+03A1. However, it is acceptable for a conforming implementation: • To support only a subset of the Unicode characters. An application might not provide mathematical symbols or the Thai script, for example. • To transform data knowingly. Uppercase conversion: ‘a’ transformed to ‘A’ Romaji to kana: ‘kyo’ transformed to ²ø U+247D ‘(10)’ decomposed to 0028 0031 0030 0029 • To build higher-level protocols on the character set. Compression of characters Use of rich text file formats • To define characters in the Private Use Area.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 31 2.10 Referencing Versions of the Unicode Standard General Structure
Examples of characters that might be encoded in the Private Use Area include supplementary ideographic characters (gaiji) or existing corpo- rate logo characters. • To not support the bidirectional algorithm or character shaping in implemen- tations that do not support complex scripts, such as Arabic and Devanagari. • To not support the bidirectional algorithm or character shaping in implemen- tations that do not display characters, such as on servers or in programs that simply parse or transcode text, such as an XML parser. Code conversion from other standards to the Unicode Standard will be considered confor- mant if the matching table produces accurate conversions in both directions.
Characters Not Used in a Subset The Unicode Standard does not require that an application be capable of interpreting and rendering all Unicode characters so as to be conformant. Many systems will have fonts only for some scripts, but not for others; sorting and other text-processing rules may be imple- mented only for a limited set of languages. As a result, an implementation is able to inter- pret a subset of characters. The Unicode Standard provides no formalized method for identifying this subset. Further- more, this subset is typically different for different aspects of an implementation. For example, an application may be able to read, write, and store any 16-bit character, and to sort one subset according to the rules of one or more languages (and the rest arbitrarily), but have access only to fonts for a single script. The same implementation may be able to render additional scripts as soon as additional fonts are installed in its environment. There- fore, the subset of interpretable characters is typically not a static concept. Conformance to the Unicode Standard implies that whenever text purports to be unmodi- fied, uninterpretable characters must not be removed or altered. (See also Section 3.1, Con- formance Requirements.)
2.10 Referencing Versions of the Unicode Standard For most character encodings, the character repertoire is fixed (and often small). Once the repertoire is decided upon, it is never changed. Addition of a new abstract character to a given repertoire is conceived of as creating a new repertoire, which then will be given its own catalog number, constituting a new object. For the Unicode Standard, on the other hand, the repertoire is inherently open. Because Unicode is a universal encoding, any abstract character that could ever be encoded is potentially a member of the actual set to be encoded, regardless of whether the character is currently known. Each new version of the Unicode Standard replaces the previous one and makes it obsolete, but implementations—and more significantly, data—are not updated instantly. In general, major and minor version changes include new characters, which do not create particular problems with old data. The Unicode Technical Committee will neither remove nor move characters, but characters may be deprecated. This approach does not remove them from the standard or from existing data. The code point will never be used for a different charac- ter, but its use is strongly discouraged. Implementations should be prepared to be forward-compatible with respect to Unicode versions. That is, they should accept text that may be expressed in future versions of this
32 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard General Structure 2.10 Referencing Versions of the Unicode Standard standard, recognizing that new characters may be assigned in those versions. Thus they should handle incoming unassigned code points as they do unsupported characters. (See Section 5.3, Unknown and Missing Characters.) A version change may also involve changes to the properties of existing characters. When this situation occurs, modifications are made to UnicodeData.txt and other relevant con- tributing data files, and a new update version is issued for the standard. Changes to the data files may alter program behavior that depends on them. Version Numbering. Version numbers for the standard consist of three fields: the major version, the minor version, and the update version. The differences among them are as fol- lows: • Major—significant additions to the standard, published as a book. • Minor—character additions or more significant normative changes, published as a technical report on the Unicode Web site. • Update—any other changes to normative or important informative portions of the standard that could change program behavior. These changes are reflected in a new UnicodeData.txt file and other contributing data files of the Unicode Character Database. For additional information on the current and past versions of the Unicode standard, see http://www.unicode.org/unicode/standard/versions/ on the Unicode Web site.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 33 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:
Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 3 Conformance 3
This chapter defines conformance to the Unicode Standard in terms of the principles and encoding architecture it embodies. The first section consists of the conformance clauses, followed by sections that define more precisely the technical terms used in those clauses. The remaining sections contain the formal algorithms that are part of conformance and referred to by the conformance clause. These algorithms specify the required results rather than the specific implementation; all implementations that produce results identical to the results of the formal algorithms are conformant. In this chapter, conformance subclauses are identified with a letter C. Definitions are iden- tified with the letter D. Bulleted items are explanatory comments regarding definitions or subclauses. Except for Section 3.12, Bidirectional Behavior, the numbering of rules and definitions matches that of The Unicode Standard, Version 2.0. Where new rules and definitions were added, letters are used with numbers—for example, D7a.
3.1 Conformance Requirements This section specifies the formal conformance requirements for processes implementing Version 3.0 of the Unicode Standard. Note that this clause has been revised from the previ- ous versions of the Unicode Standard. These revisions do not change the substance of the conformance requirements previously set forth, but rather are formalized and extended to allow for the use of transformation formats. Implementations that satisfied the confor- mance clause of the previous versions of the Unicode Standard will satisfy this revised clause.
Byte Ordering C1 A process shall interpret Unicode code values as 16-bit quantities. • Unicode values can be stored in native 16-bit machine words. • For information on use of wchar_t or other programming language types to rep- resent Unicode values, see Section 5.2, ANSI/ISO C wchar_t. C2 The Unicode Standard does not specify any order of bytes inside a Unicode value. • Machine architectures differ in ordering in terms of whether the most significant byte or the least significant byte comes first. These sequences are known as “big- endian” and “little-endian” orders, respectively. C3 A process shall interpret a Unicode value that has been serialized into a sequence of bytes by most significant byte first, in the absence of higher-level protocols. • The majority of all interchange occurs with processes running on the same or a sim- ilar configuration. As a result, intradomain interchange of Unicode text in the
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 37 3.1 Conformance Requirements Conformance
domain-specific byte order is fully conformant and limits the role of the canonical byte order to interchange of Unicode text across domain, or where the nature of the originating domain is unknown. (For a discussion of the use of byte order mark to indicate byte orderings, see Section 2.7, Special Character and Noncharacter Values.)
Invalid Code Values C4 A process shall not interpret an unpaired high- or low-surrogate as an abstract character. C5 A process shall not interpret either U+FFFE or U+FFFF as an abstract character. C6 A process shall not interpret any unassigned code value as an abstract character. • These clauses do not preclude the assignment of certain generic semantics (for example, rendering with a glyph to indicate the character block) that allow for graceful behavior in the presence of code values that are outside a supported subset or code values that are unpaired surrogates. • Private-use code values are assigned, but can be given any interpretation by confor- mant processes.
Interpretation C7 A process shall interpret a coded character representation according to the character semantics established by this standard, if that process does interpret that coded charac- ter representation. • This restriction does not preclude internal transformations that are never visible external to the process. C8 A process shall not assume that it is required to interpret any particular coded character representation. • Any means for specifying a subset of characters that a process can interpret is out- side the scope of this standard. • The semantics of a code value in the Private Use Area is outside the scope of this standard. • Although these clauses are not intended to preclude enumerations or specifications of the characters that a process or system is able to interpret, they do separate sup- ported subset enumerations from the question of conformance. In real life, any sys- tem may occasionally receive an unfamiliar character code that it is unable to interpret. C9 A process shall not assume that the interpretations of two canonical-equivalent charac- ter sequences are distinct. • Ideally, an implementation would always interpret two canonical-equivalent char- acter sequences identically. There are practical circumstances under which imple- mentations may reasonably distinguish them. • Even processes that normally do not distinguish between canonical-equivalent character sequences can have reasonable exception behavior. Some examples of this behavior include graceful fallback processing by processes unable to support correct positioning of nonspacing marks; “Show Hidden Text” modes that reveal memory representation structure; and the choice of ignoring collating behavior of
38 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.1 Conformance Requirements
combining sequences that are not part of the repertoire of a specified language (see Section 5.13, Strategies for Handling Nonspacing Marks).
Modification C10 A process shall make no change in a valid coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences, if that process purports not to modify the interpretation of that coded character represen- tation. • Replacement of a character sequence by a compatibility-equivalent sequence does modify the interpretation of the text. • Replacement or deletion of a character sequence that the process cannot or does not interpret does modify the interpretation of the text. • Changing the bit or byte ordering when transforming between different machine architectures does not modify the interpretation of the text. • Transforming to a different encoding form does not modify the interpretation of the text.
Transformations C11 When a process interprets a byte sequence in a Unicode Transformation Format, it shall interpret that byte sequence in accordance with the character semantics established by this standard for the corresponding Unicode character sequence. C12 When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed byte sequences. When a process interprets data in a Unicode Transformation Format, it shall treat illegal byte sequences as an error condition.
Bidirectional Text C13 A process that displays text containing supported right-to-left characters or embedding codes shall display all visible representations of characters (excluding format characters) in the same order as if the bidirectional algorithm had been applied to the text, in the absence of higher-level protocols (see Section 3.12, Bidirectional Behavior).
Unicode Technical Reports The following technical reports are approved and considered part of Version 3.0 of the Uni- code Standard. These reports may contain either normative or informative material, or both. Any reference to Version 3.0 of the standard automatically includes these technical reports. • UTR #11: East Asian Width, Version 5.0 • UTR #13: Unicode Newline Guidelines, Version 5.0 • UTR #14: Line Breaking Properties, Version 6.0 • UTR #15: Unicode Normalization Forms, Version 18.0
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 39 3.2 Semantics Conformance
3.2 Semantics This and the following sections more precisely define the terms that are used in the con- formance clauses. D1 Normative properties and behavior: The following are normative character properties and normative behavior of the Unicode Standard: 1. Simple properties 2. Character combination 3. Canonical decomposition 4. Compatibility decomposition 5. Surrogate property 6. Canonical ordering behavior 7. Bidirectional behavior, as interpreted according to the Unicode bidirectional algorithm 8. Conjoining jamo behavior, as interpreted according to Section 3.11, Conjoining Jamo Behavior D2 Character semantics: The semantics of a character are established by its character name, representative glyph, and normative properties and behavior. • A character may have a broader range of use than the most literal interpretation of its name might indicate; the coded representation, name, and representative glyph need to be taken in context when establishing the semantics of a character. For example, U+002E can represent a sentence period, an abbreviation period, a decimal number separator in English, a thousands number separator in German, and so on. • Consistency with the representative glyph does not require that the images be iden- tical or even graphically similar; rather, it means that both images are generally rec- ognized to be representations of the same character. Representing the character U+0061 by the glyph “X” would violate its character iden- tity. • Some normative behavior is default behavior; this behavior can be overridden by higher-level protocols. However, in the absence of such protocols, the behavior must be observed so as to follow the character semantics. • The character combination properties and the canonical ordering behavior cannot be overridden by higher-level protocols.
3.3 Characters and Coded Representations D3 Abstract character: a unit of information used for the organization, control, or repre- sentation of textual data. • When representing data, the nature of that data is generally symbolic as opposed to some other kind of data (for example, numeric, aural, or visual). Examples of such symbolic data include letters, ideographs, digits, punctuation, technical symbols, and dingbats.
40 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.3 Characters and Coded Representations
• An abstract character has no concrete form and should not be confused with a glyph. • An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a grapheme. • The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters. • Abstract characters not directly encoded by the Unicode Standard can be repre- sented by the use of combining character sequences. D4 Abstract character sequence: an ordered sequence of abstract characters. Each Unicode abstract character is encoded one or more times. Each encoding, which con- sists of the relationship between an abstract character and its scalar value (see D28), is called an encoded character. Each scalar value is represented in either of two ways: as a single code value or as a sequence of two surrogate code values. D5 Code value: the minimal bit combination that can represent a unit of encoded text for processing or interchange. • Other character encoding standards typically use code values defined as 8-bit units. The same is true for the UTF-8 transformation format of the Unicode Standard. However, the code values used in the UTF-16 form of the Unicode Standard are 16- bit units. These 16-bit code values are also known simply as Unicode values. • A code value is also referred to as a code unit in the information industry. • A single abstract character may correspond to more than one code value—for example, U+00C5 Å and U+212B Å . • Multiple code values may be required to represent a single abstract character. For example, a byte is the code unit in SJIS: characters such as “a” can be represented with a single code value, whereas ideographs require two code values. • In some encodings, specific code units cannot be used to represent an encoded character in isolation. This restriction includes a single surrogate (high or low) in Unicode, the bytes 80–FF in UTF-8, or the bytes 81–9F, E0–EF in SJIS. D6 Coded character representation: an ordered sequence of one or more code values that is associated with an abstract character in a given character repertoire. • A Unicode abstract character is generally encoded by a single Unicode code value; the only exception involves surrogate pairs (which are provided for future exten- sion, but are not currently used to represent any abstract characters). D7 Coded character sequence: an ordered sequence of coded character representations. Unless specified otherwise for clarity, in the text of the Unicode Standard the term character alone generally designates a coded character representation. Similarly, the term character sequence alone generally designates a coded character sequence. D7a Deprecated character: a coded character whose use is strongly discouraged. Such characters are retained in the standard, but should not be used. • Deprecated characters are retained in the standard so that previously conforming data stay conformant in future versions of the standard. Deprecated characters are to be distinguished from obsolete characters. • Obsolete characters are historical. They do not occur in modern text, but they are not deprecated; their use is not discouraged.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 41 3.4 Simple Properties Conformance
D8 Higher-level protocol: any agreement on the interpretation of Unicode characters that extends beyond the scope of this standard. Such an agreement need not be for- mally announced in data; it may be implicit in the context.
3.4 Simple Properties The Unicode Standard, Version 3.0, defines the normative simple character properties of case, numeric value, directionality, and mirrored. Chapter 4, Character Properties, contains explicit mappings of characters to character properties. These mappings represent the default properties for conformant processes in the absence of explicit, overriding, higher- level protocols. Additional properties that are specific to particular characters (such as the definition and use of the right-to-left override character or zero-width spaces) are discussed in the relevant sections of this standard. • The Unicode Character Database contains additional properties, such as category and case mappings, that are informative rather than normative. The interpretation of some properties (such as the case of a character) is independent of context, whereas the interpretation of others (such as directionality) is applicable to a char- acter sequence as a whole, rather than to the individual characters that compose the sequence. D9 Directionality property: a property of every graphic character that determines its horizontal ordering as specified in Section 3.12, Bidirectional Behavior. • Interpretation of directional properties according to the Unicode bidirectional algo- rithm is needed for layout of right-to-left scripts such as Arabic and Hebrew. D10 Mirrored property: the property of characters whose images are mirrored horizon- tally in text that is laid out from right to left (versus left to right). (See Section 4.7, Mirrored—Normative.) • In other words, U+0028 is interpreted as an opening parenthesis; in a left-to-right context, this character will appear as “(”; in a right-to-left context, it will be mirrored and appear as “)”. • This is the default behavior in Unicode text. (For more information, see the “Semantics of Paired Punctuation” subsection in Section 6.1, General Punctuation.) D10a Case property: a property of characters in certain alphabets whereby certain charac- ters are considered variants of a single letter. (See Section 4.1, Case—Normative.) D10b Numeric value property: a property of characters used to represent numbers. (See Section 4.6, Numeric Value—Normative.) D11 Special character properties: The behavior of most characters does not require special attention in this standard. However, certain characters exhibit special behavior, which is described in the character block descriptions. These characters are listed in Section 3.9, Special Character Properties. D12 Private use: Unicode values from U+E000 to U+F8FF and surrogate pairs (see Section 3.7, Surrogates) whose high-surrogate is from U+DB80 to U+DBFF are available for private use.
42 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.5 Combination
3.5 Combination D13 Base character: a character that does not graphically combine with preceding charac- ters, and that is neither a control nor a format character. • Most Unicode characters are base characters. This sense of graphic combination does not preclude the presentation of base characters from adopting different con- textual forms or participating in ligatures. D14 Combining character: a character that graphically combines with a preceding base character. The combining character is said to apply to that base character. • These characters are not used in isolation (unless they are being described). They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras. • Even though a combining character is intended to be presented in graphical combi- nation with a base character, circumstances may arise where either (1) no base char- acter precedes the combining character or (2) a process is unable to perform graphical combination. In both cases, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character. • The representative images of combining characters are depicted with a dotted circle in the code charts; when presented in graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle. • Combining characters generally take on the properties of their base character, while retaining their combining property. • Control and format characters, such as tab or right-left mark, are not base charac- ters. Combining characters do not apply to them. D15 Nonspacing mark: a combining character whose positioning in presentation is dependent on its base character. It generally does not consume space along the visual baseline in and of itself. • Such characters may be large enough to affect the placement of their base character relative to preceding and succeeding base characters. For example, a circumflex applied to an “i” may affect spacing (“î”), as might the character U+20DD - . D16 Spacing mark: a combining character that is not a nonspacing mark. • Examples include U+093F . In general, the behavior of spacing marks does not differ greatly from that of base characters. D17 Combining character sequence: a character sequence consisting of either a base char- acter followed by a sequence of one or more combining characters, or a sequence of one or more combining characters. • A combining character sequence is also referred to as a composite character sequence. D17a Defective combining character sequence: a combining character sequence that does not start with a base character. • Defective combining character sequences occur when a sequence of combining characters appears at the start of a string or follows a control or format character.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 43 3.6 Decomposition Conformance
3.6 Decomposition D18 Decomposable character: a character that is equivalent to a sequence of one or more other characters, according to the decomposition mappings found in the names list of Section 14.1, Character Names List. It may also be known as a precomposed charac- ter or composite character. D19 Decomposition: a sequence of one or more characters that is equivalent to a decom- posable character. A full decomposition of a character sequence results from decom- posing each of the characters in the sequence until no characters can be further decomposed.
Compatibility Decomposition D20 Compatibility decomposition: the decomposition of a character that results from recursively applying both the compatibility mappings and the canonical mappings found in the names list of Section 14.1, Character Names List, and those described in Section 3.11, Conjoining Jamo Behavior, until no characters can be further decom- posed, and then reordering nonspacing marks according to Section 3.10, Canonical Ordering Behavior. • A compatibility decomposition may remove formatting information. D21 Compatibility character: a character that has a compatibility decomposition. • Compatibility characters are included in the Unicode Standard to represent distinc- tions in other base standards. They support transmission and processing of legacy data. Their use is discouraged other than for legacy data. • Replacing a compatibility character by its decomposition may lose round-trip con- vertibility with a base standard. D22 Compatibility equivalent: Two character sequences are said to be compatibility equivalents if their full compatibility decompositions are identical.
Canonical Decomposition D23 Canonical decomposition: the decomposition of a character that results from recur- sively applying the canonical mappings found in the names list of Section 14.1, Character Names List, and those described in Section 3.11, Conjoining Jamo Behavior, until no characters can be further decomposed, and then reordering nonspacing marks according to Section 3.10, Canonical Ordering Behavior. • The canonical mappings are a subset of the compatibility mappings; however, a canonical decomposition does not remove formatting information. D24 Canonical equivalent: Two character sequences are said to be canonical equivalents if their full canonical decompositions are identical. • For example, the sequences
44 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.7 Surrogates
Note: For definitions of canonical composition and compatibility composition, see Unicode Technical Report #15, “Unicode Normalization Forms,” on the CD-ROM or the up-to-date version on the Unicode Web site.
3.7 Surrogates D25 High-surrogate: a Unicode code value in the range U+D800 through U+DBFF. D26 Low-surrogate: a Unicode code value in the range U+DC00 through U+DFFF. D27 Surrogate pair: a coded character representation for a single abstract character that consists of a sequence of two Unicode values, where the first value of the pair is a high-surrogate and the second is a low-surrogate. • Unlike combining characters, which have independent semantics and properties, high- and low-surrogates have no interpretation when they do not appear as part of a surrogate pair. • Surrogate pairs are designed to allow representation of characters in future exten- sions of the Unicode Standard. There are no such currently assigned characters in this version of the standard, but it is widely expected that such characters will be added in the not too distant future. For more information, see Section 13.4, Surro- gates Area, and Section 5.4, Handling Surrogate Pairs.
D28 Unicode scalar value: a number N from 0 to 10FFFF16 defined by applying the fol- lowing algorithm to a character sequence S:
N = U If S is a single, nonsurrogate value
N = (H – D80016) * 40016 + If S is a surrogate pair
• Unicode scalar values are defined for use by standards such as SGML, XML, and HTML, which require a scalar value associated with abstract characters. • This algorithm is identical with the ISO/IEC 10646 algorithm used to transform UTF-16 into UCS-4 (for more information, see Appendix C, Relationship to ISO/ IEC 10646). • The reverse mapping from the Unicode scalar value to a surrogate pair is given by
H = (S – 1000016) / 40016 + D80016
L = (S – 1000016) % 40016 + DC0016 The operators “/” and “%” are as defined in Section 0.2, Notational Conventions.
• A Unicode scalar value is also referred to as a code position or a code point in the information industry.
3.8 Transformations More than one representation of Unicode data can be conformant to the Unicode Stan- dard. Chief among them is UTF-8, discussed in Section 2.3, Encoding Forms, and Appendix C.3, UCS Transformation Formats. In addition, there are compression transfor- mations such as the one described in the Unicode Technical Report #6, “A Standard
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 45 3.8 Transformations Conformance
Compression Scheme for Unicode” on the CD-ROM or the up-to-date version on the Uni- code Web site. D29 A Unicode (or UCS) transformation format (UTF) transforms each Unicode scalar value into a unique sequence of code values. A UTF may specify a byte order for the serialization of the code values into bytes. A UTF may also specify the use of a byte order mark. • Code values are particular units of computer storage specified by the transforma- tion format—for example, 16-bit integers or bytes. In the latter case, a code value sequence can be referred to as a byte sequence. • Any sequence of code values that would correspond to a scalar value greater than 10FFFF16 is illegal. Because every Unicode coded character sequence maps to a unique sequence of code values in a given UTF, a reverse mapping can be derived. Thus every UTF supports lossless round- trip transcoding: mapping from any Unicode coded character sequence S to a sequence of code values and back will produce S again. To ensure that round-trip transcoding is possi- ble, a UTF mapping must also map invalid Unicode scalar values to unique code value sequences. These invalid scalar values include FFFE16, FFFF16, and unpaired surrogates. D30 For a given UTF, a code value sequence that cannot be produced from any sequence of Unicode scalar values is called an ill-formed code value sequence. D31 For a given UTF, a code value sequence that cannot be mapped back to a sequence of Unicode scalar values is called an illegal code value sequence.
• For example, in UTF-8 every code value of the form 110xxxxx2 must be followed with a code value of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxx2 is illegal and must never be generated. When faced with this illegal code value sequence while transforming or interpreting, a UTF-8 conformant process must treat the first code value 110xxxxx2 as an illegal termination error—for example, by signaling an error, filtering the code value out, or representing the code value with a marker such as U+FFFD . In the latter two cases, it will continue processing at the second code value 0xxxxxx2. D32 For a given UTF, an ill-formed code value sequence that is not illegal is called an irregular code value sequence. • To make implementations simpler and faster, some transformation formats may allow irregular code value sequences without requiring error handling. For exam- ple, UTF-8 allows nonshortest code value sequences to be interpreted: a UTF-8 con- formant process may map the code value sequence C0 80 (110000002 100000002) to the Unicode value U+0000, even though a UTF-8 conformant process shall never generate that code value sequence—it shall generate the sequence 00 (000000002) instead. A conformant process shall not use irregular code value sequences to encode out-of-band information. D33 UTF-16BE is the Unicode Transformation Format that serializes a Unicode value as a sequence of two bytes, in big-endian format. An initial sequence corresponding to U+FEFF is interpreted as a - . • In UTF-16BE, <004D 0061 0072 006B> is serialized as <00 4D 00 61 00 72 00 6B>. D34 UTF-16LE is the Unicode Transformation Format that serializes a Unicode value as a sequence of two bytes, in little-endian format. An initial sequence corresponding to U+FEFF is interpreted as a - . • In UTF-16LE, <004D 0061 0072 006B> is serialized as <4D 00 61 00 72 00 6B 00>.
46 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.9 Special Character Properties
D35 UTF-16 is the Unicode Transformation Format that serializes a Unicode value as a sequence of two bytes, in either big-endian or little-endian format. An initial byte sequence corresponding to U+FEFF is interpreted as a byte order mark: it is used to distinguish between the two byte orders. The byte order mark is not considered part of the content of the text. A serialization of Unicode values into UTF-16 may or may not begin with a byte order mark. • In UTF-16, <004D 0061 0072 006B> is serialized as
Scalar Value UTF-16 1st Byte 2nd Byte 3rd Byte 4th Byte 000000000xxxxxxx 000000000xxxxxxx 0xxxxxxx 00000yyyyyxxxxxx 00000yyyyyxxxxxx 110yyyyy 10xxxxxx zzzzyyyyyyxxxxxx zzzzyyyyyyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx uuuuuzzzzyyyyyyxxxxxx 110110wwwwzzzzyy+ 11110uuua 10uuzzzz 10yyyyyy 10xxxxxx 110111yyyyxxxxxx
a. Where uuuuu = wwww + 1 (to account for addition of 1000016 as in Section 3.7, Sur- rogates).
When converting a Unicode scalar value to UTF-8, the shortest form that can represent those values shall be used. This practice preserves uniqueness of encoding. For example, the Unicode binary value <0000000000000001> is encoded as <00000001>, not as <11000000 10000001>. The latter is an example of an irregular UTF-8 byte sequence. Irreg- ular UTF-8 sequences shall not be used for encoding any other information. When converting from UTF-8 to a Unicode scalar value, implementations do not need to check that the shortest encoding is being used. This simplifies the conversion algorithm.
3.9 Special Character Properties The behavior of most characters does not require special attention in this standard. How- ever, the following characters exhibit special behavior, as described in Chapter 13, Special Areas and Format Characters, and in Chapter 4, Character Properties.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 47 3.9 Special Character Properties Conformance
• Line boundary control 0009 HORIZONTAL TAB 000A LINE FEED 000C FORM FEED 000D CARRIAGE RETURN 0020 SPACE 00A0 NO-BREAK SPACE 0F0B TIBETAN MARK INTERSYLLABIC TSHEG 0F0C TIBETAN MARK DELIMITER TSHEG BSTAR 2000 EN QUAD 2002 EN SPACE 2003 EM SPACE 2004 THREE-PER-EM SPACE 2005 FOUR-PER-EM SPACE 2006 SIX-PER-EM SPACE 2007 FIGURE SPACE 2008 PUNCTUATION SPACE 2009 THIN SPACE 200A HAIR SPACE 200B ZERO WIDTH SPACE 2011 NON-BREAKING HYPHEN 2028 LINE SEPARATOR 2029 PARAGRAPH SEPARATOR 202F NARROW NO-BREAK SPACE FEFF ZERO WIDTH NO-BREAK SPACE • Hyphenation control 002D HYPHEN-MINUS 00AD SOFT HYPHEN 058A ARMENIAN HYPHEN 1806 MONGOLIAN TODO SOFT HYPHEN 2010 HYPHEN 2011 NON-BREAKING HYPHEN 2027 HYPHENATION POINT • Fraction formatting 2044 FRACTION SLASH • Special behavior with nonspacing marks 0020 SPACE 0069 LATIN SMALL LETTER I 006A LATIN SMALL LETTER J 00A0 NO-BREAK SPACE 0131 LATIN SMALL LETTER DOTLESS I • Double nonspacing marks 0360 COMBINING DOUBLE TILDE 0361 COMBINING DOUBLE INVERTED BREVE 0362 COMBINING DOUBLE RIGHTWARDS ARROW BELOW • Joining 200C ZERO WIDTH NON-JOINER 200D ZERO WIDTH JOINER • Bidirectional ordering 200E LEFT-TO-RIGHT MARK 200F RIGHT-TO-LEFT MARK 202A LEFT-TO-RIGHT EMBEDDING 202B RIGHT-TO-LEFT EMBEDDING 202C POP DIRECTIONAL FORMATTING 202D LEFT-TO-RIGHT OVERRIDE
48 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.9 Special Character Properties
202E RIGHT-TO-LEFT OVERRIDE • Alternate formatting 206A INHIBIT SYMMETRIC SWAPPING 206B ACTIVATE SYMMETRIC SWAPPING 206C INHIBIT ARABIC FORM SHAPING 206D ACTIVATE ARABIC FORM SHAPING 206E NATIONAL DIGIT SHAPES 206F NOMINAL DIGIT SHAPES • Syriac abbreviation 070F SYRIAC ABBREVIATION MARK • Indic dead-character formation 094D DEVANAGARI SIGN VIRAMA 09CD BENGALI SIGN VIRAMA 0A4D GURMUKHI SIGN VIRAMA 0ACD GUJARATI SIGN VIRAMA 0B4D ORIYA SIGN VIRAMA 0BCD TAMIL SIGN VIRAMA 0C4D TELUGU SIGN VIRAMA 0CCD KANNADA SIGN VIRAMA 0D4D MALAYALAM SIGN VIRAMA 0DCA SINHALA SIGN AL-LAKUNA 0F84 TIBETAN SIGN HALANTA 1039 MYANMAR SIGN VIRAMA 17D2 KHMER SIGN COENG • Mongolian variant selectors 180B MONGOLIAN FREE VARIATION SELECTOR ONE 180C MONGOLIAN FREE VARIATION SELECTOR TWO 180D MONGOLIAN FREE VARIATION SELECTOR THREE 180E MONGOLIAN VOWEL SEPARATOR • Ideographic variation indication 303E IDEOGRAPHIC VARIATION INDICATOR • Ideographic description 2FF0 IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT 2FF1 IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW 2FF2 IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO MIDDLE AND RIGHT 2FF3 IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO MIDDLE AND BELOW 2FF4 IDEOGRAPHIC DESCRIPTION CHARACTER FULL SUR- ROUND 2FF5 IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM ABOVE 2FF6 IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM BELOW 2FF7 IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LEFT 2FF8 IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER LEFT 2FF9 IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT 2FFA IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER LEFT 2FFB IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 49 3.10 Canonical Ordering Behavior Conformance
• Interlinear annotation FFF9 INTERLINEAR ANNOTATION ANCHOR FFFA INTERLINEAR ANNOTATION SEPARATOR FFFB INTERLINEAR ANNOTATION TERMINATOR • Object replacement FFFC OBJECT REPLACEMENT CHARACTER • Code conversion fallback FFFD REPLACEMENT CHARACTER • Byte order signature FEFF ZERO WIDTH NO-BREAK SPACE
3.10 Canonical Ordering Behavior The purpose of this section is to provide unambiguous interpretation of a combining char- acter sequence. In the Unicode Standard, the order of characters in a combining character sequence is interpreted according to the following principles: • In the Unicode Standard, all combining characters are encoded following the base characters to which they apply. Thus the Unicode sequence U+0061 “a” + U+0308 “þ”í + U+0075 “u” is unambiguously interpreted (and displayed) as “äu”, not “aü”. • Enclosing nonspacing marks surround all previous characters up to and including the base character (see Figure 3-1). They thus successively surround previous enclosing nonspacing marks. Figure 3-1. Enclosing Marks
a + @ + @¨ + @ ➠ a¨
• Double diacritics always bind more loosely than other nonspacing marks. When rendering, the double diacritic will float above other diacritics, excluding enclosing diacritics (see Figure 3-2). Figure 3-2. Positioning of Double Diacritics
~ ~ o + @ˆ + @ + o + @¨ ➠ ôö ~ ~ o + @ + @ˆ + o + @¨ ➠ ôö
• Combining marks with the same combining class are generally positioned graphi- cally outward from the base character they modify. Some specific nonspacing marks override the default stacking behavior by being positioned side-by-side rather than stacking or by ligaturing with an adjacent nonspacing mark. When positioned side- by-side, the order of codes is reflected by positioning in the dominant order of the script with which they are used.
50 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.10 Canonical Ordering Behavior
• If combining characters have different combining classes—for example, when one nonspacing mark is above a base character form and another is below it—then no distinction of graphic form or semantic will result. The following subsections formalize these principles in terms of a normative list of com- bining classes and an algorithmic statement of how to use those combining classes to unambiguously interpret a combining character sequence.
Combining Classes The Unicode Standard treats sequences of nonspacing marks as equivalent if they do not typographically interact. The canonical ordering algorithm defines a method for determin- ing which sequences interact and gives a canonical ordering of these sequences for use in equivalence comparisons. D37 Combining class: a numeric value given to each combining Unicode character that determines with which other combining characters it typographically interacts. • See Section 4.2, Combining Classes—Normative, for a list of the combining classes for Unicode characters. Characters have the same class if they interact typographically, and different classes if they do not. • Enclosing characters and spacing combining characters have the class of their base characters. • The particular numeric value of the combining class does not have any special sig- nificance; the intent of providing the numeric values is only to distinguish the com- bining classes as being different, for use in equivalence comparisons.
Canonical Ordering The canonical ordering of a decomposed character sequence results from a sorting process that acts on each sequence of combining characters according to their combining class. Characters with combining class zero never sort relative to other characters, so the amount of work in the algorithm depends on the number of non-class-zero characters in a row. An implementation of this algorithm will be extremely fast for typical text. The algorithm described here represents a logical description of the process. Optimized algorithms can be used in implementations as long as they are equivalent—that is, as long as they produce the same result. More explicitly, the canonical ordering of a decomposed character sequence D results from the following algorithm. R1 For each character x in D, let p(x) be the combining class of x. R2 Whenever any pair (A, B) of adjacent characters in D is such that p(B) Å 0 & p(A) > p(B), exchange those characters. R3 Repeat step R2 until no exchanges can be made among any of the characters in D.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 51 3.11 Conjoining Jamo Behavior Conformance
Examples of this ordering appear in Table 3-2. Table 3-2. Sample Combining Classes
Combining Abbreviation Code Unicode Name Class 0a 0061 220 underdot 0323 230 diaeresis 0308 230 breve 0306 0a-underdot1EA1 0 a-diaeresis 00E4 0a-breve0103
a + underdot + diaeresis Æ a + underdot + diaeresis a + diaeresis + underdot Æ a + underdot + diaeresis Because underdot has a lower combining class than diaeresis, the algorithm will return the a, then the underdot, then the diaeresis. However, because diaeresis and breve have the same combining class (because they interact typographically), they do not rearrange. a + breve + diaeresis Ç a + diaeresis + breve a + diaeresis + breve Ç a + breve + diaeresis Applying the algorithm gives the results shown in Ta ble 3-3. Table 3-3. Canonical Ordering Results
Original Decompose Sort Result a-diaeresis + underdot a + diaeresis + underdot a + underdot + diaeresis a + underdot + diaeresis a + diaeresis + underdot a + underdot + diaeresis a + underdot + diaeresis a + underdot + diaeresis a + underdot + diaeresis a-underdot + diaeresis a + underdot + diaeresis a + underdot + diaeresis a-diaeresis + breve a + diaeresis + breve a + diaeresis + breve a + diaeresis + breve a + diaeresis + breve a + breve + diaeresis a + breve + diaeresis a-breve + diaeresis a + breve + diaeresis a + breve + diaeresis
Use with Collation When collation processes do not require correct sorting outside of a given domain, they are not required to invoke the canonical ordering algorithm for excluded characters. For exam- ple, a Greek collation process may not need to sort Cyrillic letters properly; in that case, it does not have to maximally decompose and reorder Cyrillic letters and may just choose to sort them according to Unicode order. For the complete treatment of collation, see Unicode Technical Report #10, “Unicode Collation Algorithm,” on the CD-ROM or the up-to-date version on the Unicode Web site.
3.11 Conjoining Jamo Behavior The Unicode Standard contains both a large set of precomposed modern Hangul syllables and a set of conjoining Hangul jamo, which can be used to encode archaic syllable blocks as well as modern syllable blocks. This section describes how to
52 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.11 Conjoining Jamo Behavior
• Determine the syllable boundaries in a sequence of conjoining jamo characters • Compose jamo characters into Hangul syllables • Determine the canonical decomposition of Hangul syllables • Algorithmically determine the names of the Hangul syllable characters (For more information, see the “Hangul Syllables” and “Hangul Jamo” subsections in Section 10.4, Hangul.) The jamo characters can be classified into three sets of characters: choseong (leading conso- nants, or syllable-initial characters), jungseong (vowels, or syllable-peak characters), and jongseong (trailing consonants, or syllable-final characters). In the following discussion, these jamo are abbreviated as L (leading consonant), V (vowel), and T (trailing conso- nant); syllable breaks are shown by middle dots “·”; and non-jamo are shown by X.
Syllable Boundaries In rendering, a sequence of jamos is displayed as a series of syllable blocks. The following rules specify how to divide up an arbitrary sequence of jamos (including nonstandard sequences) into these syllable blocks. In these rules, a choseong filler (L ) is treated as a cho- f seong character, and a jungseong filler (V ) is treated as a jungseong. f Within any sequence of characters, a syllable break occurs between the pairs of characters shown in Table 3-4. All other sequences of Hangul jamos are considered to be part of the same syllable. Note that like other non-jamo characters, any combining mark between two conjoining jamos prevents the jamos from forming a syllable. Table 3-4. Hangul Syllable Break Rules
Condition Example Any conjoining jamo and any non-jamo L·X, V·X, T·X, X·L, X·V, X·T A jongseong (trailing) and choseong (leading) T·L A jungseong (vowel) and a choseong (leading) V·L A jongseong (trailing) and jungseong (vowel) T·V
Standard Syllables A standard syllable block is composed of a sequence of choseong followed by a sequence of jungseong and optionally a sequence of jongseong (for example, S = LV or LVT ). A sequence of nonstandard syllable blocks can be transformed into a sequence of standard syllable blocks by inserting choseong fillers and jungseong fillers. Examples. In Table 3-5, row (1) shows syllable breaks in a standard sequence, row (2) shows syllable breaks in a nonstandard sequence, and row (3) shows how the sequence in (2) could be transformed into standard form by inserting fillers into each syllable. Table 3-5. Syllable Break Examples
No. Sequence Sequence with Syllable Breaks Marked 1 LVTLVLVLV L VL V T → LVT · LV · LV · LV · L V · L V T f f f f f f f f 2 LLTVLTLTVVLL → LLT · V · LT · LT · VV · LL 3 LLTVLTLTVVLL → LLV T · L V · LV T · LV T · L VV · LLV f f f f f f
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 53 3.11 Conjoining Jamo Behavior Conformance
Hangul Syllable Composition The following algorithm describes how to take a sequence of canonically decomposed char- acters D and compose Hangul syllables. Hangul composition and decomposition are sum- marized here, but for a more complete description, implementers must consult Unicode Technical Report #15, “Unicode Normalization Forms,” on the CD-ROM or the up-to-date version on the Unicode Web site. Note that, like other non-jamo characters, any combining mark between two conjoining jamos prevents the jamos from composing. First, define the following constants: SBase = AC0016 LBase = 110016 VBase = 116116 TBase = 11A716 SCount = 11172 LCount = 19 VCount = 21 TCount = 28 NCount = VCount * TCount 1. Process D by composing the conjoining jamo wherever possible, according to the compatibility decomposition rules in Chapter 14, Code Charts. (Typical interchange of conjoining jamo will be in precomposed forms. In such cases, this step may not be necessary. Raw keyboard data, on the other hand, may be in the form of a compatibility decomposition.) 2. Let i represent the current position in the sequence D. Compute the following indices, which represent the ordinal number (zero-based) for each of the com- ponents of a syllable, and the index j, which represents the index of the last character in the syllable. LIndex = D[i] - LBase VIndex = D[i+1] - VBase TIndex = D[i+2] - TBase j = i + 2 3. If either of the first two characters is out of bounds (LIndex < 0 OR LIndex >= LCount OR VIndex < 0 OR VIndex >= VCount), then increment i, return to step 2, and continue from there. 4. If the third character is out of bounds (TIndex <= 0 or TIndex >= TCount), then it is not part of the syllable. Reset the following: TIndex = 0 j = i + 1 5. Replace the characters D[i] through D[j] by the Hangul syllable S, and set i to be j+1. S = (LIndex * VCount + VIndex) * TCount + TIndex + SBase. Example. The first three characters are U+1111 ñ U+1171 Ö U+11B6 -
Compute the following indices, LIndex = 17 VIndex = 16 TIndex = 15
54 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior and replace the three characters by S = [(17 * 21) + 16] * 28 + 15 + SBase = D4DB16 = /
Hangul Syllable Decomposition The following describes the reverse mapping—how to take Hangul syllable S and derive the canonical decomposition D. This normative mapping for these characters is equivalent to the canonical mapping in the character charts for other characters. 1. Compute the index of the syllable: SIndex = S - SBase 2. If SIndex is in the range (0 <= SIndex < SCount), then compute the compo- nents as follows: L = LBase + SIndex / NCount V = VBase + (SIndex % NCount) / TCount T = TBase + SIndex % TCount The operators “/” and “%” are as defined in Section 0.2, Notational Conventions. 3. If T = TBase, then there is no trailing character, so replace S by the sequence
Hangul Syllable Names The character names for Hangul syllables are derived from the decomposition by starting with the string , and appending the short name of each decomposition component in order. (See Section 4.4, Jamo Short Names—Normative.) For example, for U+D4DB, derive the decomposition, as shown in the preceding example. It produces the following three-character sequence: U+1111 U+1171 U+11B6 - The character name for U+D4DB is then generated as . This character name is a normative property of the character.
3.12 Bidirectional Behavior The Unicode Standard prescribes a memory representation order known as logical order. When text is presented in horizontal lines, most scripts display characters from left to right. However, in several scripts (such as Arabic or Hebrew), the natural ordering of horizontal text in display is from right to left. If all of the text has the same horizontal direction, then the ordering of the display text is unambiguous. However, when bidirectional text (a mix- ture of left-to-right and right-to-left horizontal text) is present, some ambiguities can arise in determining the ordering of the displayed characters.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 55 3.12 Bidirectional Behavior Conformance
This section describes the algorithm used to determine the directionality for bidirectional Unicode text. The algorithm extends the implicit model currently employed by a number of existing implementations and adds explicit format codes for special circumstances. In most cases, there is no need to include additional information with the text to obtain cor- rect display ordering. In the case of bidirectional text, there are circumstances where an implicit bidirectional ordering will not suffice to produce comprehensible text. To deal with these cases, a mini- mal set of directional formatting codes is defined to control the ordering of characters when rendered. This allows for exact control of the display ordering for legible interchange and also ensures that plain text used for simple items like file names or labels can always be correctly ordered for display. The directional formatting codes are used only to influence the display ordering of text. In all other respects, they should be ignored—they have no effect on the comparison of text, word breaks, parsing, or numeric analysis. When working with bidirectional text, the characters are still interpreted in logical order— only the display is affected. The display ordering of bidirectional text depends upon the directional properties of the characters in the text.
Directional Formatting Codes Two types of explicit codes are used to modify the standard implicit Unicode bidirectional algorithm. In addition, there are implicit ordering codes, the right-to-left and left-to-right marks. All of these codes are limited to the current paragraph; thus their effects are termi- nated by a paragraph separator. The directional types left-to-right and right-to-left are called strong types, and characters of those types are called strong directional characters. The directional types associated with numbers are called weak types, and characters of those types are called weak directional characters. Although the term embedding is used for some explicit codes, the text within the scope of the codes is not independent of the surrounding text. Characters within an embedding can affect the ordering of characters outside the embedding, and vice versa. The algorithm is designed so that the use of explicit codes can be equivalently represented by out-of-line information, such as stylesheet information. However, any alternative representation will be defined by reference to the behavior of the explicit codes in this algorithm. Explicit Directional Embedding. The following codes signal that a piece of text is to be treated as embedded. For example, an English quotation in the middle of an Arabic sen- tence could be marked as being embedded left-to-right text. If a Hebrew phrase occurred in the middle of the English quotation, then that phrase could be marked as being embedded right-to-left. These codes allow for nested embeddings. RLE Right-to-Left Embedding Treat the following text as embedded right- to-left. LRE Left-to-Right Embedding Treat the following text as embedded left-to- right. The precise meaning of these codes will be made clear in the discussion of the algorithm. The effect of right-left line direction, for example, can be accomplished by simply embed- ding the text with RLE...PDF. Explicit Directional Overrides. The following codes allow the bidirectional character types to be overridden when required for special cases, such as for part numbers. These codes allow for nested directional overrides.
56 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior
RLO Right-to-Left Override Force following characters to be treated as strong right-to-left characters. LRO Left-to-Right Override Force following characters to be treated as strong left-to-right characters. The precise meanings of these codes will be made clear in the discussion of the algorithm. The right-to-left override, for example, can be used to force a part number made of mixed English, digits, and Hebrew letters to be written from right to left. Terminating Explicit Directional Code. The following code terminates the effects of the last explicit code (either embedding or override) and restores the bidirectional state to what it was before that code was encountered. PDF Pop Directional Format Restore the bidirectional state to what it was before the last LRE, RLE, RLO, or LRO. Implicit Directional Marks. These characters are very lightweight codes. They act exactly like right-to-left or left-to-right characters, except that they are not displayed and do not have any other semantic effect. Their use is generally more convenient than the explicit embeddings or overrides because their scope is much more local. RLM Right-to-Left Mark Right-to-left zero-width character LRM Left-to-Right Mark Left-to-right zero-width character There is no special mention of the implicit directional marks in the following algorithm. That omission occurs because their effect on bidirectional ordering is exactly the same as a corresponding strong directional character; the only difference is that they do not appear in the display.
Basic Display Algorithm The bidirectional algorithm takes a stream of text as input and proceeds in three main phases: • Separation of the input text into paragraphs. The rest of the algorithm affects only the text between paragraph separators. • Resolution of the embedding levels of the text. In this phase, the directional charac- ter types, plus the explicit format codes, are used to produce resolved embedding levels. • Reordering the text for display on a line-by-line basis using the resolved embedding levels, once the text has been broken into lines. The algorithm reorders text only within a paragraph; characters in one paragraph have no effect on characters in a different paragraph. Paragraphs are divided by U+2029 - or the appropriate Newline Function (see Section 4.3, Directionality— Normative and Unicode Technical Report #13, “Unicode Newline Guidelines,” found on the CD-ROM or the up-to-date version on the Unicode Web site for information on the handling of CR, LF, and CRLF). Paragraphs may also be determined by higher-level proto- cols; for example, the text in two different cells of a table will be in different paragraphs. Combining characters are always attached to the preceding base character in the memory representation. Even after reordering for display and performing character shaping, the glyph representing a combining character will be attached to the glyph representing its base character in memory. Depending on the line orientation and the placement direction of base letterform glyphs, it may, for example, become attached to the glyph on the left, or on the right, or above.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 57 3.12 Bidirectional Behavior Conformance
In the following subsections, the normative definitions and rules are distinguished by the numbering given in Table 3-6. Table 3-6. Normative Definitions and Rules
Numbering Section BDn Definitions Pn Paragraph levels Xn Explicit levels and directions Wn Weak types Nn Neutral types In Implicit levels Ln Resolved levels
Definitions BD1 The bidirectional character types are values assigned to each Unicode character, including unassigned characters. BD2 Embedding levels are numbers that indicate how deeply the text is nested, and the default direction of text on that level. The minimum embedding level of text is zero, and the maximum explicit depth is level 61. • Embedding levels are explicitly set by both override format codes and embedding format codes; higher numbers mean the text is more deeply nested. The reason for having a limitation is to provide a precise stack limit for implementations to guar- antee the same results. Sixty-one levels is far more than sufficient for ordering, even with mechanically generated formatting; the display becomes rather muddied with more than a small number of embeddings. BD3 The default direction of the current embedding level (for a character in question) is called the embedding direction. It is L if the embedding level is even, and R if the embedding level is odd. • For example, in a particular piece of text, level 0 is plain English text, level 1 is plain Arabic text, possibly embedded within English level 0 text. Level 2 is English text, possibly embedded within Arabic level 1 text, and so on. Unless their direction is overridden, English text and numbers will always be an even level; Arabic text (excluding numbers) will always be an odd level. The exact meaning of the embed- ding level will become clear when the reordering algorithm is discussed, but this section provides an example of how the algorithm works. BD4 The paragraph embedding level is the embedding level that determines the default bidirectional orientation of the text in that paragraph. BD5 The direction of the paragraph embedding level is called the paragraph direction. • In some contexts, the paragraph direction is also known as the base direction. BD6 The directional override status determines whether the bidirectional type of charac- ters is reset with explicit directional controls. This status has three states, shown in Table 3-7. BD7 A level run is a maximal substring of characters that have the same embedding level. It is maximal in that no character immediately before or after the substring has the same level.
58 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior
Table 3-7. Directional Override Status
Status Interpretation Neutral No override is currently active Right-to-left Characters are to be reset to R Left-to-right Characters are to be reset to L
Example. In this and the following examples, case is used to indicate different implicit character types for those unfamiliar with right-to-left letters. Uppercase letters stand for right-to-left characters (such as Arabic or Hebrew), whereas lowercase letters stand for left- to-right characters (such as English or Russian). Memory: car is THE CAR in arabic Character types: LLL-LL-RRR-RRR-LL-LLLLLL Resolved levels: 000000011111110000000000 Notice that the neutral character (space) between THE and CAR gets the level of the sur- rounding characters. In this way, the implicit directional marks have an effect. By inserting appropriate directional marks around neutral characters, the level of the neutral characters can be changed. Bidirectional Character Types. The normative bidirectional character types for each char- acter are specified in the Unicode Character Database and summarized in Table 3-8. Table 3-8. Bidirectional Character Types
Category Type Description Scope L Left-to-Right LRM, most alphabetic, syllabic, Han ideo- graphic characters, digits that are neither European nor Arabic, all unassigned charac- ters except in the ranges (0590–05FF, FB1D– FB4F) and (0600–07BF, FB50–FDFF, FE70– FEFF) LRE Left-to-Right Embedding LRE LRO Left-to-Right Override LRO Strong R Right-to-Left RLM, Hebrew alphabet, most punctuation specific to that script, all unassigned charac- ters in the ranges (0590–05FF, FB1D–FB4F) AL Right-to-Left Arabic Arabic, Thaana, and Syriac alphabets, most punctuation specific to those scripts, all unassigned characters in the ranges (0600– 07BF, FB50–FDFF, FE70–FEFF) RLE Right-to-Left Embedding RLE RLO Right-to-Left Override RLO
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 59 3.12 Bidirectional Behavior Conformance
Table 3-8. Bidirectional Character Types (Continued)
Category Type Description Scope PDF Pop Directional Format PDF EN European Number European digits, Eastern Arabic-Indic digits, ... ES European Number Separator Solidus (slash) ET European Number Terminator Plus sign, minus sign, degree, currency sym- bols, ... AN Arabic Number Arabic-Indic digits, Arabic decimal and thou- Weak sands separators, ... CS Common Number Separator Colon, comma, full stop (period), - , ... NSM Nonspacing Mark Characters marked Mn (nonspacing mark) and Me (enclosing mark) in the Unicode Character Database BN Boundary Neutral Formatting and control characters, other than those explicitly given types above (to be ignored in processing bidirectional text) B Paragraph Separator Paragraph separator, appropriate newline functions, higher-protocol paragraph deter- mination S Segment Separator Tab Neutral WS Whitespace Space, figure space, line separator, form feed, general punctuation spaces, ... ON Other Neutrals All other characters, including
• The term European digits is used to refer to decimal forms common in Europe and elsewhere, and the term Arabic-Indic digits refers to the native Arabic forms. (See Section 8.2, Arabic, for more details on naming digits.) • Unassigned characters are given strong types in the algorithm. This convention is an explicit exception to the general Unicode conformance requirements with respect to unassigned characters. As characters become assigned in the future, these bidirec- tional types may change. • Private-use characters can be assigned different values by a conformant implemen- tation. • For the purpose of the bidirectional algorithm, inline objects (such as graphics) are treated as if they are an (U+FFFC). Table 3-9 lists additional abbreviations used in the examples and internal character types used in the algorithm. Table 3-9. Abbreviations for Examples and Internal Types
Symbol Description N Neutral or separator (B, S, WS, ON). e The text ordering type (L or R) that matches the embedding level direction (even or odd) for the current run. Cf. X10 below. sor The text ordering type (L or R) assigned to the position before a level run. eor The text ordering type (L or R) assigned to the position after a level run.
60 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior
Resolving Embedding Levels The body of the bidirectional algorithm uses character types and explicit codes to produce a list of resolved levels. This resolution process consists of five steps: (1) determining the paragraph level; (2) determining explicit embedding levels and directions; (3) resolving weak types; (4) resolving neutral types; and (5) resolving implicit embedding levels. Paragraph Level. The bidirectional algorithm applies to paragraphs. P1. Split the text into separate paragraphs. A paragraph separator is kept with the previous paragraph. Within each paragraph, apply all other rules of this algorithm. P2. In each paragraph, find the first character that is a strong directional type (L, AL, R). Because paragraph separators delimit text in this algorithm, they will generally be the first strong character after a paragraph separator or at the very beginning of the text. P3. If a character is found in rule P2 and it is of type AL or R, then set the paragraph embedding level to 1; otherwise, set it to zero. Note that when a higher-level protocol specifies the paragraph level, it is not necessary to apply rules P2 and P3. Explicit Levels and Directions. All explicit embedding levels are determined from the embedding and override codes, by applying the explicit level rules X1 through X9. These rules are applied as part of the same logical pass over the input. Explicit Embeddings. An explicit embedding code sets the level of the text, but does not change the directional character type of affected characters. X1. Begin by setting the current embedding level to the paragraph embedding level. Set the directional override status to neutral. Process each character iteratively, applying rules X2 through X9. Only embedding levels from 0 to 61 are valid in this phase. In the resolution of levels in rules I1 and I2, the maximum embedding level of 62 can be reached. X2. With each RLE, compute the least greater odd embedding level. a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to neutral. b. If the new level would not be valid, then this code is invalid. Don’t change the current level or override status. For example, level 0 ò 1; levels 1, 2 ò 3; levels 3, 4 ò 5; ...59, 60 ò 61; above 60, no change (don’t change levels with RLE if the new level would be invalid). X3. With each LRE, compute the least greater even embedding level. a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to neutral. b. If the new level would not be valid, then this code is invalid. Don’t change the current level or override status. For example, levels 0, 1 ò 2; levels 2, 3 ò 4; levels 4, 5 ò 6; ...58, 59 ò 60; above 59, no change (don’t change levels with LRE if the new level would be invalid). Explicit Overrides. An explicit directional override sets the embedding level in the same way that the explicit embedding codes do, but also changes the directional character type of affected characters to the override direction.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 61 3.12 Bidirectional Behavior Conformance
X4. With each RLO, compute the least greater odd embedding level. a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to right-to-left. b. If the new level would not be valid, then this code is invalid. Don’t change the current level or override status. X5. With each LRO, compute the least greater even embedding level. a. If this new level would be valid, then this embedding code is valid. Remember (push) the current embedding level and override status. Reset the current level to this new level, and reset the override status to left-to-right. b. If the new level would not be valid, then this code is invalid. Don’t change the current level or override status. X6. For all types besides RLE, LRE, RLO, LRO, and PDF: a. Set the level of the current character to the current embedding level. b. Whenever the directional override status is not neutral, reset the current character type to the directional override status. If the directional override status is neutral, then characters retain their normal types: Ara- bic characters stay AL, Latin characters stay L, neutrals stay N, and so on. If the directional override status is R, then characters become R. If the directional override status is L, then characters become L. Terminating Embeddings and Overrides. A single code terminates the scope of the current explicit code, whether an embedding or a directional override. All codes and pushed states are completely popped at the end of paragraphs. X7. With each PDF, determine the matching embedding or override code. If a valid match- ing code was found, restore (pop) the last remembered (pushed) embedding level and directional override. X8. All explicit directional embeddings and overrides are completely terminated at the end of each paragraph. Paragraph separators are not included in the embedding. X9. Remove all RLE, LRE, RLO, LRO, PDF, and BN codes. • Note that an implementation does not have to actually remove the codes, it just has to behave as though the codes were not present for the remainder of the algorithm. Conformance does not require any particular placement of these codes as long as all other characters are ordered correctly. See “Implementation Notes” later in this section for information on implementing the algorithm without removing the formatting codes. X10. The remaining rules are applied to each run of characters at the same level. For each run, determine the start-of-level-run (sor) and end-of-level-run (eor) type, either L or R. This determination depends on the higher of the two levels on either side of the boundary (at the start or end of the paragraph, the level of the “other” run is the base embedding level). If the higher level is odd, the type is R; otherwise, it is L. For example: Levels: 000111 2 Runs: Ä 1 ÅÄ2 Å <3>
62 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior
Run 1 is at level 0, sor is L, eor is R. Run 2 is at level 1, sor is R, eor is L. Run 3 is at level 2, sor is L, eor is L. For two adjacent runs, the eor of the first run is the same as the sor of the second. Resolving Weak Types. Weak types are now resolved one level run at a time. At level run boundaries where the type of the character on the other side of the boundary is required, the type assigned to sor or eor is used. Nonspacing marks are now resolved based on the previous characters. W1. Examine each nonspacing mark (NSM) in the level run, and change the type of the NSM to the type of the previous character. If the NSM is at the start of the level run, it will get the type of sor. Assume in this example that sor is R: AL NSM NSM ò AL AL AL sor NSM ò sor R The text is next parsed for numbers. This pass will change the directional types European Number Separator, European Number Terminator, and Common Number Separator to be European Number text, Arabic Number text, or Other Neutral text. The text to be scanned may have already had its type altered by directional overrides. If so, then it will not parse as numeric. W2. Search backward from each instance of a European number until the first strong type (R, L, AL, or sor) is found. If an AL is found, change the type of the European number to Arabic number. AL EN ò AL AN AL N EN ò AL N AN sor N EN ò sor N EN L N EN ò L N EN R N EN ò R N EN W3. Change all ALs to R. W4. A single European separator between two European numbers changes to a European number. A single common separator between two numbers of the same type changes to that type: EN ES EN ò EN EN EN EN CS EN ò EN EN EN AN CS AN ò AN AN AN W5. A sequence of European terminators adjacent to European numbers changes to all Euro- pean numbers: ET ET EN ò EN EN EN EN ET ET ò EN EN EN AN ET EN ò AN EN EN W6. Otherwise, separators and terminators change to Other Neutral: AN ET ò AN ON L ES EN ò L ON EN EN CS AN ò EN ON AN ET AN ò ON AN
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 63 3.12 Bidirectional Behavior Conformance
W7. Search backward from each instance of a European number until the first strong type (R, L, or sor) is found. If an L is found, then change the type of the European number to L. L N EN ò L N L R N EN ò R N EN Resolving Neutral Types. Neutral types are now resolved one level run at a time. At level run boundaries where the type of the character on the other side of the boundary is required, the type assigned to sor or eor is used. The next phase resolves the direction of the neutrals. The results of this phase are that all neutrals become either R or L. Generally, neutrals take on the direction of the surrounding text. In case of a conflict, they take on the embedding direction. N1. A sequence of neutrals takes the direction of the surrounding strong text if the text on both sides has the same direction. European and Arabic numbers are treated as though they were R. Start-of-level-run (sor) and end-of-level-run (eor) are used at level run boundaries. R N R ò R R R L N L ò L L L R N AN ò R R AN AN N R ò AN R R R N EN ò R R EN EN N R ò EN R R EN N AN ò EN R AN • Any European numbers after an L have already been turned into L by this time, so only the European numbers after R are treated as R. N2. Any remaining neutrals take the embedding direction. N ò e Assume in this example that eor is L, and sor is R: L N eor ò L L eor R N eor ò R e eor sor N L ò sor e L sor N R ò sor R R Examples. A list of numbers separated by neutrals and embedded in a directional run will come out in the run’s order. Storage: he said “THE VALUES ARE 123, 456, 789, OK”. Display: he said “KO ,789 ,456 ,123 ERA SEULAV EHT”. In this case, both the comma and the space between the numbers take on the direction of the surrounding text (uppercase = right-to-left), ignoring the numbers. The commas are not considered part of the number because they are not surrounded on both sides. If an adjacent left-to-right sequence is present, then European numbers will adopt that direc- tion. Storage: he said “IT IS A bmw 500, OK.” Display: he said “.KO ,bmw 500 A SI TI” Resolving Implicit Levels. In the final phase, the embedding level of text may be increased, based upon the resolved character type. Right-to-left text will always end up with an odd level, and left-to-right and numeric text will always end up with an even level. In addition, numeric text will always end up with a higher level than the paragraph level. (Note that it is
64 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior possible for text to end up at levels higher than 61 as a result of this process.) This situation results in the following rules: I1. For all characters with an even (left-to-right) embedding direction, those of type R go up one level and those of type AN or EN go up two levels. I2. For all characters with an odd (right-to-left) embedding direction, those of type L, EN, or AN go up one level. Table 3-10 summarizes the results of the implicit algorithm. Table 3-10. Resolving Implicit Levels
Embedding Level Type Even Odd LELEL+1 REL+1EL AN EL+2 EL+1 EN EL+2 EL+1
Reordering Resolved Levels The following algorithm describes the logical process of finding the correct display order. As noted earlier, this logical process is not necessarily the actual implementation, which may diverge for efficiency as long as it produces the same results. As opposed to resolution phases, this algorithm acts on a per-line basis, and is applied after any line wrapping is applied to the paragraph. The process of breaking a paragraph into one or more lines that fit within particular bounds is outside the scope of the bidirectional algorithm. Where character shaping is involved, it can be somewhat more complicated (see Section 8.2, Arabic). Logically, there are the following steps: • The levels of the text are determined according to the bidirectional algorithm. • The characters are shaped into glyphs according to their context (taking the embed- ding levels into account for mirroring!). • The accumulated widths of those glyphs (in logical order) are used to determine line breaks. • For each line, rules L1–L4 are used to reorder the characters on that line. • The glyphs corresponding to the characters on the line are displayed in that order. L1. On each line, reset the embedding level of the following characters to the paragraph embedding level: 1. segment separators, 2. paragraph separators, 3. any sequence of whitespace characters preceding a segment separator or paragraph separator, and 4. any sequence of whitespace characters at the end of the line. • The types of characters used here are the original types, not those modified by the previous phase.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 65 3.12 Bidirectional Behavior Conformance
• Because a paragraph separator breaks lines, there will be at most one per line, at the end of that line. In combination with the following rule, this requirement means that trailing whitespace will appear at the visual end of the line (in the paragraph direction). Tabulation will always have a consistent direction within a paragraph. L2. From the highest level found in the text to the lowest odd level on each line, reverse any contiguous sequence of characters that is at that level or higher. This process reverses a progressively larger series of substrings. The following four exam- ples illustrate this rule. Memory: car means CAR. Resolved levels: 00000000001110 Reverse level 1: car means RAC. Memory: car MEANS CAR. Resolved levels: 22211111111111 Reverse level 2: rac MEANS CAR. Reverse levels 1, 2: .RAC SNAEM car Memory: he said “car MEANS CAR.” Resolved levels: 000000000222111111111100 Reverse level 2: he said “rac MEANS CAR.” Reverse levels 1, 2: he said “RAC SNAEM car.” Memory: DID YOU SAY ‘he said “car MEANS CAR”’? Resolved levels: 11111111111112222222224443333333333211 Reverse level 4: DID YOU SAY ‘he said “rac MEANS CAR”’? Reverse levels 3, 4: DID YOU SAY ‘he said “RAC SNAEM car”’? Reverse levels 2–4: DID YOU SAY ’”rac MEANS CAR“ dias eh‘? Reverse levels 1–4: ?‘he said “RAC SNAEM car”’ YAS UOY DID L3. Combining marks applied to a right-to-left base character will at this point precede their base character. If the rendering engine expects them to follow the base characters in the final display process, then the ordering of the marks and the base character must be reversed. Many font designers provide default metrics for combining marks that support rendering by simple overhang. Because of the reordering for right-to-left characters, it is common practice to make the glyphs for most combining characters overhang to the left (thereby assuming the characters will be applied to left-to-right base characters) and make the glyphs for combining characters in right-to-left scripts overhang to the right (thereby assuming that the characters will be applied to right-to-left base characters). With such fonts, the display ordering of the marks and base glyphs may need to be adjusted when combining marks are applied to “unmatching” base characters. See Section 5.14, Rendering Nonspacing Marks, for more information. L4. A character that possesses the mirrored property as specified by Section 4.7, Mirrored— Normative, must be depicted by a mirrored glyph if the resolved directionality of that character is R. For example, U+0028 —which is interpreted in the Unicode Standard as an opening parenthesis—appears as “(” when its resolved level is even, and as the mirrored glyph “)” when its resolved level is odd.
66 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior
Bidirectional Conformance The bidirectional algorithm specifies part of the intrinsic semantics of right-to-left charac- ters. In the absence of a higher-level protocol that specifically supersedes the interpretation of directionality, systems that interpret these characters must achieve results identical to the implicit bidirectional algorithm when rendering. Boundary Neutrals. The goal in marking a format or control character as BN is to ensure that it has no effect on the rest of the algorithm. Because the precise ordering of format characters with respect to others is not required for conformance, implementations are free to handle them in different ways for efficiency as long as the ordering of the other charac- ters is preserved. Explicit Formatting Codes. As with any Unicode characters, systems do not have to sup- port any particular explicit directional formatting code (although it is not generally useful to include a terminating code without including the initiator). Generally, conforming sys- tems will fall into three classes: • No bidirectional formatting. This choice implies that the system does not visually interpret characters from right-to-left scripts. • Implicit bidirectionality. The implicit bidirectional algorithm and the directional marks RLM and LRM are supported. • Full bidirectionality. The implicit bidirectional algorithm, the implicit directional marks, and the explicit directional embedding codes are supported: RLM, LRM, LRE, RLE, LRO, RLO, PDF. Higher-Level Protocols. The following are permissible ways for systems to apply higher- level protocols to the ordering of bidirectional text: • Override the paragraph embedding level. A higher-level protocol should provide for overriding the paragraph embedding level, such as on a table cell, paragraph, docu- ment, or system level. • Override the number handling to use information provided by a broader context. For example, information from other paragraphs in a document could be used to con- clude that the document was fundamentally Arabic and that EN should generally be converted to AN. • Replace, supplement, or override the directional overrides or embedding codes. This task is accomplished by providing information via additional stylesheet or markup information about the embedding level or character direction. The interpretation of such information must always be defined by reference to the behavior of the equiva- lent explicit codes as given in the algorithm. • Override the bidirectional character types assigned to control codes to match the inter- pretation of the control codes within the protocol. (See also Section 13.1, Control Codes.) • Remap the number shapes to match those of another set. For example, remap the Ara- bic number shapes to have the same appearance as the European numbers. When text using a higher-level protocol is to be converted to Unicode plain text, formatting codes should be inserted to ensure that the order matches that of the higher-level protocol or that (as in the last example) the appropriate characters can be substituted.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 67 3.12 Bidirectional Behavior Conformance
Implementation Notes Reference Implementations. Source code for two reference implementations of the bidi- rectional algorithm is provided as part of Unicode Technical Report #9, “The Bidirectional Algorithm,” on the CD-ROM or the up-to-date version on the Unicode Web site. Imple- menters are encouraged to use this resource to test their implementations. Retaining Format Codes. Some implementations may wish to retain the format codes when running the algorithm. This goal may be accomplished in the following manner: • In rule X9, instead of removing the format codes, set all format codes to BN. For each sequence of BN codes, set the levels of the codes to the level of the preceding non-BN code (or if the sequence is at the beginning of the paragraph, to the para- graph level). • In rule W1, search backward from each NSM to the first character in the level run whose type is not BN, and set the NSM to its type. • In rule W4, scan past BN types that are adjacent to ES or CS. • In rule W5, change all appropriate sequences of ET and BN, not just ET. • In rule W6, change all BN types to ON as well. • In rule L1, include format codes and BN together with whitespace characters in the sequences whose levels are reset before a separator or line break. Implementations that display visible representations of format characters will want to adjust this mechanism so as to position the format characters optimally for editing. Vertical Text. In the case of vertical line orientation, the bidirectional algorithm is still used to determine the levels of the text. However, these levels are not used to reorder the text, as the characters are usually ordered uniformly from top to bottom. Instead, the levels are used to determine the rotation of the text. Sometimes vertical lines follow a vertical base- line in which each character is oriented as normal (with no rotation), with characters ordered from top to bottom whether they are Hebrew, numbers, or Latin. When setting text using the Arabic script in vertical lines, it is more common to employ a horizontal baseline that is rotated by 90 degrees counterclockwise so that the characters are ordered from top to bottom. Latin text and numbers may be rotated 90 degrees clockwise so that the characters become ordered from top to bottom as well. The bidirectional algorithm also comes into play when some characters are ordered from bottom to top. For example, this situation arises with a mixture of Arabic and Latin glyphs when all of the glyphs are rotated uniformly 90 degrees clockwise. (The choice of whether text is to be presented horizontally or vertically, or whether text is to be rotated, is not spec- ified by the Unicode Standard, and is left to higher-level protocols.) Usage. Because of the implicit character types and the heuristics for resolving neutral and numeric directional behavior, the implicit bidirectional ordering will generally produce the correct display without any further work. However, problematic cases may occur when a right-to-left paragraph begins with left-to-right characters, or nested segments of different-direction text are present, or there are weak characters on directional boundaries. In these cases, embeddings or directional marks may be required to get the correct display. Part numbers may also require directional overrides. The most common problematic case involves neutrals on the boundary of an embedded language. This issue can be addressed by setting the level of the embedded text correctly. For example, with all text at level 0, the following occurs:
68 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Conformance 3.12 Bidirectional Behavior
Memory: he said "I NEED WATER!", and expired. Display: he said "RETAW DEEN I!", and expired. If the exclamation mark is to be part of the Arabic quotation, then the user can select the text I NEED WATER! and explicitly mark it as embedded Arabic, which produces the fol- lowing result: Memory: he said "
Characters New Bidirectional Type All characters with General Category Me, Mn NSM All characters of type R in the Arabic ranges (0600..06FF, AL FB50..FDFF, FE70..FEFE) (Letters in the Thaana and Syriac ranges also have this value.) The explicit embedding characters: LRO, RLO, LRE, RLE, PDF LRO, RLO, LRE, RLE, PDF, respectively Formatting characters and controls (General Category Cf and BN Cc) that were of bidirectional type ON Zero Width Space BN
Implementations that use older property tables can be adjusted to the modifications in the bidirectional algorithm by algorithmically remapping the characters listed in Table 3-11 to the new types.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 69 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:
Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 4 Character Properties 4
Disclaimer The content of all character property tables has been verified as far as possible by the Unicode Consortium. However, the Unicode Consortium does not guarantee that the tables printed in this volume or on the CD-ROM are correct in every detail, and it is not responsible for errors that may occur either in the character property tables or in software that implements these tables. The contents of all the tables in this chapter may be superseded or augmented by information on the Uni- code Web site.
This chapter describes the attributes of character properties defined by the Unicode Stan- dard and gives mappings of characters to specific character properties. Full listings for all Unicode properties are provided in the Unicode Character Database on the CD-ROM. While the Unicode Consortium strives to minimize changes to character property data, occasionally character properties must be updated. When this situation occurs, the relevant data files of the Unicode Character Database are revised. The revised data files are posted on the Unicode Web site as an update version of the standard. Normative and Informative Properties As specified in Chapter 3, Conformance, the Unicode Standard defines both normative and informative properties. Normative Properties. Normative means that implementations that claim conformance to the Unicode Standard (at a particular version) and that make use of a particular property must follow the specifications of the standard for that property to be conformant. The term normative when applied to a character property does not mean that the value of the prop- erty will never change. Corrections and extensions to the standard in the future may require minor changes to normative values, even though the Unicode Technical Committee strives to minimize such changes. Informative Properties. If a character property is only informative, a conformant imple- mentation is free to use or change such values as it may require, while still remaining con- formant to the standard. Particular implementations may choose to override the properties that are not normative. In that case, the implementer has the option of establishing a pro- tocol to convey that information. The normative character properties of the Unicode Standard are given in Table 4-1, and the informative character properties are given in Table 4-2. Also included in the tables are the locations of each property’s description and of the list of characters with that property.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 73 Character Properties
Table 4-1. Normative Character Properties
Property Description List Case Chapter 4 Chapter 14a Combining Classes Chapter 3 Ta ble 4 -3 Conjoining Jamo (1100–11FF) Chapter 3 Chapter 14 Decomposition (Canonical and Compatibility) Chapter 3 Chapter 14 Directionality Section 3.12 CD-ROM Jamo Short Name Chapter 3 Ta ble 4 -4 Numeric Value Chapter 4 Ta ble 4 -6 Private Use Chapter 3 Section 13.5 Special Character Properties Chapter 13 Section 3.9 Surrogate Chapter 3 Section 13.4 Mirrored Chapter 3 Ta ble 4 -9 Unicode Character Names Chapter 14 Chapter 14 a. This property is listed by individual character where it is not obvious. Table 4-2. Informative Character Properties
Property Description List Case Mapping Chapter 4 CD-ROM Dashes Chapter 6 Ta ble 6 -2 East Asian Width Section 10.3 CD-ROM UTR #11 Letters (Alphabetic and Ideographic) Chapter 4 CD-ROM Line Breaking Chapter 13 Section 13.1 and Section 13.2 Mathematical Property Chapter 4 Section 4.9 Spaces Chapter 6 Ta ble 6 -1 Unicode 1.0 Names Chapter 4 CD-ROM
Default Properties. Implementations need specific properties for all code points, including those that are unassigned. To meet this need, the Unicode standard assigns default proper- ties to ranges of unassigned code points. For example, see Table 3-8. Consistency of Properties. The Unicode Standard is the product of many compromises. It has to strike a balance between uniformity of treatment for similar characters and compat- ibility with existing practice for characters inherited from legacy encodings. Because of this balancing act, one can expect a certain number of anomalies in character properties. For example, some pairs of characters might have been treated as canonical equivalents but are left unequivalent for compatibility with legacy differences. This situation pertains to U+00B5 (cf. U+03BC ¡ ) as well as to certain Korean jamo. In addition, some characters might have had properties differing in some ways from those assigned in this standard, but whose properties are left as is for compatibility with existing practice. This situation can be seen with the halfwidth voicing marks for Japanese (U+FF9E and U+FF9F - - ), which might have been better analyzed as spacing com- bining marks, and with the conjoining Hangul jamo, which might have been better analyzed as an initial base character, followed by formally combining medial and final char- acters. In the interest of efficiency and uniformity in algorithms, implementations may take advantage of such reanalyses of character properties, as long as the results they produce do not overtly conflict with those specified by the normative properties of this standard.
74 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.1 Case—Normative
4.1 Case—Normative Case is a normative property of characters in certain alphabets whereby characters are con- sidered to be variants of a single letter. These variants, which may differ markedly in shape and size, are called the uppercase letter (also known as capital or majuscule) and the lower- case letter (also known as small or minuscule). The uppercase letter is generally larger than the lowercase letter. Because of the inclusion of certain composite characters for compatibility, such as U+01F1 , a third case, called titlecase, is used where the first character of a word must be capitalized. An example of such a character is U+01F2 . The three case forms are UPPERCASE, Titlecase, lowercase. For those scripts that have case (Latin, Greek, Cyrillic, Armenian, and archaic Georgian), the case of a Unicode character can usually be obtained from the character’s name. This statement is true for only these five scripts. Uppercase characters typically contain the word capital in their names. Lowercase characters typically contain the word small. The word small in the names of characters from scripts other than the five just listed has nothing to do with case. (Note that while the archaic Georgian script contained upper- and lowercase pairs, they are rarely used in modern Georgian. See Section 7.5, Georgian.) Case Mappings. The lowercase letter default case mapping occurs between the small char- acter and the capital character. The Unicode Standard case mapping tables, which are informative, are on the CD-ROM. Exceptions to the normal casing rules can be found in the data file SpecialCasing.txt. For more information on case mappings, see Section 5.18, Case Mappings, and Unicode Technical Report #21, “Case Mappings,” on the CD-ROM or the up-to-date version on the Unicode Web site.
4.2 Combining Classes—Normative Each combining character has a normative canonical combining class. This class is used with the canonical ordering algorithm to determine which combining characters interact typographically and to determine how the canonical ordering of sequences of combining characters takes place. Class zero combining characters act like base letters for the purpose of determining canonical order. Combining characters with non-zero classes participate in reordering for the purpose of determining the canonical form of sequences of characters. (See Section 3.10, Canonical Ordering Behavior, for a description of the algorithm.) The list of combining characters and their canonical combining class appears in Table 4-3. Most combining characters are nonspacing. The spacing, class zero, combining characters are so noted. Table 4-3. Combining Classes Spacing and Nonspacing (Class 0) U+07A6 0 THAANA ABAFILI U+07A7 0 THAANA AABAAFILI U+07A8 0 THAANA IBIFILI U+07A9 0 THAANA EEBEEFILI U+07AA 0 THAANA UBUFILI U+07AB 0 THAANA OOBOOFILI U+07AC 0 THAANA EBEFILI U+07AD 0 THAANA EYBEYFILI U+07AE 0 THAANA OBOFILI U+07AF 0 THAANA OABOAFILI
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 75 4.2 Combining Classes—Normative Character Properties
Table 4-3. Combining Classes (Continued) U+07B0 0 THAANA SUKUN U+0901 0 DEVANAGARI SIGN CANDRABINDU U+0902 0 DEVANAGARI SIGN ANUSVARA U+0903 0 DEVANAGARI SIGN VISARGA U+093E 0 DEVANAGARI VOWEL SIGN AA U+0940 0 DEVANAGARI VOWEL SIGN II U+0941 0 DEVANAGARI VOWEL SIGN U U+0942 0 DEVANAGARI VOWEL SIGN UU U+0943 0 DEVANAGARI VOWEL SIGN VOCALIC R U+0944 0 DEVANAGARI VOWEL SIGN VOCALIC RR U+0945 0 DEVANAGARI VOWEL SIGN CANDRA E U+0946 0 DEVANAGARI VOWEL SIGN SHORT E U+0947 0 DEVANAGARI VOWEL SIGN E U+0948 0 DEVANAGARI VOWEL SIGN AI U+0949 0 DEVANAGARI VOWEL SIGN CANDRA O U+094A 0 DEVANAGARI VOWEL SIGN SHORT O U+094B 0 DEVANAGARI VOWEL SIGN O U+094C 0 DEVANAGARI VOWEL SIGN AU U+0962 0 DEVANAGARI VOWEL SIGN VOCALIC L U+0963 0 DEVANAGARI VOWEL SIGN VOCALIC LL U+0981 0 BENGALI SIGN CANDRABINDU U+0982 0 BENGALI SIGN ANUSVARA U+0983 0 BENGALI SIGN VISARGA U+09BE 0 BENGALI VOWEL SIGN AA U+09C0 0 BENGALI VOWEL SIGN II U+09C1 0 BENGALI VOWEL SIGN U U+09C2 0 BENGALI VOWEL SIGN UU U+09C3 0 BENGALI VOWEL SIGN VOCALIC R U+09C4 0 BENGALI VOWEL SIGN VOCALIC RR U+09D7 0 BENGALI AU LENGTH MARK U+09E2 0 BENGALI VOWEL SIGN VOCALIC L U+09E3 0 BENGALI VOWEL SIGN VOCALIC LL U+0A02 0 GURMUKHI SIGN BINDI U+0A3E 0 GURMUKHI VOWEL SIGN AA U+0A40 0 GURMUKHI VOWEL SIGN II U+0A41 0 GURMUKHI VOWEL SIGN U U+0A42 0 GURMUKHI VOWEL SIGN UU U+0A47 0 GURMUKHI VOWEL SIGN EE U+0A48 0 GURMUKHI VOWEL SIGN AI U+0A4B 0 GURMUKHI VOWEL SIGN OO U+0A4C 0 GURMUKHI VOWEL SIGN AU U+0A70 0 GURMUKHI TIPPI U+0A71 0 GURMUKHI ADDAK U+0A81 0 GUJARATI SIGN CANDRABINDU U+0A82 0 GUJARATI SIGN ANUSVARA U+0A83 0 GUJARATI SIGN VISARGA U+0ABE 0 GUJARATI VOWEL SIGN AA U+0AC0 0 GUJARATI VOWEL SIGN II U+0AC1 0 GUJARATI VOWEL SIGN U U+0AC2 0 GUJARATI VOWEL SIGN UU U+0AC3 0 GUJARATI VOWEL SIGN VOCALIC R U+0AC4 0 GUJARATI VOWEL SIGN VOCALIC RR U+0AC5 0 GUJARATI VOWEL SIGN CANDRA E U+0AC7 0 GUJARATI VOWEL SIGN E U+0AC8 0 GUJARATI VOWEL SIGN AI U+0AC9 0 GUJARATI VOWEL SIGN CANDRA O U+0ACB 0 GUJARATI VOWEL SIGN O U+0ACC 0 GUJARATI VOWEL SIGN AU U+0B01 0 ORIYA SIGN CANDRABINDU U+0B02 0 ORIYA SIGN ANUSVARA U+0B03 0 ORIYA SIGN VISARGA U+0B3E 0 ORIYA VOWEL SIGN AA
76 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.2 Combining Classes—Normative
Table 4-3. Combining Classes (Continued) U+0B3F 0 ORIYA VOWEL SIGN I U+0B40 0 ORIYA VOWEL SIGN II U+0B41 0 ORIYA VOWEL SIGN U U+0B42 0 ORIYA VOWEL SIGN UU U+0B43 0 ORIYA VOWEL SIGN VOCALIC R U+0B56 0 ORIYA AI LENGTH MARK U+0B57 0 ORIYA AU LENGTH MARK U+0B82 0 TAMIL SIGN ANUSVARA U+0B83 0 TAMIL SIGN VISARGA U+0BBE 0 TAMIL VOWEL SIGN AA U+0BBF 0 TAMIL VOWEL SIGN I U+0BC0 0 TAMIL VOWEL SIGN II U+0BC1 0 TAMIL VOWEL SIGN U U+0BC2 0 TAMIL VOWEL SIGN UU U+0BD7 0 TAMIL AU LENGTH MARK U+0C01 0 TELUGU SIGN CANDRABINDU U+0C02 0 TELUGU SIGN ANUSVARA U+0C03 0 TELUGU SIGN VISARGA U+0C3E 0 TELUGU VOWEL SIGN AA U+0C3F 0 TELUGU VOWEL SIGN I U+0C40 0 TELUGU VOWEL SIGN II U+0C41 0 TELUGU VOWEL SIGN U U+0C42 0 TELUGU VOWEL SIGN UU U+0C43 0 TELUGU VOWEL SIGN VOCALIC R U+0C44 0 TELUGU VOWEL SIGN VOCALIC RR U+0C46 0 TELUGU VOWEL SIGN E U+0C47 0 TELUGU VOWEL SIGN EE U+0C48 0 TELUGU VOWEL SIGN AI U+0C4A 0 TELUGU VOWEL SIGN O U+0C4B 0 TELUGU VOWEL SIGN OO U+0C4C 0 TELUGU VOWEL SIGN AU U+0C82 0 KANNADA SIGN ANUSVARA U+0C83 0 KANNADA SIGN VISARGA U+0CBE 0 KANNADA VOWEL SIGN AA U+0CBF 0 KANNADA VOWEL SIGN I U+0CC1 0 KANNADA VOWEL SIGN U U+0CC2 0 KANNADA VOWEL SIGN UU U+0CC3 0 KANNADA VOWEL SIGN VOCALIC R U+0CC4 0 KANNADA VOWEL SIGN VOCALIC RR U+0CC6 0 KANNADA VOWEL SIGN E U+0CCC 0 KANNADA VOWEL SIGN AU U+0CD5 0 KANNADA LENGTH MARK U+0CD6 0 KANNADA AI LENGTH MARK U+0D02 0 MALAYALAM SIGN ANUSVARA U+0D03 0 MALAYALAM SIGN VISARGA U+0D3E 0 MALAYALAM VOWEL SIGN AA U+0D3F 0 MALAYALAM VOWEL SIGN I U+0D40 0 MALAYALAM VOWEL SIGN II U+0D41 0 MALAYALAM VOWEL SIGN U U+0D42 0 MALAYALAM VOWEL SIGN UU U+0D43 0 MALAYALAM VOWEL SIGN VOCALIC R U+0D57 0 MALAYALAM AU LENGTH MARK U+0D82 0 SINHALA SIGN ANUSVARAYA U+0D83 0 SINHALA SIGN VISARGAYA U+0DCF 0 SINHALA VOWEL SIGN AELA-PILLA U+0DD0 0 SINHALA VOWEL SIGN KETTI AEDA-PILLA U+0DD1 0 SINHALA VOWEL SIGN DIGA AEDA-PILLA U+0DD2 0 SINHALA VOWEL SIGN KETTI IS-PILLA U+0DD3 0 SINHALA VOWEL SIGN DIGA IS-PILLA U+0DD4 0 SINHALA VOWEL SIGN KETTI PAA-PILLA U+0DD6 0 SINHALA VOWEL SIGN DIGA PAA-PILLA U+0DD8 0 SINHALA VOWEL SIGN GAETTA-PILLA
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 77 4.2 Combining Classes—Normative Character Properties
Table 4-3. Combining Classes (Continued) U+0DDF 0 SINHALA VOWEL SIGN GAYANUKITTA U+0DF2 0 SINHALA VOWEL SIGN DIGA GAETTA-PILLA U+0DF3 0 SINHALA VOWEL SIGN DIGA GAYANUKITTA U+0E31 0 THAI CHARACTER MAI HAN-AKAT U+0E34 0 THAI CHARACTER SARA I U+0E35 0 THAI CHARACTER SARA II U+0E36 0 THAI CHARACTER SARA UE U+0E37 0 THAI CHARACTER SARA UEE U+0E47 0 THAI CHARACTER MAITAIKHU U+0E4C 0 THAI CHARACTER THANTHAKHAT U+0E4D 0 THAI CHARACTER NIKHAHIT U+0E4E 0 THAI CHARACTER YAMAKKAN U+0EB1 0 LAO VOWEL SIGN MAI KAN U+0EB4 0 LAO VOWEL SIGN I U+0EB5 0 LAO VOWEL SIGN II U+0EB6 0 LAO VOWEL SIGN Y U+0EB7 0 LAO VOWEL SIGN YY U+0EBB 0 LAO VOWEL SIGN MAI KON U+0EBC 0 LAO SEMIVOWEL SIGN LO U+0ECC 0 LAO CANCELLATION MARK U+0ECD 0 LAO NIGGAHITA U+0F3E 0 TIBETAN SIGN YAR TSHES U+0F3F 0 TIBETAN SIGN MAR TSHES U+0F73 0 TIBETAN VOWEL SIGN II U+0F75 0 TIBETAN VOWEL SIGN UU U+0F76 0 TIBETAN VOWEL SIGN VOCALIC R U+0F77 0 TIBETAN VOWEL SIGN VOCALIC RR U+0F78 0 TIBETAN VOWEL SIGN VOCALIC L U+0F79 0 TIBETAN VOWEL SIGN VOCALIC LL U+0F7E 0 TIBETAN SIGN RJES SU NGA RO U+0F7F 0 TIBETAN SIGN RNAM BCAD U+0F81 0 TIBETAN VOWEL SIGN REVERSED II U+102C 0 MYANMAR VOWEL SIGN AA U+102D 0 MYANMAR VOWEL SIGN I U+102E 0 MYANMAR VOWEL SIGN II U+102F 0 MYANMAR VOWEL SIGN U U+1030 0 MYANMAR VOWEL SIGN UU U+1032 0 MYANMAR VOWEL SIGN AI U+1036 0 MYANMAR SIGN ANUSVARA U+1038 0 MYANMAR SIGN VISARGA U+1056 0 MYANMAR VOWEL SIGN VOCALIC R U+1057 0 MYANMAR VOWEL SIGN VOCALIC RR U+1058 0 MYANMAR VOWEL SIGN VOCALIC L U+1059 0 MYANMAR VOWEL SIGN VOCALIC LL U+17B4 0 KHMER VOWEL INHERENT AQ U+17B5 0 KHMER VOWEL INHERENT AA U+17B6 0 KHMER VOWEL SIGN AA U+17B7 0 KHMER VOWEL SIGN I U+17B8 0 KHMER VOWEL SIGN II U+17B9 0 KHMER VOWEL SIGN Y U+17BA 0 KHMER VOWEL SIGN YY U+17BB 0 KHMER VOWEL SIGN U U+17BC 0 KHMER VOWEL SIGN UU U+17BD 0 KHMER VOWEL SIGN UA U+17C6 0 KHMER SIGN NIKAHIT U+17C7 0 KHMER SIGN REAHMUK U+17C8 0 KHMER SIGN YUUKALEAPINTU U+17C9 0 KHMER SIGN MUUSIKATOAN U+17CA 0 KHMER SIGN TRIISAP U+17CB 0 KHMER SIGN BANTOC U+17CC 0 KHMER SIGN ROBAT U+17CD 0 KHMER SIGN TOANDAKHIAT
78 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.2 Combining Classes—Normative
Table 4-3. Combining Classes (Continued) U+17CE 0 KHMER SIGN KAKABAT U+17CF 0 KHMER SIGN AHSDA U+17D0 0 KHMER SIGN SAMYOK SANNYA U+17D1 0 KHMER SIGN VIRIAM U+17D3 0 KHMER SIGN BATHAMASAT Split (Class 0) U+09CB 0 BENGALI VOWEL SIGN O U+09CC 0 BENGALI VOWEL SIGN AU U+0B48 0 ORIYA VOWEL SIGN AI U+0B4B 0 ORIYA VOWEL SIGN O U+0B4C 0 ORIYA VOWEL SIGN AU U+0BCA 0 TAMIL VOWEL SIGN O U+0BCB 0 TAMIL VOWEL SIGN OO U+0BCC 0 TAMIL VOWEL SIGN AU U+0CC0 0 KANNADA VOWEL SIGN II U+0CC7 0 KANNADA VOWEL SIGN EE U+0CC8 0 KANNADA VOWEL SIGN AI U+0CCA 0 KANNADA VOWEL SIGN O U+0CCB 0 KANNADA VOWEL SIGN OO U+0D4A 0 MALAYALAM VOWEL SIGN O U+0D4B 0 MALAYALAM VOWEL SIGN OO U+0D4C 0 MALAYALAM VOWEL SIGN AU U+0DDC 0 SINHALA VOWEL SIGN KOMBUVA HAA AELA-PILLA U+0DDD 0 SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA U+0DDE 0 SINHALA VOWEL SIGN KOMBUVA HAA GAYANUKITTA U+17BF 0 KHMER VOWEL SIGN YA U+17C0 0 KHMER VOWEL SIGN IE U+17C4 0 KHMER VOWEL SIGN OO U+17C5 0 KHMER VOWEL SIGN AU Reordrant (Class 0) U+093F 0 DEVANAGARI VOWEL SIGN I U+09BF 0 BENGALI VOWEL SIGN I U+09C7 0 BENGALI VOWEL SIGN E U+09C8 0 BENGALI VOWEL SIGN AI U+0A3F 0 GURMUKHI VOWEL SIGN I U+0ABF 0 GUJARATI VOWEL SIGN I U+0B47 0 ORIYA VOWEL SIGN E U+0BC6 0 TAMIL VOWEL SIGN E U+0BC7 0 TAMIL VOWEL SIGN EE U+0BC8 0 TAMIL VOWEL SIGN AI U+0D46 0 MALAYALAM VOWEL SIGN E U+0D47 0 MALAYALAM VOWEL SIGN EE U+0D48 0 MALAYALAM VOWEL SIGN AI U+0DD9 0 SINHALA VOWEL SIGN KOMBUVA U+0DDA 0 SINHALA VOWEL SIGN DIGA KOMBUVA U+0DDB 0 SINHALA VOWEL SIGN KOMBU DEKA U+1031 0 MYANMAR VOWEL SIGN E U+17BE 0 KHMER VOWEL SIGN OE U+17C1 0 KHMER VOWEL SIGN E U+17C2 0 KHMER VOWEL SIGN AE U+17C3 0 KHMER VOWEL SIGN AI Tibetan Subjoined (Class 0) Letters U+0F90 0 TIBETAN SUBJOINED LETTER KA U+0F91 0 TIBETAN SUBJOINED LETTER KHA U+0F92 0 TIBETAN SUBJOINED LETTER GA U+0F93 0 TIBETAN SUBJOINED LETTER GHA U+0F94 0 TIBETAN SUBJOINED LETTER NGA U+0F95 0 TIBETAN SUBJOINED LETTER CA U+0F96 0 TIBETAN SUBJOINED LETTER CHA U+0F97 0 TIBETAN SUBJOINED LETTER JA
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 79 4.2 Combining Classes—Normative Character Properties
Table 4-3. Combining Classes (Continued) U+0F99 0 TIBETAN SUBJOINED LETTER NYA U+0F9A 0 TIBETAN SUBJOINED LETTER TTA U+0F9B 0 TIBETAN SUBJOINED LETTER TTHA U+0F9C 0 TIBETAN SUBJOINED LETTER DDA U+0F9D 0 TIBETAN SUBJOINED LETTER DDHA U+0F9E 0 TIBETAN SUBJOINED LETTER NNA U+0F9F 0 TIBETAN SUBJOINED LETTER TA U+0FA0 0 TIBETAN SUBJOINED LETTER THA U+0FA1 0 TIBETAN SUBJOINED LETTER DA U+0FA2 0 TIBETAN SUBJOINED LETTER DHA U+0FA3 0 TIBETAN SUBJOINED LETTER NA U+0FA4 0 TIBETAN SUBJOINED LETTER PA U+0FA5 0 TIBETAN SUBJOINED LETTER PHA U+0FA6 0 TIBETAN SUBJOINED LETTER BA U+0FA7 0 TIBETAN SUBJOINED LETTER BHA U+0FA8 0 TIBETAN SUBJOINED LETTER MA U+0FA9 0 TIBETAN SUBJOINED LETTER TSA U+0FAA 0 TIBETAN SUBJOINED LETTER TSHA U+0FAB 0 TIBETAN SUBJOINED LETTER DZA U+0FAC 0 TIBETAN SUBJOINED LETTER DZHA U+0FAD 0 TIBETAN SUBJOINED LETTER WA U+0FAE 0 TIBETAN SUBJOINED LETTER ZHA U+0FAF 0 TIBETAN SUBJOINED LETTER ZA U+0FB0 0 TIBETAN SUBJOINED LETTER -A U+0FB1 0 TIBETAN SUBJOINED LETTER YA U+0FB2 0 TIBETAN SUBJOINED LETTER RA U+0FB3 0 TIBETAN SUBJOINED LETTER LA U+0FB4 0 TIBETAN SUBJOINED LETTER SHA U+0FB5 0 TIBETAN SUBJOINED LETTER SSA U+0FB6 0 TIBETAN SUBJOINED LETTER SA U+0FB7 0 TIBETAN SUBJOINED LETTER HA U+0FB8 0 TIBETAN SUBJOINED LETTER A U+0FB9 0 TIBETAN SUBJOINED LETTER KSSA U+0FBA 0 TIBETAN SUBJOINED LETTER FIXED-FORM WA U+0FBB 0 TIBETAN SUBJOINED LETTER FIXED-FORM YA U+0FBC 0 TIBETAN SUBJOINED LETTER FIXED-FORM RA Enclosing (Class 0) U+0488 0 CYRILLIC HUNDRED THOUSANDS SIGN U+0489 0 CYRILLIC MILLIONS SIGN U+06DD 0 ARABIC END OF AYAH U+06DE 0 ARABIC START OF RUB EL HIZB U+20DD 0 COMBINING ENCLOSING CIRCLE U+20DE 0 COMBINING ENCLOSING SQUARE U+20DF 0 COMBINING ENCLOSING DIAMOND U+20E0 0 COMBINING ENCLOSING CIRCLE BACKSLASH U+20E2 0 COMBINING ENCLOSING SCREEN U+20E3 0 COMBINING ENCLOSING KEYCAP Overlays/Interior (Class 1) U+0334 1 COMBINING TILDE OVERLAY U+0335 1 COMBINING SHORT STROKE OVERLAY U+0336 1 COMBINING LONG STROKE OVERLAY U+0337 1 COMBINING SHORT SOLIDUS OVERLAY U+0338 1 COMBINING LONG SOLIDUS OVERLAY U+20D2 1 COMBINING LONG VERTICAL LINE OVERLAY U+20D3 1 COMBINING SHORT VERTICAL LINE OVERLAY U+20D8 1 COMBINING RING OVERLAY U+20D9 1 COMBINING CLOCKWISE RING OVERLAY U+20DA 1 COMBINING ANTICLOCKWISE RING OVERLAY Nuktas (Class 7) U+093C 7 DEVANAGARI SIGN NUKTA U+09BC 7 BENGALI SIGN NUKTA
80 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.2 Combining Classes—Normative
Table 4-3. Combining Classes (Continued) U+0A3C 7 GURMUKHI SIGN NUKTA U+0ABC 7 GUJARATI SIGN NUKTA U+0B3C 7 ORIYA SIGN NUKTA U+1037 7 MYANMAR SIGN DOT BELOW Kana Voicing Marks (Class 8) U+3099 8 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+309A 8 COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK Viramas (Class 9) U+094D 9 DEVANAGARI SIGN VIRAMA U+09CD 9 BENGALI SIGN VIRAMA U+0A4D 9 GURMUKHI SIGN VIRAMA U+0ACD 9 GUJARATI SIGN VIRAMA U+0B4D 9 ORIYA SIGN VIRAMA U+0BCD 9 TAMIL SIGN VIRAMA U+0C4D 9 TELUGU SIGN VIRAMA U+0CCD 9 KANNADA SIGN VIRAMA U+0D4D 9 MALAYALAM SIGN VIRAMA U+0DCA 9 SINHALA SIGN AL-LAKUNA U+0E3A 9 THAI CHARACTER PHINTHU U+0F84 9 TIBETAN MARK HALANTA U+1039 9 MYANMAR SIGN VIRAMA U+17D2 9 KHMER SIGN COENG Fixed Position Classes (Classes 10..199) U+05B0 10 HEBREW POINT SHEVA U+05B1 11 HEBREW POINT HATAF SEGOL U+05B2 12 HEBREW POINT HATAF PATAH U+05B3 13 HEBREW POINT HATAF QAMATS U+05B4 14 HEBREW POINT HIRIQ U+05B5 15 HEBREW POINT TSERE U+05B6 16 HEBREW POINT SEGOL U+05B7 17 HEBREW POINT PATAH U+05B8 18 HEBREW POINT QAMATS U+05B9 19 HEBREW POINT HOLAM U+05BB 20 HEBREW POINT QUBUTS U+05BC 21 HEBREW POINT DAGESH OR MAPIQ U+05BD 22 HEBREW POINT METEG U+05BF 23 HEBREW POINT RAFE U+05C1 24 HEBREW POINT SHIN DOT U+05C2 25 HEBREW POINT SIN DOT U+FB1E 26 HEBREW POINT JUDEO-SPANISH VARIKA U+064B 27 ARABIC FATHATAN U+064C 28 ARABIC DAMMATAN U+064D 29 ARABIC KASRATAN U+064E 30 ARABIC FATHA U+064F 31 ARABIC DAMMA U+0650 32 ARABIC KASRA U+0651 33 ARABIC SHADDA U+0652 34 ARABIC SUKUN U+0670 35 ARABIC LETTER SUPERSCRIPT ALEF U+0711 36 SYRIAC LETTER SUPERSCRIPT ALAPH U+0C55 84 TELUGU LENGTH MARK U+0C56 91 TELUGU AI LENGTH MARK U+0E38 103 THAI CHARACTER SARA U U+0E39 103 THAI CHARACTER SARA UU U+0E48 107 THAI CHARACTER MAI EK U+0E49 107 THAI CHARACTER MAI THO U+0E4A 107 THAI CHARACTER MAI TRI U+0E4B 107 THAI CHARACTER MAI CHATTAWA U+0EB8 118 LAO VOWEL SIGN U U+0EB9 118 LAO VOWEL SIGN UU
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 81 4.2 Combining Classes—Normative Character Properties
Table 4-3. Combining Classes (Continued) U+0EC8 122 LAO TONE MAI EK U+0EC9 122 LAO TONE MAI THO U+0ECA 122 LAO TONE MAI TI U+0ECB 122 LAO TONE MAI CATAWA U+0F71 129 TIBETAN VOWEL SIGN AA U+0F72 130 TIBETAN VOWEL SIGN I U+0F7A 130 TIBETAN VOWEL SIGN E U+0F7B 130 TIBETAN VOWEL SIGN EE U+0F7C 130 TIBETAN VOWEL SIGN O U+0F7D 130 TIBETAN VOWEL SIGN OO U+0F80 130 TIBETAN VOWEL SIGN REVERSED I U+0F74 132 TIBETAN VOWEL SIGN U Below Left Attached (Class 200) Below Attached (Class 202) U+0321 202 COMBINING PALATALIZED HOOK BELOW U+0322 202 COMBINING RETROFLEX HOOK BELOW U+0327 202 COMBINING CEDILLA U+0328 202 COMBINING OGONEK Below Right Attached (Class 204) Left Attached Reordrant (Class 208) Right Attached (Class 210) Above Left Attached (Class 212) Above Attached (Class 214) Above Right Attached (Class 216) U+031B 216 COMBINING HORN U+0F39 216 TIBETAN MARK TSA -PHRU Below Left (Class 218) U+302A 218 IDEOGRAPHIC LEVEL TONE MARK Below (Class 220) U+0316 220 COMBINING GRAVE ACCENT BELOW U+0317 220 COMBINING ACUTE ACCENT BELOW U+0318 220 COMBINING LEFT TACK BELOW U+0319 220 COMBINING RIGHT TACK BELOW U+031C 220 COMBINING LEFT HALF RING BELOW U+031D 220 COMBINING UP TACK BELOW U+031E 220 COMBINING DOWN TACK BELOW U+031F 220 COMBINING PLUS SIGN BELOW U+0320 220 COMBINING MINUS SIGN BELOW U+0323 220 COMBINING DOT BELOW U+0324 220 COMBINING DIAERESIS BELOW U+0325 220 COMBINING RING BELOW U+0326 220 COMBINING COMMA BELOW U+0329 220 COMBINING VERTICAL LINE BELOW U+032A 220 COMBINING BRIDGE BELOW U+032B 220 COMBINING INVERTED DOUBLE ARCH BELOW U+032C 220 COMBINING CARON BELOW U+032D 220 COMBINING CIRCUMFLEX ACCENT BELOW U+032E 220 COMBINING BREVE BELOW U+032F 220 COMBINING INVERTED BREVE BELOW U+0330 220 COMBINING TILDE BELOW U+0331 220 COMBINING MACRON BELOW U+0332 220 COMBINING LOW LINE U+0333 220 COMBINING DOUBLE LOW LINE U+0339 220 COMBINING RIGHT HALF RING BELOW U+033A 220 COMBINING INVERTED BRIDGE BELOW U+033B 220 COMBINING SQUARE BELOW U+033C 220 COMBINING SEAGULL BELOW
82 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.2 Combining Classes—Normative
Table 4-3. Combining Classes (Continued) U+0347 220 COMBINING EQUALS SIGN BELOW U+0348 220 COMBINING DOUBLE VERTICAL LINE BELOW U+0349 220 COMBINING LEFT ANGLE BELOW U+034D 220 COMBINING LEFT RIGHT ARROW BELOW U+034E 220 COMBINING UPWARDS ARROW BELOW U+0591 220 HEBREW ACCENT ETNAHTA U+0596 220 HEBREW ACCENT TIPEHA U+059B 220 HEBREW ACCENT TEVIR U+05A3 220 HEBREW ACCENT MUNAH U+05A4 220 HEBREW ACCENT MAHAPAKH U+05A5 220 HEBREW ACCENT MERKHA U+05A6 220 HEBREW ACCENT MERKHA KEFULA U+05A7 220 HEBREW ACCENT DARGA U+05AA 220 HEBREW ACCENT YERAH BEN YOMO U+0655 220 ARABIC HAMZA BELOW U+06E3 220 ARABIC SMALL LOW SEEN U+06EA 220 ARABIC EMPTY CENTRE LOW STOP U+06ED 220 ARABIC SMALL LOW MEEM U+0731 220 SYRIAC PTHAHA BELOW U+0734 220 SYRIAC ZQAPHA BELOW U+0737 220 SYRIAC RBASA BELOW U+0738 220 SYRIAC DOTTED ZLAMA HORIZONTAL U+0739 220 SYRIAC DOTTED ZLAMA ANGULAR U+073B 220 SYRIAC HBASA BELOW U+073C 220 SYRIAC HBASA/ESASA DOTTED U+073E 220 SYRIAC ESASA BELOW U+0742 220 SYRIAC RUKKAKHA U+0744 220 SYRIAC TWO VERTICAL DOTS BELOW U+0746 220 SYRIAC THREE DOTS BELOW U+0748 220 SYRIAC OBLIQUE LINE BELOW U+0952 220 DEVANAGARI STRESS SIGN ANUDATTA U+0F18 220 TIBETAN ASTROLOGICAL SIGN -KHYUD PA U+0F19 220 TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS U+0F35 220 TIBETAN MARK NGAS BZUNG NYI ZLA U+0F37 220 TIBETAN MARK NGAS BZUNG SGOR RTAGS U+0FC6 220 TIBETAN SYMBOL PADMA GDAN Below Right (Class 222) U+059A 222 HEBREW ACCENT YETIV U+05AD 222 HEBREW ACCENT DEHI U+302D 222 IDEOGRAPHIC ENTERING TONE MARK Left (Class 224) U+302E 224 HANGUL SINGLE DOT TONE MARK U+302F 224 HANGUL DOUBLE DOT TONE MARK Right (Class 226) Above Left (Class 228) U+05AE 228 HEBREW ACCENT ZINOR U+18A9 228 MONGOLIAN LETTER AG DAGALGA U+302B 228 IDEOGRAPHIC RISING TONE MARK Above (Class 230) U+0300 230 COMBINING GRAVE ACCENT U+0301 230 COMBINING ACUTE ACCENT U+0302 230 COMBINING CIRCUMFLEX ACCENT U+0303 230 COMBINING TILDE U+0304 230 COMBINING MACRON U+0305 230 COMBINING OVERLINE U+0306 230 COMBINING BREVE U+0307 230 COMBINING DOT ABOVE U+0308 230 COMBINING DIAERESIS U+0309 230 COMBINING HOOK ABOVE U+030A 230 COMBINING RING ABOVE
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 83 4.2 Combining Classes—Normative Character Properties
Table 4-3. Combining Classes (Continued) U+030B 230 COMBINING DOUBLE ACUTE ACCENT U+030C 230 COMBINING CARON U+030D 230 COMBINING VERTICAL LINE ABOVE U+030E 230 COMBINING DOUBLE VERTICAL LINE ABOVE U+030F 230 COMBINING DOUBLE GRAVE ACCENT U+0310 230 COMBINING CANDRABINDU U+0311 230 COMBINING INVERTED BREVE U+0312 230 COMBINING TURNED COMMA ABOVE U+0313 230 COMBINING COMMA ABOVE U+0314 230 COMBINING REVERSED COMMA ABOVE U+033D 230 COMBINING X ABOVE U+033E 230 COMBINING VERTICAL TILDE U+033F 230 COMBINING DOUBLE OVERLINE U+0340 230 COMBINING GRAVE TONE MARK U+0341 230 COMBINING ACUTE TONE MARK U+0342 230 COMBINING GREEK PERISPOMENI U+0343 230 COMBINING GREEK KORONIS U+0344 230 COMBINING GREEK DIALYTIKA TONOS U+0346 230 COMBINING BRIDGE ABOVE U+034A 230 COMBINING NOT TILDE ABOVE U+034B 230 COMBINING HOMOTHETIC ABOVE U+034C 230 COMBINING ALMOST EQUAL TO ABOVE U+0483 230 COMBINING CYRILLIC TITLO U+0484 230 COMBINING CYRILLIC PALATALIZATION U+0485 230 COMBINING CYRILLIC DASIA PNEUMATA U+0486 230 COMBINING CYRILLIC PSILI PNEUMATA U+0592 230 HEBREW ACCENT SEGOL U+0593 230 HEBREW ACCENT SHALSHELET U+0594 230 HEBREW ACCENT ZAQEF QATAN U+0595 230 HEBREW ACCENT ZAQEF GADOL U+0597 230 HEBREW ACCENT REVIA U+0598 230 HEBREW ACCENT ZARQA U+0599 230 HEBREW ACCENT PASHTA U+059C 230 HEBREW ACCENT GERESH U+059D 230 HEBREW ACCENT GERESH MUQDAM U+059E 230 HEBREW ACCENT GERSHAYIM U+059F 230 HEBREW ACCENT QARNEY PARA U+05A0 230 HEBREW ACCENT TELISHA GEDOLA U+05A1 230 HEBREW ACCENT PAZER U+05A8 230 HEBREW ACCENT QADMA U+05A9 230 HEBREW ACCENT TELISHA QETANA U+05AB 230 HEBREW ACCENT OLE U+05AC 230 HEBREW ACCENT ILUY U+05AF 230 HEBREW MARK MASORA CIRCLE U+05C4 230 HEBREW MARK UPPER DOT U+0653 230 ARABIC MADDAH ABOVE U+0654 230 ARABIC HAMZA ABOVE U+06D6 230 ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA U+06D7 230 ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA U+06D8 230 ARABIC SMALL HIGH MEEM INITIAL FORM U+06D9 230 ARABIC SMALL HIGH LAM ALEF U+06DA 230 ARABIC SMALL HIGH JEEM U+06DB 230 ARABIC SMALL HIGH THREE DOTS U+06DC 230 ARABIC SMALL HIGH SEEN U+06DF 230 ARABIC SMALL HIGH ROUNDED ZERO U+06E0 230 ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO U+06E1 230 ARABIC SMALL HIGH DOTLESS HEAD OF KHAH U+06E2 230 ARABIC SMALL HIGH MEEM ISOLATED FORM U+06E4 230 ARABIC SMALL HIGH MADDA U+06E7 230 ARABIC SMALL HIGH YEH
84 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.3 Directionality—Normative
Table 4-3. Combining Classes (Continued) U+06E8 230 ARABIC SMALL HIGH NOON U+06EB 230 ARABIC EMPTY CENTRE HIGH STOP U+06EC 230 ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE U+0730 230 SYRIAC PTHAHA ABOVE U+0732 230 SYRIAC PTHAHA DOTTED U+0733 230 SYRIAC ZQAPHA ABOVE U+0735 230 SYRIAC ZQAPHA DOTTED U+0736 230 SYRIAC RBASA ABOVE U+073A 230 SYRIAC HBASA ABOVE U+073D 230 SYRIAC ESASA ABOVE U+073F 230 SYRIAC RWAHA U+0740 230 SYRIAC FEMININE DOT U+0741 230 SYRIAC QUSHSHAYA U+0743 230 SYRIAC TWO VERTICAL DOTS ABOVE U+0745 230 SYRIAC THREE DOTS ABOVE U+0747 230 SYRIAC OBLIQUE LINE ABOVE U+0749 230 SYRIAC MUSIC U+074A 230 SYRIAC BARREKH U+0951 230 DEVANAGARI STRESS SIGN UDATTA U+0953 230 DEVANAGARI GRAVE ACCENT U+0954 230 DEVANAGARI ACUTE ACCENT U+0F82 230 TIBETAN SIGN NYI ZLA NAA DA U+0F83 230 TIBETAN SIGN SNA LDAN U+0F86 230 TIBETAN SIGN LCI RTAGS U+0F87 230 TIBETAN SIGN YANG RTAGS U+20D0 230 COMBINING LEFT HARPOON ABOVE U+20D1 230 COMBINING RIGHT HARPOON ABOVE U+20D4 230 COMBINING ANTICLOCKWISE ARROW ABOVE U+20D5 230 COMBINING CLOCKWISE ARROW ABOVE U+20D6 230 COMBINING LEFT ARROW ABOVE U+20D7 230 COMBINING RIGHT ARROW ABOVE U+20DB 230 COMBINING THREE DOTS ABOVE U+20DC 230 COMBINING FOUR DOTS ABOVE U+20E1 230 COMBINING LEFT RIGHT ARROW ABOVE U+FE20 230 COMBINING LIGATURE LEFT HALF U+FE21 230 COMBINING LIGATURE RIGHT HALF U+FE22 230 COMBINING DOUBLE TILDE LEFT HALF U+FE23 230 COMBINING DOUBLE TILDE RIGHT HALF Above Right (Class 232) U+0315 232 COMBINING COMMA ABOVE RIGHT U+031A 232 COMBINING LEFT ANGLE ABOVE U+302C 232 IDEOGRAPHIC DEPARTING TONE MARK Double Below (Class 233) U+0362 233 COMBINING DOUBLE RIGHTWARDS ARROW BELOW Double Above (Class 234) U+0360 234 COMBINING DOUBLE TILDE U+0361 234 COMBINING DOUBLE INVERTED BREVE Iota Subscript (Class 240) U+0345 240 COMBINING GREEK YPOGEGRAMMENI
4.3 Directionality—Normative Directional behavior is interpreted according to the Unicode bidirectional algorithm (see Section 3.12, Bidirectional Behavior). For this purpose, all characters of the Unicode Stan- dard possess a normative directional type. The directional types left-to-right and right-to- left are called strong types, and characters of these types are called strong directional charac- ters. Left-to-right types include most alphabetic and syllabic characters, as well as all Han
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 85 4.4 Jamo Short Names—Normative Character Properties ideographic characters. Right-to-left types include Arabic, Hebrew, Syriac, and Thaana, and most punctuation specific to those scripts. In addition, the Unicode bidirectional algo- rithm also uses weak types and neutrals. For the directional types of Unicode characters, see the Unicode Character Database on the CD-ROM.
4.4 Jamo Short Names—Normative The jamo short name is a normative property of the Unicode conjoining Hangul jamo char- acters. These short names, which are listed in Table 4-4, are used to determine the character names that are derived when decomposing Hangul syllables into their decomposition sequence. Table 4-4. Jamo Short Names
Short Value Glyph Unicode Name Name U+1100 G þ HANGUL CHOSEONG KIYEOK U+1101 GG HANGUL CHOSEONG SSANGKIYEOK U+1102 N ó HANGUL CHOSEONG NIEUN U+1103 D ì HANGUL CHOSEONG TIKEUT U+1104 DD ö HANGUL CHOSEONG SSANGTIKEUT U+1105 R ú HANGUL CHOSEONG RIEUL U+1106 M ÷ HANGUL CHOSEONG MIEUM U+1107 B ø HANGUL CHOSEONG PIEUP U+1108 BB í HANGUL CHOSEONG SSANGPIEUP U+1109 S û HANGUL CHOSEONG SIOS U+110A SS ç HANGUL CHOSEONG SSANGSIOS U+110B ü HANGUL CHOSEONG IEUNG U+110C J å HANGUL CHOSEONG CIEUC U+110D JJ HANGUL CHOSEONG SSANGCIEUC U+110E C ê HANGUL CHOSEONG CHIEUCH U+110F K HANGUL CHOSEONG KHIEUKH U+1110 T HANGUL CHOSEONG THIEUTH U+1111 P ñ HANGUL CHOSEONG PHIEUPH U+1112 H ò HANGUL CHOSEONG HIEUH U+1161 A Æ HANGUL JUNGSEONG A U+1162 AE Ç HANGUL JUNGSEONG AE U+1163 YA È HANGUL JUNGSEONG YA U+1164 YAE É HANGUL JUNGSEONG YAE U+1165 EO Ê HANGUL JUNGSEONG EO U+1166 E Ë HANGUL JUNGSEONG E U+1167 YEO Ì HANGUL JUNGSEONG YEO U+1168 YE Í HANGUL JUNGSEONG YE U+1169 O Î HANGUL JUNGSEONG O U+116A WA Ï HANGUL JUNGSEONG WA U+116B WAE Ð HANGUL JUNGSEONG WAE U+116C OE Ñ HANGUL JUNGSEONG OE U+116D YO Ò HANGUL JUNGSEONG YO U+116E U Ó HANGUL JUNGSEONG U U+116F WEO Ô HANGUL JUNGSEONG WEO U+1170 WE Õ HANGUL JUNGSEONG WE
86 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.5 General Category—Normative in Part
Table 4-4. Jamo Short Names (Continued)
Short Value Glyph Unicode Name Name U+1171 WI Ö HANGUL JUNGSEONG WI U+1172 YU × HANGUL JUNGSEONG YU U+1173 EU Ø HANGUL JUNGSEONG EU U+1174 YI Ù HANGUL JUNGSEONG YI U+1175 I Ú HANGUL JUNGSEONG I U+11A8 G HANGUL JONGSEONG KIYEOK U+11A9 GG HANGUL JONGSEONG SSANGKIYEOK U+11AA GS HANGUL JONGSEONG KIYEOK-SIOS U+11AB N HANGUL JONGSEONG NIEUN U+11AC NJ HANGUL JONGSEONG NIEUN-CIEUC U+11AD NH HANGUL JONGSEONG NIEUN-HIEUH U+11AE D HANGUL JONGSEONG TIKEUT U+11AF L HANGUL JONGSEONG RIEUL U+11B0 LG HANGUL JONGSEONG RIEUL-KIYEOK U+11B1 LM HANGUL JONGSEONG RIEUL-MIEUM U+11B2 LB HANGUL JONGSEONG RIEUL-PIEUP U+11B3 LS HANGUL JONGSEONG RIEUL-SIOS U+11B4 LT HANGUL JONGSEONG RIEUL-THIEUTH U+11B5 LP HANGUL JONGSEONG RIEUL-PHIEUPH U+11B6 LH HANGUL JONGSEONG RIEUL-HIEUH U+11B7 M HANGUL JONGSEONG MIEUM U+11B8 B HANGUL JONGSEONG PIEUP U+11B9 BS HANGUL JONGSEONG PIEUP-SIOS U+11BA S HANGUL JONGSEONG SIOS U+11BB SS HANGUL JONGSEONG SSANGSIOS U+11BC NG ¡ HANGUL JONGSEONG IEUNG U+11BD J ¢ HANGUL JONGSEONG CIEUC U+11BE C £ HANGUL JONGSEONG CHIEUCH U+11BF K ¤ HANGUL JONGSEONG KHIEUKH U+11C0 T ¥ HANGUL JONGSEONG THIEUTH U+11C1 P ¦ HANGUL JONGSEONG PHIEUPH U+11C2 H § HANGUL JONGSEONG HIEUH
4.5 General Category—Normative in Part The Unicode Character Database on the CD-ROM defines a General Category for all Uni- code characters. This General Category constitutes a partition of the characters into several major classes, such as letters, punctuation, and symbols, and further subclasses for each of the major classes. Each Unicode character is assigned a General Category value. Each value of the General Category is defined as a two-letter abbreviation, where the first letter gives information about a major class and the second letter designates a subclass of that major class. In each class, the subclass “other” merely collects the remaining characters of the major class. For example, the subclass “No” (Number, other) includes all characters of the Number class that are not a decimal digit or letter. These characters may have little in common besides their membership in the same major class.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 87 4.5 General Category—Normative in Part Character Properties
Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because the General Category assigns only a single value to each character. The General Category is unique, in that some of the values are correlated directly with other normative properties of Unicode characters (numeric status, combining status, or some special properties). Other values are simply informative, provided as an aid to com- mon behavior between Unicode implementations. Table 4-5 enumerates the values of Gen- eral Category, with a short description of each value; it is divided between normative and informative values. Table 4-5. General Category Normative Lu = Letter, uppercase Ll = Letter, lowercase Lt = Letter, titlecase
Mn = Mark, nonspacing Mc = Mark, spacing combining Me = Mark, enclosing
Nd = Number, decimal digit Nl = Number, letter No = Number, other
Zs = Separator, space Zl = Separator, line Zp = Separator, paragraph
Cc = Other, control Cf = Other, format Cs = Other, surrogate Co = Other, private use Cn = Other, not assigned
Informative Lm = Letter, modifier Lo = Letter, other
Pc = Punctuation, connector Pd = Punctuation, dash Ps = Punctuation, open Pe = Punctuation, close Pi = Punctuation, initial quote (may behave like Ps or Pe depending on usage) Pf = Punctuation, final quote (may behave like Ps or Pe depending on usage) Po = Punctuation, other
Sm = Symbol, math Sc = Symbol, currency Sk = Symbol, modifier So = Symbol, other
A common use of the General Category of a Unicode character is to assist in determination of boundaries in text, as in Section 5.15, Locating Text Element Boundaries. Another com- mon use is in determining language identifiers for programming, scripting, and markup, as in Section 5.16, Identifiers. This property is also used to support common APIs such as isLetter(), isUppercase(), and so on.
88 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.6 Numeric Value—Normative
4.6 Numeric Value—Normative Numeric value is a normative property of characters that represent numbers. This group includes characters such as fractions, subscripts, superscripts, Roman numerals, currency numerators, encircled numbers, and script-specific digits. In many traditional numbering systems, letters are used with a numeric value. Examples include Greek and Hebrew letters as well as Latin letters used in outlines (II.A.1.b). These special cases are not included here as numbers. Decimal digits form a large subcategory of numbers consisting of those digits that can be used to form decimal-radix numbers. They include script-specific digits, not characters such as Roman numerals (1 + 5 = 15 = fifteen, but I + V = IV = four), subscripts, or super- scripts. Numbers other than decimal digits can be used in numerical expressions, but it is up to the user to determine the specialized uses. The Unicode Standard assigns distinct codes to the forms of digits that are specific to a given script or language. Examples are the digits used with the Arabic script, Chinese num- bers, or those of the Indic languages. For naming conventions, see the introduction to Sec- tion 8.2, Arabic. Table 4-6 gives the numeric values of Unicode characters that can represent numbers. Some CJK ideographs also have numeric values; those are not included in Table 4-6, but are dis- cussed following the table. Table 4-6. Numeric Properties Value Decimal Number Name U+0030 á 0 DIGIT ZERO U+0031 á 1 DIGIT ONE U+0032 á 2 DIGIT TWO U+0033 á 3 DIGIT THREE U+0034 á 4 DIGIT FOUR U+0035 á 5 DIGIT FIVE U+0036 á 6 DIGIT SIX U+0037 á 7 DIGIT SEVEN U+0038 á 8 DIGIT EIGHT U+0039 á 9 DIGIT NINE U+00B2 2 SUPERSCRIPT TWO U+00B3 3 SUPERSCRIPT THREE U+00B9 1 SUPERSCRIPT ONE U+00BC 1/4 VULGAR FRACTION ONE QUARTER U+00BD 1/2 VULGAR FRACTION ONE HALF U+00BE 3/4 VULGAR FRACTION THREE QUARTERS U+0660 á 0 ARABIC-INDIC DIGIT ZERO U+0661 á 1 ARABIC-INDIC DIGIT ONE U+0662 á 2 ARABIC-INDIC DIGIT TWO U+0663 á 3 ARABIC-INDIC DIGIT THREE U+0664 á 4 ARABIC-INDIC DIGIT FOUR U+0665 á 5 ARABIC-INDIC DIGIT FIVE U+0666 á 6 ARABIC-INDIC DIGIT SIX U+0667 á 7 ARABIC-INDIC DIGIT SEVEN U+0668 á 8 ARABIC-INDIC DIGIT EIGHT U+0669 á 9 ARABIC-INDIC DIGIT NINE U+06F0 á 0 EXTENDED ARABIC-INDIC DIGIT ZERO U+06F1 á 1 EXTENDED ARABIC-INDIC DIGIT ONE U+06F2 á 2 EXTENDED ARABIC-INDIC DIGIT TWO U+06F3 á 3 EXTENDED ARABIC-INDIC DIGIT THREE
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 89 4.6 Numeric Value—Normative Character Properties
Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+06F4 á 4 EXTENDED ARABIC-INDIC DIGIT FOUR U+06F5 á 5 EXTENDED ARABIC-INDIC DIGIT FIVE U+06F6 á 6 EXTENDED ARABIC-INDIC DIGIT SIX U+06F7 á 7 EXTENDED ARABIC-INDIC DIGIT SEVEN U+06F8 á 8 EXTENDED ARABIC-INDIC DIGIT EIGHT U+06F9 á 9 EXTENDED ARABIC-INDIC DIGIT NINE U+0966 á 0 DEVANAGARI DIGIT ZERO U+0967 á 1 DEVANAGARI DIGIT ONE U+0968 á 2 DEVANAGARI DIGIT TWO U+0969 á 3 DEVANAGARI DIGIT THREE U+096A á 4 DEVANAGARI DIGIT FOUR U+096B á 5 DEVANAGARI DIGIT FIVE U+096C á 6 DEVANAGARI DIGIT SIX U+096D á 7 DEVANAGARI DIGIT SEVEN U+096E á 8 DEVANAGARI DIGIT EIGHT U+096F á 9 DEVANAGARI DIGIT NINE U+09E6 á 0 BENGALI DIGIT ZERO U+09E7 á 1 BENGALI DIGIT ONE U+09E8 á 2 BENGALI DIGIT TWO U+09E9 á 3 BENGALI DIGIT THREE U+09EA á 4 BENGALI DIGIT FOUR U+09EB á 5 BENGALI DIGIT FIVE U+09EC á 6 BENGALI DIGIT SIX U+09ED á 7 BENGALI DIGIT SEVEN U+09EE á 8 BENGALI DIGIT EIGHT U+09EF á 9 BENGALI DIGIT NINE U+09F4 1 BENGALI CURRENCY NUMERATOR ONE U+09F5 2 BENGALI CURRENCY NUMERATOR TWO U+09F6 3 BENGALI CURRENCY NUMERATOR THREE U+09F7 4 BENGALI CURRENCY NUMERATOR FOUR U+09F8 - BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR U+09F9 16 BENGALI CURRENCY DENOMINATOR SIXTEEN U+0A66 á 0 GURMUKHI DIGIT ZERO U+0A67 á 1 GURMUKHI DIGIT ONE U+0A68 á 2 GURMUKHI DIGIT TWO U+0A69 á 3 GURMUKHI DIGIT THREE U+0A6A á 4 GURMUKHI DIGIT FOUR U+0A6B á 5 GURMUKHI DIGIT FIVE U+0A6C á 6 GURMUKHI DIGIT SIX U+0A6D á 7 GURMUKHI DIGIT SEVEN U+0A6E á 8 GURMUKHI DIGIT EIGHT U+0A6F á 9 GURMUKHI DIGIT NINE U+0AE6 á 0 GUJARATI DIGIT ZERO U+0AE7 á 1 GUJARATI DIGIT ONE U+0AE8 á 2 GUJARATI DIGIT TWO U+0AE9 á 3 GUJARATI DIGIT THREE U+0AEA á 4 GUJARATI DIGIT FOUR U+0AEB á 5 GUJARATI DIGIT FIVE U+0AEC á 6 GUJARATI DIGIT SIX U+0AED á 7 GUJARATI DIGIT SEVEN U+0AEE á 8 GUJARATI DIGIT EIGHT U+0AEF á 9 GUJARATI DIGIT NINE U+0B66 á 0 ORIYA DIGIT ZERO U+0B67 á 1 ORIYA DIGIT ONE U+0B68 á 2 ORIYA DIGIT TWO
90 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.6 Numeric Value—Normative
Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+0B69 á 3 ORIYA DIGIT THREE U+0B6A á 4 ORIYA DIGIT FOUR U+0B6B á 5 ORIYA DIGIT FIVE U+0B6C á 6 ORIYA DIGIT SIX U+0B6D á 7 ORIYA DIGIT SEVEN U+0B6E á 8 ORIYA DIGIT EIGHT U+0B6F á 9 ORIYA DIGIT NINE U+0BE7 á 1 TAMIL DIGIT ONE U+0BE8 á 2 TAMIL DIGIT TWO U+0BE9 á 3 TAMIL DIGIT THREE U+0BEA á 4 TAMIL DIGIT FOUR U+0BEB á 5 TAMIL DIGIT FIVE U+0BEC á 6 TAMIL DIGIT SIX U+0BED á 7 TAMIL DIGIT SEVEN U+0BEE á 8 TAMIL DIGIT EIGHT U+0BEF á 9 TAMIL DIGIT NINE U+0BF0 10 TAMIL NUMBER TEN U+0BF1 100 TAMIL NUMBER ONE HUNDRED U+0BF2 1,000 TAMIL NUMBER ONE THOUSAND U+0C66 á 0 TELUGU DIGIT ZERO U+0C67 á 1 TELUGU DIGIT ONE U+0C68 á 2 TELUGU DIGIT TWO U+0C69 á 3 TELUGU DIGIT THREE U+0C6A á 4 TELUGU DIGIT FOUR U+0C6B á 5 TELUGU DIGIT FIVE U+0C6C á 6 TELUGU DIGIT SIX U+0C6D á 7 TELUGU DIGIT SEVEN U+0C6E á 8 TELUGU DIGIT EIGHT U+0C6F á 9 TELUGU DIGIT NINE U+0CE6 á 0 KANNADA DIGIT ZERO U+0CE7 á 1 KANNADA DIGIT ONE U+0CE8 á 2 KANNADA DIGIT TWO U+0CE9 á 3 KANNADA DIGIT THREE U+0CEA á 4 KANNADA DIGIT FOUR U+0CEB á 5 KANNADA DIGIT FIVE U+0CEC á 6 KANNADA DIGIT SIX U+0CED á 7 KANNADA DIGIT SEVEN U+0CEE á 8 KANNADA DIGIT EIGHT U+0CEF á 9 KANNADA DIGIT NINE U+0D66 á 0 MALAYALAM DIGIT ZERO U+0D67 á 1 MALAYALAM DIGIT ONE U+0D68 á 2 MALAYALAM DIGIT TWO U+0D69 á 3 MALAYALAM DIGIT THREE U+0D6A á 4 MALAYALAM DIGIT FOUR U+0D6B á 5 MALAYALAM DIGIT FIVE U+0D6C á 6 MALAYALAM DIGIT SIX U+0D6D á 7 MALAYALAM DIGIT SEVEN U+0D6E á 8 MALAYALAM DIGIT EIGHT U+0D6F á 9 MALAYALAM DIGIT NINE U+0E50 á 0 THAI DIGIT ZERO U+0E51 á 1 THAI DIGIT ONE U+0E52 á 2 THAI DIGIT TWO U+0E53 á 3 THAI DIGIT THREE U+0E54 á 4 THAI DIGIT FOUR U+0E55 á 5 THAI DIGIT FIVE U+0E56 á 6 THAI DIGIT SIX
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 91 4.6 Numeric Value—Normative Character Properties
Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+0E57 á 7 THAI DIGIT SEVEN U+0E58 á 8 THAI DIGIT EIGHT U+0E59 á 9 THAI DIGIT NINE U+0ED0 á 0 LAO DIGIT ZERO U+0ED1 á 1 LAO DIGIT ONE U+0ED2 á 2 LAO DIGIT TWO U+0ED3 á 3 LAO DIGIT THREE U+0ED4 á 4 LAO DIGIT FOUR U+0ED5 á 5 LAO DIGIT FIVE U+0ED6 á 6 LAO DIGIT SIX U+0ED7 á 7 LAO DIGIT SEVEN U+0ED8 á 8 LAO DIGIT EIGHT U+0ED9 á 9 LAO DIGIT NINE U+0F20 á 0 TIBETAN DIGIT ZERO U+0F21 á 1 TIBETAN DIGIT ONE U+0F22 á 2 TIBETAN DIGIT TWO U+0F23 á 3 TIBETAN DIGIT THREE U+0F2D á 4 TIBETAN DIGIT FOUR U+0F25 á 5 TIBETAN DIGIT FIVE U+0F26 á 6 TIBETAN DIGIT SIX U+0F27 á 7 TIBETAN DIGIT SEVEN U+0F28 á 8 TIBETAN DIGIT EIGHT U+0F29 á 9 TIBETAN DIGIT NINE U+0F2A 1/2 TIBETAN DIGIT HALF ONE U+0F2B 3/2 TIBETAN DIGIT HALF TWO U+0F2C 5/2 TIBETAN DIGIT HALF THREE U+0F2D 7/2 TIBETAN DIGIT HALF FOUR U+0F2E 9/2 TIBETAN DIGIT HALF FIVE U+0F2F 11/2 TIBETAN DIGIT HALF SIX U+0F30 13/2 TIBETAN DIGIT HALF SEVEN U+0F31 15/2 TIBETAN DIGIT HALF EIGHT U+0F32 17/2 TIBETAN DIGIT HALF NINE U+0F33 –1/2 TIBETAN DIGIT HALF ZERO U+1040 á 0 MYANMAR DIGIT ZERO U+1041 á 1 MYANMAR DIGIT ONE U+1042 á 2 MYANMAR DIGIT TWO U+1043 á 3 MYANMAR DIGIT THREE U+1044 á 4 MYANMAR DIGIT FOUR U+1045 á 5 MYANMAR DIGIT FIVE U+1046 á 6 MYANMAR DIGIT SIX U+1047 á 7 MYANMAR DIGIT SEVEN U+1048 á 8 MYANMAR DIGIT EIGHT U+1049 á 9 MYANMAR DIGIT NINE U+1369 á 1 ETHIOPIC DIGIT ONE U+136A á 2 ETHIOPIC DIGIT TWO U+136B á 3 ETHIOPIC DIGIT THREE U+136C á 4 ETHIOPIC DIGIT FOUR U+136D á 5 ETHIOPIC DIGIT FIVE U+136E á 6 ETHIOPIC DIGIT SIX U+136F á 7 ETHIOPIC DIGIT SEVEN U+1370 á 8 ETHIOPIC DIGIT EIGHT U+1371 á 9 ETHIOPIC DIGIT NINE U+1372 10 ETHIOPIC NUMBER TEN U+1373 20 ETHIOPIC NUMBER TWENTY U+1374 30 ETHIOPIC NUMBER THIRTY U+1375 40 ETHIOPIC NUMBER FORTY U+1376 50 ETHIOPIC NUMBER FIFTY
92 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.6 Numeric Value—Normative
Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+1377 60 ETHIOPIC NUMBER SIXTY U+1378 70 ETHIOPIC NUMBER SEVENTY U+1379 80 ETHIOPIC NUMBER EIGHTY U+137A 90 ETHIOPIC NUMBER NINETY U+137B 100 ETHIOPIC NUMBER HUNDRED U+137C 10,000 ETHIOPIC NUMBER TEN THOUSAND U+16EE 17 RUNIC ARLAUG SYMBOL U+16EF 18 RUNIC TVIMADUR SYMBOL U+16F0 19 RUNIC BELGTHOR SYMBOL U+17E0 á 0 KHMER DIGIT ZERO U+17E1 á 1 KHMER DIGIT ONE U+17E2 á 2 KHMER DIGIT TWO U+17E3 á 3 KHMER DIGIT THREE U+17E4 á 4 KHMER DIGIT FOUR U+17E5 á 5 KHMER DIGIT FIVE U+17E6 á 6 KHMER DIGIT SIX U+17E7 á 7 KHMER DIGIT SEVEN U+17E8 á 8 KHMER DIGIT EIGHT U+17E9 á 9 KHMER DIGIT NINE U+1810 á 0 MONGOLIAN DIGIT ZERO U+1811 á 1 MONGOLIAN DIGIT ONE U+1812 á 2 MONGOLIAN DIGIT TWO U+1813 á 3 MONGOLIAN DIGIT THREE U+1814 á 4 MONGOLIAN DIGIT FOUR U+1815 á 5 MONGOLIAN DIGIT FIVE U+1816 á 6 MONGOLIAN DIGIT SIX U+1817 á 7 MONGOLIAN DIGIT SEVEN U+1818 á 8 MONGOLIAN DIGIT EIGHT U+1819 á 9 MONGOLIAN DIGIT NINE U+2070 0 SUPERSCRIPT ZERO U+2074 4 SUPERSCRIPT FOUR U+2075 5 SUPERSCRIPT FIVE U+2076 6 SUPERSCRIPT SIX U+2077 7 SUPERSCRIPT SEVEN U+2078 8 SUPERSCRIPT EIGHT U+2079 9 SUPERSCRIPT NINE U+2080 0 SUBSCRIPT ZERO U+2081 1 SUBSCRIPT ONE U+2082 2 SUBSCRIPT TWO U+2083 3 SUBSCRIPT THREE U+2084 4 SUBSCRIPT FOUR U+2085 5 SUBSCRIPT FIVE U+2086 6 SUBSCRIPT SIX U+2087 7 SUBSCRIPT SEVEN U+2088 8 SUBSCRIPT EIGHT U+2089 9 SUBSCRIPT NINE U+2153 1/3 VULGAR FRACTION ONE THIRD U+2154 2/3 VULGAR FRACTION TWO THIRDS U+2155 1/5 VULGAR FRACTION ONE FIFTH U+2156 2/5 VULGAR FRACTION TWO FIFTHS U+2157 3/5 VULGAR FRACTION THREE FIFTHS U+2158 4/5 VULGAR FRACTION FOUR FIFTHS U+2159 1/6 VULGAR FRACTION ONE SIXTH U+215A 5/6 VULGAR FRACTION FIVE SIXTHS U+215B 1/8 VULGAR FRACTION ONE EIGHTH U+215C 3/8 VULGAR FRACTION THREE EIGHTHS U+215D 5/8 VULGAR FRACTION FIVE EIGHTHS U+215E 7/8 VULGAR FRACTION SEVEN EIGHTHS U+215F 1 FRACTION NUMERATOR ONE
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 93 4.6 Numeric Value—Normative Character Properties
Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+2160 1 ROMAN NUMERAL ONE U+2161 2 ROMAN NUMERAL TWO U+2162 3 ROMAN NUMERAL THREE U+2163 4 ROMAN NUMERAL FOUR U+2164 5 ROMAN NUMERAL FIVE U+2165 6 ROMAN NUMERAL SIX U+2166 7 ROMAN NUMERAL SEVEN U+2167 8 ROMAN NUMERAL EIGHT U+2168 9 ROMAN NUMERAL NINE U+2169 10 ROMAN NUMERAL TEN U+216A 11 ROMAN NUMERAL ELEVEN U+216B 12 ROMAN NUMERAL TWELVE U+216C 50 ROMAN NUMERAL FIFTY U+216D 100 ROMAN NUMERAL ONE HUNDRED U+216E 500 ROMAN NUMERAL FIVE HUNDRED U+216F 1,000 ROMAN NUMERAL ONE THOUSAND U+2170 1 SMALL ROMAN NUMERAL ONE U+2171 2 SMALL ROMAN NUMERAL TWO U+2172 3 SMALL ROMAN NUMERAL THREE U+2173 4 SMALL ROMAN NUMERAL FOUR U+2174 5 SMALL ROMAN NUMERAL FIVE U+2175 6 SMALL ROMAN NUMERAL SIX U+2176 7 SMALL ROMAN NUMERAL SEVEN U+2177 8 SMALL ROMAN NUMERAL EIGHT U+2178 9 SMALL ROMAN NUMERAL NINE U+2179 10 SMALL ROMAN NUMERAL TEN U+217A 11 SMALL ROMAN NUMERAL ELEVEN U+217B 12 SMALL ROMAN NUMERAL TWELVE U+217C 50 SMALL ROMAN NUMERAL FIFTY U+217D 100 SMALL ROMAN NUMERAL ONE HUNDRED U+217E 500 SMALL ROMAN NUMERAL FIVE HUNDRED U+217F 1,000 SMALL ROMAN NUMERAL ONE THOUSAND U+2180 1,000 ROMAN NUMERAL ONE THOUSAND C D U+2181 5,000 ROMAN NUMERAL FIVE THOUSAND U+2182 10,000 ROMAN NUMERAL TEN THOUSAND U+2183 - ROMAN NUMERAL REVERSED ONE HUNDRED U+2460 1 CIRCLED DIGIT ONE U+2461 2 CIRCLED DIGIT TWO U+2462 3 CIRCLED DIGIT THREE U+2463 4 CIRCLED DIGIT FOUR U+2464 5 CIRCLED DIGIT FIVE U+2465 6 CIRCLED DIGIT SIX U+2466 7 CIRCLED DIGIT SEVEN U+2467 8 CIRCLED DIGIT EIGHT U+2468 9 CIRCLED DIGIT NINE U+2469 10 CIRCLED NUMBER TEN U+246A 11 CIRCLED NUMBER ELEVEN U+246B 12 CIRCLED NUMBER TWELVE U+246C 13 CIRCLED NUMBER THIRTEEN U+246D 14 CIRCLED NUMBER FOURTEEN U+246E 15 CIRCLED NUMBER FIFTEEN U+246F 16 CIRCLED NUMBER SIXTEEN U+2470 17 CIRCLED NUMBER SEVENTEEN U+2471 18 CIRCLED NUMBER EIGHTEEN U+2472 19 CIRCLED NUMBER NINETEEN U+2473 20 CIRCLED NUMBER TWENTY U+2474 1 PARENTHESIZED DIGIT ONE U+2475 2 PARENTHESIZED DIGIT TWO U+2476 3 PARENTHESIZED DIGIT THREE U+2477 4 PARENTHESIZED DIGIT FOUR U+2478 5 PARENTHESIZED DIGIT FIVE
94 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.6 Numeric Value—Normative
Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+2479 6 PARENTHESIZED DIGIT SIX U+247A 7 PARENTHESIZED DIGIT SEVEN U+247B 8 PARENTHESIZED DIGIT EIGHT U+247C 9 PARENTHESIZED DIGIT NINE U+247D 10 PARENTHESIZED NUMBER TEN U+247E 11 PARENTHESIZED NUMBER ELEVEN U+247F 12 PARENTHESIZED NUMBER TWELVE U+2480 13 PARENTHESIZED NUMBER THIRTEEN U+2481 14 PARENTHESIZED NUMBER FOURTEEN U+2482 15 PARENTHESIZED NUMBER FIFTEEN U+2483 16 PARENTHESIZED NUMBER SIXTEEN U+2484 17 PARENTHESIZED NUMBER SEVENTEEN U+2485 18 PARENTHESIZED NUMBER EIGHTEEN U+2486 19 PARENTHESIZED NUMBER NINETEEN U+2487 20 PARENTHESIZED NUMBER TWENTY U+2488 1 DIGIT ONE FULL STOP U+2489 2 DIGIT TWO FULL STOP U+248A 3 DIGIT THREE FULL STOP U+248B 4 DIGIT FOUR FULL STOP U+248C 5 DIGIT FIVE FULL STOP U+248D 6 DIGIT SIX FULL STOP U+248E 7 DIGIT SEVEN FULL STOP U+248F 8 DIGIT EIGHT FULL STOP U+2490 9 DIGIT NINE FULL STOP U+2491 10 NUMBER TEN FULL STOP U+2492 11 NUMBER ELEVEN FULL STOP U+2493 12 NUMBER TWELVE FULL STOP U+2494 13 NUMBER THIRTEEN FULL STOP U+2495 14 NUMBER FOURTEEN FULL STOP U+2496 15 NUMBER FIFTEEN FULL STOP U+2497 16 NUMBER SIXTEEN FULL STOP U+2498 17 NUMBER SEVENTEEN FULL STOP U+2499 18 NUMBER EIGHTEEN FULL STOP U+249A 19 NUMBER NINETEEN FULL STOP U+249B 20 NUMBER TWENTY FULL STOP U+24EA 0 CIRCLED DIGIT ZERO U+2776 1 DINGBAT NEGATIVE CIRCLED DIGIT ONE U+2777 2 DINGBAT NEGATIVE CIRCLED DIGIT TWO U+2778 3 DINGBAT NEGATIVE CIRCLED DIGIT THREE U+2779 4 DINGBAT NEGATIVE CIRCLED DIGIT FOUR U+277A 5 DINGBAT NEGATIVE CIRCLED DIGIT FIVE U+277B 6 DINGBAT NEGATIVE CIRCLED DIGIT SIX U+277C 7 DINGBAT NEGATIVE CIRCLED DIGIT SEVEN U+277D 8 DINGBAT NEGATIVE CIRCLED DIGIT EIGHT U+277E 9 DINGBAT NEGATIVE CIRCLED DIGIT NINE U+277F 10 DINGBAT NEGATIVE CIRCLED NUMBER TEN U+2780 1 DINGBAT CIRCLED SANS-SERIF DIGIT ONE U+2781 2 DINGBAT CIRCLED SANS-SERIF DIGIT TWO U+2782 3 DINGBAT CIRCLED SANS-SERIF DIGIT THREE U+2783 4 DINGBAT CIRCLED SANS-SERIF DIGIT FOUR U+2784 5 DINGBAT CIRCLED SANS-SERIF DIGIT FIVE U+2785 6 DINGBAT CIRCLED SANS-SERIF DIGIT SIX U+2786 7 DINGBAT CIRCLED SANS-SERIF DIGIT SEVEN U+2787 8 DINGBAT CIRCLED SANS-SERIF DIGIT EIGHT U+2788 9 DINGBAT CIRCLED SANS-SERIF DIGIT NINE U+2789 10 DINGBAT CIRCLED SANS-SERIF NUMBER TEN U+278A 1 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE U+278B 2 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT TWO U+278C 3 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT THREE U+278D 4 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT FOUR
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 95 4.6 Numeric Value—Normative Character Properties
Table 4-6. Numeric Properties (Continued) Value Decimal Number Name U+278E 5 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT FIVE U+278F 6 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT SIX U+2790 7 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT SEVEN U+2791 8 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT EIGHT U+2792 9 DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT NINE U+2793 10 DINGBAT NEGATIVE CIRCLED SANS-SERIF NUMBER TEN U+3007 0 IDEOGRAPHIC NUMBER ZERO U+3021 1 HANGZHOU NUMERAL ONE U+3022 2 HANGZHOU NUMERAL TWO U+3023 3 HANGZHOU NUMERAL THREE U+3024 4 HANGZHOU NUMERAL FOUR U+3025 5 HANGZHOU NUMERAL FIVE U+3026 6 HANGZHOU NUMERAL SIX U+3027 7 HANGZHOU NUMERAL SEVEN U+3028 8 HANGZHOU NUMERAL EIGHT U+3029 9 HANGZHOU NUMERAL NINE U+3038 10 HANGZHOU NUMERAL TEN U+3039 20 HANGZHOU NUMERAL TWENTY U+303A 30 HANGZHOU NUMERAL THIRTY U+3192 1 IDEOGRAPHIC ANNOTATION ONE MARK U+3193 2 IDEOGRAPHIC ANNOTATION TWO MARK U+3194 3 IDEOGRAPHIC ANNOTATION THREE MARK U+3195 4 IDEOGRAPHIC ANNOTATION FOUR MARK U+3220 1 PARENTHESIZED IDEOGRAPH ONE U+3221 2 PARENTHESIZED IDEOGRAPH TWO U+3222 3 PARENTHESIZED IDEOGRAPH THREE U+3223 4 PARENTHESIZED IDEOGRAPH FOUR U+3224 5 PARENTHESIZED IDEOGRAPH FIVE U+3225 6 PARENTHESIZED IDEOGRAPH SIX U+3226 7 PARENTHESIZED IDEOGRAPH SEVEN U+3227 8 PARENTHESIZED IDEOGRAPH EIGHT U+3228 9 PARENTHESIZED IDEOGRAPH NINE U+3229 10 PARENTHESIZED IDEOGRAPH TEN U+3280 1 CIRCLED IDEOGRAPH ONE U+3281 2 CIRCLED IDEOGRAPH TWO U+3282 3 CIRCLED IDEOGRAPH THREE U+3283 4 CIRCLED IDEOGRAPH FOUR U+3284 5 CIRCLED IDEOGRAPH FIVE U+3285 6 CIRCLED IDEOGRAPH SIX U+3286 7 CIRCLED IDEOGRAPH SEVEN U+3287 8 CIRCLED IDEOGRAPH EIGHT U+3288 9 CIRCLED IDEOGRAPH NINE U+3289 10 CIRCLED IDEOGRAPH TEN U+FF10 á 0 FULLWIDTH DIGIT ZERO U+FF11 á 1 FULLWIDTH DIGIT ONE U+FF12 á 2 FULLWIDTH DIGIT TWO U+FF13 á 3 FULLWIDTH DIGIT THREE U+FF14 á 4 FULLWIDTH DIGIT FOUR U+FF15 á 5 FULLWIDTH DIGIT FIVE U+FF16 á 6 FULLWIDTH DIGIT SIX U+FF17 á 7 FULLWIDTH DIGIT SEVEN U+FF18 á 8 FULLWIDTH DIGIT EIGHT U+FF19 á 9 FULLWIDTH DIGIT NINE
CJK ideographs from the Unified Repertoire and Ordering also may have numeric values. The primary numeric ideographs are shown in Table 4-7. When used to represent numbers
96 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.7 Mirrored—Normative in decimal notation, zero is represented by U+3007. Otherwise, zero is represented by U+96F6. Table 4-7. Primary Numeric Ideographs U+96F6 0 U+4E00 1 U+4E8C 2 U+4E09 3 U+56DB 4 U+4E94 5 U+516D 6 U+4E03 7 U+516B 8 U+4E5D 9 U+5341 10 U+767E 100 U+5343 1,000 U+4E07 10,000 U+5104 100,000,000 (10,000 × 10,000) U+5146 1,000,000,000,000 (10,000 × 10,000 × 10,000)
Ideographic accounting numbers are commonly used on checks and other financial instru- ments to minimize the possibilities of misinterpretation or fraud in the representation of numerical values. The set of accounting numbers varies somewhat between Japanese, Chi- nese, and Korean usage. Table 4-8 gives a fairly complete listing of the known accounting characters. Some of these characters are ideographs with other meanings pressed into ser- vice as accounting numbers; others are used only as accounting numbers. Table 4-8. Ideographs Used as Accounting Numbers 1 U+58F9, U+58F1, U+5F0Ca 2U+8CAEa, U+8D30a, U+5F10a, U+5F0Da 3 U+53C3, U+53C2, U+53C1a, U+5F0Ea 4 U+8086 5U+4F0D 6 U+9678, U+9646 7 U+67D2b 8 U+634C 9 U+7396 10 U+62FE 100 U+4F70a, U+964C 1,000 U+4EDF 10,000 U+842C
a. These characters are used only as accounting numbers, and have no other meaning. b. In Japan, U+67D2 is also pronounced urusi, meaning “lacquer,” and is treated as a variant of the standard character for “lacquer” U+6F06.
4.7 Mirrored—Normative Mirrored is a normative property of characters such as parentheses, whose images are mir- rored horizontally in text that is laid out from right to left. For example, U+0028 is interpreted as opening parenthesis; in a left-to-right context it will appear as “(”, while in a right-to-left context it will appear as the mirrored glyph “)”. The list of mir- rored characters appears in Table 4-9. Note that mirroring is not limited to paired charac- ters, but that any character with the mirrored property will need two mirrored glyphs. This requirement is necessary to render the character properly in a bidirectional context.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 97 4.7 Mirrored—Normative Character Properties
Table 4-9. Mirrored Characters Value Name U+0028 LEFT PARENTHESIS U+0029 RIGHT PARENTHESIS U+003C LESS-THAN SIGN U+003E GREATER-THAN SIGN U+005B LEFT SQUARE BRACKET U+005D RIGHT SQUARE BRACKET U+007B LEFT CURLY BRACKET U+007D RIGHT CURLY BRACKET U+2045 LEFT SQUARE BRACKET WITH QUILL U+2046 RIGHT SQUARE BRACKET WITH QUILL U+207D SUPERSCRIPT LEFT PARENTHESIS U+207E SUPERSCRIPT RIGHT PARENTHESIS U+208D SUBSCRIPT LEFT PARENTHESIS U+208E SUBSCRIPT RIGHT PARENTHESIS U+2201 COMPLEMENT U+2202 PARTIAL DIFFERENTIAL U+2203 THERE EXISTS U+2204 THERE DOES NOT EXIST U+2208 ELEMENT OF U+2209 NOT AN ELEMENT OF U+220A SMALL ELEMENT OF U+220B CONTAINS AS MEMBER U+220C DOES NOT CONTAIN AS MEMBER U+220D SMALL CONTAINS AS MEMBER U+2211 N-ARY SUMMATION U+2215 DIVISION SLASH U+2216 SET MINUS U+221A SQUARE ROOT U+221B CUBE ROOT U+221C FOURTH ROOT U+221D PROPORTIONAL TO U+221F RIGHT ANGLE U+2220 ANGLE U+2221 MEASURED ANGLE U+2222 SPHERICAL ANGLE U+2224 DOES NOT DIVIDE U+2226 NOT PARALLEL TO U+222B INTEGRAL U+222C DOUBLE INTEGRAL U+222D TRIPLE INTEGRAL U+222E CONTOUR INTEGRAL U+222F SURFACE INTEGRAL U+2230 VOLUME INTEGRAL U+2231 CLOCKWISE INTEGRAL U+2232 CLOCKWISE CONTOUR INTEGRAL U+2233 ANTICLOCKWISE CONTOUR INTEGRAL U+2239 EXCESS U+223B HOMOTHETIC U+223C TILDE OPERATOR U+223D REVERSED TILDE U+223E INVERTED LAZY S U+223F SINE WAVE U+2240 WREATH PRODUCT U+2241 NOT TILDE U+2242 MINUS TILDE U+2243 ASYMPTOTICALLY EQUAL TO U+2244 NOT ASYMPTOTICALLY EQUAL TO U+2245 APPROXIMATELY EQUAL TO U+2246 APPROXIMATELY BUT NOT ACTUALLY EQUAL TO U+2247 NEITHER APPROXIMATELY NOR ACTUALLY EQUAL TO U+2248 ALMOST EQUAL TO
98 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.7 Mirrored—Normative
Table 4-9. Mirrored Characters (Continued) Value Name U+2249 NOT ALMOST EQUAL TO U+224A ALMOST EQUAL OR EQUAL TO U+224B TRIPLE TILDE U+224C ALL EQUAL TO U+2252 APPROXIMATELY EQUAL TO OR THE IMAGE OF U+2253 IMAGE OF OR APPROXIMATELY EQUAL TO U+2254 COLON EQUALS U+2255 EQUALS COLON U+225F QUESTIONED EQUAL TO U+2260 NOT EQUAL TO U+2262 NOT IDENTICAL TO U+2264 LESS-THAN OR EQUAL TO U+2265 GREATER-THAN OR EQUAL TO U+2266 LESS-THAN OVER EQUAL TO U+2267 GREATER-THAN OVER EQUAL TO U+2268 LESS-THAN BUT NOT EQUAL TO U+2269 GREATER-THAN BUT NOT EQUAL TO U+226A MUCH LESS-THAN U+226B MUCH GREATER-THAN U+226E NOT LESS-THAN U+226F NOT GREATER-THAN U+2270 NEITHER LESS-THAN NOR EQUAL TO U+2271 NEITHER GREATER-THAN NOR EQUAL TO U+2272 LESS-THAN OR EQUIVALENT TO U+2273 GREATER-THAN OR EQUIVALENT TO U+2274 NEITHER LESS-THAN NOR EQUIVALENT TO U+2275 NEITHER GREATER-THAN NOR EQUIVALENT TO U+2276 LESS-THAN OR GREATER-THAN U+2277 GREATER-THAN OR LESS-THAN U+2278 NEITHER LESS-THAN NOR GREATER-THAN U+2279 NEITHER GREATER-THAN NOR LESS-THAN U+227A PRECEDES U+227B SUCCEEDS U+227C PRECEDES OR EQUAL TO U+227D SUCCEEDS OR EQUAL TO U+227E PRECEDES OR EQUIVALENT TO U+227F SUCCEEDS OR EQUIVALENT TO U+2280 DOES NOT PRECEDE U+2281 DOES NOT SUCCEED U+2282 SUBSET OF U+2283 SUPERSET OF U+2284 NOT A SUBSET OF U+2285 NOT A SUPERSET OF U+2286 SUBSET OF OR EQUAL TO U+2287 SUPERSET OF OR EQUAL TO U+2288 NEITHER A SUBSET OF NOR EQUAL TO U+2289 NEITHER A SUPERSET OF NOR EQUAL TO U+228A SUBSET OF WITH NOT EQUAL TO U+228B SUPERSET OF WITH NOT EQUAL TO U+228C MULTISET U+228F SQUARE IMAGE OF U+2290 SQUARE ORIGINAL OF U+2291 SQUARE IMAGE OF OR EQUAL TO U+2292 SQUARE ORIGINAL OF OR EQUAL TO U+2298 CIRCLED DIVISION SLASH U+22A2 RIGHT TACK U+22A3 LEFT TACK U+22A6 ASSERTION U+22A7 MODELS U+22A8 TRUE U+22A9 FORCES
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 99 4.7 Mirrored—Normative Character Properties
Table 4-9. Mirrored Characters (Continued) Value Name U+22AA TRIPLE VERTICAL BAR RIGHT TURNSTILE U+22AB DOUBLE VERTICAL BAR DOUBLE RIGHT TURNSTILE U+22AC DOES NOT PROVE U+22AD NOT TRUE U+22AE DOES NOT FORCE U+22AF NEGATED DOUBLE VERTICAL BAR DOUBLE RIGHT TURNSTILE U+22B0 PRECEDES UNDER RELATION U+22B1 SUCCEEDS UNDER RELATION U+22B2 NORMAL SUBGROUP OF U+22B3 CONTAINS AS NORMAL SUBGROUP U+22B4 NORMAL SUBGROUP OF OR EQUAL TO U+22B5 CONTAINS AS NORMAL SUBGROUP OR EQUAL TO U+22B6 ORIGINAL OF U+22B7 IMAGE OF U+22B8 MULTIMAP U+22BE RIGHT ANGLE WITH ARC U+22BF RIGHT TRIANGLE U+22C9 LEFT NORMAL FACTOR SEMIDIRECT PRODUCT U+22CA RIGHT NORMAL FACTOR SEMIDIRECT PRODUCT U+22CB LEFT SEMIDIRECT PRODUCT U+22CC RIGHT SEMIDIRECT PRODUCT U+22CD REVERSED TILDE EQUALS U+22D0 DOUBLE SUBSET U+22D1 DOUBLE SUPERSET U+22D6 LESS-THAN WITH DOT U+22D7 GREATER-THAN WITH DOT U+22D8 VERY MUCH LESS-THAN U+22D9 VERY MUCH GREATER-THAN U+22DA LESS-THAN EQUAL TO OR GREATER-THAN U+22DB GREATER-THAN EQUAL TO OR LESS-THAN U+22DC EQUAL TO OR LESS-THAN U+22DD EQUAL TO OR GREATER-THAN U+22DE EQUAL TO OR PRECEDES U+22DF EQUAL TO OR SUCCEEDS U+22E0 DOES NOT PRECEDE OR EQUAL U+22E1 DOES NOT SUCCEED OR EQUAL U+22E2 NOT SQUARE IMAGE OF OR EQUAL TO U+22E3 NOT SQUARE ORIGINAL OF OR EQUAL TO U+22E4 SQUARE IMAGE OF OR NOT EQUAL TO U+22E5 SQUARE ORIGINAL OF OR NOT EQUAL TO U+22E6 LESS-THAN BUT NOT EQUIVALENT TO U+22E7 GREATER-THAN BUT NOT EQUIVALENT TO U+22E8 PRECEDES BUT NOT EQUIVALENT TO U+22E9 SUCCEEDS BUT NOT EQUIVALENT TO U+22EA NOT NORMAL SUBGROUP OF U+22EB DOES NOT CONTAIN AS NORMAL SUBGROUP U+22EC NOT NORMAL SUBGROUP OF OR EQUAL TO U+22ED DOES NOT CONTAIN AS NORMAL SUBGROUP OR EQUAL U+22F0 UP RIGHT DIAGONAL ELLIPSIS U+22F1 DOWN RIGHT DIAGONAL ELLIPSIS U+2308 LEFT CEILING U+2309 RIGHT CEILING U+230A LEFT FLOOR U+230B RIGHT FLOOR U+2320 TOP HALF INTEGRAL U+2321 BOTTOM HALF INTEGRAL U+2329 LEFT-POINTING ANGLE BRACKET U+232A RIGHT-POINTING ANGLE BRACKET U+3008 LEFT ANGLE BRACKET U+3009 RIGHT ANGLE BRACKET U+300A LEFT DOUBLE ANGLE BRACKET
100 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Character Properties 4.8 Unicode 1.0 Names
Table 4-9. Mirrored Characters (Continued) Value Name U+300B RIGHT DOUBLE ANGLE BRACKET U+300C LEFT CORNER BRACKET U+300D RIGHT CORNER BRACKET U+300E LEFT WHITE CORNER BRACKET U+300F RIGHT WHITE CORNER BRACKET U+3010 LEFT BLACK LENTICULAR BRACKET U+3011 RIGHT BLACK LENTICULAR BRACKET U+3014 LEFT TORTOISE SHELL BRACKET U+3015 RIGHT TORTOISE SHELL BRACKET U+3016 LEFT WHITE LENTICULAR BRACKET U+3017 RIGHT WHITE LENTICULAR BRACKET U+3018 LEFT WHITE TORTOISE SHELL BRACKET U+3019 RIGHT WHITE TORTOISE SHELL BRACKET U+301A LEFT WHITE SQUARE BRACKET U+301B RIGHT WHITE SQUARE BRACKET
4.8 Unicode 1.0 Names The Unicode 1.0 character name is an informative property of the characters defined in Ver- sion 1.0 of the Unicode Standard. The names of Unicode characters were changed in the process of merging the standard with ISO/IEC 10646. The Version 1.0 character names can be obtained from the CD-ROM accompanying the standard or from the ftp site. See also Appendix D, Changes from Unicode Version 2.0. Where the Version 1.0 character name pro- vides additional useful information, it is listed in Chapter 14, Code Charts. For example, U+00B6 has its Version 1.0 name, , listed for clarity.
4.9 Mathematical Property The mathematical property is an informative property of characters that are used as opera- tors in mathematical formulas. The mathematical property may be useful in algorithms that deal with the display of mathematical text and formulas. However, a number of these characters have multiple usages and may occur with nonmathematical semantics. For example, U+002D - may also be used as a hyphen—and not as a mathemat- ical minus sign. Other characters, including some alphabetic, numeric, punctuation, spaces, arrows, and geometric shapes, are used in mathematical expressions as well, but are even more dependent on the context for their identification. The Unicode characters in the following lists have the mathematical property. Characters with the math property and the Sm General Category: 002B, 003C..003E, 007C, 007E, 00AC, 00B1, 00D7, 00F7, 2044, 207A..207C, 208A..208C, 2190..2194, 219A..219B, 21A0, 21A3, 21A6, 21AE, 21CE..21CF, 21D2, 21D4, 2200..22F1, 2308..230B, 2320..2321, 25B7, 25C1, 266F, FB29, FE62, FE64..FE66, FF0B, FF1C..FF1E, FF5C, FF5E, FFE2, FFE9..FFEC Characters with the math property and other General Category values: 0028..002A, 002D, 002F, 005B..005E, 007B, 007D, 2016, 2032..2034, 207D..207E, 208D..208E, 20D0..20DC, 20E1, 2329..232A, 300A..300B, 301A..301B, FE35..FE38, FE59..FE5C, FE61, FE63, FE68, FF08..FF0A, FF0D, FF0F, FF3B..FF3E, FF5B, FF5D
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 101 4.10 Letters and Other Useful Properties Character Properties
4.10 Letters and Other Useful Properties The CD-ROM that accompanies the Unicode Standard contains data files that list other useful, informative properties of Unicode characters. The full list of those properties can be found in the data files; see, in particular, PropList.txt. This section highlights some of those properties that have a bearing on such implementation issues as parsing of identifiers. (See also Section 5.16, Identifiers.) Computer language standards often characterize identifiers as consisting of letters, sylla- bles, ideographs, and digits, but do not specify exactly what a “letter,” “syllable,” “ideo- graph,” or “digit” is, leaving the definitions implicitly either to a character encoding standard or to a locale specification. The large scope of the Unicode Standard means that it includes many writing systems for which these distinctions are not as self-evident as they may once have been for systems designed to work primarily for Western European lan- guages and Japanese. In particular, while the Unicode Standard includes various “alpha- bets” and “syllabaries,” it also includes writing systems that fall somewhere in between. As a result, no attempt is made to draw a sharp property distinction between letters and sylla- bles. Letter. This informative property applies to characters that are used to write words. This group includes characters such as capital letters, small letters, ideographs, hangul, and spacing modifier letters. Combining marks generally assume the letter property of the pre- ceding base character. For example, when searching for word boundaries, combining char- acters don’t break from previous letters. The letter property mappings can be obtained from the CD-ROM accompanying the standard. Alphabetic. The alphabetic property is an informative property of the primary units of alphabets and/or syllabaries, whether combining or noncombining. Included in this group would be composite characters that are canonical equivalents to a combining character sequence of an alphabetic base character plus one or more combining characters; letter digraphs; contextual variants of alphabetic characters; ligatures of alphabetic characters; contextual variants of ligatures; modifier letters; letterlike symbols that are compatibility equivalents of single alphabetic letters; and miscellaneous letter elements. Notably, U+00AA and U+00BA are simply abbreviatory forms involving a Latin letter and should be considered alphabetic rather than nonalphabetic symbols. Ideographic. The ideographic property is an informative property of the Unified CJK Ideo- graph set (U+4E00..U+9FA5); the CJK Ideograph Extension A set (U+3400..U+4DB5); the CJK Compatibility Ideograph set (U+F900..U+FA2D); U+3007 ; U+3006 ; and the Hangzhou-style numerals (U+3021.. U+3029, U+3038..U+303A).
102 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:
Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 5 Implementation Guidelines 5
It is possible to implement a substantial subset of the Unicode Standard as “wide-ASCII” with little change to existing programming practice. However, the Unicode Standard also provides for languages and writing systems that have more complex behavior than English. Whether implementing a new operating system from the ground up or enhancing existing programming environments or applications, it is necessary to examine many aspects of current programming practice and conventions to deal with this more complex behavior. This chapter covers a series of short, self-contained topics that are useful for implementers. The information and examples presented here are meant to help implementers understand and apply the design and features of the Unicode Standard. That is, they are meant to pro- mote good practice in implementations conforming to the Unicode Standard. These recommended guidelines are not normative and are not binding on the imple- menter.
5.1 Transcoding to Other Standards The Unicode Standard exists in a world of other text and character encoding standards— some private, some national, some international. A major strength of the Unicode Stan- dard is the number of other important standards that it incorporates. In many cases, the Unicode Standard included duplicate characters to guarantee round-trip transcoding to established and widely used standards. Conversion of characters between standards is not always a straightforward proposition. Many characters have mixed semantics in one standard and may correspond to more than one character in another. Sometimes standards give duplicate encodings for the same char- acter; at other times the interpretation of a whole set of characters may depend on the application. Finally, there are subtle differences in what a standard may consider a charac- ter.
Issues The Unicode Standard can be used as a pivot to transcode among n different standards. This process, which is sometimes called triangulation, reduces the number of mapping tables an implementation needs from O(n2) to O(n). Generally, tables—as opposed to algorithmic transformation—are required to map between the Unicode Standard and another standard. Table lookup often yields much better performance than even simple algorithmic conversions, as can be implemented between JIS and Shift-JIS.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 105 5.1 Transcoding to Other Standards Implementation Guidelines
Multistage Tables Tables require space. Even small character sets often map to characters from several differ- ent blocks in the Unicode Standard, and thus may contain up to 64K entries in at least one direction. Several techniques exist to reduce the memory space requirements for mapping tables. Such techniques apply not only to transcoding tables, but also to many other tables needed to implement the Unicode Standard, including character property data, collation tables, and glyph selection tables. Flat Tables. If diskspace is not at issue, virtual memory architectures yield acceptable working set sizes even for flat tables because frequency of usage among characters differs widely and even small character sets contain many infrequently used characters. In addi- tion, data intended to be mapped into a given character set generally does not contain char- acters from all blocks of the Unicode Standard (usually, only a few blocks at a time need to be transcoded to a given character set). This situation leaves large sections of the 64K-sized reverse mapping tables (containing the default character, or unmappable character entry) unused—and therefore paged to disk. Ranges. It may be tempting to “optimize” these tables for space by providing elaborate pro- visions for nested ranges or similar devices. This practice leads to unnecessary performance penalties on modern, highly pipelined processor architectures because of branch penalties. A faster solution is to use an optimized two-stage table, which can be coded without any test or branch instructions. Hash tables can also be used for space optimization, although they are not as fast as multistage tables. Two-Stage Tables. Two-stage (high-byte) tables are a commonly employed mechanism to reduce table size (see Figure 5-1). They use an array of 256 pointers and a default value. If a pointer is NULL, the returned value is the default. Otherwise, the pointer references a block of 256 values. Figure 5-1. Two-Stage Tables
Optimized Two-Stage Table. Wherever any blocks are identical, the pointers just point to the same block. For transcoding tables, this case occurs generally for a block containing only mappings to the “default” or “unmappable” character. Instead of using NULL pointers and a default value, one “shared” block of 256 default entries is created. This block is pointed to by all first-stage table entries, for which no character value can be mapped. By
106 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.2 ANSI/ISO C wchar_t avoiding tests and branches, this strategy provides access time that approaches the simple array access, but at a great savings in storage. Given an arbitrary 64K table, it is a simple matter to write a small utility that can calculate the optimal number of stages and their width.
7-Bit or 8-Bit Transmission Some transmission protocols use ASCII control codes for flow control. Others, including some UNIX mailers, are notorious for their restrictions to 7-bit ASCII. In these cases, transmissions of Unicode-encoded text must be encapsulated. A number of encapsulation protocols exist today, such as uuencode and BinHex. These protocols can be combined with compression in the same pass, thereby reducing the transmission overhead. The UTF-8 Transformation Format described in Section 2.3, Encoding Forms, and in Section 3.8, Transformations, may be used to transmit Unicode-encoded data through 8-bit transmission paths. The UTF-7 Transformation Format, which is defined in RFC-2152, also exists for use with MIME.
Mapping Table Resources To assist and guide implementers, the Unicode Standard provides a series of mapping tables on the accompanying CD-ROM. Each table consists of one-to-one mappings from the Unicode Standard to another published character standard. The tables include occa- sional multiple mappings. Their primary function is to identify the characters in these standards in the context of the Unicode Standard. In many cases, data conversion between the Unicode Standard and other standards will be application-dependent or context- sensitive. Many vendors maintain mapping tables for their own character standards.
Disclaimer The content of all mapping tables has been verified as far as possible by the Uni- code Consortium. However, the Unicode Consortium does not guarantee that the tables are correct in every detail. The mapping tables are provided for informa- tional purposes only. The Unicode Consortium is not responsible for errors that may occur either in the mapping tables printed in this volume or found on the CD-ROM, or in software that implements those tables. All implementers should check the relevant international, national, and vendor standards in cases where ambiguity of interpretation may occur.
5.2 ANSI/ISO C wchar_t With the wchar_t wide character type, ANSI/ISO C provides for inclusion of fixed- width, wide characters. ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension. The Unicode characters in the ASCII range U+0020 to U+007E satisfy these conditions. Thus, if an implementation uses ASCII to code the portable C execution set, the use of the Unicode character set for the wchar_t type, with a width of 16 bits, fulfills the requirement.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 107 5.3 Unknown and Missing Characters Implementation Guidelines
The width of wchar_t is compiler-specific and can be as little as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers. However, programmers can use a macro or typedef (for example, UNICHAR) that can be compiled as unsigned short or wchar_t depending on the target compiler and platform. This choice enables correct compilation on different platforms and compilers. Where a 16-bit implementation of wchar_t is guaranteed, such macros or typedefs may be predefined (for example, TCHAR on the Win32 API). On systems where the native character type or wchar_t is implemented as a 32-bit quan- tity, an implementation may transiently use 32-bit quantities to represent Unicode charac- ters during processing. The internal workings of this representation are treated as a black box and are not Unicode-conformant. In particular, any API or runtime library interfaces that accept strings of 32-bit characters are not Unicode-conformant. If such an implemen- tation interchanges 16-bit Unicode characters with the outside world, then this interchange can be conformant as long as the interface for this interchange complies with the require- ments of Chapter 3, Conformance.
5.3 Unknown and Missing Characters This section briefly discusses how users or implementers might deal with characters that are not supported, or that, although supported, are unavailable for legible rendering.
Unassigned and Private Use Character Codes There are two classes of character code values that even a “complete” implementation of the Unicode Standard cannot necessarily interpret correctly: • Character code values that are unassigned • Character code values in the Private Use Area for which no private agreement exists An implementation should not attempt to interpret such code values. Options for render- ing such unknown code values include printing the character code value as four hexadeci- mal digits, printing a black or white box, using appropriate glyphs such as for unassigned and @ for private use, or simply displaying nothing. In no case should an implementation assume anything else about the character’s properties, nor should it blindly delete such characters. It should not unintentionally transform them into some- thing else.
Interpretable but Unrenderable Characters An implementation may receive a character that is an assigned character in the Unicode character encoding, but be unable to render it because it does not have a font for it or is otherwise incapable of rendering it appropriately. In this case, an implementation might be able to provide further limited feedback to the user’s queries such as being able to sort the data properly, show its script, or otherwise dis- play it in a default manner. An implementation can distinguish between unrenderable (but assigned) characters and unassigned code values by printing the former with distinctive glyphs that give some general indication of their type, such as , , , , , , , , , , , and so on.
108 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.4 Handling Surrogate Pairs
Reassigned Characters Some character code values, primarily Hangul, were assigned in versions of the Unicode Standard earlier than 2.0, but have been reassigned because the characters were moved (see the Transcoding Tables on the CD-ROM). Such code values should be recognized and con- verted into the correct Version 3.0 character code values where possible. In some cases, a Version 3.0 application may still need to emit Version 1.1 character codes to communicate with some Version 1.1 applications. This issue is limited to particular early Unicode appli- cations and data sets and is not a general problem. The Unicode Consortium has the policy to never reassign characters in the future.
5.4 Handling Surrogate Pairs Surrogate pairs provide a mechanism for encoding 917,476 characters without requiring the use of 32-bit characters. Because predominantly infrequently used characters will be assigned to surrogate pairs, not all implementations need to handle these pairs initially. It is widely expected that surrogate pairs will be assigned in the not too distant future. High-surrogates and low-surrogates are assigned to disjoint ranges of code positions. Non- surrogate characters can never be assigned to these ranges. Because the high- and low- surrogate ranges are disjoint, determining character boundaries requires at most scanning one preceding or following Unicode code value, without regard to any other context. This approach enables efficient random access, which is not possible with encodings such as Shift-JIS. In well-formed text, a low-surrogate can be preceded only by a high-surrogate and not by a low-surrogate or nonsurrogate. A high-surrogate can be followed only by a low-surrogate and not by a high-surrogate or nonsurrogate. Surrogates are also designed to work well with implementations that do not recognize them. For example, the valid sequence of Unicode characters [0048] [0069] [0020] [D800] [DC00] [0021] [0021] would be interpreted by a Version 1.0-conformant implementation as “Hi
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 109 5.5 Handling Numbers Implementation Guidelines
The decisions on these issues give rise to three reasonable levels of support for surrogates as shown in Table 5-1. Table 5-1. Surrogate Support Levels
Support Level Interpretation Integrity of Pairs None No pairs Does not guarantee Weak Non-null subset of pairs Does not guarantee Strong Non-null subset of pairs Guarantees
Example. The following sentence could be displayed in three different ways, assuming that both the weak and strong implementations have Phoenician fonts but no hieroglyphics: “The Greek letter α corresponds to
Many implementations that handle advanced features of the Unicode Standard can easily be modified to support a weak surrogate implementation. For example, • Text collation can be handled by treating those surrogate pairs as “grouped characters,” much as “ij” in Dutch or “ll” in traditional Spanish. • Text entry can be handled by having a keyboard generate two Unicode values with a single keypress, much as an Arabic keyboard can have a “lam-alef ” key that generates a sequence of two characters, lam and alef. • Character display and measurement can be handled by treating specific surro- gate pairs as ligatures, in the same way as “f ” and “i” are joined to form the sin- gle glyph “”. • Truncation can be handled with the same mechanism as used to keep combin- ing marks with base characters. (For more information, see Section 5.15, Locat- ing Text Element Boundaries.) Users are prevented from damaging the text if a text editor keeps insertion points (also known as carets) on character boundaries. As with text-element boundaries, the lowest- level string-handling routines (such as wcschr) do not necessarily need to be modified to prevent surrogates from being damaged. In practice, it is sufficient that only certain higher- level processes (such as those just noted) be aware of surrogate pairs; the lowest-level rou- tines can continue to function on sequences of 16-bit Unicode code values without having to treat surrogates specially.
5.5 Handling Numbers There are many sets of characters that represent decimal digits in different scripts. Systems that interpret those characters numerically should provide the correct numerical values. For example, the sequence U+0968 , U+0966 when numerically interpreted has the value twenty.
110 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.6 Handling Properties
When converting binary numerical values to a visual form, digits can be chosen from dif- ferent scripts. For example, the value twenty can be represented either by U+0032 , U+0030 or by U+0968 , U+0966 , or by U+0662 - , U+0660 - . It is recommended that systems allow users to choose the format of the resulting digits by replacing the appropriate occurrence of U+0030 with U+0660 - , and so on. (See Chapter 4, Character Properties, for tables providing the infor- mation needed to implement formatting and scanning numerical values.) Fullwidth variants of the ASCII digits are simply compatibility variants of regular digits and should be treated as regular Western digits. The Roman numerals and East Asian ideographic numerals are decimal numeral writing systems, but they are not formally decimal radix digit systems. That is, it is not possible to do a one-to-one transcoding to forms such as 123456.789. Both of them are appropriate only for positive integer writing. It is also possible to write numbers in two ways with ideographic digits. For example, Figure 5-2 shows how the number 1,234 can be written. Figure 5-2. Ideographic Numbers
or
Supporting these digits for numerical parsing means that implementations must be smart about distinguishing between these two cases. Digits often occur in situations where they need to be parsed, but are not part of numbers. One such example is alphanumeric identifiers (see Section 5.16, Identifiers). It is only at a second level (for example, when implementing a full mathematical formula parser) that considerations such as superscripting become crucial for interpretation.
5.6 Handling Properties The Unicode Standard provides detailed information on character properties (see Chapter 4, Character Properties, and the Unicode Character Database on the accompanying CD- ROM). These properties can be used by implementers to implement a variety of low-level processes. Fully language-aware and higher-level processes will need additional informa- tion. A two-stage table, as described in Section 5.1, Transcoding to Other Standards, can also be used to handle mapping to character properties or other information indexed by character code. For example, the data from the Unicode Character Database on the accompanying CD-ROM can be represented in memory very efficiently as a set of two-stage tables. Individual properties are common to large sets of characters and therefore lend themselves to implementations using the shared blocks. Many popular implementations are influenced by the POSIX model, which provides func- tions for separate properties, such as isalpha, isdigit, and so on. Implementers of Unicode-based systems and internationalization libraries need to take care to extend these concepts to the full set of Unicode characters correctly.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 111 5.7 Normalization Implementation Guidelines
In Unicode-encoded text, combining characters participate fully. In addition to providing callers with information about which characters have the combining property, implement- ers and writers of language standards need to provide for the fact that combining charac- ters assume the property of the preceding base character (see also Section 3.5, Combination, and Section 5.16, Identifiers). Other important properties, such as sort weights, may also depend on a character’s context. Because the Unicode Standard provides such a rich set of properties, implementers will find it useful to allow access to several properties at a time, possibly returning a string of bit-fields, one bit-field per character in the input string. In the past, many existing standards, such as the C language standard, assumed very mini- malist “portable character sets” and geared their functions to operations on such sets. As the Unicode encoding itself is increasingly becoming the portable character set, imple- menters are advised to distinguish between historical limitations and true requirements when implementing specifications for particular text processes.
5.7 Normalization Alternative Spellings. The Unicode Standard contains explicit codes for the most fre- quently used accented characters. These characters can also be composed; in the case of accented letters, characters can be composed from a base character and nonspacing mark(s). The Unicode Standard provides a table of normative spellings (decompositions) of charac- ters that can be composed using a base character plus one or more nonspacing marks. These tables can be used to unify spelling in a standard manner in accordance with Section 3.10, Canonical Ordering Behavior. Implementations that are “liberal” in what they accept, but “conservative” in what they issue, will have the fewest compatibility problems. • The decomposition mappings are specific to a particular version of the Unicode Standard. Changes may occur as the result of character additions in the future. When new precomposed characters are added, mappings between those char- acters and corresponding composed character sequences will be added. Simi- larly, if a new combining mark is added to this standard, it may allow decompositions for precomposed characters that did not have decompositions before. Normalization. Systems may normalize Unicode-encoded text to one particular sequence, such as normalizing composite character sequences into precomposed characters, or vice versa (see Figure 5-3). Figure 5-3. Normalization Unnormalized a ¨ · ë ˜ ò
˜ ä· ëò a ¨¨· e ˜ o ` Precomposed Decomposed
112 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.8 Compression
Compared to the number of possible combinations, only a relatively small number of pre- composed base character plus nonspacing marks have independent Unicode character val- ues; most existed in dominant standards. Systems that cannot handle nonspacing marks can normalize to precomposed characters; this option can accommodate most modern Latin-based languages. Such systems can use fallback rendering techniques to at least visually indicate combinations that they cannot handle (see the “Fallback Rendering” subsection of Section 5.14, Rendering Nonspacing Marks). In systems that can handle nonspacing marks, it may be useful to normalize so as to elimi- nate precomposed characters. This approach allows such systems to have a homogeneous representation of composed characters and maintain a consistent treatment of such char- acters. However, in most cases, it does not require too much extra work to support mixed forms, which is the simpler route. The standard forms for normalization and for conformance to those forms are defined in Unicode Technical Report #15, “Unicode Normalization Forms,” on the CD-ROM or the up-to-date version on the Unicode Web site. For further information see Chapter 3, Con- formance; Chapter 4, Character Properties; and Section 2.6, Combining Characters.
5.8 Compression Using the Unicode character encoding may increase the amount of storage or memory space dedicated to the text portion of files. Compressing Unicode-encoded files or strings can therefore be an attractive option. Compression always constitutes a higher-level protocol and makes interchange dependent on knowledge of the compression method employed. For a detailed discussion on compression and a standard compression scheme for Unicode, see Unicode Technical Report #6, “A Standard Compression Scheme for Uni- code,” on the CD-ROM or the up-to-date version on the Unicode Web site. Encoding forms defined in Section 2.3, Encoding Forms, have different storage characteris- tics. For example, as long as text contains only characters from the Basic Latin (ASCII) block, it occupies the same amount of space whether it is encoded with the UTF-8 transfor- mation format or with ASCII codes. On the other hand, text consisting of ideographs encoded with UTF-8 will require more space than equivalent Unicode-encoded text.
5.9 Line Handling Newlines are represented on different platforms by CR, LF, NL, or CRLF. Not only are new- lines represented by different characters on different platforms, but they also have ambigu- ous behavior even on the same platform. With the advent of the Web, where text on a single machine can arise from many sources, this inconsistency causes a significant problem. Unfortunately, these characters are often transcoded directly into the corresponding Uni- code code values when a character set is transcoded. For this reason, even programs han- dling pure Unicode text must deal with the problems. For detailed guidelines, see Unicode Technical Report #13, “Unicode Newline Guidelines,” on the CD-ROM or the up-to-date version on the Unicode Web site.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 113 5.10 Regular Expressions Implementation Guidelines
5.10 Regular Expressions Byte-oriented regular expression engines require extensions to successfully handle Uni- code. The following issues are involved in such extensions: • Unicode is a large character set—regular expression engines that are adapted to handle only small character sets may not scale well. • Unicode encompasses a wide variety of languages that can have very different characteristics than English or other Western European text. For detailed information on the requirements of Unicode Regular Expressions, see Unicode Technical Report #18, “Unicode Regular Expression Guidelines,” on the CD-ROM or the up-to-date version on the Unicode Web site.
5.11 Language Information in Plain Text
Requirements for Language Tagging The requirement for language information embedded in plain text data is often overstated. Many commonplace operations such as collation seldom require this extra information. In collation, for example, foreign language text is generally collated as if it were not in a for- eign language. (See Unicode Technical Report #10, “Unicode Collation Algorithm,” on the CD-ROM or the up-to-date version on the Unicode Web site for more information.) For example, an index in an English book would not sort the Spanish word “churo” after “czar,” where it would be collated in traditional Spanish, nor would an English atlas put the Swedish city of Östersjö after Zanzibar, where it would appear in Swedish. However, language information is very useful in certain operations, such as spell-checking or hyphenating a mixed-language document. It is also useful in choosing the default font for a run of unstyled text; for example, the ellipsis character may have a very different appearance in Japanese fonts than in European fonts. Although language information is useful in performing text-to-speech operations, modern software for doing acceptable text- to-speech must be so sophisticated in performing grammatical analysis of text that the extra work in determining the language is not significant. Language information can be presented as out-of-band information or inline tags. In inter- nal implementations, it is quite common to use out-of-band information, which is stored in data structures that are parallel to the text, rather than embedded in it. Out-of-band information does not interfere with the normal processing of the text (comparison, search- ing, and so on) and more easily supports manipulation of the text. For interchange purposes, it is becoming common to use tagged information, which is embedded in the text. Unicode Technical Report #7, “Plane 14 Characters for Language Tags,” which is found on the CD-ROM or in its up-to-date version on the Unicode Web site, provides a proposed mechanism for representing language tags. Like most tagging mechanisms, these language tags are stateful: a start tag establishes an attribute for the text, and an end tag concludes it.
Working with Language Tags Avoiding Language Tags. Because of the extra implementation burden, language tags should be avoided in plain text unless language information is required and it is known that the receivers of the text will properly recognize and maintain the tags. However, where
114 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.11 Language Information in Plain Text language tags must be used, implementers should consider the following implementation issues involved in supporting language information with tags and decide how to handle tags where they are not fully supported. This discussion applies to any mechanism for pro- viding language tags in a plain text environment. Display. Language tags themselves are not displayed. This choice may not require modifi- cation of the displaying program, if the fonts on that platform have the language tag char- acters mapped to zero-width, invisible glyphs. Language tags may need to be taken into account for special processing, such as hyphenation or choice of font. Processing. Sequential access to the text is generally straightforward. If language codes are not relevant to the particular processing operation, then they should be ignored. Random access to stateful tags is more problematic. Because the current state of the text depends upon tags previous to it, the text must be searched backward, sometimes all the way to the start. With these exceptions, tags pose no particular difficulties as long as no modifications are made to the text. Editing and Modification. Inline tags present particular problems for text changes, because they are stateful. Any modifications of the text are more complicated, as those modifications need to be aware of the current language status and the
Language Tags and Han Unification Han Unification. A common misunderstanding about Unicode Han Unification is the mistaken belief that Han characters cannot be rendered properly without language infor- mation. This idea might lead an implementer to conclude that language information must always be added to plain text using the tags. However, this implication is incorrect. The goal and methods of Han Unification were to ensure that the text remained legible. Although font, size, width, and other format specifications need to be added to produce precisely the same appearance on the source and target machines, plain text remains legible in the absence of these specifications. There should never be any confusion in Unicode, because the distinctions between the uni- fied characters are all within the range of stylistic variations that exist in each country. No unification in Unicode should make it impossible for a reader to identify a character if it appears in a different font. Where precise font information is important, it is best conveyed in a rich-text format. Typical Scenarios. The following e-mail scenarios illustrate that the need for language information with Han characters is often overstated:
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 115 5.12 Editing and Selection Implementation Guidelines
• Scenario 1. A Japanese user sends out untagged Japanese text. Readers are Japa- nese (with Japanese fonts). Readers see no differences from what they expect. • Scenario 2. A Japanese user sends out an untagged mixture of Japanese and Chinese text. Readers are Japanese (with Japanese fonts) and Chinese (with Chinese fonts). Readers see the mixed text with only one font, but the text is still legible. Readers recognize the difference between the languages by the con- tent. • Scenario 3. A Japanese user sends out a mixture of Japanese and Chinese text. Text is marked with font, size, width, and so on because the exact format is important. Readers have the fonts and other display support. Readers see the mixed text with different fonts for different languages. They recognize the dif- ference between the languages by the content, and see the text with glyphs that are more typical for the particular language. It is common even in printed matter to render passages of foreign language text in native- language fonts, just for familiarity. For example, Chinese text in a Japanese document can commonly be rendered in a Japanese font.
5.12 Editing and Selection
Consistent Text Elements A user interface for editing is most intuitive when the text elements are consistent (see Figure 5-4). In particular, the editing actions of deletion, selection, mouse-clicking, and cursor-key movement should act as though they have a consistent set of boundaries. For example, hitting a leftward-arrow should result in the same cursor location as delete. This synchronization gives a consistent, single model for editing characters. Figure 5-4. Consistent Character Boundaries
Cluster ·æÕÐü Rôle s* Stack ·æÕÐü Rôle s* Atomic ·æÕÐü Rôle s*
Three types of boundaries are generally useful in editing and selecting within words. Cluster Boundaries. Cluster boundaries occur in scripts such as Devanagari. Selection or deletion using cluster boundaries means that an entire cluster (such as ka + vowel sign a) or a composed character (o + circumflex) is selected or deleted as a single unit. Stacked Boundaries. Stacked boundaries are generally somewhat finer than cluster bound- aries. Free-standing elements (such as vowel sign a) can be independently selected and deleted, but any elements that “stack” (such as o + circumflex, or vertical ligatures such as Arabic lam + meem) can be selected only as a single unit. Stacked boundaries treat all com- posed character sequences as single entities, much like precomposed characters.
116 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.13 Strategies for Handling Nonspacing Marks
Atomic Character Boundaries. The use of atomic character boundaries is closest to selec- tion of individual Unicode characters. However, most modern systems indicate selection with some sort of rectangular highlighting. This approach places restrictions on the consis- tency of editing because some sequences of characters do not linearly progress from the start of the line. When characters stack, two mechanisms are used to visually indicate par- tial selection: linear and nonlinear boundaries. Linear Boundaries. Use of linear boundaries treats the entire width of the resultant glyph as belonging to the first character of the sequence, and the remaining characters in the backing-store representation as having no width and being visually afterward. This option is the simplest mechanism and one that is currently in use on the Macintosh and some other systems. The advantage of this system is that it requires very little addi- tional implementation work. The disadvantage is that it is never easy to select narrow char- acters, let alone a zero-width character. Mechanically, it requires the user to select just to the right of the nonspacing mark and drag just to the left. It also does not allow the selec- tion of individual nonspacing marks if more than one are present. Nonlinear Boundaries. Use of linear boundaries divides any stacked element into parts. For example, picking a point halfway across a lam + meem ligature can represent the divi- sion between the characters. One can either allow highlighting with multiple rectangles or use another method such as coloring the individual characters. Notice that with more work, a precomposed character can behave in deletion as if it were a composed character sequence with atomic character boundaries. This procedure involves deriving the character’s decomposition on the fly to get the components to be used in sim- ulation. For example, deletion occurs by decomposing, removing the last character, then recomposing (if more than one character remains). However, this technique does not work in general editing and selection. In most systems, the character is the smallest addressable item in text, so the selection and assignment of properties (such as font, color, letterspacing, and so on) are done on a per- character basis. There is no good way to simulate this addressability with precomposed characters. Systematically modifying all text editing to address parts of characters would be quite inefficient. Just as there is no single notion of text element, so there is no single notion of editing char- acter boundaries. At different times, users may want different degrees of granularity in the editing process. Two methods suggest themselves. First, the user may set a global preference for the character boundaries. Second, the user may have alternative command mecha- nisms, such as Shift-Delete, which give more (or less) fine control than the default mode.
5.13 Strategies for Handling Nonspacing Marks By following these guidelines, a programmer should be able to implement systems and routines that provide for the effective and efficient use of nonspacing marks in a wide variety of applications and systems. The programmer also has the choice between minimal techniques that apply to the vast majority of existing systems and more sophisticated tech- niques that apply to more demanding situations, such as higher-end DTP (desktop pub- lishing). In this section and the following section, the terms nonspacing mark and combining charac- ter are used interchangeably. The terms diacritic, accent, stress mark, Hebrew point, Arabic vowel, and others are sometimes used instead of nonspacing mark. (They refer to particular types of nonspacing marks.)
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 117 5.13 Strategies for Handling Nonspacing Marks Implementation Guidelines
A relatively small number of implementation features are needed to support nonspacing marks. Different possible levels of implementation are also possible. A minimal system yields good results and is relatively simple to implement. Most of the features required by such a system are modifications of existing software. As nonspacing marks are required for a number of languages such as Arabic, Hebrew, and the languages of the Indian subcontinent, many vendors already have systems capable of dealing with these characters and can use their experience to produce general-purpose soft- ware for handling these characters in the Unicode Standard. Rendering. A fixed set of composite character sequences can be rendered effectively by means of fairly simple substitution. Wherever a sequence of base character plus one or more nonspacing combining marks occurs, a glyph representing the combined form can be substituted. In simple, monospaced character rendering, a nonspacing combining mark has a zero advance width, and a composite character sequence will have the same width as the base character. When truncating strings, it is always easiest to truncate starting from the end and working backward. A trailing nonspacing mark will then not be separated from the preceding base character. A more sophisticated rendering system can take into account more subtle variations in widths and kerning with nonspacing marks or account for those cases where the composite character sequence has a different advance width than the base character. Such rendering systems are not necessary for the large majority of applications. They can, however, also supply more sophisticated truncation routines. (See also Section 5.14, Rendering Nonspac- ing Marks.) Other Processes. Correct multilingual comparison routines must already be able to com- pare a sequence of characters as one character, or one character as if it were a sequence. Such routines can also handle composite character sequences when supplied the appropri- ate data. When searching strings, remember to check for additional nonspacing marks in the target string that may affect the interpretation of the last matching character. Line-break algorithms generally use state machines for determining word breaks. Such algorithms can be easily adapted to prevent separation of nonspacing marks from base characters. (See also the discussion in Section 5.17, Sorting and Searching; Section 5.7, Nor- malization; and Section 5.15, Locating Text Element Boundaries.)
Keyboard Input A common implementation for the input of composed character sequences is the use of so- called dead keys. These keys match the mechanics used by typewriters to generate such sequences through overtyping the base character after the nonspacing mark. In computer implementations, keyboards enter a special state when a dead key is pressed for the accent and emit a precomposed character only when one of a limited number of “legal” base char- acters is entered. It is straightforward to adapt such a system to emit composed character sequences or precomposed characters as needed. Although typists, especially in the Latin script, are trained on systems working in this way, many scripts in the Unicode Standard (including the Latin script) may be implemented according to the handwriting sequence, in which users type the base character first, followed by the accents or other nonspacing marks (see Figure 5-5). In the case of handwriting sequence, each keystroke produces a distinct, natural change on the screen; there are no hidden states. To add an accent to any existing character, the user positions the insertion point (caret) after the character and types the accent.
118 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.13 Strategies for Handling Nonspacing Marks
Figure 5-5. Dead Keys Versus Handwriting Sequence Dead Key Handwriting Zrich Zrich ¨ u Zrich Zurich u ¨ Zürich Zürich
Truncation There are two types of truncation: truncation by character count and truncation by dis- played width. Truncation by character count can entail loss (be lossy) or be lossless. Truncation by character count is used where, due to storage restrictions, a limited number of characters can be entered into a field; it is also used where text is broken into buffers for transmission and other purposes. The latter case can be lossless if buffers are recombined seamlessly before processing or if lookahead is performing for possible combining charac- ter sequences straddling buffers. When fitting data into a field of limited length, some information will be lost. Truncating at a text element boundary (for example, on the last composite character sequence boundary or even last word boundary) is often preferable to truncating after the last code value (see Figure 5-6). (See Section 5.15, Locating Text Element Boundaries.) Figure 5-6. Truncating Composed Character Sequences On CCS Jose« boundaries
Clipping Jos
Ellipsis Jo...
Truncation by displayed width is used for visual display in a narrow field. In this case, trun- cation occurs on the basis of the width of the resulting string rather than on the basis of a character count. In simple systems, it is easiest to truncate by width, starting from the end and working backward by subtracting character widths as one goes. Because a trailing non- spacing mark does not contribute to the measurement of the string, the result will not sep- arate nonspacing marks from their base characters. If the textual environment is more sophisticated, the widths of characters may depend on their context, due to effects such as kerning, ligatures, or contextual formation. For such systems, the width of a composed character, such as an ï, may be different than the width of a narrow base character alone. To handle these cases, a final check should be made on any truncation result derived from successive subtractions.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 119 5.14 Rendering Nonspacing Marks Implementation Guidelines
A different option is simply to clip the characters graphically. However, the result may look ugly. Also, if the clipping occurs between characters, it may not give any visual feedback that characters are being omitted. A graphic or ellipsis can be used to give this visual feed- back.
5.14 Rendering Nonspacing Marks This discussion assumes the use of proportional fonts, where the widths of individual char- acters can vary. Various techniques can be used with monospaced fonts, but in general, it is possible to get only a semblance of a correct rendering in these fonts, especially with inter- national characters. When rendering a sequence consisting of more than one nonspacing mark, the nonspacing marks should, by default, be stacked outward from the base character. That is, if two non- spacing marks appear over a base character, then the first nonspacing mark should appear on top of the base character, and the second nonspacing mark should appear on top of the first. If two nonspacing marks appear under a base character, then the first nonspacing mark should appear beneath the base character, and the second nonspacing mark should appear below the first (see Section 2.6, Combining Characters). This default treatment of multiple, potentially interacting nonspacing marks is known as the inside-out rule (see Figure 5-7). Figure 5-7. Inside-Out Rule Characters Glyphs a @¬÷@ @. @ ä÷. ö ö This default behavior may be altered based on typographic preferences or on knowledge of the specific orthographic treatment to be given to multiple nonspacing marks in the con- text of a particular writing system. For example, in the modern Vietnamese writing system, an acute or grave accent (serving as a tone mark) may be positioned slightly to one side of a circumflex accent rather than directly above it. If the text to be displayed is known to employ a different typographic convention (either implicitly through knowledge of the language of the text or explicitly through rich-text style bindings), then an alternative posi- tioning may be given to multiple nonspacing marks instead of that specified by the default inside-out rule. Fallback Rendering. Several methods are available to deal with an unknown composed character sequence that is outside of a fixed, renderable set (see Figure 5-8). One method (Show Hidden) indicates the inability to draw the sequence by drawing the base character first and then rendering the nonspacing mark as an individual unit—with the nonspacing mark positioned on a dotted circle. (This convention is used in Chapter 14, Code Charts.) Figure 5-8. Fallback Rendering ˆ @ @ Ggˆ Gˆˆg xGˆ xgˆ x “Ideal”“Show “Simple Hidden” Overlap”
120 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.14 Rendering Nonspacing Marks
Another method (Simple Overlap) uses default fixed positioning for an overlapping zero- width nonspacing mark, generally placed far away from possible base characters. For exam- ple, the default positioning of a circumflex can be above the ascent, which will place it above capital letters. Even though the result will not be particularly attractive for letters such as g-circumflex, the result should generally be recognizable in the case of single non- spacing marks. In a degenerate case, a nonspacing mark occurs as the first character in the text or is sepa- rated from its base character by a line separator, paragraph separator, or other formatting character that causes a positional separation. This result is called a defective combining character sequence (see Section 3.5, Combination). Defective combining character sequences should be rendered as if they had a space as a base character. Bidirectional Positioning. In bidirectional text, the nonspacing marks are reordered with their base characters; that is, they visually apply to the same base character after the algo- rithm is used (see Figure 5-9). There are a few ways to accomplish this positioning. Figure 5-9. Bidirectional Placement Backing Store g @ö U @V Screen Order g @ö @V U Glyph Metrics ö V xxg x x x x xUx
Aligned Glyphs ö xgx xUV x
The simplest method is similar to the Simple Overlap fallback method. In the bidirectional algorithm, combining marks take the level of their base character. In that case, Arabic and Hebrew nonspacing marks would come to the left of their base characters. The font is designed so that instead of overlapping to the left, the Arabic and Hebrew nonspacing marks overlap to the right. In Figure 5-9, the “glyph metrics” line shows the pen start and end for each glyph with such a design. After aligning the start and end points, the final result shows each nonspacing mark attached to the corresponding base letter. More sophis- ticated rendering could then apply the positioning methods outlined in the next section. With some rendering software, it may be necessary to keep the nonspacing mark glyphs consistently ordered to the right of the base character glyphs. In that case, a second pass can be done after producing the “screen order” to put the odd-level nonspacing marks on the right of their base characters. As the levels of nonspacing marks will be the same as their base characters, this pass can swap the order of nonspacing mark glyphs and base character glyphs in right-left (odd) levels. (See Section 3.12, Bidirectional Behavior.)
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 121 5.14 Rendering Nonspacing Marks Implementation Guidelines
Justification. Typically, full justification of text adds extra space at space characters so as to widen a line; however, if there are too few (or no) space characters, some systems add extra letterspacing between characters (see Figure 5-10). This process needs to be modified if zero-width nonspacing marks are present in the text. Otherwise, the nonspacing marks will be separate from their base characters. Figure 5-10. Justification Zürich 66 points/6 positions Zrichu¨ = 11 points per position 66 points/5 positions Z ü rich = 13.2 points per position
Because nonspacing marks always follow their base character, proper justification adds let- terspace between characters only if the second character is a base character.
Positioning Methods A number of different methods are available to position nonspacing marks so that they are in the correct location relative to the base character and previous nonspacing marks. Positioning with Ligatures. A fixed set of composed character sequences can be rendered effectively by means of fairly simple substitution (see Figure 5-11). Wherever the glyphs representing a sequence of
Positioning with ligatures is perhaps the simplest method of supporting nonspacing marks. Whenever there is a small, fixed set, such as those corresponding to the precomposed char- acters of 8859-1 (Latin1), this method is straightforward to apply. Because the composed character sequence almost always has the same width as the base character, rendering, mea- surement, and editing of these characters are much easier than for the general case of liga- tures. If a composed character sequence does not form a ligature, then one of the two following methods can be applied. If they are not available, then a fallback method can be used.
122 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.14 Rendering Nonspacing Marks
Positioning with Contextual Forms. A more general method of dealing with positioning of nonspacing marks is to use contextual formation (see Figure 5-12). In this case, several different glyphs correspond to different positions of the accents. Base glyphs generally fall into a fairly small number of classes, based upon their general shape and width. According to the class of the base glyph, a particular glyph is chosen for a nonspacing mark. Figure 5-12. Positioning with Contextual Forms ·ü ➠ ·F Åü ➠ ÅZ Ðü = Ðü
In general cases, a number of different heights of glyphs can be chosen to allow stacking of glyphs, at least for a few deep (when these bounds are exceeded, then the fallback methods can be used). This method can be combined with the ligature method so that in specific cases ligatures can be used to produce fine variations in position and shape. Positioning with Enhanced Kerning. A third technique for positioning diacritics is an extension of the normal process of kerning to be both horizontal and vertical (see Figure 5-13). Typically, kerning maps from pairs of glyphs to a positioning offset. For example, in the word “To” the “o” should nest slightly under the “T”. An extension of this system maps to both a vertical and a horizontal offset, allowing glyphs to be arbitrarily posi- tioned. Figure 5-13. Positioning with Enhanced Kerning To w« To w«
For effective use in the general case, the kerning process must also be extended to handle more than simple kerning pairs, as multiple diacritics may occur after a base letter. Positioning with enhanced kerning can be combined with the ligature method so that in specific cases ligatures can be used to produce fine variations in position and shape.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 123 5.15 Locating Text Element Boundaries Implementation Guidelines
5.15 Locating Text Element Boundaries A string of Unicode-encoded text often needs to be broken up into text elements program- matically. Common examples of text elements include what users think of as characters, words, lines, and sentences. The precise determination of text elements may vary according to locale, even as to what constitutes a character. The goal of matching user perceptions cannot always be met because the text alone does not always contain enough information to unambiguously decide boundaries. For example, the period (U+002E ) is used ambiguously, sometimes for end-of-sentence purposes, sometimes for abbreviations, and sometimes for numbers. In most cases, however, programmatic text boundaries can match user perceptions quite closely, or at least not surprise the user. Rather than concentrate on algorithmically searching for text elements themselves, a sim- pler computation looks instead at detecting the boundaries between those text elements. The determination of those boundaries is often critical to the performance of general soft- ware, so it is important to be able to make such a determination as quickly as possible. The following boundary determination mechanism provides a straightforward and effi- cient way to determine word boundaries. It builds upon the uniform character representa- tion of the Unicode Standard, while handling the large number of characters and special features such as combining marks and surrogates in an effective manner. As this boundary determination mechanism lends itself to a completely data-driven implementation, it can be customized to particular locales according to special language or user requirements without recoding. In particular, word, line, and sentence boundaries will need to be cus- tomized according to locale and user preference. In Korean, for example, lines may be bro- ken either at spaces (as in Latin text) or on ideograph boundaries (as in Chinese). For some languages, this simple method is not sufficient. For example, Thai linebreaking requires the use of dictionary lookup, analogous to English hyphenation. An implementa- tion therefore may need to provide means to override or subclass the standard, fast mecha- nism described in the “Boundary Specification” subsection in this section. The large character set of the Unicode Standard and its representational power place requirements on both the specification of text element boundaries and the underlying implementation. The specification needs to allow for the designation of large sets of char- acters sharing the same characteristics (for example, uppercase letters), while the imple- mentation must provide quick access and matches to those large sets. The mechanism also must handle special features of the Unicode Standard, such as com- bining or nonspacing marks, conjoining jamo, and surrogate characters. The following discussion looks at two aspects of text element boundaries: the specification and the underlying implementation. Specification means a way for programmers and localizers to specify programmatically where boundaries can occur.
Boundary Specification A boundary specification defines different classes, then lists the rules for boundaries in terms of those classes. Remember that characters may be represented by a sequence of two surrogates. The character classes are specified as a list, where each element of the list is • A literal character • A range of characters • A property of a Unicode character, as defined in the Unicode Character Data- base
124 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.15 Locating Text Element Boundaries
• A removal of the following elements from the list For the formal description of the notation used in this section, see Section 0.2, Notational Conventions. Additional notational conventions used in this section are as follows: ÷ Allow break here. × Do not allow break here. An underscore (“_”) is used to indicate a space in examples. There is always a boundary at the beginning and at the end of a string of text. As with typi- cal regular expressions, the longest match possible is used. For example, in the following, the boundary is placed after as many Y’s as possible: X Y* ÷ ¬ X (In this book, the rules are numbered for reference.) Some additional constraints are reflected in the specification. These constraints make the implementation significantly simpler and more efficient and have not been found to be limitations for natural language use. 1. Limited context. Given boundaries at positions X and Y, then the positions of any other boundaries between X and Y do not depend on characters outside of X and Y (so long as X and Y remain boundary positions). For example, with boundaries at “ab÷cde”, changing “a” to “A” cannot intro- duce a new boundary, such as at “Ab÷cd÷e”. 2. Single boundaries. Each rule has exactly one boundary position. Because of rule (1), this restriction is more a limitation on the specification methods, because a rule with two boundaries could generally be expressed as two rules.
For example, “ab÷cd÷ef” could be broken into “ab÷cd” and “cd÷ef”.
3. No conflicts. Two rules cannot have initial portions that match the same text, but with different boundary positions. For example, “x÷abc” and “a÷bc” cannot be part of the same boundary specifi- cation. 4. No overlapping sets. For efficiency, two character sets in a specification cannot intersect. A later character set definition will override a previous one, removing its characters from the previous set.
For example, the second set specification removes “AEIOUaeiou” from the first:
Let = [Lu][Ll][Lt][Lm][Lo] EngVowel = AEIOUaeiou 5. No more than 256 sets. This restriction is purely an implementation detail to save on storage. 6. Ignore degenerates. Implementations need not make special provisions to get marginally better behavior for degenerate cases that never occur in practice, such as an A followed by an Indic combining mark. These rules can also be approximated with pair tables, where only one character before, and one character after, a possible break point are considered. Although this approach does not generally give the correct results, it is often sufficient if languages are restricted. How- ever, the grapheme boundaries are designed to be fully expressed using pair tables.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 125 5.15 Locating Text Element Boundaries Implementation Guidelines
Example Specifications Different issues are present with different types of boundaries, as the following discussion and examples should make clear. In this section, the rules are somewhat simplified, and not all edge cases are included. In particular, characters such as control characters and format characters do not cause breaks. Permitting such cases would complicate each of the exam- ples (and is left as an exercise for the reader). In addition, it is intended that the rules them- selves be localizable; the examples provided here are not valid for all locales. Rather than listing the contents of the character sets in these examples, the contents are simply explained. If an implementation uses a state table, the performance does not depend on the complex- ity or number of rules. The only feature that does affect performance is the number of characters that may match after the boundary position in a rule that is matched.
Grapheme Boundaries A grapheme is a minimal unit of a writing system, just as a phoneme is a minimal unit of a sound system. In some instances, a grapheme is represented by more than one Unicode character, such as in Indic languages. As far as a user is concerned, the underlying represen- tation of text is not important, but it is paramount that an editing interface present a uni- form implementation of what the user thinks of as graphemes. Graphemes should behave as units in terms of mouse selection, arrow key movement, backspacing, and so on. For example, if an accented character is represented by a combining character sequence, then using the right arrow key should skip from the start of the base character to the end of the last combining character. This situation is analogous to a system using conjoining jamo to represent Hangul syllables, where they are treated as single graphemes for editing. In those circumstances where end users need character counts (which is actually rather rare), the counts need to correspond to the users’ perception of what constitutes a grapheme. The principal requirements for general character boundaries are the handling of combin- ing marks, Hangul conjoining jamo, and Indic and Tibetan character clusters. See Table 5-3. Table 5-3. Grapheme Boundaries
Character Classes
CR Carriage Return LF Line Feed Format All other control or format characters Virama Indic viramas Joining All combining characters, plus Tibetan subjoined characters L Hangul leading jamo V Hangul vowel jamo T Hangul trailing jamo Lo Other letters Other All other characters
126 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.15 Locating Text Element Boundaries
Table 5-3. Grapheme Boundaries (Continued)
Rules Always break before an LF, unless it is preceded by a CR. Always break before and after all other format or control characters. ¬ CR ÷ LF (1) CR ÷¬ LF (2) ÷( CR | Format ) (3) ( Format | LF )÷ (4)
Break before viramas and joining characters only if they are preceded by con- trols or formats. ( Format | CR | LF ) ÷ ( Virama | Joining ) (5)
Break before and after Hangul jamo unless they are in an allowable sequence. ¬ L ÷ L (6) ¬ ( L | V )÷V (7) ¬ ( L | V | T )÷T (8) ( L | V | T )÷¬ ( L | V | T )(9)
Don’t break between viramas and other letters. This rule provides for Indic graphemes, where virama will link character clusters together. Virama × Lo (10)
Break before any other characters. ÷ Other (11)
Word Boundaries Word boundaries are used in a number of different contexts. The most familiar ones are double-click mouse selection, “move to next word,” and detection of whole words for search and replace. For the search and replace option of “find whole word,” the rules are fairly clear. The boundaries are between letters and nonletters. Trailing spaces cannot be counted as part of a word, because searching for “abc” would then fail on “abc_”. For word selection, the rules are somewhat less clear. Some programs include trailing spaces, whereas others include neighboring punctuation. Where words do not include trailing spaces, sometimes programs treat the individual spaces as separate words; other times they treat an entire string of spaces as a single word. (The latter fits better with usage in search and replace.) • Word boundaries can also be used in so-called intelligent cut and paste. With this feature, if the user cuts a piece of text on word boundaries, adjacent spaces are collapsed to a single space. For example, cutting “quick” from “The_quick_fox” would leave “The_ _fox”. Intelligent cut and paste collapses this text to “The_fox”.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 127 5.15 Locating Text Element Boundaries Implementation Guidelines
This discussion outlines the case where boundaries occur between letters and nonletters, and there are no boundaries between nonletters. It also includes Japanese words (for word selection). In Japanese, words are not delimited by spaces. Instead, a heuristic rule is used in which strings of either kanji (ideographs) or katakana characters (optionally followed by strings of hiragana characters) are considered words. See Table 5-4. Table 5-4. Word Boundaries
Character Classes
CR Carriage Return LF Line Feed Sep LS, PS TAB Tab character Let Letter Com Combining mark Hira Hiragana Kata Katakana Han Han ideograph (Kanji)
Rules Always break before an LF, unless it is preceded by a CR. Always break before and after all other format or control characters, including TAB. ¬ CR ÷ LF (1) CR ÷¬ LF (2) ÷ ( CR | Sep | TAB )(3) ( Sep | TAB | LF ) ÷ (4)
Break between letters and nonletters. Include trailing nonspacing marks as part of a letter. ¬ ( Let | Com ) ÷ Let (5) Let Com* ÷¬ Let (6)
Handle Japanese specially for word selection. Treat clusters of kanji or kata- kana (with or without following hiragana) as single words. Break when pre- ceded by other characters (such as punctuation). Include nonspacing marks. ¬ ( Hira | Kata | Han | Com ) ÷ Hira | Kata | Han (7) Hira Com* ÷¬ Hira (8) Kata Com* ÷¬ ( Hira | Kata ) (9) Han Com* ÷¬ ( Hira | Han ) (10)
• One could also generally break between any letters of different scripts. In prac- tice—except for languages that do not use spaces—this possibility is a degener- ate case.
128 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.15 Locating Text Element Boundaries
Line Boundaries Line boundaries determine acceptable locations for line-wrap to occur without hyphen- ation. (More sophisticated line-wrap also makes use of hyphenation, but generally only in cases where the natural line-wrap yields inadequate results.) Note that this approach is very similar to word boundaries but not generally identical. For the purposes of line-break, a composed character sequence should generally be treated as though it had the same properties as the base character. The nonspacing marks should not be separated from the base character. Nonspacing marks may be exhibited in isolation—that is, over a space or nonbreaking space. In that case, the entire composed character sequence is treated as a unit. If the com- posed character sequence consists of a no-break space followed by nonspacing marks, then it does not generally allow line-breaks before or after the sequence. If the composed charac- ter sequence consists of any other space followed by nonspacing marks, then it generally does allow line-breaks before or after the sequence. There is a degenerate case where a nonspacing mark occurs as the first character in the text or after a line or paragraph separator. In that case, the most consistent treatment for the line-break is to treat the nonspacing mark as though it were applied to a space. See Table 5-5. Table 5-5. Line Boundaries
Character Classes
CR Carriage Return LF Line Feed ZWSP Zero-width space ZWNBSP Zero-width no-break space Sp Spaces Break Mandatory break (for example, Paragraph Separator) Com All combining characters (including Tibetan subjoined characters), plus medial and final conjoining Hangul jamo Ideographic Ideographic characters Alphabetic Alphabetic characters and most symbols Exclam Terminating characters like exclamation point Syntax Solidus (‘/’) Open Open punctuation, such as ‘(’ Close Closing punctuations, such as ‘)’ Quote Ambiguous quotations marks, such as " and ' NonStarter Small hiragana and katakana characters HyphenMinus U+002D Insep Ellipsis characters and leaders Number Digits NumericPrefix Characters such as ‘$’ NumericPostfix Characters such as ‘%’ NumericPrefix Characters such as ‘$’ NumericPostfix Characters such as ‘%’ NumericInfix Characters such as period and comma that occur with- in European numbers
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 129 5.15 Locating Text Element Boundaries Implementation Guidelines
Table 5-5. Line Boundaries (Continued)
Base Base characters NonBase Nonbase characters (control characters and format characters) All All Unicode characters Note: For a precise specification of these classes, see Unicode Technical Report #14, “Line Breaking Properties,” on the CD-ROM or the up-to-date version on the Unicode Web site. (The classes have slightly different names here for consistency.) Rules Explicit breaks and nonbreaks: • Always break after hard line-breaks (but never between CR and LF). There is a break opportunity after every ZWSP, but not a hard break. • Do not break before spaces or hard line-breaks. • Do not break before or after ZWNBSP. CR × LF (1) ( CR | LF | Break | ZWSP ) ÷ (2) × ( SP | CR | LF | Break | ZWSP | ZWNBSP )(3) ZWNBSP × (4) Combining marks: • Do not break graphemes (before combining marks). Virama and jamos are merged with the proper classes so they work correctly. In all of the following rules: • If a space is the base character for a combining mark, the space is changed to type AL. • At any possible break opportunity between Com and a following character, Com behaves as if it had the type of its base character. If there is no base, the Com behaves like AL. × Com (5) Sp Com ò Alphabetic Com (6) Base Com ò Base Base (7) NonBase Com ò NonBase Alphabetic (8) Handle opening and closing: These have special behavior with respect to spaces. • Do not break before exclamations, syntax characters, or closing punctuation, even after spaces.
130 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.15 Locating Text Element Boundaries
Table 5-5. Line Boundaries (Continued)
• Do not break after opening punctuations, even after spaces. • Do not break within quotations and opening punctuations, or between closing punctuation and ideographs, even with inter- vening spaces. × ( Exclam | Syntax | Close ) (9) Open × (10) Quote × Open (11) Close × Ideographic (12) Once opening and closing are handled, spaces can be finished up. • Break after spaces Sp ÷ (13) Special case rules: • Do not break before or after ambiguous quotation marks. • Do not break before no-starts or HyphenMinus. • Do not break between two ellipses, or between letters or numbers and ellipsis, such as in ‘9...’, ‘a...’, ‘H...’. • Do not break within alphabetics and numbers, or numbers and alphabetics, or ideographs and numeric suffixes. × ( Quote | NonStarter | HyphenMinus ) (14) Quote × (15) ( Insep | Number | Ideographic | Alphabetic ) × Insep (16) Alphabetic × Number (17) Number × Alphabetic (18) Ideographic × NumericPostfix (19) Numbers are of various forms that should not be broken across lines—for example, $(12.35), 12, (12)¢, 12.54¢, and so on. These are approximated with the following rule. (Some cases were already handled, such as ‘9,’ and ‘[9’.) NumericPrefix × ( Number | Open | HyphenMinus ) (20) ( HyphenMinus | Syntax | Number | NumericInfix ) × Number (21) Number × NumericPostfix (22) Close × NumericPostfix (23)
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 131 5.15 Locating Text Element Boundaries Implementation Guidelines
Table 5-5. Line Boundaries (Continued)
Join alphabetic letters and break everywhere else. Alphabetic × Alphabetic (24) ÷ All (25) All ÷ (26)
Sentence Boundaries Sentence boundaries are often used for triple-click or some other method of selecting or iterating through blocks of text that are larger than single words. Plain text provides inadequate information for determining good sentence boundaries. Periods, for example, can either signal the end of a sentence, indicate abbreviations, or be used for decimal points. Without analyzing the text semantically, it is impossible to be cer- tain which of these usages is intended (and sometimes ambiguities still remain). See Table 5-6. Table 5-6. Sentence Boundaries
Character Classes
CR Carriage Return LF Line Feed Sep LS, PS Sp Space separator Term !? Dot Period Cap Uppercase, titlecase, and noncased letters Lower Lowercase Open Open punctuation Close Close punctuation, period, comma, ...
Rules Always break before an LF, unless it is preceded by a CR. Always break before and after separating control or format characters. ¬ CR ÷ LF (1) CR ÷¬ LF (2) ÷ ( CR | Sep ) (3) ( Sep | LF ) ÷ (4)
Break after sentence terminators, but include nonspacing marks, closing punctuation, trailing spaces, and (optionally) a paragraph separator. Term Close* Sp* {Sep} ÷ (5)
132 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.16 Identifiers
Table 5-6. Sentence Boundaries (Continued)
Handle a period specially, as it may be an abbreviation or numeric period— and not the end of a sentence. Don’t break if it is followed by a lowercase let- ter instead of an uppercase letter. Dot Close* Sp+ ÷ Open* ¬ Lower (6)
Random Access A further complication is introduced by random access (see Figure 5-14). When iterating through a string from beginning to end, the preceding approach works well. It guarantees a limited context, and it allows a fresh start at each boundary to find the next boundary. By constructing a state table for the reverse direction from the same specification of the rules, reverse searches are possible. However, suppose that the user wants to iterate starting at a random point in the text. If the starting point does not provide enough context to allow the correct set of rules to be applied, then one could fail to find a valid boundary point. For example, suppose a user clicked after the first space in “?_ _A”. On a forward iteration searching for a sentence boundary, one would fail to find the boundary before the “A”, because the “?” hadn’t been seen yet. A second set of rules to determine a “safe” starting point provides a solution. Iterate back- ward with this second set of rules until a safe starting point is located, then iterate forward from there. Iterate forward to find boundaries that were located between the starting point and the safe point; discard these. The desired boundary is the first one that is not less than the starting point. Figure 5-14. Random Access
? _ _ A ...
This process would represent a significant performance cost if it had to be performed on every search. However, this functionality could be wrapped up in an iterator object, which preserves the information regarding whether it currently is at a valid boundary point. Only if it is reset to an arbitrary location in the text is this extra backup processing performed.
5.16 Identifiers A common task facing an implementer of the Unicode Standard is the provision of a pars- ing and/or lexing engine for identifiers. To assist in the standard treatment of identifiers in Unicode character-based parsers, a set of guidelines is provided for the definition of identi- fier syntax. Note that the task of parsing for identifiers and the task of locating word boundaries are related, and it is a straightforward procedure to restate the sample syntax provided here in
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 133 5.16 Identifiers Implementation Guidelines the form just discussed for locating text element boundaries. In this section, a more tradi- tional BNF-style syntax is presented to facilitate incorporation into existing standards. The formal syntax provided here is intended to capture the general intent that an identifier consists of a string of characters that begins with a letter or an ideograph, and then includes any number of letters, ideographs, digits, or underscores. Each programming language standard has its own identifier syntax; different programming languages have different conventions for the use of certain characters from the ASCII range ($, @, #, _) in identifiers. To extend such a syntax to cover the full behavior of a Unicode implementation, imple- menters need only combine these specific rules with the sample syntax provided here. These rules are no more complex than current rules in the common programming languages, except that they include more characters of different types. The innovations in the sample identifier syntax to cover the Unicode Standard correctly include the following: 1. Incorporation of proper handling of combining marks 2. Allowance for layout and format control characters, which should be ignored when parsing identifiers Combining Marks. Combining marks must be accounted for in identifier syntax. A com- posed character sequence consisting of a base character followed by any number of com- bining marks must be valid for an identifier. This requirement results from the conformance rules in Chapter 3, Conformance, regarding interpretation of canonical- equivalent character sequences. Enclosing combining marks (for example, U+20DD..U+20E0) are excluded from the syn- tactic definition of
Syntactic Rule
134 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.17 Sorting and Searching
Table 5-7. Syntactic Classes for Identifiers
Syntactic Class Equivalent Category Set Coverage
5.17 Sorting and Searching Sorting and searching overlap in that both implement degrees of equivalence of terms to be compared. In the case of searching, equivalence defines when terms match (for example, it determines when case distinctions are meaningful). In the case of sorting, equivalence affects the proximity of terms in a sorted list. These determinations of equivalence always depend on the application and language, but for an implementation supporting the Uni- code Standard, sorting and searching must also take into account the Unicode character equivalence and canonical ordering defined in Chapter 3, Conformance. This section also discusses issues of adapting sublinear text searching algorithms, providing for fast text searching while still maintaining language-sensitivity, and using the same ordering algorithms that are used for collation. For more information, see Unicode Techni- cal Report #10, “Unicode Collation Algorithm,” on the CD-ROM or the up-to-date version on the Unicode Web site.
Culturally Expected Sorting Sort orders vary from culture to culture, and many specific applications require variations. Sort order can be by word or sentence, case sensitive or insensitive, ignoring accents or not; it can also be either phonetic or based on the appearance of the character, such as ordering by stroke and radical for East Asian ideographs. Phonetic sorting of Han characters requires use of either a lookup dictionary of words or special programs to maintain an associated phonetic spelling for the words in the text. Languages vary not only regarding which types of sorts to use (and in which order they are to be applied), but also in what constitutes a fundamental element for sorting. For exam- ple, Swedish treats U+00C4 as an individual let- ter, sorting it after z in the alphabet; German, however, sorts it either like ae or like other accented forms of ä following a. Spanish traditionally sorted the digraph ll as if it were a let- ter between l and m. Examples from other languages (and scripts) abound. As a result, it is neither possible to arrange characters in an encoding in an order so that simple binary string comparison produces the desired collation order, nor is it possible to provide single-level sort-weight tables. The latter implies that character encoding details have only an indirect influence on culturally expected sorting. To address the complexities of culturally expected sorting, a multilevel comparison algo- rithm is typically employed.1 Each character in string is given several categories of sort weights. Categories can include alphabetic, case, and diacritic weights, among others.
1. A good example can be found in Denis Garneau, Keys to Sort and Search for Culturally- Expected Results (IBM document number GG24-3516, June 1, 1990), which addresses the problem for Western European languages, Arabic, and Hebrew.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 135 5.17 Sorting and Searching Implementation Guidelines
In a first pass, these weights are accumulated into a sort key for the string. At the end of the first pass, the sort key contains a string of alphabetic weights, followed by a string of case weights, and so on. In a second pass, these substrings are compared by order of importance so that case and accent differences can either be fully ignored or applied only where needed to differentiate otherwise identical sort strings. The first pass looks very similar to the decomposition of Unicode characters into base char- acter and accent. The fact that the Unicode Standard permits multiple spellings (composed and composite) of the same accented letter turns out not to matter at all. If anything, a completely decomposed text stream can simplify the first implementation of sorting. To provide a powerful, table-based approach to natural-language collation using Unicode characters, implementers need to consider providing full functionality for these features of language-sensitive algorithmic sorting: • Four collation levels • French or normal orientation • Contracting or expanding characters • Ordering of unmapped characters • More than one level of ignorable characters
Unicode Character Equivalence Section 3.6, Decomposition, and Section 3.10, Canonical Ordering Behavior, define equiva- lent sequences and provide an exact algorithm for determining when two sequences are equivalent. Equivalent sequences of Unicode characters should be collated in exactly the same way, no matter what the underlying storage is. Figure 5-15 gives two examples of character equivalence. Figure 5-15. Character Equivalence ä = a @¨ a @¨ @. = a @. @¨ a @¨ @` ≠ a @` @¨
Compatibility characters—especially where they have the same appearance—should also be collated in exactly the same way (for example, U+00C5 Å and U+212B Å ).
Similar Characters Languages differ in what they consider similar characters, but users generally want charac- ters that are similar (such as upper- and lowercase) to sort close to each other but not to be collated exactly the same way. If upper- and lowercase collated identically, words differing
136 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.17 Sorting and Searching only in case would appear in random order (see Figure 5-16). The same is true for accented characters. Typically, there is an ordering among these similar characters. In English dictionaries, for example, lowercase precedes uppercase (the reverse of what happens with a naïve ASCII comparison). However, this ordering applies only when the strings are the same in all other respects. Otherwise, Aachen would sort after azure. Characters with accents often sort close to the base character, but different accents on the same base character always sort in a given order. Figure 5-16. Naïve Comparison
pat pat É pet Pat É PAT pit É É pet É pot pit É put É É pot Pat É É put PAT É Naïve Comparison Priority Level Comparison
Levels of Comparison The way to handle these problems is to use multiple levels of comparison and attach only a secondary or tertiary difference to the letters based on their case or accents (see Figure 5-17). Thus we get the following rules: R1 Count secondary differences—only if there are no primary differences. R2 Count tertiary differences—only if there are no primary or secondary differences. Figure 5-17. Levels of Comparison resume resume résumé Resume resumes résumé résumés Résumé Secondary Difference Tertiary Difference
In English and similar languages, accents make only a secondary difference, and case differ- ences make only a tertiary difference. Ignorable characters are counted as secondary or ter- tiary differences. In other languages and scripts, other features map to secondary or tertiary differences.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 137 5.17 Sorting and Searching Implementation Guidelines
French presents an interesting special case. In French sorting, the differences in accents later in the string are more important than those earlier in the string. In Figure 5-18, the two pairs of words are identical in the first five base characters. In English, the first accents are the most significant ones; in French, the first accents from the end are as the boxes show. Figure 5-18. Orientation
Primary Sequence Primary Sequence Secondary Sequence Sequence Secondary péché pêche pêche péché pécher pécher pêcher pêcher Forward Sequence Backward Sequence (French)
Ignorable Characters Another class of interesting cases involves ignorable characters (see Figure 5-19), such as spaces, hyphens, and some other punctuation. In this case, the character itself is ignored unless there are no stronger differences in the string. Figure 5-19. Ignorable Characters blackbird role black-bird roˆ@le blackbirds roles black-birds roˆ@les Ignorable punctuation Ignorable accents
The general rule is as follows: R3 Treat ignorable characters as having no primary difference.
Multiple Mappings With many language collations such as traditional Spanish or German, one character may compare as if it were two, or two characters may compare as if they were one character (see Figure 5-20). In traditional Spanish orthography, for example, “Ch” sorts as a single let- ter—that is, after “Cz”; otherwise, it would come before “Ci”. In traditional German, “ö” sorts as if it were “oe”, putting it after “od” and before “of”.
138 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.17 Sorting and Searching
Figure 5-20. Multiple Mappings Cza r Tod QS Chico Tn e QU Hermos a Tofu QT Za pa t a Ton RS Grouped (n:1) Split (1:n) RT RU General (n:m)
In Japanese, a á length mark lengthens the vowel of the preceding character; depending on the vowel, the result will sort in a different order. For example, after the character “ka”, the á length mark indicates a long “a” and comes after “a”; after the character “ki”, the á length mark indicates a long “i” and comes after “i”. Look at the second character in each word in the third column of Figure 5-20. There are three different characters in the second position: “a”, á length mark, and “i”. The gen- eral rule is as follows: R4 N characters may compare as if they were M.
Collating Out-of-Scope Characters Collation implements an ordering that matches the expectations of the user, based on rules of the user’s language. Lists of terms encoded using the Unicode Standard may easily come from many different languages. These terms are all sorted according to the custom of the user’s language. For scripts and characters outside the use of a particular language, explicit rules may not exist. For example, Swedish and French have clear and different rules on sorting ä (either after z or as an accented character with a secondary difference from a), but neither defines a particular sorting order for the Han ideographs. Implementations supporting the Uni- code Standard therefore typically provide a default ordering (like the culturally neutral ordering for ideographs used in this standard). Sorting for a Japanese user would still sort upper- and lowercase Latin letters in proximity. The relative ordering of scripts is typically configurable. R5 Default to a common or culturally neutral ordering for out-of-scope characters.
Unmapped Characters Another option is to treat out-of-scope characters as irrelevant. Such characters can include box forms, dingbats, and perhaps also alphabets that are not of concern for the user base of an implementation. Characters irrelevant to a collation sequence are usually not assigned weights; this choice saves space in the collation sequence. However, to provide a definitive sorting order, a position needs to be specified in the collation sequence for any unassigned character. For efficiency, collate any unassigned characters in Unicode bit order. R6 Collate irrelevant characters in Unicode bit order, in a specified position.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 139 5.17 Sorting and Searching Implementation Guidelines
Parameterization For effective use of collation, programmers need to have a certain level of control and need to be able to get back sufficient information. One output needs to be the place in the string where a difference occurs, so that common initial substrings can be found. Notice that this position may be different in the two different strings. Because of multiple mappings, the first three characters in one string might be equivalent to the first four in the other. Input • Specify language rule • Relative order of scripts • Allows specification of precision Output • First point in each string where different • Direction and precision of difference Processing • Can preprocess each string for fast comparison • Process just as much as needed for a stand-alone comparison
Optimizations Multiple-level comparison requires a bit more work than binary comparison. Although real-life studies put the overhead at around 50 percent, it often pays to first transform terms to be sorted into equivalent sort keys, which result in the same sorted list when subjected to a simple and fast binary comparison. (In the standard C library, the function wcsxfrm provides such a transformation.) These sort keys might consist of a string of base weights followed by strings for weights used for secondary and tertiary differences, as just dis- cussed. Unlike Unicode code values, sort keys don’t need to be 16-bit-based. Thus highly optimized functions, such as the strcmp function from the standard C library, can be used. Sort keys can also be stored, obviating recomputation when a list needs to be re-sorted. Another straightforward optimization is to compare as you go. For each string, sort weights are assembled into sort keys only until a difference is located. This approach reduces the computation necessary when a difference is found early in the string.
Searching Searching is subject to many of the same issues as comparison, such as a choice of a weak, strong, or exact match. Other features are often added, such as only matching words (that is, where a word boundary appears on each side of the match). One technique is to code a fast search for a weak match. When a candidate is found, additional tests can be made for other criteria (such as matching diacriticals, word match, case match, and so on). When searching strings, it is necessary to check for trailing nonspacing marks in the target string that may affect the interpretation of the last matching character. That is, a search for “San Jose” may find a match in the string “Visiting San José, Costa Rica is a...”. If an exact (diacritic) match is desired, then this match should be rejected. If a weak match is sought, then the match should be accepted, but any trailing nonspacing marks should be included when returning the location and length of the target substring. The mechanisms discussed in Section 5.15, Locating Text Element Boundaries, can be used for this purpose.
140 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.18 Case Mappings
One important application of weak equivalence is case-insensitive searching. Many tradi- tional implementations map both the search string and the target text to uppercase. How- ever, case mappings are language-dependent and not unambiguous (see Section 4.1, Case— Normative, and Section 7.1, Latin). The preferred method of implementing case insensitiv- ity uses the same mechanisms and tables described in the sorting discussions in the begin- ning of this section. In particular, it is advisable for many applications (for example, file systems) to treat a particular set of characters (i, I, dotless-i,capital i with dot) as a single equivalency class to guarantee reasonable results for Turkish. A related issue can arise because of inaccurate mappings from external character sets. To deal with this problem, characters that are easily confused by users can be kept in a weak equivalency class (ñ d-bar, Õ eth, capital d-bar, µ capital eth). This approach tends to do a better job of meeting users’ expectations when searching for named files or other objects.
Sublinear Searching International searching is clearly possible using the information in the collation, just by using brute force. However, this tactic requires an O(m*n) algorithm in the worst case and O(m) in common cases, where n is the number of characters in the pattern that is being searched for and m is the number of characters in the target to be searched. A number of algorithms allow for fast searching of simple text, using sublinear algorithms. These algorithms use only O(m/n) in common cases, by skipping over characters in the tar- get. Several implementers have adapted one of these algorithms to search text pre- transformed according to a collation algorithm, which allows for fast searching with native-language matching (see Figure 5-21). Figure 5-21. Sublinear Searching The_quick_brown… quick qui ck qu i ck quick quick
The main problems with adapting a language-aware collation algorithm for sublinear searching relate to multiple mappings and ignorables. Additionally, sublinear algorithms precompute tables of information. Mechanisms like the two-stage tables introduced in Figure 5-1 are efficient tools in reducing memory requirements.
5.18 Case Mappings The vast majority of case mappings are uniform across languages. In a few instances, upper- and lowercase mappings may differ from language to language between writing sys- tems that employ the same letters. The principal example is Turkish, where U+0131 “” maps to U+0049 “I” and U+0069 “i” maps to U+0130 “” , as shown in Figure 5-22.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 141 5.18 Case Mappings Implementation Guidelines
Figure 5-22. Case Mapping for Turkish I
õ I i õ
The process of case mapping has important exceptions. See the file SpecialCasing.html on the CD-ROM for more about these exceptions. For more information on case mapping data, see Section 4.1, Case—Normative, and Unicode Technical Report #21, “Case Map- pings,” on the CD-ROM or the up-to-date version on the Unicode Web site. Case Mappings Not Reversible. It is important to note that casing operations do not always provide a round-trip mapping. Also, because many characters are really caseless (most of the IPA block, for example), uppercasing a string does not mean that it will no longer contain any lowercase letters. Case correspondences are not always one-to-one: the result of case folding may be a differ- ent character length than in the source string. For example, U+00DF ß - becomes “SS” in uppercase. As discussed in Section 7.2, Greek, the iota-subscript characters used to represent ancient text can be viewed as having special case mappings. Normally, the uppercase and lowercase forms of alpha-iota-subscript will map back and forth. In some instances, where uppercase words should be transformed into their older spellings by removing accents and changing the iota-subscript into a capital iota (and perhaps even removing spaces). Note that case transformations are not reversible. For example, upper(lower(“John Brown”)) Å “JOHN BROWN” lower(upper(“John Brown”)) Å “john brown” There are even single words like vederLa in Italian or the name McGowan in English, which are neither upper-, lower-, nor titlecase. This format is sometimes called inner-caps. Also, some single characters do not have reversible mappings. For example, U+03C2 § uppercases to U+03A3 , but the capital sigma lowercases to (nonfinal) U+03C3 ¨ . For word processors that use a single command-key sequence to toggle the selection through different casings, it is recommended to save the original string, and then return to it in the sequence of keys. The user interface would produce the following results in response to a series of command keys. Notice that the original string is restored every fourth time. 1. The quick brown 2. THE QUICK BROWN 3. the quick brown 4. The Quick Brown 5. The quick brown Uppercase, titlecase, and lowercase can be represented in a word processor by using a char- acter style. Removing the character style restores the text to its original state. However, if this approach is taken, any spell-checking software needs to be aware of the case style so that it can check the spelling according to the actual appearance.
142 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Implementation Guidelines 5.18 Case Mappings
For information on case conversion, detecting when strings are of a given case, and per- forming caseless matching of strings, see Unicode Technical Report #21, “Case Mappings,” on the CD-ROM or the up-to-date version on the Unicode Web site.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 143 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:
Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 6 Punctuation 6
The appearance and usage of punctuation marks varies between languages and scripts, but punctuation marks have a common function. They separate units of text, such as sentences and phrases, thus clarifying the meaning of the text. The use of punctuation marks is not limited to prose; they are also used in mathematical and scientific formulae, for example. In the Unicode Standard, punctuation marks are called punctuation characters. In the standard, characters are organized into related groups called blocks. Character blocks generally contain characters from a single script. In many cases, a script is fully rep- resented in its character block. There are, however, important exceptions, most notably in the area of punctuation characters. Punctuation characters occur in several widely sepa- rated places in the character blocks, including Basic Latin, Latin-1 Supplement, General Punctuation, and CJK Symbols and Punctuation, as well as occasional characters in charac- ter blocks for specific scripts. Punctuation characters—for example, U+002C or U+2022 —are encoded only once, rather than being encoded again and again for particular scripts; such general- purpose punctuation may be used for any script or mixture of scripts. In contrast, punctu- ation characters that are encoded in a given script block—for example, U+058A or U+060C —are intended primarily for use in the context of that script. They are unique in function, have different directionality, or are distinct in appear- ance or usage from their generic counterparts. The use and interpretation of punctuation characters can be heavily context-dependent. For example, U+002E can be used as sentence-ending punctuation, an abbrevi- ation indicator, a decimal point, and so on. Punctuation characters vary in appearance with the font style, just like the surrounding text characters. In some cases, where used in the context of a particular script, a specific glyph style is preferred. For example, U+002E should appear square when used with Armenian, but is typically circular when used with Latin. For mixed Latin/Armenian text, two fonts (or one font allowing for context-dependent glyph variation) may need to be used to faithfully render the character. In a bidirectional context (see Section 3.12, Bidirectional Behavior), shared punctuation characters have no inherent directionality, but resolve according to the Unicode bidirec- tional algorithm. Where the image of a punctuation character is not bilaterally symmetric, the mirror image is used when the character is part of the right-to-left text stream (see Section 4.7, Mirrored—Normative). In vertical writing, many punctuation characters have special vertical glyphs. A number of characters in the blocks described in this chapter are not graphic punctuation characters, but nevertheless affect the operation of layout algorithms. For a description of these characters, see Section 13.2, Layout Controls.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 147 6.1 General Punctuation Punctuation
6.1 General Punctuation
Punctuation: U+0020–U+00BF Standards. The Unicode Standard adapts the ASCII (ISO 646) 7-bit standard by retaining the semantics and numeric code values, merely supplying enough leading zeros to convert them into 16-bit values. The content and arrangement of the ASCII standard are far from optimal in the context of a 16-bit space, but the Unicode Standard retains it without change because of its prevalence in existing usage. The ASCII (ANSI X3.4) standard is identical to ISO/IEC 646:1991-IRV. ASCII Graphic Characters. Some of the nonletter characters in this range suffer from overburdened usage as a result of the limited number of codes in a 7-bit space. Some cod- ing consequences of this problem are discussed in this section under “Encoding Characters with Multiple Semantic Values” and “General Punctuation,” but also see “Language-Based Use of Quotation Marks.” The rather haphazard ASCII collection of punctuation and mathematical signs are isolated from the larger body of Unicode punctuation, signs, and symbols (which are encoded in ranges starting at U+2000) only because the relative loca- tions within ASCII are so widely used in standards and software. Typographic Variations. Code values in the ASCII range are well established and used in widely varying implementations. The Unicode Standard therefore provides only minimal specifications on the typographic appearance of corresponding glyphs. For example, the value U+0024 ($) (derived from ASCII 24) has the semantic dollar sign, leaving open the question of whether the dollar sign is to be rendered with one vertical stroke or two. The Unicode value U+0024 refers to the dollar sign semantic, not to its precise appearance. Likewise, in old-style numerals, where numbers vary in placement above and below the baseline, a decimal or thousands separator may be displayed with a dot that is raised above the baseline. Because it would be inadvisable to have a stylistic variation between old-style and new-style numerals that actually changes the underlying representation of text, the Unicode Standard considers this raised dot to be merely a glyphic variant of U+002E “.” . For other characters in this range that have alternative glyphs, the Unicode character is displayed with the basic or most common glyph; rendering software may present any other graphical form of that character. Encoding Characters with Multiple Semantic Values. Some ASCII characters have multi- ple uses, either through ambiguity in the original standards or through accumulated rein- terpretations of a limited code set. For example, 27 hex is defined in ANSI X3.4 as apostrophe (closing single quotation mark; acute accent), and 2D hex is defined as hyphen- minus. In general, the Unicode Standard provides the same interpretation for the equiva- lent code values, without adding to or subtracting from their semantics. The Unicode Stan- dard supplies unambiguous codes elsewhere for the most useful particular interpretations of these ASCII values; the corresponding unambiguous characters are cross-referenced in the character names list for this block. For a complete list of space characters and dash characters in the Unicode Standard, see “General Punctuation” later in this section. For historical reasons, U+0027 is a particularly overloaded character. In ASCII, it is used to represent a punctuation mark (such as right single quotation mark, left single quotation mark, apostrophe punctuation, vertical line, or prime) or a modifier letter (such as apos- trophe modifier or acute accent). Punctuation marks generally break words; modifier let- ters generally are considered part of a word.
148 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Punctuation 6.1 General Punctuation
The preferred character for apostrophe is U+2019, but U+0027 is commonly present on keyboards. In modern software, it is therefore common to substitute U+0027 by the appro- priate character in input. In these systems, a U+0027 in the data stream is always repre- sented as a straight vertical line and can never represent a curly apostrophe or a right quotation mark. For more information, see “Apostrophes” later in this section. Semantics of Paired Punctuation. Paired punctuation marks such as parentheses (U+0028, U+0029), square brackets (U+005B, U+005D), and braces (U+007B, U+007D) are interpreted semantically rather than graphically in the context of bidirectional or verti- cal texts; that is, these characters have consistent semantics but alternative glyphs depending upon the directional flow rendered by a given software program. The software must ensure that the rendered glyph is the correct one. When interpreted semantically rather than graphically, characters containing the qualifier “” are taken to denote opening; charac- ters containing the qualifier “” are taken to denote closing. For example, U+0028 and U+0029 are interpreted as opening and closing parentheses, respectively, in the context of bidirectional or vertical texts. In a right-to-left directional flow, U+0028 is rendered as “)”. In a left-to-right flow, the same character is rendered as “(”. See also “Language-Based Use of Quotation Marks” later in this section. Tilde. U+007E can be used either as a Spacing Clone of Combining Tilde (see “Spacing Clones of Diacritics” in Section 7.1, Latin) or more often as a center line tilde, similar in appearance to U+223C “~” . Two common uses are to indicate an approximate value or in dictionaries to repeat the defined term in the definition of the ~. Although U+007E “~” is ambiguous in its rendering, modern fonts generally render it with a center line glyph, as shown in the code charts.
General Punctuation: U+2000–U+206F General punctuation combines punctuation characters and characterlike elements used to achieve certain text layout effects. Some punctuation characters can be used with many dif- ferent scripts. Many general punctuation characters can also be found in the Basic Latin (ASCII) and Latin-1 Supplement blocks. In many cases, current standards include generic characters for punctuation instead of the more precisely specified characters used in printing. Examples include the single and dou- ble quotes, period, dash, and space. The Unicode Standard includes these generic charac- ters, but also encodes the unambiguous characters independently: various forms of quotation mark, decimal period, em dash, en dash, minus, hyphen, em space, en space, hair space, zero-width space, and so on. Punctuation principally used with a specific script is found in the block corresponding to that script, such as U+061B “ý” or the punctuation used with ideo- graphs in the CJK Symbols and Punctuation Block.
Space Characters The most commonly used space character is U+0020 . Also often used is its non- breaking counterpart, U+00A0 - . These two characters have the same width, but behave differently for line breaking. U+00A0 - behaves like a numeric separator for the purposes of bidirectional layout. (See Section 3.12, Bidirectional Behavior, for a detailed discussion of the Unicode bidirectional algorithm.) In ideographic text, U+3000 is commonly used because its width matches that of the ideographs.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 149 6.1 General Punctuation Punctuation
The main difference among other space characters is their width. U+2000..U+2006 are standard quad widths used in typography. U+2007 has a fixed width, known as tabular width, which is the same width as digits used in tables. U+2008 is a space defined to be the same width as a period. U+2009 and U+200A are successively smaller-width spaces used for narrow word gaps and for justi- fication of type. The fixed-width space characters (U+2000..U+200A) are derived from conventional (hot lead) typography. Algorithmic kerning and justification in computerized typography do not use these characters. However, where they are used, as, for example, in typesetting mathematical formulae, their width is generally font-specified, and they typi- cally do not expand during justification. The exception is U+2009 , which sometimes gets adjusted. Space characters with special behavior in word or line breaking are described in “Line and Word Breaking” in Section 13.2, Layout Controls. Space characters may also be found in other character blocks in the Unicode Standard. The list of space characters appears in Table 6-1. Table 6-1. Unicode Space Characters Code Name U+0020 U+00A0 - U+2000 U+2001 U+2002 U+2003 U+2004 -- U+2005 -- U+2006 -- U+2007 U+2008 U+2009 U+200A U+200B U+202F - U+3000
U+200B , as well as several spacelike, zero-width characters with special properties, are described in Section 13.2, Layout Controls.
Dashes and Hyphens Because of its prevalence in legacy encodings, U+002D - is the most com- mon of the dash characters used to represent a hyphen. It has ambiguous semantic value and is rendered with an average width. U+2010 represents the hyphen as found in words such as “left-to-right.” It is rendered with a narrow width. When typesetting text, U+2010 is preferred over U+002D -. U+2011 - is present for compatibility with existing standards. It has the same semantic value as U+2010 , but should not be broken across lines. U+2012 is present for compatibility with existing standards; it has the same (ambiguous) semantic as the U+002D -, but has the same width as digits (if they are monospaced). U+2013 is used to indicate a range of values, such as 1973– 1984. It should be distinguished from the U+2122 , which is an arithmetic operator; however, typographers have typically used U+2013 in typesetting to represent the minus sign. For general compatibility in interpreting formulas, U+002D -,
150 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Punctuation 6.1 General Punctuation
U+2012 , and U+2212 should each be taken as indicating a minus sign, as in “x = a - b.” U+2014 is used to make a break—like this—in the flow of a sentence. It is com- monly represented with a typewriter as a double-hyphen. In older mathematical typogra- phy, U+2014 is also used to indicate a binary minus sign. U+2015 is used to introduce quoted text in some typographic styles. Dashes and hyphen characters may also be found in other character blocks in the Unicode Standard. A list of dash and hyphen characters appears in Ta ble 6-2. For a description of the line-breaking behavior of dashes and hyphens, see Unicode Technical Report #14, “Line Breaking Properties,” on the CD-ROM or the up-to-date version on the Unicode Web site. Table 6-2. Unicode Dash Characters Code Name U+002D - U+007E (= swung dash) U+00AD U+058A U+1806 U+2010 U+2011 - U+2012 U+2013 U+2014 U+2015 (= quotation dash) U+207B U+208B U+2212 U+301C U+3030
Language-Based Usage of Quotation Marks U+0022 is the most commonly used character for quotation mark. How- ever, it has ambiguous semantics and direction. Word processors commonly offer a facility for automatically converting the U+0022 to a contextually-selected curly quote glyph. Low Quotation Marks. U+201A - and U+201E - are unambiguously opening quotation marks. All other quotation marks have heterogeneous semantics. They may represent opening or closing quotation marks depending on the usage. European Usage. The use of quotation marks differs systematically by language and by medium. In European typography, it is common to use guillemets (single or double angle quotation marks) for books and, except for some languages, curly quotation marks in office automation. Single guillemets can be found for quotes inside quotes. The following description does not attempt to be complete, but intends to document a range of known usages of quotation mark characters. Some of these usages are also illustrated in Figure 6-1. In this section, the words single and double are omitted from character names where there is no conflict or both are meant. Dutch, English, Italian, Portugese, Spanish, and Turkish use a left quotation mark and a right quotation mark for opening and closing quotations, respectively. It is typical to alter- nate single and double quotes for quotes within quotes. Whether single or double quotes are used for the outer quotes depends on local and stylistic conventions.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 151 6.1 General Punctuation Punctuation
Czech, German, and Slovak use the low-9 style of quotation mark for opening instead of the standard open quotes. They employ the left quotation mark style of quotation mark for closing instead of the more common right quotation mark forms. When guillemets are used in German books, they point to the quoted text. This style is the inverse of French usage. Danish, Finnish, Norwegian, and Swedish use the same right quotation mark character for both the opening and closing quotation character. This usage is employed both for office automation purposes and for books. Books sometimes use the guillemet, U+00BB - , for both opening and closing. Hungarian and Polish usage of quotation marks is similar to the Scandinavian usage, except that they use low double quotes for opening quotations. Presumably, these lan- guages avoid the low single quote so as to prevent confusion with the comma. French, Greek, Russian and Slovenian, among others, use the guillemets, but Slovenian usage is the same as German usage in their direction. Of these languages, at least French inserts space between text and quotation marks. In the French case, U+00A0 - can be used for the space that is enclosed between quotation mark and text; this choice helps line-breaking algorithms. Figure 6-1. European Quotation Marks
Single right quote = apostrophe ‘quote’ don’t
Usage depends on language “English”«French» „German“»Slovenian« ”Swedish”»Swedish books»
East Asian Usage. The glyph for each quotation mark character for an Asian character set occupies predominantly a single quadrant of the character cell. The quadrant used depends on whether the character is opening or closing and whether the glyph is for use with hori- zontal or vertical text. The pairs of quotation characters are listed in Table 6-3. Table 6-3. East Asian Quotation Marks
Style Opening Closing Corner bracket 300C 300D White corner bracket 300E 300F Double prime 301D 301F
Glyph Variation. The glyphs for “double-prime” quotation marks consist of a pair of wedges, slanted either forward or backward, with the tips of the wedges pointing either up or down. In a pair of double-prime quotes, the closing and the opening character of the pair slant in opposite directions. Two common variations exist as shown in Figure 6-2. To
152 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Punctuation 6.1 General Punctuation confuse matters more, another form of double-prime quotation marks is used with West- ern style, horizontal text, in addition to the curly single or double quotes. Figure 6-2. Asian Quotation Marks
Horizontal and vertical glyphs
Glyphs for overloaded character codes “Text” Font style based glyph alternates
Three pairs of quotation marks are used with Western-style, horizontal text, as shown in Table 6-4. Table 6-4. Opening and Closing Forms
Style Opening Closing Comment Single 2018 2019 Rendered as “wide” character Double 201C 201D Rendered as “wide” character Double prime 301D 301E
Overloaded Character Codes. The character codes for standard quotes can refer to regular narrow quotes from a Latin font used with Latin text, as well as wide quotes from an Asian font used with other wide characters. This situation can be handled with some success where the text is marked up with language tags. Consequences for Semantics. The semantics of U+00AB, U+00BB (double guillemets), and U+201D (right double quotation) are context-dependent. The semantics of U+201A and U+201B (low-9 quotation marks) are always opening; this usage is distinct from the usage of U+301E , which is unambiguously closing.
Apostrophes U+0027 is the most commonly used character for apostrophe. However, it has ambiguous semantics and direction. When text is set, U+2019 - is preferred as apostrophe. Word processors commonly offer a facility for auto- matically converting the U+0027 to a contextually-selected curly quotation glyph. Letter Apostrophe. U+02BC is preferred where the apos- trophe is to represent a modifier letter (for example, in transliterations to indicate a glottal stop). In the latter case, it is also referred to as a letter apostrophe.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 153 6.1 General Punctuation Punctuation
Punctuation Apostrophe. U+2019 is preferred where the character is to represent a punctuation mark, as for contractions: “We’ve been here before.” In this latter case, U+2019 is also referred to as a punctuation apostrophe. An implementation cannot assume that users’ text always adheres to the distinction between these characters. The text may come from different sources, including mapping from other character sets that do not make this distinction between the letter apostrophe and the punctuation apostrophe/right single quotation mark. In that case, all of them will generally be represented by U+2019. The semantics of U+2019 are therefore context-dependent. For example, if surrounded by letters or digits on both sides, it behaves as an in-text punctuation character and does not separate words or lines.
Other Punctuation Hyphenation Point. U+2027 is a raised dot used to indicate correct word breaking, as in dic·tion·ar·ies. It is a punctuation mark, to be distinguished from U+00B7 , which has multiple semantics. Fraction Slash. U+2044 is used between digits to form numeric fractions, such as 2/3, 3/9, and so on. The standard form of a fraction built using the fraction slash is defined as follows: Any sequence of one or more decimal digits, followed by the fraction slash, followed by any sequence of one or more decimal digits. Such a fraction should be 3 displayed as a unit, such as ¾ or as --- . The precise choice of display can depend upon addi- 4 tional formatting information. If the displaying software is incapable of mapping the fraction to a unit, then it can also be displayed as a simple linear sequence as a fallback (for example, 3/4). If the fraction is to be separated from a previous number, then a space can be used, choosing the appropriate width (normal, thin, zero width, and so on). For example, 1 + + 3 + + 4 is displayed as 1¾. Spacing Overscore. U+203E is the above-the-line counterpart to U+005F . It is a spacing character, not to be confused with U+0305 . As with all over- or underscores, a sequence of these characters should connect in an unbroken line. The overscoring characters also must be distinguished from U+0304 , which does not connect horizontally in this way. Doubled Punctuation. Several doubled punctuation characters that have compatibility decompositions into a sequence of two punctuation marks are also encoded as single char- acters: U+203C , U+2048 , and U+2049 . These doubled punctuation marks are included as an implementation convenience for East Asian and Mongolian text, which is rendered ver- tically. Bullets. U+2022 is the typical character for a bullet. Within the general punctua- tion, several alternative forms for bullets are separately encoded: U+2023 , U+204C , and so on. U+00B7 also often functions as a small bullet. Bullets mark the head of specially formatted paragraphs, often occurring in lists, and may in fact use arbitrary graphics or dingbat forms as well as more conventional bullet forms. U+261E , for example, is often used to highlight a note in text, as a kind of gaudy bullet. Paragraph Marks. U+00A7 and U+00B6 are often used as visible indications of sections or paragraphs of text, in editorial markup, to show format modes, and so on. Which character indicates sections and which character indicates
154 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Punctuation 6.1 General Punctuation paragraphs may vary by convention. U+204B is a fairly common alternate representation of the paragraph mark.
CJK Symbols and Punctuation: U+3000–U+303F This block encodes punctuation marks and symbols used primarily by writing systems that employ Han ideographs. Most of these characters are found in East Asian standards. U+3000 is provided for compatibility with legacy character sets. It is a fixed-width space appropriate for use with an ideographic font. U+301C and U+3030 are special forms of dashes found in East Asian character standards. (For a list of other space and dash characters in the Unicode Standard, see Table 6-1 and Table 6-2.) U+3037 is a visible indicator of the line feed separator symbol used in the Chinese telegraphic code; it is comparable to the pictures of control codes found in the Control Pictures block. U+3005 is used to stand for the second of a pair of identical ideographs occurring in adjacent positions within a document. U+3006 is used frequently on signs to indicate that a store or booth is closed for business. The Japanese pronunciation is shime, most often encountered in the compound shime-kiri. U+3012 is used in Japanese addresses immediately preceding the numerical postal code. It is also used on forms and applications to indicate the blank space in which a postal code is to be entered. U+3020 and U+3016 are properly glyphic variants of U+3012, and are included for compatibility. U+3031 and U+3032 are used only in vertically written Japanese to repeat pairs of kana characters occurring immediately prior in a document. The voiced variety U+3032 is used in cases where the repeated kana are to be voiced. For instance, a repetitive phrase like toki- doki could be expressed in vertical writing. Both of these characters are intended to be represented by “double height” glyphs requiring two ideo- graphic “cells” to print; this intention also explains the existence in source standards of the characters representing the top and bottom halves of these (that is, the characters U+3033, U+3034, and U+3035). In horizontal writing, similar characters are used, and they are sep- arately encoded. In Hiragana, the equivalent repeat marks are encoded at U+309D and U+309E; in Katakana, they are U+30FD and U+30FE.
Unknown or Unavailable Ideographs U+3013 is used to indicate the presence of, or to hold a place for, an ideograph that is not available when a document is printed. It has no other use. Its name comes from its resemblance to the mark left by traditional Japanese sandals (geta); a variety of light and heavy glyphic variants occur. U+303E is a graphic character that is to be rendered visibly. It alerts the user that the intended character is similar to, but not equal to, the char- acter that follows. Its use is similar to the existing character U+3013 . A substitutes for the unknown or unavailable character, but does not identify it. The is the head of a two-character sequence that gives some indication about the intended glyph or intended character. Ultimately, the - and the character following it are intended to be replaced
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 155 6.1 General Punctuation Punctuation by the correct character, once it has been identified or a font resource or input resource has been provided for it. U+303F - is a visible indicator of a display cell filler used when ideographic characters have been split during display on systems using a double-byte character encoding. It is included in the Unicode Standard for compatibility. See also “Ideographic Description Sequences” in Section 10.1, Han.
CJK Compatibility Forms: U+FE30–U+FE4F A number of presentation forms encoded in this block are found in the Republic of China (Taiwan) national standard CNS 11643. These vertical forms of punctuation characters are provided for compatibility with those legacy implementations that encode these characters explicitly when Chinese text is being set in vertical rather than horizontal lines. The pre- ferred Unicode encoding is to encode the nominal characters that correspond to these ver- tical variants. Then, at display time, the appropriate glyph is selected according to the line orientation.
Small Form Variants: U+FE50–U+FE6F The Republic of China (Taiwan) national standard CNS 11643 also encodes a number of small variants of ASCII punctuation. The characters of this block, while construed as fullwidth characters, are nevertheless depicted using small forms that are set in a fullwidth display cell. (See the discussion in Section 10.3, Katakana.) These characters are provided for compatibility with legacy imple- mentations. Unifications. Two small form variants from CNS 11643/plane 1 were unified with other characters outside the ASCII block: 213116 was unified with U+00B7 , and 226116 was unified with U+2215 .
156 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:
Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 7 European Alphabetic Scripts 7
Modern European alphabetic scripts are derived from or influenced by the Greek script. The word alphabet is derived from the Greek word alphabetos, itself derived from the names of the first two Greek letters, alpha and beta. The Greek script itself is an adaptation of the Phoenician alphabet. A Greek innovation was writing the letters from left to right, which is the writing direction for all the scripts derived from or inspired by Greek. The European alphabetic scripts in the Unicode Standard, •Latin • Greek • Cyrillic •Armenian •Georgian •Ogham •Runic are written from left to right. Many have separate lowercase and uppercase forms of the alphabet. Spaces are used to separate words. Accents and diacritical marks are used to indi- cate phonetic features and to extend the use of base scripts to additional languages. The process of applying diacritical marks is potentially open-ended—one of the reasons com- bining marks are encoded in the Unicode Standard. Latin and Cyrillic are used to write or transliterate texts in a wide variety of languages. The Latin alphabet is derived from the alphabet used by the Etruscans, who had adopted a west- ern variant of the classical Greek alphabet. Originally it contained only 24 capital letters. The modern Latin alphabet as it is found in the Basic Latin block owes its appearance to innovations of scribes during the Middle Ages and practices of the early Renaissance print- ers. The Cyrillic script was developed in the ninth century and is also based on Greek. The Georgian and Armenian scripts were invented in the fifth century and are influenced by Greek. Modern Georgian does not have separate upper- and lowercase forms. The International Phonetic Alphabet is an extension of the Latin alphabet, enabling it to represent the phonetics of all languages. The two historic scripts of northwestern Europe, Runic and Ogham, have a distinct appear- ance owing to their primary use in carving inscriptions in stone and wood. They are con- ventionally rendered left to right in scholarly literature, but on the original stone carvings often proceeded in an arch tracing the outline of the stone.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 159 7.1 Latin European Alphabetic Scripts
7.1 Latin The Latin script was derived from the Greek script. Today it is used to write a wide variety of languages all over the world. In the process of adapting it to other languages, numerous extensions have been devised. The most common is the addition of diacritical marks. Fur- thermore, the creation of digraphs, inverse or reverse forms, and outright new characters have all been used to extend the Latin script. The Latin script is written in linear sequence from left to right. Spaces are used to separate words and provide the primary line-breaking opportunities. Hyphens are used where lines are broken in the middle of a word. (For more information, see Unicode Technical Report #14, “Line Breaking Properties,” on the CD-ROM or the up-to-date version on the Uni- code Web site.) Latin letters come in upper- and lowercase pairs. Diacritical Marks. Speakers of different languages treat the addition of a diacritical mark to a base letter differently. In some languages, the combination is treated as a letter in the alphabet for the language. In others, such as English, the same words can often be spelled with and without the diacritical mark without implying any difference. Most languages that use the Latin script treat letters with diacritical marks as variations of the base letter, but do not accord the combination the full status of an independent letter in the alphabet. The encoding for the Latin script in the Unicode standard is sufficiently flexible to allow implementations to support these letters according to the users’ expectation, as long as the language is known. Widely used accented character combinations are provided as single characters to accommodate interoperation with pervasive practice in legacy encodings. Combining diacritical marks can express these and all other accented letters as combining character sequences. In the Unicode Standard, all diacritical marks are encoded in sequence after the base char- acters to which they apply. For more details, see subsection on “Combining Diacritical Marks” in Section 7.9, Combining Marks, and also Section 2.6, Combining Characters. Standards. Unicode follows ISO 8859-1 in the layout of Latin letters up to U+00FF. ISO 8859-1, in turn, is based on older standards, among others ASCII (ANSI X3.4), which is identical to ISO/IEC 646:1991-IRV. Like ASCII, ISO 8859-1 contains Latin letters, punctu- ation signs, and mathematical symbols. The use of the additional characters is not restricted to the context of Latin script usage. The description of these characters is found in Chapter 6, Punctuation. Related Characters. For other Latin or Latin-derived characters, see letterlike symbols (U+2100..U+214F), currency symbols (U+20A0..U+20CF), miscellaneous symbols (U+2600..U+26FF), enclosed alphanumberics (U+2460..U+24FF), and fullwidth forms (U+FF21..U+FF5A).
Letters of Basic Latin: U+0041–U+007A Only a small fraction of the languages written with the Latin script can be written entirely with the basic set of 26 uppercase and 26 lowercase Latin letters contained in this block. The 26 basic letter pairs form the core of the alphabets used by all the other languages that use the Latin script. A stream of text using one of these alphabets would therefore intermix characters from the Basic Latin block and other Latin blocks. Occasionally a few of the basic letter pairs are omitted, such as in Italian, which does not use “j” or “w”.
160 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.1 Latin
Alternative Graphics. Common typographical variations include the open- and closed- loop form of the lowercase letters “a” and “g”. Phonetic transcription systems, such as IPA and Pinyin, make a distinction between such forms.
Letters of the Latin-1 Supplement: U+00C0–U+00FF The Latin-1 supplement extends the basic 26 letter pairs of ASCII by providing additional letters for major languages of Europe (listed in the next paragraph). Like ASCII, the Latin- 1 set also includes a miscellaneous set of punctuation and mathematical signs. Punctua- tion, signs, and symbols not included in the Basic Latin and Latin-1 Supplement blocks are encoded in character blocks starting with the General Punctuation block. Languages. The languages supported by the Latin-1 supplement include Danish, Dutch, Faroese, Finnish, Flemish, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Span- ish, and Swedish. Ordinals. U+00AA and U+00BA can be depicted with an underscore, but many modern fonts show them as superscripted Latin letters with no underscore. In sorting and searching, these characters should be treated as weakly equivalent to their Latin character equivalents. Spacing Clones of Diacritics. ISO 8859-1 contains eight characters that are ambiguous regarding whether they denote combining characters or separate spacing characters. In the Unicode Standard, the corresponding codepoints (U+005E ^ , U+005F _ , U+0060 Å , U+007E ~ , U+00A8 ¨ , U+00AF ¯ , U+00B4 ´ , and U+00B8 ¸ ) are restricted to use as spacing characters. The Unicode Standard provides unambiguous combining characters in the character block for Combining Diacritical Marks, which can be used to represent accented Latin letters by means of composed character sequences. U+00B0 ° is also occasionally used ambiguously by implementations of ISO 8859-1 to denote a spac- ing form of a diacritic ring above a letter; in the Unicode Standard, that spacing diacritical mark is denoted unambiguously by U+02DA ° . U+007E “~” is ambigu- ous between usage as a spacing form of a diacritic and as an operator or other punctuation; it is generally rendered with a center line glyph, rather than as a diacritic raised tilde. The spacing form of the diacritic tilde is denoted unambiguously by U+02DC “Á” .
Latin Extended-A: U+0100–U+017F The Latin Extended-A block contains a collection of letters that, when added to the letters contained in the Basic Latin and Latin-1 Supplement blocks, allow for the representation of most European languages that employ the Latin script. Many other languages can also be written with the characters in this block. Most of these characters are equivalent to precom- posed combinations of base character forms and combining diacritical marks. These com- binations may also be represented by means of composed character sequences. See Section 2.6, Combining Characters. Standards. This block includes characters contained in International Standard ISO 8859— Part 2. Latin alphabet No. 2, Part 3. Latin alphabet No. 3, Part 4. Latin alphabet No. 4, and Part 9. Latin alphabet No. 5. Many of the other graphic characters contained in these stan- dards, such as punctuation, signs, symbols, and diacritical marks, are already encoded in the Latin-1 Supplement block. Other characters from these parts of ISO 8859 are encoded in other blocks, primarily in the Spacing Modifier Letters block (U+02B0..U+02FF) and in the character blocks starting at and following the General Punctuation block.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 161 7.1 Latin European Alphabetic Scripts
Languages. Most languages supported by this block also require the concurrent use of characters contained in the Basic Latin and Latin-1 Supplement blocks. When combined with these two blocks, the Latin Extended-A block supports Afrikaans, Basque, Breton, Catalan, Croatian, Czech, Esperanto, Estonian, French, Frisian, Greenlandic, Hungarian, Latin, Latvian, Lithuanian, Maltese, Polish, Provençal, Rhaeto-Romanic, Romanian, Romany, Sami, Slovak, Slovenian, Sorbian, Turkish, Welsh, and many others. Alternative Glyphs. Some characters have alternative representations, although they have a common semantic. In such cases, a preferred glyph is chosen to represent the character in the code charts, even though it may not be the form used under all circumstances. Some examples to illustrate this point are provided in Figure 7-1 and discussed in the text that follows. Figure 7-1. Alternative Glyphs @AU BCD, EF LR ST MN VW XYZ GH
When Czech is printed in books, U+010F and U+0165 letter forms with apostrophe are often used instead of letter forms with caron (hacek) over the base forms. In Slovak, this use also applies to U+013E . The use of an apostrophe can avoid some line crashes over the ascenders of those letters and so result in better typography. In typewritten or handwritten documents, or in didactic and pedagogical material, on the other hand, let- ter forms with haceks are preferred. Languages other than Czech or Slovak that make use of these characters may simply choose to always use the forms with haceks. A similar situation can be seen in the Latvian letter U+0123 . In good Latvian typography, this character is always shown with a rotated comma over the g, rather than a cedilla below the g, because of the typographical design and layout issues resulting from trying to place a cedilla below the descender loop of the g. Poor Latvian fonts may substitute an acute accent for the rotated comma, and handwritten or other printed forms may actually show the cedilla below the g. The uppercase form of the letter is always shown with a cedilla, as the rounded bottom of the G poses no problems for attachment of the cedilla. Other Latvian letters with cedilla below (U+0137 , U+0146 , and U+0157 ) always prefer a glyph with a floating comma below as there is no proper attach- ment point for a cedilla proper at the bottom of the base form. In Turkish and Romanian, a cedilla and a comma below can replace one another depending on the font style. The letters U+015F and U+0163 (and their uppercase counterparts) have been dupli- cated at U+0219 and U+021B . The duplicated characters with explicit commas below are provided solely for compatibility with sociopolitical practices. Legacy encodings for these
162 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.1 Latin characters, including ISO/IEC 8859-2, contain only a single form of each of these charac- ters, which is mapped to the form with cedilla. In general, characters with cedillas or ogoneks below are subject to variable typographical usage, depending on the availability and quality of fonts used, the technology, and the geo- graphic area. Various hooks, commas, and squiggles may be substituted for the nominal forms of these diacritics below, and even the direction of the hooks may be reversed. Imple- menters should take care to become familiar with particular typographical traditions before assuming that characters are missing or are wrongly represented in the code charts in the Unicode Standard. Exceptional Case Pairs. The characters U+0130 and U+0131 (used primarily in Turkish) are assumed to take ASCII “i” and “I” as their case alternates, respectively. This mapping makes the corre- sponding reverse mapping language-specific; mapping in both directions requires special attention from the implementor (see Section 5.17, Sorting and Searching). See SpecialCas- ing.txt on the CD-ROM for more information. Diacritics on i and j. A dotted (normal) i or j followed by a top nonspacing mark loses the dot in rendering. Thus, in the word naïve, the ï could be spelled with i + diaeresis. Just as Cyrillic A is not equivalent to Latin A, a dotted-i is not equivalent to a Turkish dotless-i + overdot, nor are other cases of accented dotted-i equivalent to accented dotless-i (for exam- ple, i + ¨Å + ¨). The same pattern is used for j. To express the forms sometimes used in the Baltic (where the dot is retained under a top accent), use i + overdot + accent (see Figure 7-2). Figure 7-2. Diacritics on i and j
i + Y ➠ i + \ + Z ➠ W j + [ ➠ X i + Z + \ ➠ V
Latin Extended-B: U+0180–U+024F The Latin Extended-B block contains letter forms used to extend Latin scripts to represent additional languages. It also contains phonetic symbols not included in the International Phonetic Alphabet (see the IPA Extensions block, U+0250..U+02AF). Standards. This block covers, among others, characters in ISO 6438 Documentation— African coded character set for bibliographic information interchange, Pinyin Latin tran- scription characters from the People’s Republic of China national standard GB 2312 and from the Japanese national standard JIS X 0212, and Sami characters from ISO 8859 Part 10. Latin alphabet No. 6. Arrangement. The characters are arranged in a nominal alphabetical order, followed by a small collection of Latinate forms. Upper- and lowercase pairs are placed together where possible, but in many instances the other case form is encoded at some distant location and so is cross-referenced. Variations on the same base letter are arranged in the following order: turned, inverted, hook attachment, stroke extension or modification, different style (script), small cap, modified basic form, ligature, and Greek-derived. Croatian Digraphs Matching Serbian Cyrillic Letters. Serbo-Croatian is a single language with paired alphabets: a Latin script (Croatian) and a Cyrillic script (Serbian). A set of
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 163 7.1 Latin European Alphabetic Scripts compatibility digraph codes is provided for one-to-one transliteration. There are two potential uppercase forms for each digraph, depending on whether only the initial letter is to be capitalized (titlecase), or both (all uppercase). The Unicode Standard offers both forms so that software can convert one form to the other without changing font sets. The appropriate cross-references are given for the lowercase letters. For more information about canonical equivalence, see Chapter 3, Conformance. Pinyin Diacritic-Vowel Combinations. The Chinese standard GB 2312, as well as the Japa- nese standard JIS X 0212, includes a set of codes for Pinyin, used for Latin transcription of Mandarin Chinese. Most of the letters used in Pinyin romanization (even those with com- bining diacritical marks) are already covered in the preceding Latin blocks. The group of 16 characters provided here completes the Pinyin character set specified in GB 2312 and JIS X 0212. Case Pairs. A number of characters in this block are uppercase forms of characters whose lowercase form is part of some other grouping. Many of these characters came from the International Phonetic Alphabet; they acquired novel uppercase forms when they were adopted into Latin script-based writing systems. Occasionally, however, alternative upper- case forms arose in this process. In some instances, research has shown that alternative uppercase forms are merely variants of the same character. If so, such variants are assigned a single Unicode value, as is the case of U+01B7 . But when research has shown that two uppercase forms are actually used in different ways, then they are given different codes; such is the case for U+018E and U+018F . In this instance, the shared lowercase form is copied to enable unique case-pair mappings if desired: U+01DD is a copy of U+0259 . For historical reasons, the names of some case pairs differ. For example, U+018E is the uppercase of U+01DD —not of U+0258 . (For default case mappings of Unicode characters, see Chapter 4, Character Properties.) Languages. Some indication of language or other usage is given for most characters within the names lists accompanying the character charts.
IPA Extensions: U+0250–U+02AF The IPA Extensions block contains primarily the unique symbols of the International Pho- netic Alphabet (IPA), which is a standard system for indicating specific speech sounds. The IPA was first introduced in 1886 and has undergone occasional revisions of content and usage since that time. The Unicode Standard covers all single symbols and all diacritics in the last published IPA revision (1989), as well as a few symbols in former IPA usage that are no longer currently sanctioned. A few symbols have been added to this block that are part of the transcriptional practices of Sinologists, Americanists, and other linguists. Some of these practices have usages independent of the IPA and may use characters from other Latin blocks rather than IPA forms. Note also that a few nonstandard or obsolete phonetic symbols are encoded in the Latin Extended-B block. An essential feature of IPA is the use of combining diacritical marks. IPA diacritical mark characters are coded in the Combining Diacritical Marks block, U+0300..U+036F. In IPA, diacritical marks can be freely applied to base form letters to indicate fine degrees of pho- netic differentiation required for precise recording of different languages. Standards. The characters in this block are taken from the 1989 revision of the Interna- tional Phonetic Alphabet, published by the International Phonetic Association. The Inter- national Phonetic Association standard considers IPA to be a separate alphabet, so it
164 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.1 Latin includes the entire Latin lowercase alphabet a–z, a number of extended Latin letters such as U+0153 œ, and a few Greek letters and other symbols as sepa- rate and distinct characters. In contrast, the Unicode Standard does not duplicate either the Latin lowercase letters a–z or other Latin or Greek letters in encoding IPA. Note that unlike other character standards referenced by the Unicode Standard, IPA constitutes an extended alphabet and phonetic transcriptional standard, rather than a character encoding standard. Unifications. The IPA symbols are unified as much as possible with other letters, albeit not with nonletter symbols like U+222B . The IPA symbols have also been adopted into the Latin-based alphabets of many written languages, such as some used in Africa. It is futile to attempt to distinguish a transcription from an actual alphabet in such cases. Therefore, many IPA symbols are found outside the IPA Extensions block. IPA symbols that are not found in the IPA Extensions block are listed as cross-references at the begin- ning of the character names list for this block. IPA Alternates. In a few cases IPA practice has, over time, produced alternate forms, such as U+0269 “ι” versus U+026A “.” The Unicode Standard provides separate encodings for the two alternate forms because they are used in a meaningfully distinct fashion. Case Pairs. IPA does not sanction case distinctions; in effect, its phonetic symbols are all lowercase. When IPA symbols are adopted into a particular alphabet and used by a given written language (as has occurred, for example, in Africa) they acquire uppercase forms. Because these uppercase forms are not themselves IPA symbols, they are generally encoded in the Latin Extended-B block (or other Latin extension blocks) and are cross-referenced with the IPA names list. Typographic Variants. IPA includes typographic variants of certain Latin and Greek letters that would ordinarily be considered variations of font style rather than of character iden- tity, such as letter forms. Examples include a typographic variant of the Greek letter phi φ, as well as the borrowed letter Greek iota ι, which has a unique Latin uppercase form. These forms are encoded as separate characters in the Unicode Standard because they have distinct semantics in plain text. Affricate Digraph Ligatures. IPA officially sanctions six digraph ligatures used in tran- scription of coronal affricates. These are encoded at U+02A3..U+02A8. The IPA digraph ligatures are explicitly defined in IPA and also have possible semantic values that make them not simply rendering forms. Thus, for example, while U+02A6 is a transcription for the sounds that could also be transcribed in IPA as “ts” U+0074 U+0073, the choice of the digraph ligature may be the result of a deliberate dis- tinction made by the transcriber regarding the systematic phonetic status of the affricate. The choice of whether to ligate cannot be left to rendering software based on the font avail- able. This ligature also differs in typographical design from the ts ligature found in some old-style fonts. Encoding Structure. The IPA Extensions block is arranged in approximate alphabetical order according to the Latin letter that is graphically most similar to each symbol. This order has nothing to do with a phonetic arrangement of the IPA letters.
Latin Extended Additional: U+1E00–U+1EFF The characters in this block constitute a number of precomposed combinations of Latin letters with one or more general diacritical marks. Each of the characters contained in this block may be alternatively represented with a base letter followed by one or more general diacritical mark characters found in the Combining Diacritical Marks block. A canonical form for such alternative representations is specified in Chapter 3, Conformance.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 165 7.1 Latin European Alphabetic Scripts
Vietnamese Vowel Plus Tone Mark Combinations. A portion of this block (U+1EA0.. U+1EF9) comprises vowel letters of the modern Vietnamese alphabet (qu¶c ngÔ) com- bined with a diacritic mark that denotes the phonemic tone that applies to the syllable. In the modern Vietnamese alphabet, there are 12 vowel letters and five tone marks (see Figure 7-3). Figure 7-3. Vietnamese Letters and Tone Marks a ì â e ê i o ô u y þþþôþìþ
Some implementations of Vietnamese systems prefer storing the combination of vowel let- ter and tone mark as a singly encoded element; other implementations prefer storing the vowel letter and tone mark separately. The former implementations will use characters defined in this block along with combination forms defined in the Latin-1 Supplement and Latin Extended-A character blocks; the latter implementations will use the basic vowel let- ters in the Basic Latin, Latin-1 Supplement, and Latin Extended-A blocks along with char- acters from the Combining Diacritical Marks block. For these latter implementations, the characters U+0300 , U+0309 c , U+0303 - , U+0301 , and U+0323 c should be used in representing the Vietnamese tone marks. The characters U+0340 and U+0341 are deprecated and should not be used.
Latin Ligatures: FB00–FB06 This section of the Alphabetic Presentation forms block (U+FB00..U+FB4F) contains sev- eral common Latin ligatures, which occur in legacy encodings. By design, the Unicode standard does not provide a general mechanism to indicate where ligatures should be dis- played. Where to place a Latin ligature is a matter of typographical style as well as the orthographical rules of the language. Some languages prohibit ligatures across word boundaries. In these cases, it is preferable for the implementations to use unligated charac- ters in the backing store and provide out-of-band information to the display layer where ligatures may be placed.
166 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.2 Greek
7.2 Greek
Greek: U+0370–U+03FF The Greek script is used for writing the Greek language and (in an extended variant) the Coptic language. The Greek script had a strong influence in the development of the Latin and Cyrillic scripts. The Greek script is written in linear sequence from left to right with the occasional use of nonspacing marks. Greek letters come in upper- and lowercase pairs. Standards. The Unicode encoding of Greek is based on ISO 8859-7, which is equivalent to the Greek national standard ELOT 928. The Unicode Standard encodes Greek characters in the same relative positions as in ISO 8859-7. A number of variant and archaic characters are taken from the bibliographic standard ISO 5428. Polytonic Greek. Polytonic Greek, used for ancient Greek (classical and Byzantine), may be encoded using either combining character sequences or precomposed base plus diacritic combinations. For the latter, see the following subsection, “Greek Extended: U+1F00– U+1FFF.” Nonspacing Marks. Several nonspacing marks commonly used with the Greek script are found in the Combining Diacritical Marks range (see Table 7-1). Table 7-1. Nonspacing Marks Used with Greek
Code Name Alternative Names U+0300 varia U+0301 tonos, oxia U+0302 U+0303 U+0304 U+0306 U+0308 dialytika U+0313 psili U+0314 dasia U+0342 circumflex, tilde U+0343 comma above U+0345 iota subscript
Because the characters in the Combining Diacritical Marks block are encoded by shape, not by meaning, they are appropriate for use in Greek where applicable. However, the character U+0344 is deprecated and should not be used. The combination of dialytika plus tonos is instead represented by the sequence U+0308 - plus U+0301 . Multiple nonspacing marks applied to the same baseform character are encoded in inside- out sequence. See the general rules for applying nonspacing marks in Section 2.6, Combin- ing Characters. The basic Greek accent written in modern Greek is called tonos. It is represented by an acute accent (U+0301). The shape that the acute accent takes over Greek letters is generally steeper than that shown over Latin letters in Western European typographic traditions, and
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 167 7.2 Greek European Alphabetic Scripts in earlier editions of this standard was mistakenly shown as a vertical line over the vowel. Polytonic Greek has several contrastive accents, and the accent, or tonos, written with an acute accent is referred to as oxia, in contrast to the varia, which is written with a grave accent. U+0342 may appear as either a circumflex or a tilde: 1 versus 2. Because of this variation in form, the perispomeni was encoded distinctly from U+0303 . U+0313 and U+0343 both take the form of a raised comma over a baseform letter. U+0343 was included for compatibility reasons; U+0313 is the preferred form for general use. The nonspacing mark ypogegrammeni (also known as iota subscript in English) can be applied to the vowels alpha, eta, and omega to represent historic diphthongs. This mark appears as a small iota below the vowel. When applied to a single uppercase vowel, the iota does not appear as a subscript, but is instead normally rendered as a regular lowercase iota to the right of the uppercase vowel. This form of the iota is called prosgegrammeni (also known as iota adscript in English). In completely uppercased words, the iota subscript should be replaced by a capital iota. See SpecialCasing.txt on the CD-ROM. Archaic repre- sentations of Greek words (which did not have lowercase or accents) use the Greek capital letter iota following the vowel for these diphthongs. Such archaic representations require special case mapping. Variant Letterforms. U+03A5 has two common forms— one looks essentially like the Latin capital Y, and the other has two symmetric upper branches that curl like rams’ horns, “:”. The Y-form glyph has been chosen consistently for use in the code charts, both for monotonic and polytonic Greek. For mathematical usage, the rams’ horn form of the glyph is required. Variant forms of certain other Greek letters are encoded as separate characters in ISO/IEC 8859-7 and ISO 5428; therefore, those forms are also included as separate characters in the Unicode Standard. They include U+03C2 and U+03D0 , as well as additional forms of the capital letter upsilon that have an asymmetric hook—for example, U+03D2 . Greek Letters as Symbols. For compatibility purposes, a few Greek letters are separately encoded as symbols in other character blocks. Examples include U+00B5 µ in the Latin-1 Supplement character block and U+2126 Ω in the Letterlike Symbols character block. The use of Greek letters for mathematical variables and operators is well established. Characters from the Greek block may be used for these symbols. Punctuation-like Characters. The question of which punctuation-like characters are uniquely Greek and which ones can be unified with generic Western punctuation has no definitive answer. The Greek question mark U+037E erotimatiko “;” is encoded for compatibility. The preferred character is U+003B . Historic Letters. Historic Greek letters have been retained from ISO 5428. Coptic-Unique Letters. The Coptic script is regarded primarily as a stylistic variant of the Greek alphabet. The letters unique to Coptic are encoded in a separate range at the end of the Greek character block. Those characters may be used together with the basic Greek characters to represent the complete Coptic alphabet. Coptic text may be rendered using a font that contains the Coptic style of depicting the characters it shares with the Greek alphabet. Texts that mix Greek and Coptic languages together must employ appropriate font style associations.
168 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.2 Greek
Related Characters. For math symbols, see mathematical operators (U+2200..U+22FF). For additional punctuation to be used with this script, see C0 Controls and ASCII Punctu- ation (U+0000..U+007F).
Greek Extended: U+1F00–U+1FFF The characters in this block constitute a number of precomposed combinations of Greek letters with one or more general diacritical marks; in addition, a number of spacing forms of Greek diacritical marks are provided here. In particular, these characters facilitate the representation of Polytonic Greek texts without the use of combining marks. Each of the characters contained in this block may be alternatively represented with a base letter from the Greek block followed by one or more general diacritical mark characters found in the Combining Diacritical Marks block. A canonical form for such alternative representations is specified in Chapter 3, Conformance. Spacing Diacritics. Sixteen additional spacing diacritic marks are provided in this charac- ter block for use in the representation of Polytonic Greek texts. Each has an alternative rep- resentation for use with systems that support nonspacing marks. The Unicode Standard considers the nonspacing alternative forms to be the canonical Unicode representation of the information represented by the spacing forms. The nonspacing alternatives appear in Table 7-2. Table 7-2. Greek Spacing and Nonspacing Pairs
Spacing Form Nonspacing Form 1FBD GREEK KORONIS 0313 COMBINING COMMA ABOVE 037A GREEK YPOGEGRAMMENI 0345 COMBINING YPOGEGRAMMENI 1FBF GREEK PSILI 0313 COMBINING COMMA ABOVE 1FC0 GREEK PERISPOMENI 0342 COMBINING GREEK PERISPOMENI 1FC1 GREEK DIALYTIKA AND 0308 COMBINING DIAERESIS PERISPOMENI + 0342 COMBINING GREEK PERISPOMENI 1FCD GREEK PSILI AND VARIA 0313 COMBINING COMMA ABOVE + 0300 COMBINING GRAVE ACCENT 1FCE GREEK PSILI AND OXIA 0313 COMBINING COMMA ABOVE + 0301 COMBINING ACUTE ACCENT 1FCF GREEK PSILI AND PERISPOMENI 0313 COMBINING COMMA ABOVE + 0342 COMBINING GREEK PERISPOMENI 1FDD GREEK DASIA AND VARIA 0314 COMBINING REVERSED COMMA ABOVE + 0300 COMBINING GRAVE ACCENT 1FDE GREEK DASIA AND OXIA 0314 COMBINING REVERSED COMMA ABOVE + 0301 COMBINING ACUTE ACCENT 1FDF GREEK DASIA AND PERISPOMENI 0314 COMBINING REVERSED COMMA ABOVE + 0342 COMBINING GREEK PERISPOMENI 1FED GREEK DIALYTIKA AND VARIA 0308 COMBINING DIAERESIS + 0300 COMBINING GRAVE ACCENT 1FEE GREEK DIALYTIKA AND OXIA 0308 COMBINING DIAERESIS + 0301 COMBINING ACUTE ACCENT 1FEF GREEK VARIA 0300 COMBINING GRAVE ACCENT 1FFD GREEK OXIA 0301 COMBINING ACUTE ACCENT 1FFE GREEK DASIA 0314 COMBINING REVERSED COMMA ABOVE
Decomposition of Spacing Forms. When decomposing the spacing forms, the spacing sta- tus of the implied usage must be taken into account. Unless information is present to the
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 169 7.2 Greek European Alphabetic Scripts contrary, these spacing forms would be decomposed to U+0020 followed by the nonspacing form equivalents shown in Table 7-2. In archaic forms of Greek, U+0345 and the precom- posed forms that contain it have special case mappings.
170 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.3 Cyrillic
7.3 Cyrillic
Cyrillic: U+0400–U+04FF The Cyrillic script is a member of the family of scripts strongly influenced by the Greek script. Cyrillic has traditionally been used for writing various Slavic languages, among which Russian is predominant. In the nineteenth and early twentieth centuries, Cyrillic was extended to write the non-Slavic minority languages of the former Soviet Union. The Cyrillic script is written in linear sequence from left to right with the occasional use of non- spacing marks. Cyrillic letters come in upper- and lowercase pairs. Standards. The Cyrillic block of the Unicode Standard is based on ISO 8859-5. The Uni- code Standard encodes Cyrillic characters in the same relative positions as in ISO 8859-5. Unifications. Latin characters included in those alphabets that use both Latin and Cyrillic letters are not given duplicate Cyrillic encodings. Examples include q and w for Kurdish. Historic Letters. The historic form of the Cyrillic alphabet is treated as a font style varia- tion of modern Cyrillic because the historic forms are relatively close to the modern appearance and because some of them are still in modern use in languages other than Rus- sian (for example, U+0406 “I”is used in modern Ukrainian and Byelorussian). Since the historic Cyrillic characters encoded in Unicode (U+0460.. U+0486) rarely occur in modern form, these letters are shown in the charts in an archaic font. A complete Old Cyrillic set would be obtained by rendering the whole Cyrillic section (that is, U+0400..U+0486) in that same style. Extended Cyrillic. These letters are used in alphabets for minority languages of the former Soviet Union. The scripts of some of these languages have often been revised in the past; the Unicode Standard includes only the alphabets in current use, not the rejected old letter- forms. Glagolitic. The history of the creation of the Slavic scripts and their relationship has been lost. The Unicode Standard regards Glagolitic as a separate script from Cyrillic, not as a font change from Cyrillic. This position is taken primarily because Glagolitic appears unrecog- nizably different from Cyrillic, and secondarily because Glagolitic has not grown to match the expansion of Cyrillic. The Glagolitic script is not currently supported by the Unicode Standard.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 171 7.4 Armenian European Alphabetic Scripts
7.4 Armenian
Armenian: U+0530–U+058F The Armenian script is used primarily for writing the Armenian language. The script is written from left to right and generally does not use diacritics (except for the modifier let- ters specified below). It does have upper- and lowercase pairs. Modifier Letters. In modern Armenian typography, the small marks in the group called Armenian modifier letters are placed above and to the right of other letters so that they occupy a letter position of their own. For example, the emphasis mark, the exclamation mark, and the interrogation mark are positioned to the right of the vowel in the syllable that is stressed. As the rendering of these modifiers may involve horizontal and vertical adjustments, it should ideally be done with enhanced kerning as described in Section 5.15, Locating Text Element Boundaries. Because these modifiers generally have their own width, they are treated as spacing letters in the Unicode Standard, rather than as nonspacing marks. There appears to be no recorded usage of U+0559 in Armenian text; its presence in this block is thus probably spurious. Punctuation Letters. The Armenian writing system uses many punctuation letters from other blocks, such as U+002C and U+00B7 . However, when used in Armenian, such punctuation letters should be rendered in a style compatible with the Armenian font style. In addition to U+055D , which is already included in the modifier letters, two punctuation letters are specific to Armenian: U+0589 - and U+058A . The behavior of the latter character is similar to U+00AD . It is used to indicate a line-breaking opportunity within a polysyllabic Armenian word. Its shape distinguishes it from the soft hyphen. Ligatures. Five Armenian ligatures are encoded in the Alphabetic Presentation Forms block in the range U+FB13..U+FB17. By design, the Unicode Standard does not provide a general mechanism to indicate where ligatures should be displayed. For additional punctuation to be used with this script, see C0 Controls and ASCII Punctu- ation (U+0000..U+007F).
172 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.5 Georgian
7.5 Georgian
Georgian: U+10A0–U+10FF The Georgian script is used primarily for writing the Georgian language. Upper- and low- ercase pairs exist primarily in archaic forms of the script. Script Forms. The modern Georgian script is a lowercase style called mkhedruli (soldier’s). It originated as the secular derivative of a form called khutsuri (ecclesiastical) that had both uppercase and lowercase pairs. Although no longer used in most modern texts, the khutsuri style is still used for liturgical purposes; the Unicode Standard encodes the uppercase form of khutsuri as well as the lowercase letters of modern Georgian. Georgian Paragraph Separator. The Georgian paragraph separator has a distinct repre- sentation, so it has been separately encoded at U+10FB. It visually marks a paragraph end, but it must be followed by a formatting character such as U+2029 to cause a paragraph break. See the discussion of the paragraph separator in Section 13.2, Layout Controls. Other Punctuation. For the Georgian full stop, use U+0589 . For additional punctuation to be used with this script, see C0 Controls and ASCII Punctu- ation (U+0000..U+007F).
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 173 7.6 Runic European Alphabetic Scripts
7.6 Runic
Runic: U+16A0–U+16F0 The Runic script was historically used to write the languages of the early and medieval soci- eties in the German, Scandinavian, and Anglo-Saxon areas. Use of the Runic script in vari- ous forms covers a period from the first century to the nineteenth century. Some 6,000 Runic inscriptions are known. They form an indispensable source of information about the development of the Germanic languages. Historical Script. The Runic script is one of the first “historical” or “extinct” scripts to be incorporated into the Unicode Standard. The only important use of runes today is in schol- arly and popular works about the old Runic inscriptions and their interpretation. The Runic script illustrates many technical problems that are typical for this kind of script. Unlike other scripts in the Unicode Standard, which predominantly serve the needs of the modern user community—with occasional extensions for historic forms—the encoding of the Runic script attempts to suit the needs of texts from different periods of time and from distinct societies that had little contact with each other. Direction. Like other early writing systems, runes could be written either from left to right or from right to left, or moving first in one direction, then the other (boustrophedon), or following the outlines of the inscribed object. At times, characters appear in mirror image, or upside down, or both. In modern scholarly literature, Runic is written left to right. Therefore, the letters of the Runic script have a default directionality of strong left-to-right in this standard. The Runic Alphabet. Present-day knowledge about runes is incomplete. The set of graphe- mically distinct units shows greater variation in its graphical shapes than most modern scripts. The Runic alphabet changed several times during its history, both in the number and shapes of the letters contained in it. The shape of most runes can be related to some Latin capital letter, but not necessarily with a letter representing the same sound. The most conspicuous difference between the Latin and the Runic alphabets is the order of the let- ters. The Runic alphabet is known as the futhark from the name of its first six letters. The origi- nal old futhark contained 24 runes: ¡£¦ ¨¬®¯ ´·»¼¿ÁÃÄ They are usually transliterated in this way: f u ã arkgw hnij Ô pz s t be ml ° do
In England and Friesland, seven additional runes were added from the fifth century to the ninth century. In the Scandinavian countries, the futhark changed in a different way; in the eighth century, the simplified younger futhark appeared. It consists of only 16 runes, some of which are used in two different forms. The long-branch form is shown here: ¡£¦ ª° ´·½¿ Ë f u ã or k hni as t bml þ
174 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.6 Runic
The use of runes continued in Scandinavia during the Middle Ages. During that time, the futhark was influenced by the Latin alphabet and new runes were invented so that there was full correspondence with the Latin letters. Representative Glyphs. The known inscriptions can include considerable variations of shape for a given rune, sometimes to the point where the nonspecialist will mistake the shape for a different rune. There is no dominant main form for some runes, particularly for many runes added in the Anglo-Friesian and medieval Nordic systems. When transcribing a Runic inscription into its Unicode-encoded form, one cannot rely on the idealized refer- ence glyph shape in the character charts alone. One must take into account to which of the four Runic systems an inscription belongs, and be knowledgeable about the permitted form variations within each system. The reference glyphs were chosen to provide an image that distinguishes each rune visually from all other runes in the same system. For actual use, it might be advisable to use a separate font for each Runic system. Of particular note is the fact that the glyph for U+16C4 © is actually a rare form, as the more common form is already used for U+16E1 Æ . Unifications. When a rune in an earlier writing system evolved into several different runes in a later system, the unification of the earlier rune with one of the later runes was based on similarity in graphic form rather than similarity in sound value. In cases where a substan- tial change in the typical graphical form has occurred, though the historical continuity is undisputed, unification has not been attempted. When runes from different writing sys- tems have the same graphic form but a different origin and denote different sounds, they have been coded as separate characters. Long-Branch and Short-Twig. Two sharply different graphic forms, the long-branch and the short-twig form, were used for nine of the 16 Viking Age Nordic runes. Although only one form is used in a given inscription, there are runologically important exceptions. In some cases, the two forms were used to convey different meanings in later use in the medi- eval system. Therefore the two forms have been separated in the Unicode Standard. Staveless Runes. Staveless runes are a third form of the Viking Age Nordic runes, a kind of runic shorthand. The number of known inscriptions is small and the graphic forms of many of the runes show great variability between inscriptions. For this reason, staveless runes have been unified with the corresponding Viking Age Nordic runes. The correspond- ing Viking Age Nordic runes must be used to encode these characters, specifically the short-twig characters, where both short-twig and long-branch characters exist. Punctuation Marks. The wide variety of Runic punctuation marks has been reduced to three distinct characters based on simple aspects of their graphical form, as very little is known about any difference in intended meaning between marks that look different. Any other punctuation marks have been unified with shared punctuation marks elsewhere in the Unicode Standard. Golden Numbers. Runes were used as symbols for Sunday letters and golden numbers on calendar staves used in Scandinavia during the Middle Ages. To complete the number series 1–19, three additional calendar runes were added. They are included after the punctuation marks. Encoding. A total of 81 characters of the Runic script are included in the Unicode Standard. Of these, 75 are Runic letters, 3 are punctuation marks, and 3 are Runic symbols. The order of the Runic characters follows the traditional futhark order, with variants and derived runes inserted directly after the corresponding ancestor. Runic character names are based as much as possible on the sometimes several traditional names for each rune, often with the Latin transliteration at the end of the name.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 175 7.7 Ogham European Alphabetic Scripts
7.7 Ogham
Ogham: U+1680–U+169F Ogham is an alphabetic script devised to write a a very early form of Irish. Monumental Ogham inscriptions are found in Ireland, Wales, Scotland, England, and on the Isle of Man. Many of the Scottish inscriptions are undeciphered and may be in Pictish. It is probable that Ogham (Old Irish “Ogam”) was widely written in wood in early times. The main flow- ering of “classical” Ogham, rendered in monumental stone, was in the fifth and sixth cen- turies. Such inscriptions were mainly employed as territorial markers and memorials; the more ancient examples are standing stones. The script was originally written along the edges of stone where two faces meet; when writ- ten on paper, the central “stemlines” of the script can be said to represent the edge of the stone. Inscriptions written on stemlines cut into the face of the stone, instead of along its edge, are known as “scholastic” and are of a later date (post-seventh century). Notes were also commonly written in Ogham in manuscripts as recently as the sixteenth century. Structure. The Ogham alphabet consists of 26 distinct characters (feda), the first 20 of which are considered to be primary, the last 6 (forfeda) supplementary. The four primary series are called aicmí (plural of aicme, meaning “family”). Each aicme was named after its first character, (Aicme Beithe, Aicme Uatha, meaning “the B Family,” “the H Family,” and so forth). The character names used in this standard reflect the spelling of the names in modern Irish Gaelic, except that the acute accent is stripped from Úr, Éabhadh, Ór, and Ifín, and the mutation of nGéadal is not reflected. Rendering. Ogham text is read beginning from the bottom left side of a stone, continuing upward, across the top and down the right side (in the case of long inscriptions). Monu- mental Ogham was incised chiefly in a bottom-to-top direction, though there are examples of left-to-right bilingual inscriptions in Irish and Latin. Manuscript Ogham accommo- dated the horizontal left-to-right direction of the Latin script, and the vowels were written as vertical strokes as opposed to the incised notches of the inscriptions. Ogham should therefore be rendered on computers from left-to-right or from bottom-to-top (never start- ing from top-to-bottom). Forfeda (Supplementary Characters). In printed and in manuscript Ogham, the fonts are conventionally designed with a central stemline, but this convention is not necessary. In implementations without the stemline, the character U+1680 should be given its conventional width and simply left blank like U+0020 . U+169B and U+169C are used at the beginning and the end of Ogham text, particularly in manuscript Ogham. In some cases, only the Ogham feather mark is used, which can indicate the direction of the text.
176 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.8 Modifier Letters
7.8 Modifier Letters
Spacing Modifier Letters: U+02B0–U+02FF Modifier letters are an assorted collection of small signs that are generally used to indicate modifications of a preceding letter. A few may modify the following letter, and some may serve as independent letters. These signs are distinguished from diacritical marks in that modifier letters are treated as free-standing, spacing characters. They are distinguished from similar or identical appearing punctuation or symbols by the fact that the members of this block are considered to be letter characters that do not break up a word. They have the “letter” character property (see Chapter 4, Character Properties). The majority of these signs are phonetic modifiers, including the characters required for coverage of the Interna- tional Phonetic Alphabet (IPA). Phonetic Usage. Modifier letters have relatively well-defined phonetic interpretations. Their usage generally indicates a specific articulatory modification of a sound represented by another letter or intended to convey a particular level of stress or tone. In phonetic usage, the modifier letters are sometimes called “diacritics,” which is correct in the logical sense that they are modifiers of the preceding letter. However, in the Unicode Standard, the term “diacritical marks” refers specifically to nonspacing marks, whereas the codes in this block specify spacing characters. For this reason, many of the modifier letters in this block correspond to separate diacritical mark codes, which are cross-referenced in Section 14.1, Character Names List. Encoding Principles. This block includes characters that may have different semantic val- ues attributed to them in different contexts. It also includes multiple characters that may represent the same semantic values—there is no necessary one-to-one relationship. The intention of the Unicode encoding is not to resolve the variations in usage, but merely to supply implementers with a set of useful forms from which to choose. The list of usages given for each modifier letter should not be considered exhaustive. For example, the glottal stop (Arabic hamza) in Latin transliteration has been variously represented by the charac- ters U+02BC , U+02BE , and U+02C0 . Conversely, an apostrophe can have several uses; for a list, see the entry for U+02BC in the character names list. There are also instances where an IPA modifier letter is explicitly equated in semantic value to an IPA nonspacing diacritic form. Latin Superscripts. Graphically, some of the phonetic modifier signs are raised or super- scripted, some are lowered or subscripted, and some are vertically centered. Only those few forms that have specific usage in IPA or other major phonetic systems are encoded. Spacing Clones of Diacritics. Some corporate standards explicitly specify spacing and nonspacing forms of combining diacritical marks, and the Unicode Standard provides matching codes for these interpretations when practical. A number of the spacing forms are covered in the Basic Latin and Latin-1 Supplement blocks. The six common European diacritics that do not have encodings there are added as spacing characters. These forms can have multiple semantics, such as U+02D9 , which is used as an indicator of the Mandarin Chinese fifth tone. Rhotic Hook. U+02DE is defined in IPA as a free-standing modifier letter. However, in common usage, it is treated as a ligated hook on a baseform let- ter. Hence, U+0259 + U+02DE may be treated as equivalent to U+025A .
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 177 7.8 Modifier Letters European Alphabetic Scripts
Tone Letters. U+02E5..U+02E9 comprise a set of basic tone letters, defined in IPA and commonly used in detailed tone transcriptions of African and other languages. Each tone letter refers to one of five distinguishable tone levels. To represent contour tones, the tone letters are used in combinations. The rendering of contour tones follows a regular set of ligation rules that results in a graphic image of the contour (see Figure 7-4). Figure 7-4. Tone Letters +=
178 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.9 Combining Marks
7.9 Combining Marks
Combining Diacritical Marks: U+0300–U+036F The combining diacritical marks in this block are intended for general use with any script. Diacritical marks specific to a particular script are encoded with the alphabet for that script. Diacritical marks that are primarily used with symbols are defined in the Combin- ing Diacritical Marks for Symbols character block (U+20D0..U+20FF). Standards. The combining diacritical marks are derived from a variety of sources, includ- ing IPA, ISO 5426, and ISO 6937. Sequence of Base Letters and Diacritics. In the Unicode character encoding, all nonspac- ing marks, including diacritics, are encoded after the base character. For example, the Uni- code character sequence U+0061 “a” , U+0308 “þ”í , U+0075 “u” unambiguously encodes “äu”, not “aü”. The Unicode Standard convention is consistent with the logical order of other nonspacing marks in Semitic and Indic scripts, the great majority of which follow the base characters with respect to which they are positioned. This convention is also in line with the way mod- ern font technology handles the rendering of nonspacing glyphic forms, so that mapping from character memory representation to rendered glyphs is simplified. (For more infor- mation on the use of diacritical marks, see Chapter 2, General Structure, and Chapter 3, Conformance.) Diacritics Positioned Over Two Base Characters. IPA and a few languages such as Tagalog use diacritics that are applied to two baseform characters. These marks apply to the previ- ous base character—just like all other combining nonspacing marks—but hang over the following letter as well. The two characters U+0360 and U+0361 are intended to be displayed as depicted in Figure 7-5. Figure 7-5. Double Diacritics ~ ~ o + @ ➠ o ~ ~ o + @ + o ➠ oo
These double diacritics always bind more loosely than all other nonspacing marks except U+0345 iota subscript, and thus sort near the end in the canonical representation. In ren- dering, the double diacritic will float above other diacritics (excluding surrounding diacrit- ics), as in Figure 7-6. Figure 7-6. Ordering of Double Diacritics ~ ~ o + @ˆ + @ + o + @¨ ➠ ôö ~ ~ o + @@ + @ˆ + o + ¨ ➠ ôö
Marks as Spacing Characters. By convention, combining marks may be exhibited in (apparent) isolation by applying them to U+0020 or to U+00A0 - .
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 179 7.9 Combining Marks European Alphabetic Scripts
This approach might be taken, for example, when referring to the diacritical mark itself as a mark, rather than using it in its normal way in text. The use of U+0020 versus U+00A0 - affects line-breaking behavior. In charts and illustrations in this standard, the combining nature of these marks is illus- trated by applying them to U+25CC , as shown in the examples throughout this standard. The Unicode Standard separately encodes clones of many common European diacritical marks as spacing characters. These related characters are cross-referenced in the character names list. Encoding Principles. Because nonspacing marks have such a wide variety of applications, the characters in this block may have multiple semantic values. For example, U+0308 = diaeresis = umlaut = double derivative. There are also cases of several different Unicode characters for equivalent semantic values; variants of include at least U+0312 , U+0326 , and U+0327 . (For more information about the difference between nonspacing marks and combining characters, see Chapter 2, General Structure.) Glyphic Variation. When rendered in the context of a language or script, like ordinary let- ters, combining marks may be subjected to systematic stylistic variation. For example, when used in Polish, U+0301 appears at a steeper angle than when it is used in French. When it is used for Greek (as oxia), it can appear nearly upright. U+030C is commonly rendered as an apostrophe when used with cer- tain letterforms. U+0326 is sometimes rendered as U+0312 on a lowercase “g” to avoid conflict with the decender. In many fonts, there is no clear distinction made between and U+0327 . Combining accents above the base glyph are usually adjusted in height for use with upper- case versus lowercase forms. In the absence of specific font protocols, combining marks are often designed as if they were applied to typical base characters in the same font. For more information, see Section 5.14, Rendering Nonspacing Marks.
Combining Marks for Symbols: U+20D0–U+20FF Diacritical marks for symbols are generally applied to mathematical or technical symbols. They can be used to extend the range of the symbol set. For example, U+20D3 can be used to express negation. Its presentation may change in those circumstances, changing length or slant. That is, U+2261 followed by U+20D3 is equivalent to U+2262 . In this case, there is a pre- composed form for the negated symbol. However, this statement does not always hold true, and U+20D3 can be used with other symbols to form the negation. For example, U+2258 followed by U+20D3 can be used to express does not correspond to, with- out requiring that a precomposed form be part of the Unicode Standard. Other nonspacing characters can also be used in mathematical expressions, of course. For example, a U+0304 is commonly used in propositional logic to indi- cate logical negation. Enclosing Diacritics. These nonspacing characters are supplied for compatibility with existing standards, allowing individual base characters to be enclosed in several ways. For example, U+2460 Å can be expressed as U+0030 “1” + U+20DD O. As with other combining characters, this one can also be applied productively; circled letter alef can be produced by the sequence:
180 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard European Alphabetic Scripts 7.9 Combining Marks
U+05D0 µ + U+20DD O. The com- bining enclosing diacritics cannot be used to enclose a sequence of base characters in plain text. For example, there is no way to represent U+246A with the , as there is no single character .
Combining Half Marks: U+FE20–U+FE2F This block consists of a number of presentation form (glyph) encodings that may be used to visually encode certain combining marks that apply to multiple base letterforms. These characters are intended to facilitate the support of such combining marks in legacy imple- mentations. Unlike the other compatibility characters, these characters do not correspond to a single nominal character or a sequence of nominal characters; rather, a discontiguous sequence of these combining half marks corresponds to a single combining mark, as depicted in Figure 7-7. The preferred forms are the double diacritics, U+0360 and U+0361. Figure 7-7. Combining Half Marks
Combining Half Marks~ ~ ~ ng+ þ + + þ → ng
U+006E U+FE22 U+0067 U+FE23
Single Combining Mark~ ~ n + þ + g → ng
U+006E U+0360 U+0067
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 181 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:
Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 8 Middle Eastern Scripts 8
The scripts in this category have a common origin in the ancient Phoenician alphabet. They include: •Hebrew •Arabic •Syriac • Thaana The Hebrew script is used in Israel and for languages of the Diaspora. The Arabic script is used to write many languages throughout the Middle East, North Africa, and certain parts of Asia. The Syriac script is used to write a number of Middle Eastern languages. These three scripts also function as major liturgical scripts, and therefore are used worldwide by various religious groups. The Thaana script is used to write Dhivehi, the language of the Republic of Maldives, an island nation in the middle of the Indian Ocean. The Middle Eastern scripts are all alphabetic, with small character sets. Words are demar- cated by spaces. Except for Thaana, these scripts include a number of distinctive punctua- tion marks. In addition, the Arabic script includes traditional forms for digits, called “Arabic-Indic digits” in the Unicode Standard. Text in these scripts is written from right to left. Implementations of these scripts must conform to the Unicode bidirectional algorithm (see Section 3.12, Bidirectional Behavior). Arabic and Syriac are cursive scripts even when typeset, unlike Hebrew and Thaana, where letters are unconnected. Most letters in Arabic and Syriac assume different forms depend- ing on their position in a word. Shaping rules for the rendering of text are specified in the block descriptions for Arabic and Syriac. Shaping rules are not required for Hebrew because only five letters have position-dependent final forms, and these forms are sepa- rately encoded. Historically, Middle Eastern scripts did not include short vowels. Nowadays, short vowels are represented by marks positioned above or below a consonantal letter. Vowels and other marks of pronunciation (“vocalization”) are encoded as combining characters, so support for vocalized text necessitates use of composed character sequences. Syriac, Thaana, and Yiddish are normally written with vocalization; Hebrew and Arabic are usually written unvocalized.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 185 8.1 Hebrew Middle Eastern Scripts
8.1 Hebrew
Hebrew: U+0590–U+05FF The Hebrew script is used for writing the Hebrew language as well as Yiddish, Judezmo (Ladino), and a number of other languages. Vowels and various other marks are written as points, which are applied to consonantal base letters; these marks are usually omitted in Hebrew, except for liturgical texts and other special applications. Five Hebrew letters assume a different graphic form when last in a word. Directionality. The Hebrew script is written from right to left. Conformant implementa- tions of Hebrew script must use the Unicode bidirectional algorithm (see Section 3.12, Bidi- rectional Behavior.) Cursive. The Unicode Standard uses the term cursive to refer to writing where the letters of a word are connected. A handwritten form of Hebrew is known as cursive, but its rounded letters are generally unconnected, so the Unicode definition does not apply. Fonts based on cursive Hebrew exist: they are used not only to show examples of Hebrew handwriting, but also for display purposes. Standards. ISO 8859-8—Part 8. Latin/Hebrew Alphabet. The Unicode Standard encodes the Hebrew alphabetic characters in the same relative positions as in ISO 8859-8; however, there are no points or Hebrew punctuation characters in the ISO standard. Vowels and Other Marks of Pronunciation. These combining marks, generically called points in the context of Hebrew, indicate vowels or other modifications of consonantal let- ters. General rules for applying combining marks are given in Chapter 2.6, Combining Characters, and Section 3.10, Canonical Ordering Behavior. Additional Hebrew-specific behavior is described below. Hebrew points can be separated into four classes: dagesh, shin dot and sin dot, vowels, and other marks of punctuation. Dagesh, U+05BC , has the form of a dot that appears inside the let- ter that it affects. It is not a vowel, but a diacritic that affects the pronunciation of a conso- nant. The same base consonant can also have a vowel and/or other diacritics. Dagesh is the only element that goes inside a letter. The dotted Hebrew consonant shin is explicitly encoded as the sequence U+05E9, followed by U+05C1, . The shin dot is positioned on the upper-right side of the undotted base letter. Similarly, the dotted consonant sin is explicitly encoded as the sequence U+05E9, followed by U+05C2, . The sin dot is positioned on the upper-left side of the base letter. The two dots are mutually exclusive. The base letter shin can also have a dagesh, a vowel, and other diacritics. Use of the two dots with any other base character is an error. Vowels all appear below the base character that they affect, except for holam, U+05B9 , which appears above left. The following points represent vowels: U+05B0–U+05B9, U+05BB. The remaining three points are marks of pronunciation: U+05BD , U+05BF , and U+FB1E - . Meteg, also known as siluq, goes below the base character; rafe and varika go above. The varika, used in Judezmo, is a glyphic variant of rafe.
186 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.1 Hebrew
Shin and Sin. Separate characters for the dotted letters shin and sin are not included in this block. When it is necessary to distinguish between the two forms, they should be encoded as U+05E9 followed by the appropriate dot (U+05C1 or U+05C2). This practice is consistent with Israeli standard encoding. Final (Contextual Variant) Letterforms. Variant forms of five Hebrew letters are encoded as separate characters in this block, as in Hebrew standards including ISO 8859-8. These variant forms are generally used in place of the nominal letter forms at the end of words. Certain words, however, are spelled with nominal rather than final forms, particularly names and foreign borrowings in Hebrew, and some words in Yiddish. Because final form usage is a matter of spelling convention, software should not automatically substitute final forms for nominal forms at the end of words. The positional variants should be coded directly from the codeset and rendered one-to-one via their own glyphs—that is, without contextual analysis. Yiddish Digraphs. The digraphs are considered to be independent characters in Yiddish. The Unicode Standard has included them as separate characters so as to distinguish certain letter combinations in Yiddish text—for example, to distinguish the digraph double vav from an occurrence of a consonantal vav followed by a vocalic vav. The use of digraphs is consistent with standard Yiddish orthography. Other letters of the Yiddish alphabet, such as pasekh alef, can be composed from other characters, although alphabetic presentation forms are also encoded. Punctuation. Most punctuation marks used with the Hebrew script are not given indepen- dent codes (that is, they are unified with Latin punctuation), except for the few cases where the mark has a unique form in Hebrew—namely, U+05BE , U+05C0 (also known as legarmeh), U+05C3 - , U+05F3 , and U+05F4 - . Note that for paired punctuation such as parentheses, the glyphs chosen to represent U+0028 and U+0029 will depend upon the direction of the rendered text. See Section 4.7, Mirrored—Normative, for more information. Cantillation Marks. Cantillation marks are used in publishing liturgical texts, including the Bible. There are various historical schools of cantillation marking; the set of marks included in the Unicode Standard follows the Israeli standard SI 1311.2. Positioning. Marks may combine with vowels and other points, and there are complex typographic rules for positioning these combinations. The vertical placement (meaning above, below, or inside) of points and marks is very well defined. The horizontal placement (meaning left, right, or center) of points is also very well defined. The horizontal placement of marks, on the other hand, is not well defined, and convention allows for the different placement of marks relative to their base character. When points and marks are located below the same base letter, the point always comes first (on the right) and the mark after it (on the left), except for the marks yetiv, U+059A , and dehi, U+05AD , which come first (on the right) and are followed (on the left) by the point. These rules are followed when points and marks are located above the same base letter: • If the point is holam, all cantillation marks precede it (on the right), except pashta, U+0599 . • Pashta always follows (goes to the left of) points.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 187 8.1 Hebrew Middle Eastern Scripts
• Holam on a sin consonant (shin base + sin dot) follows (goes to the left of) the sin dot. However, the two combining marks are sometimes rendered as a single assimilated dot. • Shin dot and sin dot are generally represented closer vertically to the base letter than other points and marks that go above it.
Alphabetic Presentation Forms: U+FB1D–U+FB4F The Hebrew characters in this block are chiefly of two types: variants of letters and marks encoded in the main Hebrew block, and precomposed combinations of a Hebrew letter or digraph with one or more vowels or pronunciation marks. This block contains all of the vocalized letters of the Yiddish alphabet. The alef lamed ligature and a Hebrew variant of the plus sign are also included. The Hebrew plus sign variant, U+FB29 , is more often used in handwriting than in print, but does occur in school textbooks. It is used by those who wish to avoid cross symbols, which can have reli- gious and historical connotations. Use of Wide Letters. Wide letterforms are used in handwriting and in print to achieve even margins. The wide-form letters in the Unicode Standard are those that are most commonly “stretched” in justification. If Hebrew text is to be rendered with even margins, justification should be left to the text formatting software. These alphabetic presentation forms are included for compatibility purposes. For the pre- ferred encoding, see Hebrew Presentation Forms, U+FB1D–U+FB4F, in the names list. For letterlike symbols, see U+2135..U+2138. For additional punctuation to be used with this script, see Section 6.1, General Punctuation. The (U+20AA) is encoded in the currency block.
188 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.2 Arabic
8.2 Arabic
Arabic: U+0600–U+06FF The Arabic script is used for writing the Arabic language and has been extended for repre- senting a number of other languages, such as Persian, Urdu, Pashto, Sindhi, and Kurdish. Urdu is often written with the ornate Nastaliq script variety. Some languages, such as Indo- nesian/Malay, Turkish, and Ingush, formerly used the Arabic script and now employ the Latin or Cyrillic scripts. The Arabic script is cursive, even in its printed form (see Figure 8-1). As a result, the same letter may be written in different forms depending on how it joins with its neighbors. Vow- els and various other marks may be written as combining marks called harakat, which are applied to consonantal base letters. In normal writing, however, these harakat are omitted. Directionality. The Arabic script is written from right to left. Conformant implementa- tions of Arabic script must use the Unicode bidirectional algorithm (see Section 3.12, Bidi- rectional Behavior.) Figure 8-1. Directionality and Cursive Connection
Standards. ISO 8859-6—Part 6. Latin/Arabic Alphabet. The Unicode Standard encodes the basic Arabic characters in the same relative positions as in ISO 8859-6. ISO 8859-6, in turn, is based on ECMA-114, which was based on ASMO 449. Encoding Principles. The basic set of Arabic letters is well defined. Each letter receives only one Unicode character value in the basic Arabic block, no matter how many different con- textual appearances it may exhibit in text. Each Arabic letter in the Unicode Standard may be said to represent the inherent semantic identity of the letter. A word is spelled as a sequence of these letters. The representative glyph shown in the Unicode character chart for an Arabic letter is usually the form of the letter when standing by itself. It is simply used to distinguish and identify the character in the code charts and does not restrict the glyphs used to represent it. Punctuation. Most punctuation marks used with the Arabic script are not given indepen- dent codes (that is, they are unified with Latin punctuation), except for the few cases where the mark has a significantly different appearance in Arabic—namely, U+060C , U+061B , U+061F , and U+066A . For paired punctuation such as parentheses, the glyphs chosen to represent U+0028 and U+0029 will depend upon the direction of the rendered text. The Non-joiner and the Joiner. The Unicode Standard provides two user-selectable for- matting codes: U+200C - and U+200D (see Figure 8-2, Figure 8-3, and Figure 8-4). The use of a non-joiner between two letters prevents them from forming a cursive connection with each other when rendered. Examples include
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 189 8.2 Arabic Middle Eastern Scripts the Persian plural suffix, some Persian proper names, and Ottoman Turkish vowels. For further discussion of joiners and non-joiners, see Section 13.2, Layout Controls. Figure 8-2. Using Joiner
Figure 8-3. Using Non-joiner
Figure 8-4. Combinations of Joiner and Non-joiner
Harakat (Vowel) Nonspacing Marks. Harakat are marks that indicate vowels or other modifications of consonant letters. The occurrence of a character in the harakat range and its depiction in relation to a dashed circle constitute an assertion that this character is intended to be applied via some process to the character that precedes it in the text stream, the base character. General rules for applying nonspacing marks are given in the Combin- ing Diacritical Marks block description section. The few marks that are placed after (to the left of) the base character are treated as ordinary spacing characters in the Unicode Stan- dard. The Unicode Standard does not specify a sequence order in case of multiple harakat applied to the same Arabic base character as there is no possible ambiguity of interpreta- tion. (For more information about the canonical ordering of nonspacing marks, see Chap- ter 2, General Structure, and Chapter 3, Conformance.) Arabic-Indic Digits. The names for the forms of decimal digits vary widely across different languages. The decimal numbering system originated in India (Devanagari vwx…) and was subsequently adopted in the Arabic world with a different appearance (Arabic …). The Europeans adopted decimal numbers from the Arabic world, although once again the forms of the digits changed greatly (European 0123…). The European forms were later adopted widely around the world and are used even in many Arabic- speaking countries in North Africa. In each case, the interpretation of decimal numbers remained the same. However, the forms of the digits changed to such a degree that they are no longer recognizably the same characters. Because of the origin of these characters, the European decimal numbers are widely known as “Arabic numerals” or “Hindi-Arabic
190 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.2 Arabic numerals,” whereas the decimal numbers in use in the Arabic world are widely known there as “Hindi numbers.” The Unicode Standard includes both Indic digits (including forms used with different Indic scripts), Arabic digits (with forms used in most of the Arabic world), and European digits (now used internationally). Because of this decision, the traditional names could not be retained without confusion. In addition, there are two main variants of the Arabic digits— those used in Iran and Pakistan (here called Eastern Arabic-Indic) and those used in other parts of the Arabic world. The Persian and Urdu variant digits are given separate codes in the Unicode Standard to account for the differences in appearance and directional treatment when rendering them. The Unicode Standard provides one sequence of digits for both Persian and Urdu. Their presentation depends on the font. (For a complete discussion of directional formatting in the Unicode Standard, see Section 3.12, Bidirectional Behavior.) In summary, the Unicode Standard uses the names shown in Table 8-1 . These names have been chosen to reduce the confusion involved in the use of the decimal number forms. They do not have any normative content; as with the choice of any other names, they are meant to be unique distinguishing labels and should not be viewed as favoring one culture over another. Table 8-1. Digit Names
Name Code Points Forms European U+0030..U+0039 0123456789 Arabic-Indic U+0660..U+0669 Eastern Arabic-Indic U+06F0..U+06F9 ÕÖ×ØÙÚÛÜÝÞ Indic (Devanagari) U+0966..U+096F vwxyz{|}~
Extended Arabic Letters. Arabic script is used to write major languages, such as Persian and Urdu, but it has also been used to transcribe some relatively obscure languages, such as Baluchi and Lahnda, which have little tradition in printed typography. As a result, the set of characters encoded in this section unavoidably contains spurious forms. The Unicode Standard encodes multiple forms of the Extended Arabic letters and variant digits because the character forms and usages are not well documented for a number of languages. This approach was felt to be the most practical in the interest of minimizing the risk of omitting valid characters. Koranic Annotation Signs. These characters are used in the Koran to mark pronunciation and other annotation. The two enclosing marks, U+06DD and U+06DE, are used to enclose digits. When rendered, these digits appear in a smaller size. Languages. The languages using a given character are occasionally indicated, even though this information is incomplete. When such an annotation ends with an ellipsis (…), then the languages cited are merely the known principal ones among many. Minimum Rendering Requirements. The cursive nature of the Arabic script imposes spe- cial requirements on display or rendering processes that are not typically found in Latin script-based systems. A display process must convert between the logical order in which Arabic characters are placed in backing store and the visual (or physical) order required by the display device. (See Section 3.12, Bidirectional Behavior, for a description of the conver- sion between logical and visual orders.) At a minimum, a display process must also select an appropriate glyph to depict each Ara- bic letter according to its immediate joining context; furthermore, it must substitute certain
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 191 8.2 Arabic Middle Eastern Scripts ligature glyphs for sequences of Arabic characters. The remainder of this section specifies a minimum set of rules that provide legible Arabic joining and ligature substitution behavior.
Cursive Joining Joining Classes. Each Arabic letter must be depicted by one of a number of possible con- textual glyph forms. The appropriate form is determined on the basis of its joining class and the joining class of adjacent characters. Each Arabic character falls into one of the classes shown in Table 8-2 (see ArabicShaping.txt on the CD-ROM for a complete list). Table 8-3 defines derived superclasses of the primary Arabic joining classes; those super- classes are used in the cursive joining rules. In both tables, right and left refer to visual order. Table 8-2. Primary Arabic Joining Classes
Joining Class Symbols Members Right-joining R , , , , Left-joining L None Dual-joining D , , , … Join-causing C , (kashida) These characters are distinguished from the dual-joining characters in that they do not change shape. Non-joining U - and all spacing characters (other than the above), including , , spaces, digits, punctuation, non-Arabic letters, and so on Transparent T All combining marks and format marks, including - , , , , , , , , - , and so on Table 8-3. Derived Arabic Joining Classes
Joining Class Members Right join-causing Superset of dual-joining, left-joining, and join-causing Left join-causing Superset of dual-joining, right-joining, and join-causing
Joining Rules. The following rules describe the joining behavior of Arabic letters in terms of their display (visual) order. In other words, the positions of letterforms in the included examples are presented as they would appear on the screen after the bidirectional algorithm has reordered the characters of a line of text. • An implementation may choose to restate the following rules according to logi- cal order so as to apply before the bidirectional algorithm’s reordering phase. In this case, the words right and left as used in this section would become preceding and following. In the following rules, if X refers to a character, then various glyph types representing that character are referred to as shown in Table 8-4. Table 8-4. Arabic Glyph Types
Glyph Types Description
Xn Nominal glyph form as it appears in the code charts Xr Right-joining glyph form (both right-joining and dual-joining characters may employ this form)
192 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.2 Arabic
Table 8-4. Arabic Glyph Types (Continued)
Glyph Types Description
Xl Left-joining glyph form (both left-joining and dual-joining characters may employ this form) Xm Dual-joining (medial) glyph form that joins on both left and right (only dual- joining characters employ this form)
R1 Transparent characters do not affect the joining behavior of base (spacing) charac- ters. For example: MEEM.N + SHADDAH.N + LAM.N Å MEEM.R + SHADDAH.N + LAM.L
R2 A right-joining character X that has a right join-causing character on the right will adopt the form Xr . For example: ALEF.N + TATWEEL.N Å ALEF.R + TATWEEL.N
R3 A left-joining character X that has a left join-causing character on the left will adopt the form Xl. R4 A dual-joining character X that has a right join-causing character on the right and a left join-causing character on the left will adopt the form Xm. For example: TATWEEL.N + MEEM.N + TATWEEL.N Å TATWEEL.N + MEEM.M + TATWEEL.N
R5 A dual-joining character X that has a right join-causing character on the right and no left join-causing character on the left will adopt the form Xr. For example: MEEM.N + TATWEEL.N Å MEEM.R + TATWEEL.N
R6 A dual-joining character X that has a left join-causing character on the left and no right join-causing character on the right will adopt the form Xl. For example: TATWEEL.N + MEEM.N Å TATWEEL.N + MEEM.L
R7 If none of the above rules applies to a character X, then it will adopt the nominal form Xn. As just noted, the - may be used to prevent joining, as in the Per- sian (Farsi) plural suffix or Ottoman Turkish vowels.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 193 8.2 Arabic Middle Eastern Scripts
Ligatures Ligature Classes. Certain types of ligatures are obligatory in Arabic script regardless of font design. Many other optional ligatures are possible, depending on font design. Because they are optional, those ligatures are not covered. For the purpose of describing the obligatory Arabic ligatures, certain Unicode characters fall into the following classes (see the end of this block description for a complete list): Alef-types: --, … Lam-types: , , … These two classes are designated below as ALEF and LAM, respectively. Ligature Rules. The following rules describe the formation of ligatures. They are applied after the preceding joining rules. Like the joining rules just discussed, the following rules describe ligature behavior of Arabic letters in terms of their display (visual) order. In the following rules, if X and Y refer to characters, then various glyph types representing combinations of these characters are referred to as shown in Table 8-5. Table 8-5. Ligature Notation
Symbol Description
(X.Y)n Nominal ligature glyph form representing a combination of an Xr form and a Yl form (X.Y)r Right-joining ligature glyph form representing a combination of an Xr form and a Ym form (X.Y)l Left-joining ligature glyph form representing a combination of an Xm form and a Yl form (X.Y)m Dual-joining (medial) ligature glyph form representing a combination of an Xm form and a Ym form
L1 Transparent characters do not affect the ligating behavior of base (nontranspar- ent) characters. For example: ALEF.R + FATHAH.N + LAM.L Å LAM-ALEF.N + FATHAH.N
L2 Any sequence with ALEFr on the left and LAMm on the right will form the ligature (ALEF.LAM)l. For example:
L3 Any sequence with ALEFr on the left and LAMl on the right will form the ligature (ALEF.LAM)n. For example:
• From the perspective of logical (or reading) order, the preceding (ALEF.LAM) ligature forms are referred to as - forms. The difference reflects the use of visual order rather than logical order to state ligature rules. Optional Features. Many other ligatures and contextual forms are optional—depending on the font and application. Some of these presentation forms are encoded in the ranges FB50..FDFB and FE70..FEFE. However, these forms should not be used in general inter- change. Moreover, it is not expected that every Arabic font will contain all of these forms, nor that these forms will include all presentation forms used by every font.
194 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.2 Arabic
More sophisticated rendering systems will use additional shaping and placement. For example, contextual placement of the nonspacing vowels such as fatha will provide better appearance. The justification of Arabic tends to stretch words instead of adding width to spaces. Basic stretching can be done by inserting tatweel between characters shaped by rules R2, R4..R6, L2, and L3; the best places for inserting tatweel will depend on the font and rendering software. More powerful systems will choose different shapes for characters such as kaf to fill the space in justification. Arabic Character Joining Types. Table 8-6, Table 8-7, and Table 8-8 provide a detailed list of the Arabic characters that are either right-joining or dual-joining. All other Arabic char- acters (aside from ) are non-joining, including U+06D5 . For brevity in the names, the words , , and are abbreviated using digits. Most of the extended Arabic characters are merely variations on the basic Arabic shapes, with additional or different marks. For compatibility, many precomposed forms are included. In some cases, characters occur only at the end of words in correct spelling; they are called trailing characters. Examples include , , and . When trailing characters are joining (such as ), they are classed as right-join- ing, even when similarly shaped characters are dual-joining. When characters do not join or cause joining (such as ), they are classified as transparent. • In the case of U+0647 , the glyph HEHl is shown in the code charts. This form is often used to reduce the chance of misidentifying as U+0665 - , which has a very similar shape. The nominal form of is the isolate form, which looks like U+06D5 . In the case of U+06C1 , the nominal form is not even listed; it also resembles - . The characters in these tables are grouped by shape and not by standard Arabic alphabeti- cal order. For a machine readable version of the information in these tables, see Arabic- Shaping.txt on the CD-ROM. Table 8-6. Dual-Joining Arabic Characters
Other Characters with Similar Group X X X X n r m l Shaping Behavior All letters based on the BEH form, includ- BEH ò ñing TEH, THEH, and diacritic variants of these. 0628, 062A, 062B, 0679..0680. All letters based on the NOON form. 0646, NOON ÊËÍ Ì 06B9..06BD. All letters based on the YEH form, includ- YEH Ö× Ù Øing ALEF MAKSURA. 0626, 0649, 064A, 0678, 06CC, 06CE, 06D0, 06D1. All letters based on the HAH form, includ- HAH ing KHAH, JEEM, and diacritic variants of these. 062C..062E, 0681..0687, 06BF. All letters based on the SEEN form, includ- SEEN ing SHEEN and diacritic variants of these. 0633, 0634, 069A..069C, 06FA. All letters based on the SAD form, includ- SAD ¡ ing DAD and diacritic variants of these. 0635, 0636, 069D, 069E, 06FB.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 195 8.2 Arabic Middle Eastern Scripts
Table 8-6. Dual-Joining Arabic Characters (Continued)
Other Characters with Similar Group X X X X n r m l Shaping Behavior All letters based on the TAH form, includ- TAH ¦§©¨ing ZAH and diacritic variants of these. 0637, 0638, 069F. All letters based on the AIN form, includ- AIN ®¯±°ing GHAIN and diacritic variants of these. 0639, 063A, 06A0, 06FC. All letters based on the FEH form. 0641, FEH ¶· ¹ ¸ 06A1..06A6.
All letters based on the QAF form. 0642, QAF § »½¼ 06A7, 06A8.
MEEM ÆÇÉÈAll letters based on the MEEM form. 0645.
HEH ÎÏÑÐAll letters based on the HEH form. 0647.
KNOTTED All letters based on the KNOTTED HEH HEH form. 06BE. All letters based on the HEH GOAL form, HEH GOAL but excluding HAMZA ON HEH GOAL. 06C1. All letters based on the KAF form. 0643, KAF ¾¿ÁÀ 06AC..06AE. SWASH >?@ AAll letters based on the SWASH KAF form. KAF 06AA.
All letters based on the GAF form. 06A9, GAF òôù õ 06AB, 06AF..06B4.
All letters based on the LAM form. 0644, LAM ÂÃÅ Ä 06B5..06B8.
Table 8-7. Right-Joining Arabic Characters
Group Xn Xr Other Characters with Similar Shaping Behavior All letters based on the ALEF form. 0622, 0623, 0625, ALEF ê 0627, 0671, 0672, 0673, 0675.
All letters based on the WAW form. 0624, 0648, 0676, WAW ÒÓ 0677, 06C4..06CB, 06CF.
All letters based on the DAL form, including THAL and DAL diacritic variants of these. 062F, 0630, 0688..0690.
All letters based on the REH form, including ZAIN and REH diacritic variants of these. 0631, 0632, 0691..0699. All letters based on the HEH form that show a knotted TEH MARB- ôõform on right-joining, including HAMZA ON HEH. UTA 0629, 06C0.
196 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.2 Arabic
Table 8-7. Right-Joining Arabic Characters (Continued)
Group Xn Xr Other Characters with Similar Shaping Behavior All letters based on the HEH form that show a goal form HAMZA ON ¥§on right-joining (Urdu), including HAMZA ON HEH HEH GOAL GOAL. 06C2, 06C3. YEH WITH BCThe tailed form of YEH. 06CD. TAIL YEH All letters based on the YEH BARREE form (Urdu). 06D2, BARREE 06D3.
Table 8-8. Other Arabic Character Joining Classes
Class Members Join-causing ZERO WIDTH JOINER (200D) and TATWEEL (0640). ZERO WIDTH NON-JOINER (200C), and all spacing characters (other than those mentioned in Table 8-6 or Table 8-7) are non-joining, includ- Non-joining ing HAMZA (0621), HIGH HAMZA (0674), spaces, digits, punctuation, non-Arabic letters, and so on. All combining marks and other format controls are irrelevant, including FATHATAN (064B) and other points, MADDAH ABOVE (0653), Transparent HAMZA ABOVE (0654), HAMZA BELOW (0655), SUPERSCRIPT ALEF (0670), combining Koranic annotation signs, RIGHT-TO-LEFT MARK (200F), and so on.
Arabic Presentation Forms-A: U+FB50–U+FDFF This block contains a list of presentation forms (glyphs) encoded as characters for compat- ibility. At the time of publication, there are no known implementations of all of these pre- sentation forms. As with all other compatibility encodings, these characters have a preferred encoding that makes use of noncompatibility characters. The presentation forms in this block consist of contextual (positional) variants of Extended Arabic letters, contextual variants of Arabic letter ligatures, spacing forms of Arabic dia- critic combinations, contextual variants of certain Arabic letter/diacritic combinations, and Arabic phrase ligatures. The ligatures include a large set of presentation forms for com- patibility with existing systems. However, the set of ligatures appropriate for any given Ara- bic font will generally not match this set precisely. Fonts will often include only a subset of these glyphs, and they may also include glyphs outside of this set. These glyphs are gener- ally not accessible as characters and are used only by rendering engines. The alternative (ornate) forms of parentheses for use with the Arabic script are not com- patibility characters.
Arabic Presentation Forms-B: U+FE70–U+FEFF This block contains additional Arabic presentation forms consisting of spacing or tatweel forms of Arabic diacritics, contextual variants of primary Arabic letters, and the obligatory - ligature. They are included here for compatibility with preexisting standards and legacy implementations that use these forms as characters. They can be replaced by letters from the Arabic block (U+0600..U+06FF). Implementations can handle contextual glyph
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 197 8.2 Arabic Middle Eastern Scripts shaping by rendering rules when accessing glyphs from fonts, rather than by encoding con- textual shapes as characters. Spacing and Tatweel Forms of Arabic Diacritics. For compatibility with certain imple- mentations, a set of spacing forms of the Arabic diacritics is provided here. The tatweel forms are combinations of the joining connector tatweel and a diacritic. Zero Width No-Break Space. This character (U+FEFF), which is not an Arabic presenta- tion form, is described in the Specials block (U+FFF0..U+FFFF).
198 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.3 Syriac
8.3 Syriac
Syriac: U+0700–U+074F The Syriac language belongs to the Aramaic branch of the family of Semitic languages. The earliest datable Syriac writing dates from the year 6 . Syriac is the active liturgical lan- guage of many communities in the Middle East (Syrian Orthodox, Assyrian, Maronite, Syr- ian Catholic, and Chaldaean) and Southeast India (Syro-Malabar and Syro-Malankara). It is also the native language of a considerable population in these communities. Syriac is divided into two dialects. West Syriac is used by the Syrian Orthodox, Maronites, and Syrian Catholics. East Syriac is used by the Assyrians (that is, Ancient Church of the East) and Chaldaeans. The two dialects are very similar with almost no difference in gram- mar and vocabulary. They differ in pronunciation and use different dialectal forms of the Syriac script. Syriac Languages. A number of modern languages and dialects employ the Syriac script in one form or another. They include the following: 1. Literary Syriac. The primary usage of Syriac script. 2. Neo-Aramaic dialects. The Syriac script is widely used for modern Aramaic lan- guages, next to Hebrew, Cyrillic, and Latin. A number of Eastern Modern Ara- maic dialects known as Swadaya (also called vernacular Syriac, modern Syriac, modern Assyrian, and so on, and spoken mostly by the Assyrians and Chaldae- ans of Iraq, Turkey, and Iran), and the Central Aramaic dialect, Turoyo (spoken mostly by the Syrian Orthodox of the Tur Abdin region in southeast Turkey), belong to this category of languages. 3. Garshuni (Arabic written in the Syriac script). It is currently used for writing Arabic liturgical texts by Syriac-speaking Christians. Garshuni employs the Arabic set of vowels and overstrike marks. 4. Christian Palestinian Aramaic (known also as Palestinian Syriac). This dialect is no longer spoken. 5. Other languages. The Syriac script was used in various historical periods for writing Armenian and some Persian dialects. Syriac speakers employed it for writing Arabic, Ottoman Turkish, and Malayalam. The Syriac Script. The Syriac script is cursive and has shaping rules that are similar to those for Arabic. The Unicode Standard does not include any presentation form characters for Syriac characters. Directionality. The Syriac script is written from right to left. Conformant implementations of Syriac script must use the Unicode bidirectional algorithm (see Section 3.12, Bidirec- tional Behavior). Syriac Type Styles. Syriac texts employ several type styles. Because all type styles use the same Syriac characters, even though their shapes vary to some extent, the Unicode Stan- dard encodes only a single Syriac script. 1. Estrangela type style. Estrangela (a word derived from Greek strongulos, mean- ing “rounded”) is the oldest type style. Ancient manuscripts use this writing style exclusively. Estrangela is used today in West and East Syriac texts for writ- ing headers, titles, and subtitles. It is the current standard in writing Syriac texts in Western scholarship.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 199 8.3 Syriac Middle Eastern Scripts
2. Serto or West Syriac type style. This type style is the most cursive of all Syriac type styles. It emerged around the eighth century and is used today in West Syr- iac texts, as well as Turoyo (Central Neo-Aramaic) and Garshuni. 3. East Syriac type style. Its early features appear as early as the sixth century; it developed into its own type style by the twelfth or thirteenth centuries. It is used today for writing East Syriac texts, as well as Swadaya (Eastern Neo-Ara- maic). It is also used today in West Syriac texts for headers, titles, and subtitles alongside the Estrangela type style. 4. Christian Palestinian Aramaic. Manuscripts of this dialect employ a script that is akin to Estrangela. It can be considered a subcategory of Estrangela. The Unicode Standard provides for usage of the type styles mentioned above. Additionally, it accommodates letters and diacritics used in Neo-Aramaic languages, Christian Palestin- ian Aramaic, and Garshuni languages. Examples are supplied in the Serto type style, except where otherwise noted. Character Names. Character names follow the East Syriac convention for naming the let- ters of the alphabet. Diacritical points use a descriptive naming—for example, . The Syriac Abbreviation Mark. U+070F (SAM) is a user- selectable zero-width formatting code that has no effect on the shaping process of Syriac characters. The SAM specifies the beginning point of a Syriac abbreviation, which is a line drawn horizontally above one or more characters, at the end of a word or of a group of characters followed by a character other than a Syriac letter or diacritic mark. A Syriac abbreviation may contain Syriac diacritics. Ideally, the Syriac abbreviation is rendered by a line that has a dot at each end and the cen- ter as in the examples. While not preferable, it has become acceptable for computers to ren- der the Syriac abbreviation as a line without the dots. The line is acceptable for the presentation of Syriac in plain text, but the presence of dots is recommended in liturgical texts. The Syriac abbreviation is used for letter numbers and contractions. A Syriac abbreviation generally extends from the last tall character in the word until the end of the word. A com- mon exception to this rule is found with letter numbers that are preceded by a preposition character, as seen in Figure 8-5. Figure 8-5. Syriac Abbreviation
= 15 (number in letters) = on the 15th (number with prefix) = =
200 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.3 Syriac
A SAM is placed before the character where the abbreviation begins. The Syriac abbrevia- tion begins over the character following the SAM and continues until the end of the word. Use of the SAM is demonstrated in Figure 8-6. Figure 8-6. Use of SAM
Backing Store: 3£K% Reversed: %K£3 Rendered:
Note: Modern East Syriac texts employ a punctuation mark for contractions of this sort. Ligatures and Combining Characters. Only one ligature is included in the Syriac block— U+071E . This combination is used as a unique character in the same manner as an æ ligature. A number of combining diacritics unique to Syriac are encoded, but combining characters from other blocks are also used, especially from the Arabic block. Diacritic Marks and Vowels. The function of the diacritic marks varies: they indicate vow- els (like Arabic and Hebrew), mark grammatical attributes (for example, verb versus noun, interjection), or guide the reader in the pronunciation and/or reading of the given text. “The reader of the average Syriac manuscript or book, is confronted with a bewildering profusion of points. They are large, of medium size and small, arranged singly or in twos and threes, placed above the word, below it, or upon the line.” There are two vocalization systems. The first, attributed to Jacob of Edessa (633–708 ), utilizes letters derived from Greek that are placed above (or below) the characters they modify. The second is the more ancient dotted system, which employs dots in various shapes and locations to indicate vowels. East Syriac texts exclusively employ the dotted sys- tem, whereas West Syriac texts (especially later ones and in modern times) employ a mix- ture of the two systems. Diacritic marks are nonspacing and are normally centered above or below the character. Exceptions to this rule follow: 1. U+0741 and U+0742 are used only with the letters beth, gamal (in its Syriac and Garshuni forms), galath, kaph, pe, and taw. • The qushshaya indicates that the letter is pronounced hard and unaspirated. • The rukkakha indicates that the letter is pronounced soft and aspirated. When the rukkakha is used in conjunction with the dalath, it is printed slightly to the right of the dalath’s dot below. 2. In Modern Syriac usage, when a word contains a rish and a seyame, the dot of the rish and the seyame are replaced by a rish with two dots above it. 3. The feminine dot is usually used to the left of a final taw. Punctuation. Most punctuation marks used with Syriac are found in the Latin-1 and Ara- bic blocks. The other ones are encoded in this block.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 201 8.3 Syriac Middle Eastern Scripts
Digits. Modern Syriac employs European numerals, as does Hebrew. The ordering of num- bers follows the same scheme as in Hebrew. Harklean Marks. The Harklean marks are used in the Harklean translation of the New Tes- tament. U+070B and U+070D mark the beginning of a phrase, word, or morpheme that has a marginal note. U+070C marks the end of such sections. Dalath and Rish. Prior to the development of pointing, early Syriac texts did not distin- guish between a dalath and a rish, distinguished in later periods with a dot below the former and a dot above the latter. Unicode provides U+0716 as an ambiguous character. Semkath. Unlike other letters, the joining mechanism of semkath varies through the course of history from right-joining to dual-joining. It is necessary to enter a U+200C - character after the semkath to obtain the right-joining form where required. There are two common variants of this character, U+0723 and U+0724 . They occur interchangeably in the same document, similar to the case of Greek sigma. Vowel Marks. As shown in the preceding examples, the so-called Greek vowels may be used above or below letters. As West Syriac texts employ a mixture of the Greek and dotted sys- tems, both versions are accounted for here. Miscellaneous Diacritics. Miscellaneous general diacritics are also used in Syriac text. Their usage is explained in Table 8-9. Table 8-9. Miscellaneous Diacritic Use U+0303, U+0330 These are used in Swadaya to indicate letters not found in Syriac. U+0304, U+0320 These are used for various purposes ranging from phonological to grammatical to orthographic markers. U+0307, U+0323 These points are used for various purposes—grammatical, phono- logical, and otherwise. They differ typographically and semantically from the qushshaya and rukkakha points, as well as the dotted vowel points. U+0308 This is the plural marker. It is also used in Garshuni for the Arabic teh marbuta. U+030A, U+0325 These are two other forms for the indication of qushshaya and rukka- kha. They are used interchangeably with U+0741 - and U+0742 , especially in West Syriac grammar books. U+0324 This diacritic mark is found in ancient manuscripts. It has a gram- matical and phonological function. U+032D This is one of the digit markers. U+032E This is a mark used in late and modern East Syriac texts, as well as Swadaya, to indicate a fricative pe.
Use of Characters of the Arabic Block. Syriac makes use of several characters from the Ara- bic block, including tatweel (U+0640). Modern texts use U+060C , U+061B , and U+061F . The shadda (U+0651) is also used in the core part of literary Syriac on top of a waw in the word “O”. Arabic har- akat are used in Garshuni to indicate the corresponding Arabic vowels and diacritics.
202 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.3 Syriac
Syriac Shaping Minimum Rendering Requirements. Rendering requirements for Syriac are similar to those for Arabic. The remainder of this section specifies a minimum set of rules that pro- vides legible Syriac joining and ligature substitution behavior. Joining Classes. Each Syriac character is represented by up to four possible contextual glyph forms. The form used is determined by its joining class and the joining class of the letter on each side. These classes are identical in behavior to those outlined for Arabic, with the addition of three extra classes that determine the behavior of final alaphs. See Table 8-10. Table 8-10. Additional Syriac Joining Classes
Joining Class Description Afj Final joining (alaph only) Afn Final non-joining except following dalath and rish (alaph only) Afx Final non-joining following dalath and rish (alaph only)
R1 An alaph that has a left-joining character to its right and a word-breaking charac- ter to its left will take the form of Afj.
R2 An alaph that has a non-left-joining character to its right, except for a dalath or rish, and a word-breaking character to its left will take the form of Afn.
R3 An alaph that has a dalath or rish to its right and a word-breaking character to its left will take the form of Afx.
The above example is in the East Syriac font style.
Syriac Cursive Joining Table 8-11, Table 8-12, and Table 8-13 provide listings of how each character is shaped in the appropriate joining type. Syriac characters not shown are non-joining. The shaping classes are included in ArabicShaping.txt on the CD-ROM. Table 8-11. Right-Joining Syriac Characters
Character Xn Xr
DALATH
DOTLESS DALATH RISH
HE
WAW
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 203 8.3 Syriac Middle Eastern Scripts
Table 8-11. Right-Joining Syriac Characters (Continued)
Character Xn Xr
ZAIN
YUDH HE
SADHE
RISH
TAW
Table 8-12. Dual-Joining Syriac Characters
Character Xn Xr Xm Xl BETH
GAMAL
GAMAL GARSHUNI !
HETH "#$%
TETH &'()
TETH GARSHUNI *+,-
YUDH +/01
KAPH 2345
LAMADH 67 8 9
MIM :;<=
NUN >?@A
SEMKATH BCDE
SEMKATH FINAL FGHI
E JKLM
PE NOPQ
PE REVERSED RSTU
QAPH VWXY
SHIN Z[\]
204 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard Middle Eastern Scripts 8.3 Syriac
Table 8-13. Alaph-Joining Syriac Characters
Type Style An Ar Afj Afn Afx Estrangela hiihh
Serto (West Syriac) ^__``
East Syriac 3 ag43
Ligatures Ligature Classes. As in other scripts, ligatures in Syriac vary depending upon the font style. Table 8-14 identifies the principal valid ligatures for each font style. When applicable, these ligatures are obligatory, unless denoted with an asterisk (*). Table 8-14. Syriac Ligatures
Serto (West Characters Estrangela East Syriac Sources Syriac) ALAPH LAMADH N/A Dual-joining N/A Beth Gazo GAMAL LAMADH N/A Dual-joining* N/A Armalah GAMAL E N/A Dual-joining* N/A Armalah HE YUDH N/A N/A Right-joining* Qdom YUDH TAW N/A Right-joining* N/A Armalah* KAPH LAMADH N/A Dual-joining* N/A Shhimo KAPH TAW N/A Right-joining* N/A Armalah LAMADH SPACE ALAPH N/A Right-joining* N/A Nomocanon LAMADH ALAPH Right-joining* Right-joining Right-joining* BFBS LAMADH LAMADH N/A Dual-joining* N/A Shhimo NUN ALAPH N/A Right-joining* N/A Shhimo SEMKATH TETH N/A Dual-joining* N/A Qurobo SADHE NUN Right-joining* Right-joining* Right-joining* Mushhotho RISH SEYAME Right-joining Right-joining Right-joining BFBS TAW ALAPH Right-joining* N/A Right-joining* Qdom TAW YUDH N/A N/A Right-joining*
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 205 8.4 Thaana Middle Eastern Scripts
8.4 Thaana
Thaana: U+0780–U+07BF The Thaana script is used to write the modern Dhivehi language of the Republic of Maldives, a group of atolls in the Indian Ocean. Like the Arabic script, Thaana is written from right to left and uses vowel signs, but it is not cursive. The basic Thaana letters have been extended by a small set of dotted letters used to transcribe Arabic. The use of modified Thaana letters to write Arabic dates from the middle of the twentieth century. Loan words from Arabic may be written in the Arabic script, although this custom is not very prevalent today. (See Section 8.2, Arabic.) Directionality. The Thaana script is written from right to left. Conformant implementa- tions of Thaana script must use the Unicode bidirectional algorithm (see Section 3.12, Bidi- rectional Behavior). Vowels. Consonants are always written with either a vowel sign (U+07A6..U+07AF) or the null vowel sign (U+07B0 ). U+0787 with the null vowel sign denotes a glottal stop. Numerals. Both European (U+0030..U+0039) and Arabic digits (U+0660..U+0669) are used. European numbers are used more commonly, and have left-to-right display direc- tionality in Thaana. Arabic numeric punctuation is used with digits, whether Arabic or European. Punctuation. The Thaana script uses spaces between words. It makes use of a mixture of Arabic and European punctuation, with rules of usage not clearly defined. Sentence-final punctuation is now generally shown with a single period (U+002E “.” ), but may also use a sequence of two periods (U+002E followed by U+002E). Phrases may be sepa- rated with a comma (usually U+060C ) or with a single period (U+002E). Colons, dashes, and double quotation marks are also used in the Thaana script. In addi- tion, Thaana makes use of U+061F and U+061B - . Character Names and Arrangement. The character names are based on the names used in the Republic of Maldives. The character name at U+0794, yaa, is found in some sources as yaviyani, but the former is more common today. Characters are listed in Thaana alphabet- ical order from haa to ttaa for the Thaana letters, followed by the extended characters in Arabic alphabetical order from hhaa to waavu.
206 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:
Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 9 South and Southeast Asian Scripts 9
The following scripts are encoded in this group: • Devanagari •Bengali •Gurmukhi •Gujarati •Oriya •Tamil •Telugu •Kannada • Malayalam •Sinhala •Thai •Lao •Tibetan •Myanmar •Khmer The scripts of South and Southeast Asia share so many common features that a side-by-side comparison of a few will often reveal structural similarities even in the modern letter- forms. With minor historical exceptions, they are written from left to right; many use no interword spacing but use spaces or marks between phrases. They are all essentially alpha- betic scripts in which most symbols stand for a consonant plus an inherent vowel (usually the sound “ah”). Word-initial vowels in many of these scripts have special symbols, but word-internal vowels are usually written by juxtaposing a vowel sign in the vicinity of the affected consonant. Absence of the inherent vowel, when that occurs, is frequently marked with a special sign. Most of the scripts of South and Southeast Asia, from the Himalayas in the north to Sri Lanka in the south, from Pakistan in the west to the eastern-most islands of Indonesia, are derived from the ancient Brahmi script. The oldest lengthy inscriptions of India, the Asoka edicts of the third century , were written in two scripts, Kharosthi and Brahmi. These are both ultimately of Semitic origin, probably deriving from Aramaic, which was an
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 209 South and Southeast Asian Scripts important administrative language of the Middle East at that time. Kharosthi, written from right to left, was supplanted by Brahmi and its derivatives. The descendents of Brahmi spread with myriad changes throughout the subcontinent and outlying islands. There are said to be some 200 different scripts deriving from it. By the eleventh century, the modern script known as Devanagari was in ascendancy in India proper as the major script of San- skrit literature and the archetypal representative of the northern branch of the script fam- ily. This northern branch includes such modern scripts as Myanmar (Burmese), Gurmukhi, and Tibetan; the southern branch includes scripts such as Malayalam. The major scripts of India proper, including Devanagari, are all encoded according to a common plan, so that comparable characters are in the same order and relative location. This structural arrangement, which facilitates transliteration to some degree, is based upon the Indian national standard (ISCII) encoding for these scripts. Sinhala, Myanmar, and Khmer have virama-based models that do not map to ISCII. Thai and Lao share a different model, based on the Thai Industrial Standard encoding for Thai, which uses visual order- ing and is not laid out compatibly to ISCII. Tibetan stands apart from all of these models, reflecting its somewhat different structure and usage. Many of the character names in this group of scripts represent the same sounds, and naming conventions are similar across the range.
210 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari
9.1 Devanagari
Devanagari: U+0900–U+097F The Devanagari script is used for writing classical Sanskrit and its modern historical deriv- ative, Hindi. Extensions to Devanagari are used to write other related languages of India (such as Marathi) and of Nepal (Nepali). In addition, the Devanagari script is used to write the following languages: Awadhi, Bagheli, Bhatneri, Bhili, Bihari, Braj Bhasha, Chhattis- garhi, Garhwali, Gondi (Betul, Chhindwara, and Mandla dialects), Harauti, Ho, Jaipuri, Kachchhi, Kanauji, Konkani, Kului, Kumaoni, Kurku, Kurukh, Marwari, Mundari, Newari, Palpa, and Santali. All other Indic scripts, as well as the Sinhala script of Sri Lanka, the Tibetan script, and the Southeast Asian scripts (Thai, Lao, Khmer, and Myanmar), are historically connected with the Devanagari script as descendants of the ancient Brahmi script. The entire family of scripts shares a large number of structural features. The principles of the Indic scripts are covered in some detail in this introduction to the Devanagari script. The remaining introductions to the Indic scripts are abbreviated but highlight any differences from Devanagari where appropriate. Standards. The Devanagari block of the Unicode Standard is based on ISCII-1988 (Indian Standard Code for Information Interchange). The ISCII standard of 1988 differs from and is an update of earlier ISCII standards issued in 1983 and 1986. The Unicode Standard encodes Devanagari characters in the same relative position as those coded in positions A0–F416 in the ISCII-1988 standard. The same character code layout is followed for eight other Indic scripts in the Unicode Standard: Bengali, Gurmukhi, Guja- rati, Oriya, Tamil, Telugu, Kannada, and Malayalam. This parallel code layout emphasizes the structural similarities of the Brahmi scripts and follows the stated intention of the Indian coding standards to enable one-to-one mappings between analogous coding posi- tions in different scripts in the family. Sinhala, Thai, Lao, Khmer, and Myanmar depart to a greater extent from the Devanagari structural pattern, so the Unicode Standard does not attempt to provide any direct mappings for these scripts to the Devanagari order. In November 1991, at the time The Unicode Standard, Version 1.0, was published, the Bureau of Indian Standards published a new version of ISCII in Indian Standard (IS) 13194:1991. This new version partially modified the layout and repertoire of the ISCII- 1988 standard. Because of these events, the Unicode Standard does not precisely follow the layout of the current version of ISCII. Nevertheless, the Unicode Standard remains a super- set of the ISCII-1991 repertoire except for a number of new Vedic extension characters defined in IS 13194:1991 Annex G—Extended Character Set for Vedic. Modern, non-Vedic texts encoded with ISCII-1991 may be automatically converted to Unicode code values and back to their original encoding without loss of information. Encoding Principles. The writing systems that employ Devanagari and other Indic scripts constitute a cross between syllabic writing systems and phonemic writing systems (alpha- bets). The effective unit of these writing systems is the orthographic syllable, consisting of a consonant and vowel (CV ) core and, optionally, one or more preceding consonants, with a canonical structure of ((C)C)CV. The orthographic syllable need not correspond exactly with a phonological syllable, especially when a consonant cluster is involved, but the writing system is built on phonological principles and tends to correspond quite closely to pronunciation.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 211 9.1 Devanagari South and Southeast Asian Scripts
The orthographic syllable is built up of alphabetic pieces, the actual letters of the Devana- gari script. These pieces consist of three distinct character types: consonant letters, inde- pendent vowels, and dependent vowel signs. In a text sequence, these characters are stored in logical (phonetic) order.
Principles of the Script Rendering Devanagari Characters. Devanagari characters, like characters from many other scripts, can combine or change shape depending on their context. A character’s appearance is affected by its ordering with respect to other characters, the font used to ren- der the character, and the application or system environment. These variables can cause the appearance of Devanagari characters to differ from their nominal glyphs (used in the code charts). Additionally, a few Devanagari characters cause a change in the order of the displayed char- acters. This reordering is not commonly seen in non-Indic scripts and occurs indepen- dently of any bidirectional character reordering that might be required. Consonant Letters. Each consonant letter represents a single consonantal sound but also has the peculiarity of having an inherent vowel, generally the short vowel /a/ in Devanagari and the other Indic scripts. Thus U+0915 represents not just /k/ but also /ka/. In the presence of a dependent vowel, however, the inherent vowel associated with a consonant letter is overridden by the dependent vowel. Consonant letters may also be rendered as half-forms, which are presentation forms used to depict the initial consonant in consonant clusters. These half-forms do not have an inher- ent vowel. Their rendered forms in Devanagari often resemble the full consonant but are missing the vertical stem, which marks a syllabic core. (The stem glyph is graphically and historically related to the sign denoting the inherent /a/ vowel.) Some Devanagari consonant letters have alternative presentation forms whose choice depends upon neighboring consonants. This variability is especially notable for U+0930 , which has numerous different forms, both as the initial element and as the final element of a consonant cluster. Only the nominal forms, rather than the contextual alternatives, are depicted in the code chart. The traditional Sanskrit/Devanagari alphabetic encoding order for consonants follows articulatory phonetic principles, starting with velar consonants and moving forward to bilabial consonants, followed by liquids and then fricatives. ISCII and the Unicode Stan- dard both observe this traditional order. Independent Vowel Letters. The independent vowels in Devanagari are letters that stand on their own. The writing system treats independent vowels as orthographic CV syllables in which the consonant is null. The independent vowel letters are used to write syllables that start with a vowel. Dependent Vowel Signs (Matras). The dependent vowels serve as the common manner of writing noninherent vowels and are generally referred to as vowel signs, or as matras in San- skrit. The dependent vowels do not stand alone; rather, they are visibly depicted in combi- nation with a base letterform. A single consonant, or a consonant cluster, may have a dependent vowel applied to it to indicate the vowel quality of the syllable, when it is differ- ent from the inherent vowel. Explicit appearance of a dependent vowel in a syllable over- rides the inherent vowel of a single consonant letter. The greatest variation among different Indic scripts is found in the way that the dependent vowels are applied to base letterforms. Devanagari has a collection of nonspacing depen- dent vowel signs that may appear above or below a consonant letter, as well as spacing dependent vowel signs that may occur to the right or to the left of a consonant letter or
212 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari consonant cluster. Other Indic scripts generally have one or more of these forms, but what is a nonspacing mark in one script may be a spacing mark in another. Also, some of the Indic scripts have single dependent vowels that are indicated by two or more glyph compo- nents—and those glyph components may surround a consonant letter both to the left and right or may occur both above and below it. The Devanagari script has only one character denoting a left-side dependent vowel sign: U+093F . Other Indic scripts either have no such vowel signs (Telugu and Kannada) or include as many as three of these signs (Bengali, Tamil, and Malayalam). A one-to-one correspondence exists between the independent vowels and the dependent vowel signs. Independent vowels are sometimes represented by a sequence consisting of the independent form of the vowel /a/ followed by a dependent vowel sign. For example, Figure 9-1 illustrates this relationship (see the notation formally described in the “Rules for Rendering” later in this section). Figure 9-1. Dependent Versus Independent Vowels
/a/ + Dependent Vowel Independent Vowel
+→+ ≈ An Ivs Ivs An In Ì+ → Ì ≈ +→+ ≈ An Uvs An Uvs Un + Î → Î ≈
The combination of the independent form of the default vowel /a/ (in the Devanagari script, U+0905 ) with a dependent vowel sign may be viewed as an alternative spelling of the phonetic information normally represented by an isolated inde- pendent vowel form. However, these two representations should not be considered equiva- lent for the purposes of rendering. Higher-level text processes may choose to consider these alternative spellings equivalent in terms of information content, but such an equivalence is not stipulated by this standard. Virama. Devanagari and other Indic scripts employ a sign known as the virama, halant, or vowel omission sign. A virama sign (for example, U+094D ) nominally serves to cancel (or kill) the inherent vowel of the consonant to which it is applied. The virama functions as a combining character, with its shape varying from script to script. When a consonant has lost its inherent vowel by the application of virama, it is known as a dead consonant; in contrast, a live consonant is one that retains its inherent vowel or is written with an explicit dependent vowel sign. In the Unicode Standard, a dead consonant is defined as a sequence consisting of a consonant letter followed by a virama. The default rendering for a dead consonant is to position the virama as a combining mark bound to the consonant letterform.
For example, if Cn denotes the nominal form of consonant C, and Cd denotes the dead consonant form, then a dead consonant is encoded as shown in Figure 9-2. Consonant Conjuncts. The Indic scripts are noted for a large number of consonant con- junct forms that serve as orthographic abbreviations (ligatures) of two or more adjacent
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 213 9.1 Devanagari South and Southeast Asian Scripts
Figure 9-2. Dead Consonants
+ → TAn VIRAMAn TAd ±±+→Ü Ü letterforms. This abbreviation takes place only in the context of a consonant cluster. An orthographic consonant cluster is defined as a sequence of characters that represents one or more dead consonants (denoted Cd) followed by a normal, live consonant letter (denoted Cl) or an independent vowel letter. Under normal circumstances, a consonant cluster is depicted with a conjunct glyph if such a glyph is available in the current font(s). In the absence of a conjunct glyph, the one or more dead consonants that form part of the cluster are depicted using half-form glyphs. In the absence of half-form glyphs, the dead consonants are depicted using the nominal con- sonant forms combined with visible virama signs (see Figure 9-3). Figure 9-3. Conjunct Formations
₍₎ + → + ₍₎ + → GAd DHAl GAh DHAn KAd SSHAl K.SSHAn Ü +→´ó´ Ü +→ÆS
₍₎ + → ₍₎ + → + KAd KAl K.KAn RAd RIn RIn RAsup Ü +→P ¿Ü +→F
A number of types of conjunct formations appear in these examples: (1) a half-form of GA in its combination with the full form of DHA; (2) a vertical conjunct K.KA; (3) a fully ligated conjunct K.SSHA, in which the components are no longer distinct; and (4) a rare conjunct formed with an independent vowel letter, in this case the vowel letter RI (also known as vocalic r). Note that in example (4) in Figure 9-3, the dead consonant RAd is depicted with the nonspacing combining mark RAsup (repha). A well-designed Indic script font may contain hundreds of conjunct glyphs, but they are not encoded as Unicode characters because they are the result of ligation of distinct letters. Indic script rendering software must be able to map appropriate combinations of charac- ters in context to the appropriate conjunct glyphs in fonts. When an independent vowel appears as the terminal element of a consonant cluster, as in example (4) in Figure 9-3, the independent vowel should not be depicted as a dependent vowel sign, but as an independent vowel letterform. Explicit Virama. Normally a virama character serves to create dead consonants that are, in turn, combined with subsequent consonants to form conjuncts. This behavior usually results in a virama sign not being depicted visually. Occasionally, however, this default behavior is not desired when a dead consonant should be excluded from conjunct forma- tion, in which case the virama sign is visibly rendered. To accomplish this goal, the Unicode Standard adopts the convention of placing the character U+200C - immediately after the encoded dead consonant that is to be excluded from conjunct
214 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari formation. In this case, the virama sign is always depicted as appropriate for the consonant to which it is attached. For example, in Figure 9-4,S the use of - prevents the default forma- tion of the conjunct form (K.SSHAn). Figure 9-4. Preventing Conjunct Forms
+ + → + KAd ZWNJ SSHAl KAd SSHAn
Ü ++ZW Æ → ÆÜ NJ
Explicit Half-Consonants. When a dead consonant participates in forming a conjunct, the dead consonant form is often absorbed into the conjunct form, such that it is no longer dis- tinctly visible. In other contexts, however, the dead consonant may remain visible as a half- consonant form. In general, a half-consonant form is distinguished from the nominal consonant form by the loss of its inherent vowel stem, a vertical stem appearing to the right side of the consonant form. In other cases, the vertical stem remains but some part of its right-side geometry is missing. In certain cases, it is desirable to prevent a dead consonant from assuming full conjunct formation yet still not appear with an explicit virama. In these cases, the half-form of the consonant is used. To explicitly encode a half-consonant form, the Unicode Standard adopts the convention of placing the character U+200D immediately after the encoded dead consonant. The denotes a nonvisible letter that presents linking or cursive joining behavior on either side (that is, to the previous or fol- lowing letter). Therefore, in the present context, the may be consid- ered to present a context to which a preceding dead consonant may join so as to create the half-form of the consonant.
For example, if Ch denotes the half-form glyph of consonant C, then a half-consonant form is encoded as shown in Figure 9-5. Figure 9-5. Half-Consonants
+ + → + KAd ZWJ SSHAl KAh SSHAn
Ü ++ZW Æ → þÆ J
• In the absence of the S , this sequence would normally pro- duce the full conjunct form (K.SSHAn). This encoding of half-consonant forms also applies in the absence of a base letterform. That is, this technique may also be used to encode independent half-forms, as shown in Figure 9-6. Consonant Forms. In summary, each consonant may be encoded such that it denotes a live consonant, a dead consonant that may be absorbed into a conjunct, or the half-form of a dead consonant (see Figure 9-7).
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 215 9.1 Devanagari South and Southeast Asian Scripts
Figure 9-6. Independent Half-Forms
+ → GAd ZWJ GAh
Ü + ZW → ó J
Figure 9-7. Consonant Forms → KAl
+→ Ü Ü KAd
+→Ü + ZW þ J KAh
Rendering Rules for Rendering. The following provides more formal and detailed rules for minimal rendering of Devanagari as part of a plain text sequence. It describes the mapping between Unicode characters and the glyphs in a Devanagari font. It also describes the combining and ordering of those glyphs. These rules provide minimal requirements for legibly rendering interchanged Devanagari text. As with any script, a more complex procedure can add rendering characteristics, depending on the font and application. It is important to emphasize that in a font that is capable of rendering Devanagari, the set of glyphs is greater than the number of Devanagari Unicode characters. Notation. In the next set of rules, the following notation applies:
Cn Nominal glyph form of consonant C as it appears in the code charts.
Cl A live consonant, depicted identically to Cn.
Cd Glyph depicting the dead consonant form of consonant C.
Ch Glyph depicting the half-consonant form of consonant C.
Ln Nominal glyph form of a conjunct ligature consisting of two or more component consonants. A conjunct ligature composed of two consonants X and Y is also denoted X.Yn.
RAsup A nonspacing combining mark glyph form of the U+0930 positioned above or attached to the upper part of a base glyph form. This form is also known as repha.
RAsub A nonspacing combining mark glyph form of the U+0930 positioned below or attached to the lower part of a base glyph form.
Vvs Glyph depicting the dependent vowel sign form of a vowel V.
216 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari
VIRAMAn The nominal glyph form nonspacing combining mark depict- ing U+094D . • A virama character is not always depicted; when it is depicted, it adopts this nonspacing mark form. Dead Consonant Rule. The following rule logically precedes the application of any other rule to form a dead consonant. Once formed, a dead consonant may be subject to other rules described next.
R1 When a consonant Cn precedes a VIRAMAn , it is considered to be a dead conso- nant Cd . A consonant Cn that does not precede VIRAMAn is considered to be a live consonant Cl . + → TAn VIRAMAn TAd ±±+→Ü Ü
Consonant RA Rules. The character U+0930 takes one of a num- ber of visual forms depending on its context in a consonant cluster. By default, this letter is depicted with its nominal glyph form (as shown in the code charts). In two contexts, it is depicted using a nonspacing glyph form that combines with a base letterform.
R2 If the dead consonant RAd precedes either a consonant or an independent vowel, then it is replaced by the superscript nonspacing mark RAsup , which is positioned so that it applies to the logically subsequent element in the memory representation. + → + Displayed RAd KAl KAl RAsup Output ¿Ü +→ + F → F
12+ → 2 + 1 RAd RAd RAd RAsup ¿Ü +→¿Ü ¿Ü + F → ¿Ü F
R3 If the superscript mark RAsup is to be applied to a dead consonant and that dead consonant is combined with another consonant to form a conjunct ligature, then the mark is positioned so that it applies to the conjunct ligature form as a whole. + + → + Displayed RAd JAd NYAl J.NYAn RAsup Output ¿Ü +→¦Ü + © + F → F
R4 If the superscript mark RAsup is to be applied to a dead consonant that is subse- quently replaced by its half-consonant form, then the mark is positioned so that it applies to the form that serves as the base of the consonant cluster. + + → + + Displayed RAd GAd GHAl GAh GHAl RAsup Output ¿Ü +→ Ü + ¢ ó + ¢ + F → ó¢ F
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 217 9.1 Devanagari South and Southeast Asian Scripts
R5 In conformance with the ISCII standard, the half-consonant form RRAh is repre- sented as eyelash-RA. This form of RA is commonly used in writing Marathi. + → RRAn VIRAMAn RRAh ¿ +→Ü :
R5a For compatibility with The Unicode Standard, Version 2.0, if the dead consonant RAd precedes , then the half -consonant form RAh , depicted as eyelash-RA, is used instead of RAsup . + → RAd ZWJ RAh
¿Ü + ZW → : J
R6 Except for the dead consonant RAd , when a dead consonant Cd precedes the live consonant RAl , then Cd is replaced with its nominal form Cn , and RA is replaced by the subscript nonspacing mark RAsub , which is positioned so that it applies to Cn .
THA + RA → THA + RA Displayed d l n sub Output «Ü +→¿ « + ã → «ã
R7 For certain consonants, the mark RAsub may graphically combine with the conso- nant to form a conjunct ligature form. These combinations, such as the one shown here, are further addressed by the ligature rules described shortly.
PHA + RA → PHA + RA Displayed d l n sub Output ¸Ü +→¿ ¸ + ã → p
R8 If a dead consonant (other than RAd ) precedes RAd , then the substitution of RA for RAsub is performed as described above; however, the VIRAMA that formed RAd remains so as to form a dead consonant conjunct form. + → + + → TAd RAd TAn RAsub VIRAMAn T.RA d ±Ü +→¿Ü ± + ã + Ü → dÜ
A dead consonant conjunct form that contains an absorbed RAd may subsequently combine to form a multipart conjunct form. + → T.RA d YAl T.R.YA n dÜ +→½î½
218 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari
Modifier Mark Rules. In addition to vowel signs, three other types of combining marks may be applied to a component of an orthographic syllable or to the syllable as a whole: nukta, bindus, and svaras. R9 The nukta sign, which modifies a consonant form, is placed immediately after the consonant in the memory representation and is attached to that consonant in ren- dering. If the consonant represents a dead consonant, then NUKTA should precede VIRAMA in the memory representation. + + → KAn NUKTAn VIRAMAn QAd +→ + Ü Ü
R10 The other modifying marks, bindus and svaras, apply to the orthographic syllable as a whole and should follow (in the memory representation) all other characters that constitute the syllable. In particular, the bindus should follow any vowel signs, and the svaras should come last. The relative placement of these marks is horizontal rather than vertical; the horizontal rendering order may vary according to typo- graphic concerns. + + KAn AAvs CANDRABINDUn +→Ë + Ë
Ligature Rules. Subsequent to the application of the rules just described, a set of rules gov- erning ligature formation apply. The precise application of these rules depends on the availability of glyphs in the current font(s) being used to display the text. R11 If a dead consonant immediately precedes another dead consonant or a live conso- nant, then the first dead consonant may join the subsequent element to form a two- part conjunct ligature form. + → + → JAd NYAl J.NYAn TTAd TTHAl TT.TTHAn ¦Ü + © → ªÜ + « → _
R12 A conjunct ligature form can itself behave as a dead consonant and enter into further, more complex ligatures. + + → + → SAd TAd RAn SAd T.R.A n S.T.RAn ÇÜ + ±Ü + ¿ → ÇÜ + d →
A conjunct ligature form can also produce a half-form. + → + K.SSHAd YAl K.SSHh YAn Sð½Ü + ½ →
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 219 9.1 Devanagari South and Southeast Asian Scripts
R13 If a nominal consonant or conjunct ligature form precedes RAsub as a result of the application of rule R2, then the consonant or ligature form may join with RAsub to form a multipart conjunct ligature (see rule R2 for more information). + → + → KAn RAsub K.RAn PHAn RAsub PH.RAn + ã → R ¸ + ã → p
R14 In some cases, other combining marks will also combine with a base consonant, either attaching at a nonstandard location or changing shape. In minimal rendering there are only two cases, RAl with Uvs or UUvs . + → + → RAl Uvs RUn RAl UUvs RUUn ¿ + G → L ¿ + H → M
Memory Representation and Rendering Order. The order for storage of plain text in Devanagari and all other Indic scripts generally follows phonetic order; that is, a CV sylla- ble with a dependent vowel is always encoded as a consonant letter C followed by a vowel sign V in the memory representation. This order is employed by the ISCII standard and corresponds with both the phonetic and keying order of textual data (see Figure 9-8). Figure 9-8. Rendering Order
Character Order Glyph Order + → + KAn Ivs Ivs KAn Ì+→Ì
Because Devanagari and other Indic scripts have some dependent vowels that must be depicted to the left side of their consonant letter, the software that renders the Indic scripts must be able to reorder elements in mapping from the logical (character) store to the pre- sentational (glyph) rendering. For example, if Cn denotes the nominal form of consonant C, and Vvs denotes a left-side dependent vowel sign form of vowel V, then a reordering of glyphs with respect to encoded characters occurs as just shown.
R15 When the dependent vowel Ivs is used to override the inherent vowel of a syllable, it is always written to the extreme left of the orthographic syllable. If the orthographic syl- lable contains a consonant cluster, then this vowel is always depicted to the left of that cluster. For example: + + → + → + TAd RAl Ivs T.RA n Ivs Ivs T.RA d ±Ü +→+→¿ + Ì d Ì Ìd
Sample Half-Forms. Table 9-1 shows examples of half-consonant forms that are com- monly used with the Devanagari script. These forms are glyphs, not characters. They may be encoded explicitly using as shown; in normal conjunct formation, they may be used spontaneously to depict a dead consonant in combination with subse- quent consonant forms.
220 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari
Table 9-1. Sample Half-Forms
þµüÜ ZW Ü ZW J J ·åÜ ZW Ü ZW J J ó¸Ü ZW Ü ZW J J ¢ìºêÜ ZW Ü ZW J J ¤ö»Ü ZW Ü ZW J J ¦ú¼Ü ZW Ü ZW J J ¨½ñÜ ZW Ü ZW J J ©÷ÁòÜ ZW Ü ZW J J °øÄôÜ ZW Ü ZW J J ±íÅõÜ ZW Ü ZW J J ²ûÆùÜ ZW Ü ZW J J ´çÇïÜ ZW Ü ZW J J
Sample Ligatures. Table 9-2 shows examples of conjunct ligature forms that are commonly used with the Devanagari script. These forms are glyphs, not characters. Not every writing system that employs this script uses all of these forms; in particular, many of these forms are used only in writing Sanskrit texts. Furthermore, individual fonts may provide fewer or more ligature forms than are depicted here. Table 9-2. Sample Ligatures Pª«_Ü Ü ±Q««nÜ Ü ¿R¬ `Ü Ü ÆS¬¬aÜ Ü £V¬®bÜ Ü £W±Ü Ü ±c £ X±Ü Ü ¿d £¢Yµµ¾Ü Ü ©¦¸¿pÜ Ü ¦©ÅÜ Ü ¿o ³¢fȼrÜ Ü ³³gȽsÜ Ü ³´hÈÁtÜ Ü
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 221 9.1 Devanagari South and Southeast Asian Scripts
Table 9-2. Sample Ligatures (Continued) ³ºiÈÄuÜ Ü ³»jÈÜ C N ³¼k¿Ü A L ³½l¿Ü B M ³ÄmÇdÜ Ü ªª^Ü
Sample Half-Ligature Forms. In addition to half-form glyphs of individual consonants, half-forms are also used to depict conjunct ligature forms. A sample of such forms is shown in Table 9-3. These forms are glyphs, not characters. They may be encoded explicitly using as shown; in normal conjunct formation, they may be used spontane- ously to depict a conjunct ligature in combination with subsequent consonant forms. Table 9-3. Sample Half-Ligature Forms
ÆÜ Ü ZW ð J ¦©Ü Ü ZW J ±±Ü Ü ZW ë J ±¿Ü Ü ZW î J Å¿Ü Ü ZW é J
Combining Marks. Devanagari and other Indic scripts have a number of combining marks that could be considered diacritic. One class of these marks, known as bindus, is repre- sented by U+0901 and U+0902 - . These marks indicate nasalization or final nasal closure of a syllable. U+093C is a true diacritic. It is used to extend the basic set of consonant letters by modifying them (with a subscript dot in Devanagari) to create new letters. U+0951..U+0954 are a set of combining marks used in transcription of Sanskrit texts. Digits. Each Indic script has a distinct set of digits appropriate to that script. These digits may or may not be used in ordinary text in that script. European digits have displaced the Indic script forms in modern usage in many of the scripts. Some Indic scripts—notably Tamil—lack a distinct digit for zero. Punctuation and Symbols. U+0964 is similar to a full stop. Corre- sponding forms occur in many other Indic scripts. U+0965 marks the end of a verse in traditional texts. Many modern languages written in the Devanagari script intersperse punctuation derived from the Latin script. Thus U+002C and U+002E are freely used in writing Hindi, and the danda is usually restricted to more traditional texts. Encoding Structure. The Unicode Standard organizes the nine principal Indic scripts in blocks of 128 encoding points each. The first six columns in each script are isomorphic with the ISCII-1988 encoding, except that the last 11 positions (U+0955..U+095F in
222 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.1 Devanagari
Devanagari, for example), which are unassigned or undefined in ISCII-1988, are used in the Unicode encoding. The seventh column in each of these scripts, along with the last 11 positions in the sixth column, represent additional character assignments in the Unicode Standard that are matched across all nine scripts. For example, positions U+xx66..U+xx6F and U+xxE6.. U+xxEF code the Indic script digits for each script. The eighth column for each script is reserved for script-specific additions that do not cor- respond from one Indic script to the next.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 223 9.2 Bengali South and Southeast Asian Scripts
9.2 Bengali
Bengali: U+0980–U+09FF The Bengali script is a North Indian script closely related to Devanagari. It is used to write the Bengali language primarily in West Bengal state (India) and in the nation of Ban- gladesh. It is also used to write Assamese in Assam (India) and a number of other minority languages (Daphla, Garo, Hallam, Khasi, Manipuri, Mizo, Munda, Naga, Rian, and Santali) in northeastern India. Two-Part Vowel Signs. The Bengali script, along with a number of other Indic scripts, makes use of two-part vowel signs; in these vowels one-half of the vowel is placed on each side of a consonant letter or cluster—for example, U+09CB and U+09CC . The vowel signs are coded in each case in the position in the charts isomorphic with the corresponding vowel in Devanagari. Hence U+09CC - is isomorphic with U+094C . To provide compatibility with existing implementations of the scripts that use two-part vowel signs, the Unicode Standard explicitly encodes the right half of these vowel signs; for example, U+09D7 represents the right-half glyph component of U+09CC . Special Characters. U+09F2..U+09F9 are a series of Bengali additions for writing currency and fractions. Rendering Behavior. For rendering of the Bengali script, see the rules for rendering in Section 9.1, Devanagari.
224 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.3 Gurmukhi
9.3 Gurmukhi
Gurmukhi: U+0A00–U+0A7F The Gurmukhi script is a North Indian script historically derived from an older script called Lahnda. It is quite closely related to Devanagari structurally. Gurmukhi is used to write the Punjabi language in the Punjab in India. Rendering Behavior. For rendering of the Gurmukhi script, see the rules for rendering in Section 9.1, Devanagari.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 225 9.4 Gujarati South and Southeast Asian Scripts
9.4 Gujarati
Gujarati: U+0A80–U+0AFF The Gujarati script is a North Indian script closely related to Devanagari. It is most obvi- ously distinguished from Devanagari by not having a horizontal bar for its letterforms, a characteristic of the older Kaithi script to which Gujarati is related. The Gujarati script is used to write the Gujarati language of the Gujarat state in India. Rendering Behavior. For rendering of the Gujarati script, see the rules for rendering in Section 9.1, Devanagari.
226 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.5 Oriya
9.5 Oriya
Oriya: U+0B00–U+0B7F The Oriya script is a North Indian script structurally similar to Devanagari, but with semi- circular lines at the top of most letters instead of the straight horizontal bars of Devanagari. The actual shapes of the letters, particularly for vowel signs, show similarities to Tamil. The Oriya script is used to write the Oriya language, of Orissa state, India, as well as minority languages such as Khondi and Santali. Special Characters. U+0B57 is provided as an encoding for the right side of the surroundrant vowel U+0B4C . Rendering Behavior. For rendering of the Oriya script, see the rules for rendering in Section 9.1, Devanagari.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 227 9.6 Tamil South and Southeast Asian Scripts
9.6 Tamil
Tamil: U+0B80–U+0BFF The Tamil script is a South Indian script. South Indian scripts are structurally related to the North Indian scripts, but they are used to write the Dravidian languages of southern India and of Sri Lanka, which are genetically unrelated to the North Indian languages such as Hindi, Bengali, and Gujarati. The shapes of letters in the South Indian scripts are generally quite distinct from the shapes of letters in Devanagari and its related scripts. This difference is partly a result of the fact that the South Indian scripts were originally carved with needles on palm leaves, a technology that apparently favored rounded letter shapes rather than square, blocklike shapes. The Tamil script is used to write the Tamil language of Tamil Nadu state in India as well as minority languages such as Badaga. Tamil is also used in Sri Lanka, Singapore, and parts of Malaysia. This script has fewer consonants than the other Indic scripts. It also lacks con- junct consonant forms. Instead of conjunct consonant forms, the virama (U+0BCD) is normally fully depicted in Tamil text. Naming Conventions for Mid Vowels. The Unicode character encoding for Tamil uses a distinct set of naming conventions for mid vowels in the South Indian (Dravidian) scripts. These conventions are illustrated by U+0B8E and U+0B8F , to be contrasted with the isomorphic positions in Devanagari: U+090E and U+090F . The Dravidian languages have a reg- ular length distinction in the mid vowels that is not reflected in normal Devanagari. U+090E is an addition to Devanagari to enable transcription of the Dravidian short vowel forms. The naming conventions are chosen to best reflect the actual nature of the vowels in question in the Dravidian scripts, as well as in Devanagari and the other North Indian scripts. Special Characters. U+0BD7 is provided as an encoding for the right side of the surroundrant (or two-part) vowel U+0BCC . Rendering of Tamil Script. The South Indic scripts function in much the same way as Devanagari, with the additional feature of two-part vowels. As in the Devanagari example, the words “ ” and “ ” will be omitted where this deletion does not cause ambiguity. It is important to emphasize that in a font that is capable of rendering Tamil, the set of glyphs is greater than the number of Tamil characters. Table 9-4 is a summary of the Tamil letters.
228 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.6 Tamil
Table 9-4. Tamil Letter Summary ¡£¤¨©®¯ KA NGA CA JA NYA TTA NNA TA NA NNNA PA ³´µ¶·¸¹º¼ ½¾ MA YA RA RRA LA LLA LLLA VA SSA SA HA A AA I II U UU E EE AI O OO AU Ãw þÅ þ|ý Ü þ|æ ÝËþ Ìþ Íþ ËþÃÌþÃËþ¸ A AA I II U UU E EE AI O OO AU þ | ¸ VIRAMA AU LENGTH
Independent Versus Dependent Vowels. As with Devanagari, the dependent vowel signs are not equivalent to a sequence of virama + independent vowel. For example: ® + w ≠ ® + þ| +
As in the case of Devanagari, a consonant cluster is any sequence of one or more consonants separated by viramas, possibly terminated with a virama. Two-Part Vowels. Certain Indic vowels consist of two discontiguous elements. As in other cases of discontiguous elements, two sequences of Unicode values can be used to express equivalent spellings. This representation is similar to the case of letters such as “â”, which can be spelled either with “a” followed by a nonspacing “ ”ó or with a single Unicode char- acter “â”.
ËþÃ (0BCA) ≈ Ëþ + Ã (0BC6 + 0BBE) ÌþÃ (0BCB) ≈ Ìþ + Ã (0BC7 + 0BBE)
Ëþ¸ (0BCC) ≈ Ëþ + ¸ (0BC7 + 0BD7)
Note that the ¸ in the third example is not U+0BB3 ; it is U+0BD7 . If the precomposed forms are used in the memory representation instead of the separate characters, then a similar transformation occurs in the rendering process. The precom- posed form on the left is transformed into the two separate forms equivalent to those on the right, which are then subject to vowel reordering, as shown below. Thus in rendering: ËþÃ → Ëþ + Ã ÌþÃ → Ìþ + Ã Ëþ¸ → Ëþ + ¸
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 229 9.6 Tamil South and Southeast Asian Scripts
Vowel Reordering. As shown in Table 9-5, the following vowels are always reordered in front of the previous consonant cluster, similar to the rendering behavior of the - :
Ëþ (0BC6) Ìþ (0BC7) Íþ (0BC8)
Table 9-5. Vowel Reordering
Memory Representation Display Ëþ → Ë Ìþ → Ì Íþ → Í
The same effect occurs with the results of vowel splitting (see Ta ble 9-6). Table 9-6. Vowel Splitting and Reordering
Memory Representation Display ËþÃ → ËÃ Ëþ Ã → ËÃ ÌþÃ → ÌÃ Ìþ Ã → ÌÃ
Ëþ¸ → ˸
Ëþ ¸ → ˸
In both cases, the ordering of the elements is unambiguous: the consonant (cluster) occurs first in the memory representation. The vowel also has two discontinuous parts and can be composed using the . Ligatures. The following examples illustrate the range of ligatures available in Tamil. These changes take place after vowel reordering and vowel splitting. Unlike Devanagari, Tamil includes very few conjunct consonants; most ligatures are located between a vowel and a neighboring consonant. 1. Conjunct consonants. + þ | + ¼ → a
As with Devanagari, vowel reordering occurs around conjunct consonants. For example: + þ | + ¼ + Ëþ + Ã → ËaÃ
230 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.6 Tamil
2. The vowel à optionally ligates with ¨, ®, or ¶ on its left: ¨ + à → @ ® + à → A ¶ + à → B
Because this process takes place after reordering and splitting, the following lig- atures may also occur: Separate Vowels Precomposed Vowels ¨ + Ëþ + Ã → Ë@ ¨ +Ëþ Ã → Ë@ ¨ + Ìþ + Ã → Ì@ ¨ + ÌþÃ → Ì@ ® + Ëþ + Ã → ËA ® + ËþÃ → ËA ® + Ìþ + Ã → ÌA ® + ÌþÃ → ÌA ¶ + Ëþ+ Ã → ËB ¶ + ËþÃ → ËB ¶ + Ìþ + Ã → ÌB ¶ + ÌþÃ → ÌB
3. The vowel signs w and þ Å form ligatures with ¤ on their left. ¤ + w → C ¤ + þÅ→ D
These vowels often change shape or position slightly to link up with the appro- priate shape of the consonant on their left: · + w → í · + þ Å→ û
4. The vowel signs þý|Ü and þæ|Ý typically change form or ligate (see Table 9-7). Table 9-7. Ligating Vowel Signs
xx + þ ý |Ü x + þ æ |Ý xx + þ ý |Ü x + þ æ |Ý EF ¯ k g je ³PQ GH ´lh £ I I× µ R S ¤JK ¶TT× ¨LL× · U U× ©MM׸VW
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 231 9.6 Tamil South and Southeast Asian Scripts
Table 9-7. Ligating Vowel Signs (Continued)
xx + þ ý |Ü x + þ æ |Ý xx + þ ý |Ü x + þ æ |Ý NN×¹XWX ®OO× º mi
To the right of ¡, ¼, ½, ¾, or a, these forms have a spacing form (see Figure 9-9). Figure 9-9. Spacing Forms of Vowels ¡ + þý → ¡Ü ¡ + þ æ → ¡Ý
5. The vowel sign Íþ changes to Î þto the left of ¨, ®, ·, or ¸. Íþ + ¨ → Ψ Íþ + ® → ή Íþ + · → η Íþ + ¸ → θ
Remember that this change takes place after the vowel reordering. In the first example, the vowel Íþ follows ¨ in the memory representation. After vowel reordering, it is on the left of ¨, and thus changes form. The com- plete process is ¨ + Íþ → Íþ + ¨ → Ψ
6. The consonant µ changes shape to Ã. This change occurs when the à form of µ U+0BB0 would not be confused with the nominal form à of U+0BBE (for example, when – is combined withþ,| w , orþ Å ). µ + þ | → à + þ | µ + w → à + w µ + þÅ → à + þÅ
232 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.7 Telugu
9.7 Telugu
Telugu: U+0C00–U+0C7F The Telugu script is a South Indian script used to write the Telugu language of Andhra Pradesh state in India, as well as minority languages such as Gondi (Adilabad and Koi dia- lects) and Lambadi. Rendering Behavior. For rendering of the Telugu script, see the rules for rendering in Section 9.6, Tamil. Take note that, unlike Tamil, the Telugu script writes conjunct conso- nants with subscripted letters. There are also numerous consonant letters with contextual shape changes when used in conjuncts. Some vowel signs also change their shape in speci- fied combinations. Special Characters. U+0C55 is provided as an encoding for the sec- ond element of the vowel U+0C47 . U+0C56 is provided as an encoding for the second element of the surroundrant vowel U+0C48 . The length marks are both nonspacing characters.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 233 9.8 Kannada South and Southeast Asian Scripts
9.8 Kannada
Kannada: U+0C80–U+0CFF The Kannada script is a South Indian script used to write the Kannada (or Kanarese) lan- guage of Karnataka state, as well as minority languages such as Tulu. It is very closely related to the Telugu script both with regard to the shapes of the letters and in the way that conjunct consonants behave. Special Characters. U+0CD5 is provided as an encoding for the right side of the two-part vowel U+0CC7 . U+0CD6 is provided as an encoding for the right side of the two-part vowel U+0CC8 . The Kannada two-part vowels actually consist of a nonspacing element above the consonant letter and one or more spacing letters to the right of the con- sonant letter. Kannada Letter LLLA. U+0CDE is actually an obsolete Kannada let- ter that is transliterated in Dravidian scholarship as an “r” marked with the diacritic U+0324 . This form ought to have been given the letter name “”, rather than “”, so the Unicode name is simply a mistake. The letter in question has not been actively used in Kannada since the end of the tenth century. Collations should treat U+0CDE as following U+0CB3 .
234 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.9 Malayalam
9.9 Malayalam
Malayalam: U+0D00–U+0D7F The Malayalam script is a South Indian script used to write the Malayalam language of Ker- ala state. The shapes of Malayalam letters closely resemble those of Tamil. Malayalam, however, has a very full and complex set of conjunct consonant forms. Special Characters. U+0D57 is provided as an encoding for the right side of the two-part vowel U+0D4C .
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 235 9.10 Sinhala South and Southeast Asian Scripts
9.10 Sinhala
Sinhala: U+0D80–U+0DFF The Sinhala script, also known as Sinhalese, is used to write the Sinhala language, the majority language of Sri Lanka, formerly Ceylon. It is also used to write the Pali and San- skrit languages. The script is a descendant of Brahmi and resembles the scripts of South India in form and structure. Sinhala differs from other languages of the region in that it has a series of prenasalized stops that are distinguished from the combination of a nasal followed by a stop. In other words, both forms occur and are written differently—for example: $%[U+0D85 U+0DAC] / a.Nda/ “sound” versus $&'( [U+0D85 U+0DAB U+0DCA U+0DA9] /a.n.da/ “egg”. In addition, Sinhala has separate distinct signs for both a short and long low front vowel sounding similar to the initial vowel of the English word “apple,” usually represented in IPA as U+00E6 æ (ash). The independent forms of these vowels are encoded at U+0D87 and U+0D88; the corresponding dependent forms are U+0DD0 and U+0DD1. Because of these extra letters, the encoding for Sinhala does not precisely follow the pattern established for the other Indic scripts (for example, Devanagari), but does use the same general structure, making use of phonetic order, matra reordering, and use of the virama (U+0DCA -,) to indicate conjunct consonant clusters. Sinhala does not use half-forms in the Devanagari manner, but does use many ligatures. Other Letters for Tamil. The Sinhala script may also be used to write Tamil. In this case, some additional combinations may be required. Some letters, such as U+0DBB and U+0DB1 , may be modified by adding the equiva- lent of a nukta. There is, however, no nukta presently encoded in the Sinhala block. Historical Symbols. Neither the Sinhala numerals nor the punctuation sign Kunddaliya (U+0DF4) is in general use today, having been replaced by Western digits and Western- style punctuation. The Kunddaliya was formerly used as a full stop; it is included for schol- arly use. The Sinhala numerals are not presently encoded.
236 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.11 Thai
9.11 Thai
Thai: U+0E00–U+0E7F The Thai script is used to write Thai and other Southeast Asian languages, such as Kuy, Lavna, and Pali. It is a member of the Indic family of scripts descended from Brahmi. Thai modifies the original Brahmi letter shapes and extends the number of letters to accommo- date features of the Thai language, including tone marks derived from superscript digits. On the other hand, Thai script lacks the conjunct consonant mechanism and independent vowel letters found in most other Bhrami-derived scripts. As in all scripts of this family, the predominant writing direction is left to right. The Lao script is closely related to Thai, and the encoding principles described in this sec- tion apply to the Lao encoding as well. Standards. Thai layout in the Unicode Standard is based on the Thai Industrial Standard 620-2529, and its updated version 620-2533. Encoding Principles. In common with the Indic scripts, each Thai letter is a consonant possessing an inherent vowel sound. Thai letters further feature inherent tones. The inher- ent vowel and tone can be modified by means of vowel signs and tone marks attached to the base consonant letter. Some of the vowel signs and all of the tone marks are rendered in the script as diacritics attached above or below the base consonant. These combining signs and marks are encoded after the modified consonant in the memory representation. Most of the Thai vowel signs are rendered by full letter-sized in-line glyphs placed either before (that is, to the left of) or after (to the right of) or around (on both sides of) the glyph for the base consonant letter. In the Thai encoding, the letter-sized glyphs that are placed before (left of) the base consonant letter, in full or partial representation of a vowel sign, are in fact encoded as separate characters that are typed and stored before the base consonant character. This encoding for left-side Thai vowel sign glyphs (and similarly in Lao) differs from the conventions for all other Indic scripts, which uniformly encode all vowels after the base consonant. The difference is necessitated by the encoding practice commonly employed with Thai character data as represented by the Thai Industrial Stan- dard. Thai Punctuation. Thai uses a variety of punctuation marks particular to this script. U+0E4F is the Thai bullet, used to mark items in lists, or appearing at the beginning of a verse, sentence, paragraph, or other textual segment. U+0E46 is used to mark repetition of preceding letters. U+0E2F is used to indicate ellision or abbreviation of letters; it is itself viewed as a kind of letter, however, and is used with considerable frequency because of its appearance in such words as the Thai name for Bangkok. Paiyannoi is also used in combination (U+0E2F U+0E25 U+0E2F) to create a construct called paiyanyai, which means “et cetera, and so forth.” The Thai paiyanyai is comparable to the isomorphic analogue in the Khmer script: U+17D8 . U+0E5A is used to mark the end of a long segment of text. It can be combined with a following U+0E30 to mark a larger seg- ment of text; typically this usage can be seen at the end of a verse in poetry. U+0E5B marks the end of a chapter or document, where it always follows the angkhankhu + sara a combination. The Thai angkhankhu and its combination with sara a to mark breaks in text have analogues in many other Brahmi-derived scripts. For example, they are closely related to U+17D4 and U+17D5
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 237 9.11 Thai South and Southeast Asian Scripts
, which are themselves ultimately related to the danda and double danda of Devanagari. Thai words are not separated by spaces. Text is laid out with spaces introduced at text seg- ments where Western typography would typically make use of commas or periods. How- ever, Latin-based punctuation such as comma, period, and colon are also used in text, particularly in conjunction with Latin letters, or in formatting numbers, addresses, and so forth. If word boundary indications are desired—for example, for the use of automatic line layout algorithms—the character U+200B should be used to place invisible marks for such breaks. The can grow to have a visible width when justified. Thai Transcription of Pali and Sanskrit. The Thai script is frequently used to write Pali and Sanskrit. When so used, consonant clusters are represented by the explicit use of U+0E3A (virama) to mark the removal of the inherent vowel. There is no conjoining behavior, unlike other Indic scripts. U+0E4D is the Pali nigghahita and Sanskrit anusvara. U+0E30 is the Sanskrit visarga. U+0E24 and U+0E26 are vocalic /r/ and /l/, with U+0E45 used to indicate their lengthening.
238 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.12 Lao
9.12 Lao
Lao: U+0E80–U+0EFF The Lao language and script are closely related to Thai. The Unicode Standard encodes the Lao script in the same relative order as Thai. A few additional letters in Lao have no match in Thai: U+0EBB U+0EBC U+0EBD The preceding two semivowel signs are the last remnants of the system of subscript medials, which in Myanmar also includes original “rw”. Myanmar and Khmer include a full set of subscript consonant forms used for conjuncts. Thai no longer uses any of these forms; Lao has just the two. There are also two ligatures in the Unicode character encoding for Lao: U+0EDC and U+0EDD . They correspond to sequences of [h] plus [n] or [h] plus [m] without ligating. Their function in Lao is to provide versions of the [n] and [m] consonants with a different inherent tonal implication.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 239 9.13 Tibetan South and Southeast Asian Scripts
9.13 Tibetan
Tibetan: U+0F00–U+0FBF The Tibetan script is used for writing Tibetan in several countries and regions throughout the Himalayas. Aside from Tibet itself, the script is used in Ladakh, Nepal, and northern areas of India bordering Tibet where large Tibetan-speaking populations now reside. The Tibetan script is also used in Bhutan to write Dzongkha, the official language of that coun- try. Tibetan is also used as the language of philosophy and liturgy by Buddhist traditions spread from Tibet into the Mongolian cultural area that encompasses Mongolia, Buriatia, Kalmykia, and Tuva. The Tibetan scripting and grammatical systems were originally defined together in the sixth century by royal decree when the Tibetan King Songtsen Gampo sent 16 men to India to study Indian languages. One of those men, Thumi Sambhota, is credited with creating the Tibetan writing system upon his return, having studied various Indic scripts and gram- mars. The king’s primary purpose was to bring Buddhism from India to Tibet. The new script system was therefore designed with compatibility extensions for Indic (principally Sanskrit) transliteration so that Buddhist texts could be properly represented. Because of this origin, over the last 1,500 years the Tibetan script has also been widely used to repre- sent Indic words, a number of which have been adopted into the Tibetan language retain- ing their original spelling. A note on Latin transliteration: Tibetan spelling is traditional, and does not generally reflect modern pronunciation. Throughout this section, Tibetan words are represented in italics when transcribed as spoken, followed at first occurrence by a parenthetical translit- eration; in these transliterations, presence of the tsek (tsheg) character is expressed with a hyphen. Thumi Sambhota’s original grammar treatise defined two script styles. The first, called uchen (dbu-can, “with head”), is a formal “inscriptional capitals” style said to be based on an old form of Devanagri. It is the script used in Tibetan xylograph books and the one used in the coding tables. The second style, called u-mey (dbu-med, or “headless”), is more cur- sive and said to be based on the Wartu script. Numerous styles of u-mey have evolved since then, including both formal calligraphic styles used in manuscripts and running handwrit- ing styles. All Tibetan scripts follow the same lettering rules, though there is a slight differ- ence in the way that certain compound stacks are formed in uchen and u-mey. General Principles of the Tibetan Script. Tibetan grammar divides letters into consonants and vowels. There are 30 consonants, and each consonant is represented by a discrete writ- ten character. There are five vowel sounds, only four of which are represented by written marks. The four vowels that are explicitly represented in writing are each represented with a single mark that is applied above or below a consonant to indicate the application of that vowel to that consonant. The absence of one of the four marks implies that the first vowel sound (like a short “ah” in English) is present and is not modified to one of the four other possibilities. Three of the four marks are written above the consonants; one is written below. Each word in Tibetan has a base or root consonant. The base consonant can be written sin- gly or it can have other consonants added above or below it to make a vertically “stacked” letter. Tibetan grammar contains a very complete set of rules regarding letter gender, and these rules dictate which letters can be written in adjacent positions. The rules therefore dictate which combinations of consonants can be joined to make stacks. Any combination not allowed by the gender rules does not occur in native Tibetan words. However, when
240 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.13 Tibetan transcribing other languages (for example, Sanskrit, Chinese) into Tibetan, these rules do not operate. In certain instances, other than transliteration, any consonant may be com- bined with any other subjoined consonant. Implementations should therefore be prepared to accept and display any combinations. The model adopted to encode the Tibetan lettering set described above contains the follow- ing groups of items: Tibetan consonants, vowels, numerals, punctuation, ornamental signs and marks, and Tibetan-transliterated Sanskrit consonants and vowels. Each of these will be described below. Both in this description and in Tibetan, the terms “subjoined” (-btags) and “head” (-mgo) are used in different senses. In the structural sense, they indicate specific slots defined in native Tibetan orthography. In spatial terms, they refer to the position in the stack; any- thing in topmost position is “head,” anything not in topmost position is “subjoined.” Unless explicitly qualified, the terms “subjoined” and “head” are used here in their spatial sense. For example, in a conjunct like “rka” the letter in the root slot is “KA”, but, because it is not the topmost letter of the stack, it is expressed with a subjoined character code, while “RA”, which is structurally in the head slot, is expressed with a nominal character code. On the other hand, in a conjunct “kra” in which the root slot is also occupied with “KA”, the “KA” is encoded with a nominal character code, because it is in the topmost position in the stack. The Tibetan script has its own system of formatting and details of that system relevant to the characters encoded in this standard are explained herein. However, an increasing num- ber of publications in Tibetan do not strictly adhere to this original formatting system. This change is due to the partial move from publishing on long, horizontal, loose-leaf folios, to publishing in vertically oriented, bound books. The Tibetan script also has a punctuation set designed to meet needs quite different from the punctuation that has evolved for West- ern scripts. With the appearance of Tibetan newspapers, magazines, school textbooks, and Western style reference books in the last 20 or 30 years, Tibetans have begun using things like columns, indented blocks of text, Western-style headings, and footnotes. Some West- ern punctuation marks, including brackets, parentheses, and quotation marks, are also becoming commonplace in these kinds of publication. With the introduction of more sophisticated electronic publishing systems, there is also a renaissance in the publication of voluminous religious and philosophical works in the traditional horizontal, loose-leaf for- mat—many set in digital typefaces closely conforming to the proportions of traditional hand-lettered text. Consonants. The system devised to encode the Tibetan system of writing consonants in both single and stacked forms as described above is as follows: All of the consonants are encoded a first time from U+0F40 through U+0F69. There are the basic Tibetan consonants and, in addition, six compound consonants used to represent the Indic consonants gha, jha, d.ha, dha, bha, and ksh.a. These codes are used to represent occurrences of either a stand-alone consonant or a consonant in the head position of a ver- tical stack. Glyphs generated from these codes will always sit in the normal position starting at and dropping down from the design baseline. All of the consonants are then encoded a second time. These second encodings from U+0F90 through U+0FB9 represent conso- nants in subjoined stack position. To represent a single consonant in a text stream, one of the first, “nominal,” set of codes is placed. To represent a stack of consonants in the text stream, a “nominal” consonant code is followed directly by one or more of the subjoined consonant codes. The stack so formed continues for as long as subjoined consonant codes are contiguously placed. This encoding method was chosen over an alternative method that would have involved a virama-based encoding, like Devanagari. There were two main reasons for this choice.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 241 9.13 Tibetan South and Southeast Asian Scripts
First, the virama is not normally used in the Tibetan writing system to create letter combi- nations. (There is a virama in the Tibetan script, but only because of the need to represent Devanagari; it is called “srog-med” and encoded at U+0F84. The virama is never used in writing Tibetan words and can be, but is almost never, used as a substitute for stacking in writing Sanskrit mantras in the Tibetan script.) Second, there is a prevalence of stacking in native Tibetan, and the model chosen specifically results in decreased data storage require- ments. Furthermore, in languages other than Tibetan, there are many cases where stacks occur that do not occur in Tibetan-language texts; it is thus imperative to have a model that allows for any consonant to be stacked with any subjoined consonant(s). Thus a model for stack building was chosen that follows the Tibetan approach to creating letter combina- tions, but is not limited to a specific set of the possible combinations. Vowels. Each of the four basic Tibetan vowel marks mentioned above is coded as a separate entity. They are U+0F72, U+0F74, U+0F7A, and U+0F7C. For compatibility, a set of sev- eral compound vowels for Sanskrit transcription is also provided in the other codepoints between U+0F71 and U+0F7D. Most Tibetan users do not view these compound vowels as single characters, and their use is limited to Sanskrit words. It is acceptable for users to enter these compounds as a series of simpler elements and have software render them appropriately. Canonical equivalences are specified for all except U+0F77 and U+0F79. All vowel signs are nonspacing marks above or below a stack of consonants, sometimes on both sides. A stand-alone consonant or a stack of consonants can have a vowel sign applied to it. In accordance with the rules of Tibetan writing, a code for a vowel sign applied to a consonant should always be placed after the bare consonant or the stack of consonants formed by the method just described. All of the symbols and punctuation marks have straightforward encodings. Further infor- mation about many of them is included below. Coding Order. In general, the correct coding order for a stream of text will be the same as the order in which Tibetans spell and in which the characters of the text would be written by hand. For example, the correct coding order for the most complex Tibetan stack would be: head position consonant first subjoined consonant ... (intermediate subjoined consonants if any) last subjoined consonant subjoined vowel a-chung (U+0F71) standard or compound vowel sign, or virama Where used, the character U+0F39 - occurs immediately after the consonant it modifies. Allographical Considerations. When consonants are combined to form a stack, one of them retains the status of being the principal consonant in the stack. The principal conso- nant always retains its stand-alone form. However, consonants placed in the “head” and “subjoined” positions to the main consonant sometimes retain their stand-alone form and sometimes are given a new, special form. Because of this fact, certain of the consonants are given a further, special encoding treatment. The affected consonants are “wa” (U+0F5D), “ya” (U+0F61), and “ra” (U+0F62). Head Position “ra”. When the consonant “ra” is written in the “head” position (ra-mgo) at the top of a stack in the normal Tibetan-defined lettering set, the shape of the consonant can change. It can either be a full-form shape or the full-form shape but with the bottom stroke removed (looking like a short-stemmed letter “T”). This requirement of “ra” in the
242 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.13 Tibetan head position where the glyph representing it can change shape is correctly coded by using the stand-alone “ra” consonant (U+0F62) followed by the appropriate subjoined conso- nant(s). For example, in the normal Tibetan ra-mgo combinations, the “ra” in the head position is mostly written as the half-ra but in the case of “ra + sub-joined nya” must be written as the full-form “ra”. Thus the normal Tibetan ra-mgo combinations are correctly encoded with the normal “ra” consonant (U+0F62) because it can change shape as required. It is the responsibility of the font developer to provide the correct glyphs for rep- resenting the characters where the “ra” in the head position will change shape—for exam- ple, as in “ra + subjoined nya”. Full-Form “ra” in Head Position. Some instances of “ra” in the head position require that the consonant be represented as a full-form “ra” that never changes. This is not standard usage for the Tibetan language itself, but occurs in transliteration and transcription. In these cases, only the code U+0F6A - should be used. This “ra” will always be represented as a full-form “ra consonant” and will never change shape to the form where the lower stroke has been cut off. For example, the letter combination “ra + ya” when appearing in transliterated Sanskrit works is correctly written with a full- form “ra” followed by either a modified subjoined “ya” form or a full-form subjoined “ya” form. Note that the fixed-form “ra” should be used only in combinations where “ra” would normally transform into a short form but the user specifically wants to prevent that change. For example, the combination “ra + subjoined nya” never requires the use of fixed- form “ra”, because “ra” normally retains its full glyph form over “nya”. It is the responsibil- ity of the font developer to provide the appropriate glyphs to represent the encodings. Subjoined Position “wa”, “ya”, and “ra”. All three of these consonants can be written in subjoined position to the main consonant according to normal Tibetan grammar. In this position, all of them change to a new shape. The “wa” consonant when written in sub- joined position is not a full “wa” letter any longer but is literally the bottom-right corner of the “wa” letter cut off and appended below. For that reason it is called a wazur (wa-zur, or “corner of a wa”) or less frequently, but just as validly, wa-ta (wa-btags) to indicate that it is a subjoined “wa”. The consonants “ya” and “ra” when in the subjoined position are called ya-ta (ya-btags) and ra-ta (ra-btags), respectively. To encode these subjoined consonants that follow the rules of normal Tibetan grammar, the shape-changed, subjoined forms U+0F5D , U+0F61 , and U+0F62 should be used. All three of these subjoined consonants also have full-form non-shape-changing counter- parts for the needs of transliterated and transcribed text. For this purpose, the full sub- joined consonants that do not change shape (encoded at U+0FBA, U+0FBB, and U+0FBC, respectively) are used where necessary. The combinations of “ra + ya” are a good example because they include instances of “ra” taking a short (ya-btags) form and “ra” taking a full- form subjoined “ya”. U+0FB0 - (a-chung) should be used only in the very rare cases where a full-sized subjoined a-chung letter is required. The small vowel lengthening a-chung encoded as U+0F71 is far more frequently used in Tibetan text, and it is therefore recommended that implementations treat this character rather than U+0FB0 as the normal subjoined a-chung. Line-Breaking Considerations. Tibetan text separates units called natively “tsheg-bar”, an inexact translation of which is “syllable.” Tsheg-bar is literally the unit of text between tseks, and is generally a consonant cluster with all of its prefixes, suffixes, and vowel signs. It is not a “syllable” in the English sense. Tibetan script has two break characters only. The primary break character is the standard interword tsek (“tsheg”), which is encoded at U+0F0B. The other break character is the space. Space or tsek characters in a stream of Tibetan text are not always break characters
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 243 9.13 Tibetan South and Southeast Asian Scripts and so need proper contextual handling. Issues surrounding these two potential break characters will now be discussed. The primary delimiter character in Tibetan text is the tsek (U+0F0B - ). In general, automatic line-breaking processes may break after any occur- rence of this tsek, except where it follows a U+0F44 (with or without a vowel sign) and precedes a shay (U+0F0D), or where Tibetan grammatical rules do not permit a break. (Normally, tsek is not written before shay except after “nga”. This type of tsek-after-nga is called “nga-phye-tsheg”, and may be expressed by U+0F0B, or by the spe- cial character U+0F0C, a nonbreaking form of tsek.) The Unicode names for these two types of tsek are misnomers, retained for compatibility. The standard tsek U+0F0B is always required to be a potentially breaking char- acter, whereas the “nga-phye-tsheg” is always required to be a nonbreaking tsek. U+0F0C is specifically not a “delimiter” and is not for gen- eral use. There are no other break characters in Tibetan text. Unlike English, Tibetan has no system for hyphenating or otherwise breaking a word within the group of letters making up the word. Tibetan text formatting does not allow text to be broken within a word. Whitespace appears in Tibetan text, although it should be represented by U+00A0 - instead of U+0020 . Tibetan text breaks lines after tsek instead of at whitespace. Complete Tibetan text formatting is best handled by a formatter in the application and not just by the code stream. If the interword and nonbreaking tseks are properly employed as breaking and nonbreaking characters, respectively, and if all spaces are nonbreaking spaces, then any application will still wrap lines correctly on that basis, even though the breaks might be sometimes inelegant. Tibetan Punctuation. The punctuation apparatus of Tibetan is relatively limited. The principle punctuation characters are the tsek already mentioned, the shay (transliterated “shad”), which is a vertical stroke used to mark the end of a section of text, the space used sparingly as a space, and two of several variant forms of the shay that are used in specialized situations requiring a shay. There are also several other marks and signs but they are spar- ingly used. The shay at U+0F0D marks the end of a piece of text called “tshig-grub”. The mode of marking bears no commonality with English phrases or sentences and should not be described as a delimiter of phrases. In Tibetan grammatical terms, a shay is used to mark the end of an expression (“brjod.pa”) and a complete expression. Two shays are used at the end of whole topics (“don.tshan”). Because some writers use the double shay with a differ- ent spacing than would be obtained by coding two adjacent occurrences of U+0F0D, the double shay has been coded at U+0F0E with the intent that it would have a larger spacing between component shays than if two shays were simply written together. However, most writers do not use an unusual spacing between the double shay, so the application should allow the user to write two U+0F0D codes one after the other. Additionally, font designers will have to decide whether to implement these shays with a larger than normal gap. The U+0F11 rin-chen-pung-shay (rin-chen-spung-shad) is a variant shay used in a specific “new-line” situation. Its use was not defined in the original grammars but Tibetan tradi- tion gives it a highly defined use. The drul-shay (“sbrul-shad”) is likewise not defined by the original grammars but Tibetan tradition gives it a highly defined use. It is used for sep- arating sections of meaning that are equivalent to topics (“don.tshan”) and subtopics. A drul-shay is usually surrounded on both sides by the equivalent of about three spaces (though there is no rule specified). Hard spaces will be needed for these because the
244 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.13 Tibetan drul-shay should not appear at the beginning of a new line and the whole structure of spac- ing-plus-shay should not be broken up, if possible. Tibetan texts use a yig-go (“head mark,” yig-mgo) to indicate the beginning of the front of a folio, there being no other certain way, in the loose-leaf style of traditional Tibetan books, to tell which is the front of a page. The head mark can and does vary from text to text; there are many different ways to write it. The common type of head mark has been provided for with U+0F04 and its extension U+0F05 . An initial mark yig-go can be written alone or combined with as many as three closing marks following it. When the initial mark is written in com- bination with one or more closing marks, the individual parts of the whole must stay in proper registration with each other to appear authentic. Therefore, it is strongly recom- mended that font developers create precomposed ligature glyphs to represent the various combinations of these two characters. The less common head marks mainly appear in Nyingmapa and Bonpo literature. Three of these head marks have been provided for with U+0F01, U+0F02, and U+0F03; however, many others have not been encoded. Font devel- opers will have to deal with the fact that many types of head marks in use in this literature have not been encoded, cannot be represented by a replacement that has been encoded, and will be required by some users. Two characters, U+0F3C and U+0F3D , are paired punctuation, typically used together forming a roof over one or more digits or words. In this case, kerning or special ligatures may be required for proper rendering. The right ang khang may also be used much as a single closing parenthesis is used in forming lists; again, special kerning may be required for proper rendering. The marks U+0F3E and U+0F3F are paired signs used to combine with digits; special glyphs or compositional metrics are required for their use. A set of frequently occurring astrological and religious signs specific to Tibetan is encoded between U+0FBE and U+0FCF. U+0F34 means “et cetera” or “and so on” and is used after the first few tsek-bar of a recur- ring phrase. U+0FBE (often three times) indicates a refrain. U+0F36 and U+0FBF are used to indicate where text should be inserted within other text or as references to footnotes or marginal notes. Other Characters. The Wheel of Dharma, which occurs sometimes in Tibetan texts, is encoded in the Miscellaneous Symbols block at U+2638. Left-facing and right-facing swastika symbols are likewise used. They are found among the Chinese ideographs at U+534D (“yung-drung-chi-khor”) and U+5350 (“yung-drung- nang-khor”). The marks U+0F35 and U+0F37 conceptually attach to a tsek-bar rather than to an individual character and function more like attributes than characters—like underlining to mark or emphasize text. In Tibetan interspersed commentaries, they may be used to tag the tsek-bar belonging to the root text that is being commented on. The same thing is often accomplished by set- ting the tsek-bar belonging to the root text in large type and the commentary in small type. Correct placement of these glyphs may be problematic. If they are treated as normal com- bining marks, they can be entered into the text following the vowel signs in a stack; if used, their presence will need to be accounted for by searching algorithms, and so forth. Tibetan Half-Numbers. The half-number forms (U+0F2A..U+0F33) are peculiar to Tibetan, though other scripts (for example, Bengali) have similar fractional concepts. The value of each half-number is 0.5 less than the number within which it appears. These forms
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 245 9.13 Tibetan South and Southeast Asian Scripts are used only in some traditional contexts and appear as the last digit of a multidigit num- ber. The sequence of digits “U+0F24 U+0F2C” represents the number 42.5 or forty-two and one-half. Tibetan Transliteration and Transcription of Other Languages. Tibetan traditions are in place for transliterating other languages. Most commonly, Sanskrit has been the language being transliterated, though Chinese has become more common in modern times. Addi- tionally, Mongolian has a transliterated form. There are even some conventions for translit- erating English. One feature of Tibetan script/grammar is that it allows for totally accurate transliteration of Sanskrit. The basic Tibetan letterforms and punctuation marks contain most of what is needed, although a few extra things are required. With these additions, Sanskrit can be transliterated perfectly into Tibetan, and the Tibetan transliteration can be rendered backward perfectly into Sanskrit with no ambiguities or difficulties. The six Sanskrit retroflex letters are interleaved among the other consonants. The compound Sanskrit consonants are not included in normal Tibetan. They could be made using the method described earlier for Tibetan stacked consonants, generally by sub- joining “ha”. However, to maintain consistency in transliterated texts and for ease in trans- mission and searching, it is recommended that implementations of Sanskrit in the Tibetan script use the precomposed forms of aspirated letters (and U+0F69 “ka + reversed sha”) whenever possible, rather than implementing these consonants as completely decomposed stacks. Note that implementations must ensure that decomposed stacks and precomposed forms are interpreted equivalently (see Section 3.6, Decomposition). The compound conso- nants are explicitly coded as follows: U+0F93 , U+0F9D , U+0FA2 , U+0FA7 , U+0FAC , and U+0FB9 . The vowel signs of Sanskrit not included in Tibetan are encoded with other vowel signs between U+0F70 and U+0F7D. U+0F7F (nam chay) is the visarga, and U+0F7E (ngaro) is the anusvara. See Section 9.1, Devanagari, for more information on these two characters. The characters encoded in the range U+0F88..U+0F8B are used in transliterated text and are most commonly found in Kalachakra literature. When the Tibetan script is used to transliterate Sanskrit, consonants are sometimes stacked in ways that are not allowed in native Tibetan stacks. Even complex forms of this stacking behavior are catered for properly by the method described earlier for coding Tibetan stacks. Other Signs. U+0F09 is a list enumerator used at the start of administrative letters in Bhutan, as is the petition honorific U+0F0A - . U+0F3A and U+0F3B are paired punctuation marks (brackets). The sign U+0F39 - (tsa-’phru, which is a lenition mark) is the ornamental flaglike mark that is an integral part of the three consonants U+0F59 , U+0F5A , and U+0F5B . Although those consonants are not decomposable, this mark has been abstracted and may by itself be applied to “pha” and other consonants to make new letters for use in transliteration and transcription of other languages. For example, in modern literary Tibetan, it is one of the ways used to transcribe the Chinese “fa” and “va” sounds not represented by the normal Tibetan consonants. Tsa-’phru is also used to represent tsa, tsha, or dza in abbreviations. Traditional Text Formatting and Line Justification. Native Tibetan texts (“pecha”) are written and printed using a justification system that is, strictly speaking, right-ragged but
246 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.13 Tibetan with an attempt to right justify. Each page has a margin. That margin is usually demarcated with visible border lines required of a pecha. In modern times, as Tibetan text is produced in Western-style books, the margin lines may be dropped and an invisible margin used. When writing the text within the margins, an attempt is made to have the lines of text jus- tified up to the right margin. To do so, writers keep an eye on the overall line length as they fill lines with text and try manually to justify to the right margin. Even then, there is often a gap at the right margin that cannot be filled. If the gap is short, it will be left as is and the line will be said to be justified enough, even though by machine-justification standards the line is not truly flush on the right. If the gap is large, the intervening space will be filled with as many tseks as are required to justify the line. Again, the justification is not done perfectly in the way that English text might be perfectly right-justified; as long as the last tsek is more or less at the right margin, that will do. The net result is that of a right-justified, blocklike look to the text, but the actual lines are always a little right-ragged. Justifying tseks are nearly always used to pad the end of a line when the preceding character is a tsek—in other words, when the end of a line arrives in the middle of tshig-grub (see the previous definition under “Tibetan Punctuation”). However, it is unusual for a line that ends at the end of tshig-grub to have justifying tseks added to the shay at the end of the tshig-grub. That is, xxxx| | is not usually padded like this (though it is allowable): xxxx| |..... In this case, instead of justifying the line with tseks, the space between shays is enlarged and/or the whitespace following the final shay is usually left as is. Padding is never applied following an actual space character. For example, given the existence of a space after a shay, the following may not be written with the padding, xxxxxxx| ...... because the final shay should have a space after it, and padding is never applied after spaces. The same applies where the final consonant of a tshig-grub that ends a line is a “ka” or “ga”. In that case, the ending shay is dropped but a space is still required after the consonant and that space must not be padded. For example, the following is not acceptable: xxxxxga ...... Tibetan text has two rules regarding the formatting of text at the beginning of a new line. There are severe constraints on which characters can start a new line, and the rule is tradi- tionally stated as follows: a shay of any description may never start a new line. Nothing but actual words of text can start a new line, the only exception being a go-yig (yig-mgo) at the head of a front page or a da-tshe (zla-tshe, meaning “crescent moon”—for example, U+0F05) or one of its variations, which is effectively an “in-line” go-yig (yig-mgo), on any other line. One of two or three ornamental shays is also commonly used in short pieces of prose in place of the more formal da-tshe. This rule also means that a space may not start a new line in the flow of text. If there is a major break in a text, a new line might be indented. A syllable (tsheg-bar) that comes at the end of a tshig-grub and that starts a new line must have the shay that would normally follow it replaced by a rin-chen-spung-shad (U+0F11). (The reason for this rule is that the presence of the rin-chen-spung-shad makes the end of tshig-grub more visible and hence makes the text easier to read.) In verse, the second shay following the first rin-chen-spung-shad is also replaced some- times with a rin-chen-spung-shad, though the practice is formally incorrect. It is a writer’s trick done to make a particular scribing of a text more elegant. It is moderately popular device but does breaks the rule. Not only is rin-chen-spung-shad used as the replacement
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 247 9.13 Tibetan South and Southeast Asian Scripts for the shay but a whole class of “ornamental shays” are used for the same purpose. All are scribal variants on a rin-chen-spung-shad, which is correctly written with three dots above it. Tibetan Shorthand Abbreviations (bskungs-yig) and Limitations of the Encoding. A con- sonant functioning as the word-base (ming-gzhi) is allowed to take only one vowel sign according to Tibetan grammar. The Tibetan shorthand writing technique called bskungs- yig does allow one or more words to be contracted into a single, very unusual combination of consonants and vowels. This construction frequently entails the application of more than one vowel sign to a single consonant or stack, and the composition of the stacks them- selves can break the rules of normal Tibetan grammar. For this reason, vowel signs do sometimes interact typographically, which accounts for their particular combining classes (see Section 4.2, Combining Classes—Normative). The Unicode Standard accounts for plain text compounds of Tibetan that contain at most one base consonant, any number of subjoined consonants, followed by any number of vowel signs. This coverage constitues the vast majority of Tibetan text. Rarely, stacks are seen that contain more than one such consonant-vowel combination in a vertical arrange- ment. These stacks are highly unusual and are considered beyond the scope of plain text rendering. They may be handled by higher-level mechanisms.
248 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.14 Myanmar
9.14 Myanmar
Myanmar: U+1000–U+109F The Myanmar script is used to write Burmese, the majority language of Myanmar (for- merly called Burma). Variations and extensions of the script are used to write other lan- guages of the region, such as Shan and Mon, as well as Pali and Sanskrit. The Myanmar script was formerly known as the Burmese script, but the term “Myanmar” is now pre- ferred. The Myanmar writing system derives from a Brahmi-related script borrowed from South India in about the eighth century for the Mon language. The first inscription in the Myan- mar script dates from the eleventh century, using an alphabet almost identical to that of the Mon inscriptions. Aside from rounding of the originally square characters, this script has remained largely unchanged to the present. It is said that the rounder forms were devel- oped to permit writing on palm leaves without tearing the writing surface of the leaf. Because of its Brahmi origins, the Myanmar script shares the structural features of its Indic relatives: consonant symbols include an inherent “a” vowel; various signs are attached to a consonant to indicate a different vowel; ligatures and conjuncts are used to indicate conso- nant clusters; and the overall writing direction is left to right. Thus, despite great differ- ences in appearance and detail, the Myanmar script follows the same basic principles as, for example, Devanagari. Standards. There is not yet an official national standard for the encoding of Myanmar/ Burmese. The current encoding was prepared with the consultation of experts from the Myanmar Information Technology Standardization Committee (MITSC) in Yangon (Rangoon). The MITSC, formed by the government in 1997, consists of experts from the Myanmar Computer Scientists’ Association, Myanmar Language Commission, and Myan- mar Historical Commission. Encoding Principles. As with Indic scripts, the Myanmar encoding represents only the basic underlying characters; multiple glyphs and rendering transformations are required to assemble the final visual form for each syllable. Even some single characters, such as U+102C , may assume variant forms depending on the other characters with which they combine. Conversely, characters or combinations that may appear visually identical in some fonts, such as U+101D and U+1040 , are distinguished by their underlying encoding. Composite Characters. As is the case in Extended Latin and many other scripts, some Myanmar letters or signs may be analyzed as composites of two or more other characters, and are not encoded separately. The following is an example of a Myanmar letter repre- sented by a combining character sequence: myanmar vowel sign o = U+1031 + U+102C Encoding Subranges. The basic consonants, independent vowels, and dependent vowel signs required for writing the Myanmar language are encoded at the beginning of the Myanmar range. Extensions of each of these categories for use in writing other languages, such as Pali and Sanskrit, are appended at the end of the range. In between these two sets lie the script-specific signs, punctuation, and digits. Conjunct and Medial Consonants. As in other Indic-derived scripts, conjunction of two consonant letters is indicated by the insertion of a virama U+1039
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 249 9.14 Myanmar South and Southeast Asian Scripts between them; it causes ligation or other rendered combination of the consonants, although the virama itself is not rendered visibly. The conjunct form of U+1004 is rendered as a superscript sign called kinzi. Kinzi is encoded in logical order as a conjunct consonant before the syllable to which the mark applies, such as the Devanagari ra. (See Section 9.1, Devanagari, Rule R2.) For example, kinzi applied to U+1000 would be written via the fol- lowing sequence: U+1004 U+1039 U+1000 The Myanmar script traditionally distinguishes a set of subscript “medial” consonants: forms of , , , and that are considered to be modifiers of the syllable’s vowel. In the Myanmar encoding, the medial consonants are treated as conjuncts; that is, they are coded using the virama. So, for example, the word “kywe” (“to drop off”) would be written via the following sequence: U+1000 U+1039 U+101A U+1039 U+101D U+1031 Explicit Virama. The virama U+1039 also participates in some common constructions where it appears as a visible sign, commonly termed killer. In this usage where it appears as a visible diacritic, U+1039 is followed by a U+200C -, as with Devanagari (see Figure 9-7). Signs After Consonants. Dependent vowels and other signs (except kinzi, as noted above) are encoded in logical order after the consonant to which they apply, regardless of where the glyph for the sign happens to be rendered relative to the glyph for the consonant. In particular, U+1031 is encoded after its consonant (as in the previ- ous example), although in visual presentation it is reordered to appear before (to the left of) the consonant form. Spacing. Myanmar does not use any whitespace between words. If word boundary indica- tions are desired—for example, for the use of automatic line layout algorithms—the char- acter U+200B should be used to place invisible marks for such breaks. The can grow to have a visible width when justified.
250 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.15 Khmer
9.15 Khmer
Khmer: U+1780–U+17FF Khmer, also known as Cambodian, is the official language of Kampuchea. Mutually intelli- gible dialects are also spoken in northeastern Thailand and the Mekong Delta region of Vietnam. Although Khmer is not an Indo-European language, it has borrowed much vocabulary from Sanskrit and Pali, and religious texts in those languages have been translit- erated as well as translated into Khmer. The Khmer script, called a’saa kmae (“Khmer letters”), is descended from the Brahmi script of South India, as are Thai, Lao, Myanmar, Old Mon, and others. The exact sources have not been determined, but there is a great similarity between the earliest inscriptions in the region and the Pallawa script of the Coromandel coast of India. Modern Khmer has two basic styles of script: the a’saa criang (“slanted script”) and the a’saa muul (“round script”). There is no fundamental structural difference between the two. The slanted script (“stand- ing” variant) is chosen as representative in Chapter 14, Code Charts. Structurally, the Khmer script stays close to its southern Brahmi origins; in modern terms, it shares features with both Thai and Myanmar. Consonant letters bear an inherent vowel sound, with additional signs placed before, above, below, and after the consonants to indi- cate vowels other than the inherent one. Consonant clusters are represented by conjuncts, where the first consonant of the cluster maintains its full form and succeeding consonants are written as subscripts. The overall writing direction is left to right. The Khmer language has a richer set of vowels than the languages for which the ancestral script was used, but it has a smaller set of consonant sounds. The Khmer script takes advantage of this situation by assigning different symbols to the same consonant but with different inherent vowels. This usage is analogous to the situation in Thai, where different consonant symbols have the same sound but encode different tones. The Khmer consonant letters are organized into two series or registers, whose inherent vowels are nominally “a” and “o”. Two “shifter” signs convert a consonant from one series to the other. The depen- dent vowel signs then do not have a single phonetic value, but rather are interpreted in the context of the consonant to which they are attached. Encoding Principles. As with other related scripts, the Khmer encoding represents only the basic underlying characters; multiple glyphs and rendering transformations are required to assemble the final visual form for each syllable. Even some single characters, such as U+1789 , may assume variant forms depending on the other characters with which they combine. Conversely, characters or combinations that may appear visually identical in some fonts, such as U+17A2 and U+17A3 - , are distinguished by their underlying encoding. Independent Vowels. In Khmer, as in most related scripts (but in contrast to Thai), inde- pendent vowels have their own letterforms. Conjunct Consonants. U+17D2 plays the conjunct formation role of Indic virama, killing the vowel of its preceding consonant, and indicating that the following consonant should be treated as a subscript. Sign coeng should not be confused with U+17D1 , which has a name similar to virama but an unrelated func- tion—indicating that the base character is part of the previous word. Note that a subjoined U+179A is typed normally—that is, after the pre- ceding consonant and sign coeng, even though the glyph for the subjoint letter ro is ren- dered before (to the left of) the full-size consonant. U+17CC
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 251 9.15 Khmer South and Southeast Asian Scripts historically corresponds to the Devanagari repha (that is, an initial /r/), but it has lost this function in Khmer and instead is treated as a simple diacritical mark rather than as part of a conjunct. Consonant Shifters. The marks U+17C9 and U+17CA are used to shift the base consonant between registers. In the presence of other superscript glyphs, both of these signs may be rendered via the same glyph shape as that of U+17BB . Selection of the proper rendering form is left to the display software. Dependent Vowel Signs. Structurally, the Khmer dependent vowel signs are close to those of Thai, but the encoding principle is different. Khmer follows the general model for Brahmi-derived scripts, in which each dependent vowel sign is represented by a single code that occurs after the code for the base consonant. This principle reflects the fact that a vowel sign has its own identity, regardless of the number and placement of glyph fragments that may be used to render the sign, and regardless of its contextual phonetic interpreta- tion. Each Khmer consonant either expresses its implicit vowel or is followed by a single dependent vowel sign character. Ordering of Syllable Components. The standard order of components in consonantal syl- lables as expressed in BNF is C ( SC C )* { S } { V } { O } where C is a consonant SC is a sign coeng (= virama) S is a consonant shift V is a dependent vowel sign O is any other sign For example, the word /khnyom/, “I”, would represented by the following sequence: U+1781 U+17D2 U+1789 U+17B9 U+17C6 Except for the presence of consonant shifters, this structure is analogous to Devanagari (see Section 9.1, Devanagari). Special Character Usage. Although U+17A3 and U+17A2 have identical glyphs, only U+17A2 should be used for modern text. U+17A3 and U+17A4 are solely for Pali/Sanskrit transliteration. U+1789 may have two different subscript forms. Furthermore, the glyph part of the base letter U+1789, which normally occurs under the baseline, is omitted whenever a subscript is conjoined. U+17B1 bears a close relationship to U+17B2 . The very commonly used word meaning “give” should be spelled U+17B2 U+17D2 U+1799, with rendering software optionally replacing the glyph for U+17B1 with the glyph for U+17B2. U+17D3 is an exceedingly rare sign used in historic lunar dates; it should not be confused with U+17C6 , which carries
252 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard South and Southeast Asian Scripts 9.15 Khmer an “m” sound and occurs frequently in modern Khmer text. U+17C6 - , U+17C7 , and U+17C8 are signs that occur in many permutations with vowels; they are not vowels themselves. Spacing. Khmer does not use any whitespace between words. If word boundary indications are desired—for example, for the use of automatic line layout algorithms—the character U+200B should be used to place invisible marks for such breaks. The can grow to have a visible width when justified.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 253 This PDF file is an excerpt from The Unicode Standard, Version 3.0, issued by the Unicode Consor- tium and published by Addison-Wesley. The material has been modified slightly for this online edi- tion, however the PDF files have not been modified to reflect the corrections found on the Updates and Errata page (see http://www.unicode.org/unicode/uni2errata/UnicodeErrata.html). More recent versions of the Unicode standard exist (see http://www.unicode.org/unicode/standard/versions/). Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial capital letters. However, not all words in initial capital letters are trademark designations. The authors and publisher have taken care in preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode®, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. Dai Kan-Wa Jiten used as the source of reference Kanji codes was written by Tetsuji Morohashi and published by Taishukan Shoten. ISBN 0-201-61633-5 Copyright © 1991-2000 by Unicode, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or other- wise, without the prior written permission of the publisher or Unicode, Inc. This book is set in Minion, designed by Rob Slimbach at Adobe Systems, Inc. It was typeset using FrameMaker 5.5 running under Windows NT. ASMUS, Inc. created custom software for chart layout. The Han radical-stroke index was typeset by Apple Computer, Inc. The following companies and organizations supplied fonts:
Apple Computer, Inc. Atelier Fluxus Virus Beijing Zhong Yi (Zheng Code) Electronics Company DecoType, Inc. IBM Corporation Monotype Typography, Inc. Microsoft Corporation Peking University Founder Group Corporation Production First Software Additional fonts were supplied by individuals as listed in the Acknowledgments. The Unicode® Consortium is a registered trademark, and Unicode™ is a trademark of Unicode, Inc. The Unicode logo is a trademark of Unicode, Inc., and may be registered in some jurisdictions. All other company and product names are trademarks or registered trademarks of the company or manufacturer, respectively. The publisher offers discounts on this book when ordered in quantity for special sales. For more information please contact: Corporate, Government, and Special Sales Addison Wesley Longman, Inc. One Jacob Way Reading, Massachusetts 01867 Visit A-W on the Web: http://www.awl.com/cseng/ First printing, January 2000. Chapter 10 East Asian Scripts 10
All of the writing systems of East Asia have as their root the ideographic script developed in China in the second millennium . Originally intended to write the various Chinese dia- lects, this script was exported throughout China’s sphere of influence and was used for cen- turies to write even non-Chinese languages such as Japanese, Korean, and Vietnamese. Speakers of other languages, such as Yi, created their own ideographs in imitation of Chinese. The Chinese language is largely monosyllabic and noninflecting, and an ideographic writ- ing system suits it well. Ideographs are less well suited for unrelated languages. In Japan, this problem was solved by creating two syllabic scripts, hiragana and katakana. In Korea, an alphabetic system was invented, with letters grouped into ideograph-like syllabic blocks called hangul. Ideographs continued to be the exclusive way of writing Vietnamese until the twentieth century, when they were replaced with an alphabet based on the Latin script. They are still extensively used in Japanese, where they are called kanji, and somewhat more rarely in Korean, where they are called hanja. In mainland China, the government has promoted the use of more modern, simplified forms of the ideographs over the older, more traditional forms used in Taiwan and overseas Chinese communities. Appendix A, Han Unification History, describes how the diverse typographic traditions of mainland China, Taiwan, Japan, Korea, and Vietnam have been reconciled to provide a common set of ideographs in the Unicode Standard for all of these languages and regions. The Unicode Standard includes a complete set of Korean hangul syllables, as well as the individual letters (jamos), which can be also be used to write Korean. Section 3.11, Conjoin- ing Jamo Behavior, describes how to use the conjoining jamos and how to convert between two methods for representing Korean. In all East Asian scripts, individual characters are written within uniformly sized squares. Diacritical marks are rarely used, although phonetic annotations are not uncommon. Tra- ditionally, East Asian scripts were written from the top of the page downward, with col- umns running right to left across the page. Under the influence of Western practices, a horizontal left-to-right writing sequence is now common. Many older character sets include characters intended to simplify the implementation of East Asian scripts, such as variant punctuation forms for text written vertically, halfwidth forms (which occupy only half a square), and fullwidth forms (which allow Latin letters to occupy a full square). These characters are included in the Unicode Standard for compati- bility with these older standards.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 257 10.1 Han East Asian Scripts
10.1 Han
CJK Unified Ideographs These blocks contain a set of Unified Han ideographic characters used in the written Chi- nese, Japanese, and Korean languages.1 The term Han, derived from the Chinese Han Dynasty, refers generally to Chinese traditional culture. The Han ideographic characters make up a coherent script, which was traditionally written vertically, with the vertical lines ordered from right to left. In modern usage, especially in technical works and in computer- rendered text, the Han script is written horizontally from left to right and is freely mixed with Latin or other scripts. When used in writing Japanese or Korean, the Han characters are interspersed with other scripts unique to those languages (Hiragana and Katakana for Japanese; Hangul syllables for Korean). The Han ideographic characters constitute a very large set, numbering in the tens of thou- sands. They have a long history of use in East Asia. Enormous compendia of Han ideo- graphic characters exist because of a continuous, millennia-long scholarly tradition of collecting all Han character citations, including variant, mistaken, and nonce forms, into annotated character dictionaries. Because of the large size of the Han ideographic character repertoire, and because of the particular problems the characters pose for standardizing their encoding, this character block description is more extended than that for other scripts and is divided into subsec- tions. The first three subsections (CJK Standards, Blocks, and Mapping to Standards) describe the character set standards used as sources, the way in which the Unicode Stan- dard divides ideographs into blocks, and some of the issues involved in mapping these characters to other character sets. These subsections are followed by an extended discus- sion of the characteristics of Han characters, with particular attention being paid to the problem of unification of encoding for characters used for different languages. Then there is a formal statement of the principles behind the Unified Han character encoding adopted in the Unicode Standard and order of their arrangement. For a detailed account of the background and history of development of the Unified Han character encoding, see also Appendix A, Han Unification History.
CJK Standards The Unicode Standard draws its Han character repertoire of 27,484 characters from a num- ber of character set standards. These standards are grouped into six sources, as indicated in Table 10-1. The primary work of unifying and ordering the characters from these sources was done by the Ideographic Rapporteur Group (IRG), a subgroup of ISO/IEC JTC1/SC2/ WG2. The G, T, J, K, and V sources represent the characters submitted to the IRG by its member bodies. The G source consists of submissions from mainland China, the Hong Kong SAR, and Singapore. The other four are the submissions from Taiwan, Japan, Korea, and Viet- nam, respectively. The U source represents character set standards that were not submitted to the IRG by any member body but which were used by the Unicode Consortium.
1. Although the term “CJK”—Chinese, Japanese, and Korean—is used throughout this text to describe the languages that currently use Han ideographic characters, it should be noted that earlier Vietnamese writing systems were based on Han ideographs. Conse- quently, the term “CJKV” would be more accurate in a historical sense. Han ideographs are still used for historical, religious, and pedagogical purposes in Vietnam.
258 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.1 Han
Table 10-1. Sources for Unified Han G source: G0 GB2312-80 G1 GB12345-90 with 58 Hong Kong and 92 Korean “Idu” characters G3 GB7589-87 unsimplified forms G5 GB7590-87 unsimplified forms G7 General Purpose Hanzi List for Modern Chinese Language, and General List of Simplified Hanzi GS Singapore Characters G8 GB8565-88 GE GB16500-95 T source: T1 CNS 11643-1992 1st plane T2 CNS 11643-1992 2nd plane T3 CNS 11643-1992 3rd plane with some additional characters T4 CNS 11643-1992 4th plane T5 CNS 11643-1992 5th plane T6 CNS 11643-1992 6th plane T7 CNS 11643-1992 7th plane TF CNS 11643-1992 15th plane J source: J0 JIS X 0208-1990 J1 JIS X 0212-1990 JA Unified Japanese IT Vendors Contemporary Ideographs, 1993 K source: K0 KS C 5601-1987 (unique ideographs) K1 KS C 5657-1991 K2 PKS C 5700-1 1994 K3 PKS C 5700-2 1994 V source: V0 TCVN 5773:1993 V1 TCVN 6056:1995 U source: KS C 5601-1987 (duplicate ideographs) ANSI Z39.64-1989 (EACC) Big-5 (Taiwan) CCCII, level 1 GB 12052-89 (Korean) JEF (Fujitsu) PRC Telegraph Code Taiwan Telegraph Code (CCDC) Xerox Chinese Han Character Shapes Permitted for Personal Names (Japan) IBM Selected Japanese and Korean Ideographs
In some cases, the entire ideographic repertoire of the original character set standards was not included in the corresponding source. Three reasons explain this decision: 1. Where the repertoires of two of the character set standards within a single source have considerable overlap, the characters in the overlap might be included only once in the source. This approach is used, for example, with GB 2312-80 and GB 12345-90, which have many ideographs in common. Charac- ters in GB 12345-90 that are duplicates of characters in GB 2312-80 are not included in the G source. 2. Where a character set standard is based on unification rules that differ substan- tially from those used by the IRG, many variant characters found in the charac- ter set standard will not be included in the source. This situation is the case with CNS 11643-1992, EACC, and CCCII. It is the only case where full round- trip compatibility with the Han ideograph repertoire of the relevant character set standards is not guaranteed. 3. KS C 5601-1987 contains numerous duplicate ideographs included because they have multiple pronunciations in Korean. These multiply-encoded ideo- graphs are not included in the K source but are included in the U source to pro- vide full round-trip compatibility with KS C 5601-1987.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 259 10.1 Han East Asian Scripts
Blocks Ideographs are found in three blocks of the Unicode Standard: the CJK Unified Ideographs block (U+4E00–U+9FFF, common ideographs), the CJK Unified Ideographs Extension A block (U+3400–U+4DFF, rare ideographs), and the CJK Compatibility Ideographs block (U+F900–U+FAFF, duplicates, unifiable variants, and corporate characters). Characters in the CJK Unified Ideographs and CJK Unified Ideographs Extension A blocks are defined by the IRG and are derived entirely from the G, T, J, K, and V sources. The CJK Unified Ideographs block represents characters submitted to the IRG prior to 1992 and consists of commonly used characters. Characters in the CJK Unified Ideographs Extension A block are rarer, were submitted to the IRG between 1992 and 1998, and are not unifiable with characters in the CJK Unified Ideographs block. The only difference in the unification work done by the IRG on these two blocks is that the source separation rule was applied to the CJK Unified Ideographs block and not to the CJK Unified Ideographs Extension A block. This rule states that ideographs that are distinct in a source must not be unified. (For further discussion, see the subsection “Principles” later in this section.) Characters unique to the U source are found in the CJK Compatibility Ideographs block. There are 12 of these characters: U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, and U+FA29. The remaining characters in the CJK Compatibility Ideographs block are either duplicates or unifiable variants of char- acters in the one of the other blocks and are included in the Unicode Standard for reasons of round-trip compatibility.
General Characteristics of Han Ideographs The authoritative Japanese dictionary Kouzien defines Han characters to be characters that originated among the Chinese to write the Chinese lan- guage. They are now used in China, Japan, and Korea. They are logo- graphic (each character represents a word, not just a sound) characters that developed from pictographic and ideographic principles. They are also used phonetically. In Japan they are generally called kanzi (Han, that is, Chinese, characters) including the “national characters” (kokuzi) such as touge (mountain pass), which have been created using the same prin- ciples. They are also called mana (true names, as opposed to kana, false or borrowed names).1 For many centuries, written Chinese was the accepted written standard throughout East Asia. The influence of the Chinese language and its written form on the modern East Asian languages is similar to the influence of Latin on the vocabulary and written forms of lan- guages in the West. This influence is immediately visible in the mixture of Han characters and native phonetic scripts (kana in Japan, hangul in Korea) as now used in the orthogra- phies of Japan and Korea (see Table 10-2). The evolution of character shapes and semantic drift over the centuries has resulted in changes to the original forms and meanings. For example, the Chinese character tang (Japanese tou or yu, Korean thang), which originally meant “hot water,” has come to mean “soup” in Chinese. “Hot water” remains the primary meaning in Japanese and Korean, whereas “soup” appears in more recent borrowings from Chinese, such as “soup noodles”
1. Lee Collins’ translation from the Japanese, Kouzien, Izuru, Shinmura, ed. (Tokyo: Iwan- ami Syoten, 1983).
260 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.1 Han
Table 10-2. Common Han Characters
English Han Character Chinesea Japanese Korean Translation tian1 ten, ame chen heaven, sky
di4 ti, tuti ci earth, ground
ren2 zin, hito in man, person
shan1 san, yama san mountain
shui3 sui, mizu swu water
shang4 zyou, ue sang above
xia4 ka, sita ha below
a. The superscripted numbers in this table represent Chinese (Mandarin) tone marks.
(Japanese tanmen; Korean thangmyen). Still, the identical appearance and similarities in meaning are dramatic and more than justify the concept of a unified Han script that tran- scends language. The “nationality” of the Han characters became an issue only when each country began to create coded character sets (for example, China’s GB 2312-80, Japan’s JIS X 0208-1978, and Korea’s KS C 5601-87) based on purely local needs. This problem appears to have arisen more from the priority placed on local requirements and lack of coordination with other countries, rather than out of conscious design. Nevertheless, the identity of the Han char- acters is fundamentally independent of language, as shown by dictionary definitions, vocabulary lists, and encoding standards. Terminology. Several standard romanizations of the term used to refer to East Asian ideo- graphic characters are commonly used. They include hanzi (Chinese), kanzi (Japanese), kanji (colloquial Japanese), hanja (Korean), and ChÔ hán (Vietnamese). The standard English translations for these terms are interchangeable: Han character, Han ideographic character, East Asian ideographic character, or CJK ideographic character. For the purpose of clarity, the Unicode Standard uses some subset of the English terms when referring to these characters. The term Kanzi is used in reference to a specific Japanese government publication. The unrelated term KangXi (which is a Chinese reign name, rather than another romanization of “Han character”) is used only when referring to the dictionary on which the Unified Repertoire and Ordering, Version 2.0, was based. Distinguishing Han Character Usage Between Languages. There is some concern that unifying the Han characters may lead to confusion because they are sometimes used differ- ently by the various East Asian languages. Computationally, Han character unification pre- sents no more difficulty than employing a single Latin character set that is used to write languages as different as English and French. Programmers do not expect the characters “c”, “h”, “a”, and “t” alone to tell us whether chat is a French word for cat or an English word meaning “informal talk.” Likewise, we depend on context to identify the American hood (of a car) with the British bonnet. Few computer users are confused by the fact that ASCII can also be used to represent such words as the Welsh word ynghyd, which are strange look- ing to English eyes. Although it would be convenient to identify words by language for pro- grams such as spell-checkers, it is neither practical nor productive to encode a separate Latin character set for every language that uses it.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 261 10.1 Han East Asian Scripts
Similarly, the Han characters are often combined to “spell” words whose meaning may not be evident from the constituent characters. For example, the two characters “to cut” and “hand” mean “postal stamp” in Japanese, but the compound may appear to be nonsense to a speaker of Chinese or Korean (see Figure 10-1). Figure 10-1. Han Spelling
+ 1. Japanese “stamp” = to cut hand 2. Chinese “cut hand”
Even within one language, a computer requires context to distinguish the meanings of words represented by coded characters. The word chuugoku in Japanese, for example, may refer to China or to a district in central west Honshuu (see Figure 10-2). Figure 10-2. Context for Characters
+ 1. China = middle country 2. Chuugoku district of Honshuu
Coding these two characters as four so as to capture this distinction would probably cause more confusion and still not provide a general solution. The Unicode Standard leaves the issues of language tagging and word recognition up to a higher level of software and does not attempt to encode the language of the Han characters. Sorting Han Ideographs. The Unicode Standard does not define a method by which ideo- graphic characters are sorted; the requirements for sorting differ by locale and application. Possible collating sequences include phonetic, radical-stroke (KangXi, Xinhua Zidian, and so on), four-corner, and total stroke count. Raw character codes alone are seldom sufficient to achieve a usable ordering in any of these schemes; ancillary data are usually required. (See Table 10-5.) Character Glyphs. In form, Han characters are monospaced. Every character takes the same vertical and horizontal space, regardless of how simple or complex its particular form is. This practice follows from the long history of printing and typographical practice in China, which traditionally placed each character in a square cell. When written vertically, there are also a number of named cursive styles for Han characters, but the cursive forms of the characters tend to be quite idiosyncratic and are not implemented in general-purpose Han character fonts for computers. There may be a wide variation in the glyphs used in different countries and for different applications. The most commonly used typefaces in one country may not be used in others. The types of glyphs used to depict characters in the Han ideographic repertoire of the Uni- code Standard have been constrained by available fonts. Users are advised to consult authoritative sources for the appropriate glyphs for individual markets and applications. It is assumed that most Unicode implementations will provide users with the ability to select the font (or mixture of fonts) that is most appropriate for a given locale.
Principles Three-Dimensional Conceptual Model. To develop the explicit rules for unification, a conceptual framework was developed to model the nature of Han ideographic characters. This model expresses written elements in terms of three primary attributes: semantic
262 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.1 Han
(meaning, function), abstract shape (general form), and actual shape (instantiated, type- face form). These attributes are graphically represented in three dimensions according to the X, Y, and Z axes (see Figure 10-3). Figure 10-3. Three-Dimensional Conceptual Model
1 2
Y (abstract shape) Z (type face)
X (semantic)
The semantic attribute (represented along the X axis) distinguishes characters by meaning and usage. Distinctions are made between entirely unrelated characters such as ! (marsh) and (machine) as well as extensions or borrowings beyond the original semantic cluster such as 1 (a phonetic borrowing used as a simplified form of ) and 2 (table, the orig- inal meaning). The abstract shape attribute (the Y axis) distinguishes the variant forms of a single charac- ter with a single semantic attribute (that is, a character with a single position on the X axis). The actual shape (typeface) attribute (the Z axis) is for differences of type design (the actual shape used in imaging) of each variant form. Only characters that have the same abstract shape (that is, occupy a single point on the X and Y axes) are potential candidates for unification. Z axis typeface and stylistic differences are generally ignored. Unification Rules. The following rules were applied during the process of merging Han characters from the different source character sets: R1 Source Separation Rule. If two ideographs are distinct in a primary source stan- dard, then they are not unified. • This rule is sometimes called the round-trip rule because its goal is to facilitate a round-trip conversion of character data between an IRG source standard and the Unicode Standard without loss of information. • This rule was applied only for the work on the original Unified Repertoire and Ordering (URO). The IRG dropped this rule in 1992 and will not use it in future work. For example, the ideographs from the URO shown in Figure 10-4 would normally be sub- ject to unification by rule R3; however, their unification is prevented because they are dis- tinct in the primary source standard J0 (JIS X 0208-1990). R2 Noncognate Rule. In general, if two ideographs are unrelated in historical deriva- tion (noncognate characters), then they are not unified.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 263 10.1 Han East Asian Scripts
Figure 10-4. Preserving Variants
“sword”
For example, the following ideographs (in Figure 10-5), although visually quite similar, are nevertheless not unified because they are historically unrelated and have distinct meanings. Figure 10-5. Not Cognates, Not Unified ≠ earth warrior, scholar
R3 By means of a two-level classification (described next), the abstract shape of each ideograph is determined. Any two ideographs that possess the same abstract shape are then unified provided that their unification is not disallowed by either the source separation rule or the noncognate rule. Two-Level Classification. Using the three-dimensional model, characters are analyzed in a two-level classification. The two-level classification distinguishes characters by abstract shape (Y axis) and actual shape of a particular typeface (Z axis). Variant forms are identi- fied based on the difference of abstract shapes. To determine differences in abstract shape and actual shape, the structure and features of each component of an ideograph are analyzed as follows. Ideograph Component Structure. The component structure of each ideograph is exam- ined. A component is a geometrical combination of primitive elements. Various ideo- graphs can be configured with these components used in conjunction with other components. Some components can be combined to make a component more complicated in its structure. Therefore, an ideograph can be defined as a component tree with the entire ideograph as the root node and with the bottom nodes consisting of primitive elements (see Figure 10-6 and Figure 10-7). Figure 10-6. Component Structure
Figure 10-7. The Most Superior Node of a Component
vs.
vs.
vs.
264 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.1 Han
Ideograph Features. The following features of each ideograph to be compared are exam- ined: • Number of components • Relative position of components in each complete ideograph • Structure of a corresponding component • Treatment in a source character set • Radical contained in a component Uniqueness. If one or more of these features are different between the ideographs com- pared, the ideographs are considered to have different abstract shapes and therefore are considered unique characters and are not unified. Unification. If all of these features are identical between the ideographs, the ideographs are considered to have the same abstract shape and are therefore unified. The examples in Table 10-3 represent some typical differences in abstract character shape. The ideographs are therefore not unified. Table 10-3. Ideographs Not Unified
Characters Reason D Å E Different Number of Components F Å G Same Number of Components Placed in Different Relative Position H Å I Same Number and Same Relative Position of Components, Corresponding Components Structure Differently J Å K Characters Treated Differently in a Source Character Set L Å M Characters with Different Radical in a Component N Å O Same Abstract Shape, Difference in Actual Shape
Differences in actual shape of ideographs that have been unified are illustrated in Table 10-4. Table 10-4. Ideographs Unified
Characters Reason P Q Different Writing Sequence T U Differences in Overshoot at the Stroke Termination V W Differences in Contact of Strokes X Y Differences in Protrusion at the Folded Corner of Strokes Z [ Differences in Bent Strokes \ ] Differences in Stroke Termination 3 4 Differences in Accent at the Stroke Initiation a 7 Difference in Rooftop Modification R S Difference in Rotated Strokes/Dotsa a. These ideographs (having the same abstract shape) would have been unified except for the source separation rule.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 265 10.1 Han East Asian Scripts
Han Ideograph Arrangement. The arrangement of the Unicode Han characters is based on the position of characters as they are listed in four major dictionaries. The KangXi Zidian was chosen as primary because it contains most of the source characters and because the dictionary itself and the principles of character ordering it employs are commonly used throughout East Asia. The Han ideograph arrangement follows the index (page and position) of the dictionaries listed in Table 10-5 with their priorities. Table 10-5. Han Ideograph Arrangement
Priority Dictionary City Publisher Version 1 KangXi Zidian Beijing Zhonghua Bookstore, 1989 7th edition 2 Dai Kan-Wa Jiten Tokyo Taishuukan Shoten, 1986 Revised edition 3 Hanyu Da Zidian Chengdu Sichuan Cishu Publishing, 1986 1st edition 4 Dae Jaweon Seoul Samseong Publishing Co. Ltd, 1988 1st edition
When a character is found in the KangXi Zidian, it follows the KangXi Zidian order. When it is not found in the KangXi Zidian and it is found in Dai Kan-Wa Jiten, it is given a posi- tion extrapolated from the KangXi position of the preceding character in Dai Kan-Wa Jiten. When it is not found in either KangXi or Dai Kan-Wa, then the Hanyu Da Zidian and Dae Jaweon dictionaries are consulted in a similar manner. Ideographs with simplified KangXi radicals are placed in a group following the traditional KangXi radical from which the simplified radical is derived. For example, characters with the simplified radical corresponding to KangXi radical follow the last nonsimplified character having as a radical. The arrangement for these simplified characters is that of the Hanyu Da Zidian. The few characters that are not found in any of the four dictionaries are placed following characters with the same KangXi radical and stroke count. The radical-stroke order that results is a culturally neutral order. It does not exactly match the order found in common dictionaries. Information for sorting all CJK ideographs by the radical-stroke method is found on the CD-ROM. It should be used if characters from the CJK Unified Ideographs and CJK Unified Ideographs Extension A blocks and compatibility ideographs are to be properly interleaved. The form of the charts for the CJK Unified Ideographs block is described in the introduc- tion to Chapter 14, Code Charts. A full radical-stroke index is also provided in Section 15.1, Han Radical-Stroke Index, to help users locate characters in the main charts.
Mapping to Standards The mappings defined by the IRG between the ideographs in the Unicode Standard and the IRG sources are included on the CD-ROM. These mappings are considered to be norma- tive parts of ISO/IEC 10646-1; that is, the characters are defined to be the targets for conver- sion of these characters in these character set standards. These mappings have as their source editions of the relevant national standards that were provided directly to the IRG by its member bodies. In some cases, these editions differ from the published editions generally available. The mappings defined by the IRG are considered normative for ISO/IEC 10646-1, and must be considered normative for the Unicode Standard as well. Because they may not match the mappings derived from published editions of the standards, developers of indi- vidual mapping tables, which are intended to handle real-life intercharacter set mappings,
266 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.1 Han may choose to use alternative mappings more directly correlated with published character set editions. The Unicode Consortium provides tables indicating where the variant map- pings may be desirable. Specialized conversion systems may also choose more sophisticated mapping mecha- nisms—for example, semantic conversion, variant normalization, or conversion between simplified and traditional Chinese. The Unicode Consortium also provides mapping information that extends beyond the normative mappings defined by the IRG. These additional mappings include mappings to character set standards included in the U source, including duplicate characters from KS C 5601-1987, mappings to portions of character set standards omitted from IRG sources, ref- erences to standard dictionaries, and suggested character/stroke counts.
CJK Compatibility Ideographs: U+F900–U+FAFF The Korean national standard KS C 5601-1987, which served as one of the primary source sets for the Unified CJK Ideograph Repertoire and Ordering, Version 2.0, contains 268 duplicate encodings of identical ideograph forms to denote alternative pronunciations. That is, in certain cases, the standard encoded a single character multiple times to denote different linguistic uses. This approach is like encoding the letter “a” five times to denote the different pronunciations it has in the words hat, able, art, father, and adrift. They are in all ways identical in shape to their nominal counterparts, and so were excluded by the IRG from its sources. For round-trip conversion with KS C 5601-1987, they are encoded sepa- rately from the primary CJK Unified Ideographs block. In addition, another 34 ideographs from various regional and industry standards were encoded in this block, primarily to achieve round-trip conversion compatibility. Twelve of these 34 ideographs (U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, and U+FA29) are not encoded in the CJK Unified Ideographs Areas. These 12 characters are not duplicates and should be treated as a small extension to the set of unified ideographs.
Kanbun: U+3190–U+319F This block contains a set of Kanbun marks used in Japanese texts to indicate the Japanese reading order of classical Chinese texts. They are not encoded in any current character encoding standards but are widely used in literature. They are typically written in an anno- tation style to the left of each line of vertically rendered Chinese text. See also enclosed CJK letters and months (U+3200..U+32FF) and CJK compatibility (U+3300..U+33FF).
CJK and KangXi Radicals: U+2E80–U+2FD5 East Asian ideographic radicals are ideographs or fragments of ideographs used to index dictionaries and word lists, and as the basis for creating new ideographs. The term radical comes from the Latin radix, meaning root, and refers to the part of the character under which the character is classified in dictionaries. Section 15.1, Han Radical-Stroke Index, pro- vides information on how to use radical-stroke lookup to locate ideographs encoded in the Unicode Standard.
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 267 10.1 Han East Asian Scripts
There is no single radical set in general use throughout East Asia; however, the set of 214 radicals used in the eighteenth-century KangXi dictionary is universally recognized. The visual appearance of radicals is often very different when they are used as radicals from what it is when they are stand-alone ideographs. Indeed, many radicals have multiple graphic forms when used as parts of characters. A standard example is the water radical, which is written $ when an ideograph and generally % when part of an ideograph. The Unicode Standard includes two blocks of encoded radicals: The KangXi Radicals block (U+2F00 through U+2FD5), which contains the base forms for the 214 radicals, and the CJK Radicals Extension block (U+2E80 through U+2EF3), which contains a set of variant shapes taken by the radicals either when they occur as parts of characters or are used for simplified Chinese. These variant shapes are commonly found as independent and distinct characters in dictionary indices—such as for the radical-stroke charts in the Unicode Stan- dard. As such, they have not been subject to the usual unification rules used for other char- acters in the standard. Most of the radicals in the CJK and KangXi Radicals blocks are also part of the CJK Unified Ideographs block of the Unicode Standard. Radicals that have one graphic form as an ideo- graph and another as part of an ideograph are generally encoded in both forms in the CJK Unified Ideographs block (such as U+6C34 and U+6C35 for the water radical). Standards. CNS 11643-1992 includes a block of radicals separate from its ideograph block. This block includes of 212 of the 214 KangXi radicals. These characters are included in the KangXi Radicals block. Those radicals that are ideographs in their own right have a definite meaning and are usu- ally referred to by that meaning. Accordingly, most of the characters in the KangXi Radicals block have been assigned names reflecting their meaning. The other radicals have been given names based on their shape. Semantics. Characters in the CJK and KangXi Radicals blocks should never be used as ideographs. They have different properties and meaning. U+2F00 is not equivalent to U+4E00 4E00. The former is to be treated as a symbol, the latter as a word or part of a word. It is necessary to make a semantic distinction between a character used as an ideograph and the same character used as a radical. To emphasize this difference, radicals may be given a distinct font style from their ideographic counterparts.
Ideographic Description: U+2FF0–U+2FFB Although the Unicode Standard includes over 27,000 ideographs, many thousands of extremely rare ideographs were nevertheless left unencoded. As an example, nearly half of the characters in the KangXi dictionary are unencoded. Research into cataloging additional ideographs for encoding continues, but it is anticipated that at no point will the entire set of potential, encodable ideographs be completely exhausted. In particular, ideographs con- tinue to be coined and such new coinages will invariably be unencoded. The 12 characters in the Ideographic Description block provide a mechanism for the stan- dard interchange of text that must reference unencoded ideographs. Unencoded ideo- graphs can be described using these characters and encoded ideographs; the reader can then create a mental picture of the ideographs from the description. This process is different from a formal encoding of an ideograph. There is no canonical description of unencoded ideographs; there is no semantic assigned to described ideo- graphs; there is no equivalence defined for described ideographs. Conceptually, ideograph
268 Copyright © 1991-2000 by Unicode, Inc. The Unicode Standard East Asian Scripts 10.1 Han descriptions are more akin to the English phrase, “an ‘e’ with an acute accent on it,” than to the character sequence “U+006E U+0301.” In particular, support for the characters in the Ideographic Description block does not require the rendering engine to recreate the graphic appearance of the described character. Note also that many of the ideographs that users might represent using the Ideographic Description characters will be formally encoded in future versions of the Unicode Stan- dard. The Ideographic Description algorithm depends on the fact that virtually all CJK ideo- graphs can be broken down into smaller pieces which are themselves ideographs. The broad coverage of the ideographs already encoded in the Unicode Standard implies that the vast majority of unencoded ideographs can be represented using the Ideographic Descrip- tion characters. Ideographic Description Sequences. Ideographic Description Sequences are defined by the following grammar. See Section 0.2, Notational Conventions, for the notational conventions used here. IDS ::= UnifiedIdeograph | Radical | BinaryDescriptionOperator IDS IDS | TrinaryDescriptionOperator IDS IDS IDS BinaryDescriptionOperator ::= U+2FF0 | U+2FF1 | U+2FF4 | U+2FF5 | U+2FF6 | U+2FF7 | U+2FF8 | U+2FF9 | U+2FFA | U+2FFB TrinaryDescriptionOperator::= U+2FF2 | U+2FF3 Radical ::= U+2E80 | U+2E81 | ... | U+2EF2 | U+2EF3 | U+2F00 | U+2F01 | ... | U+2FD4 | U+2FD5 UnifiedIdeograph ::= U+3400 | U+3401 | ... | U+4DB4 | U+4DB5 | U+4E00 | U+4E01 | ... | U+9FA4 | U+9FA5 | U+FA0E |U+FA0F | U+FA11 | U+FA13 | U+FA14 | U+FA1F |U+FA21 | U+FA23 | U+FA24 | U+FA27 | U+FA28 |U+FA29 In addition to the above grammar, Ideographic Description Sequences have two additional length constraints: • No sequence can be longer than 16 Unicode scalar values in length. • No sequence can contain more than six UnifiedIdeographs in a row without an intervening Ideographic Description character. A sequence of characters that includes Ideographic Description characters but does not conform to the above grammar and length constraints is not an Ideographic Description Sequence. The operators indicate the relative graphic positions of the operands running from left to right and from top to bottom. Note that non-unique compatibility ideographs (U+F900 through U+FA2D, but not U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, or U+FA29) are not counted as unified ideographs for the purposes of this grammar, although they do have the ideographic property (see Section 4.10, Letters and Other Useful Properties). Non-unique compatibility ideographs are excluded from Ideo- graphic Description Sequences to incrementally reduce the ambiguity of such sequences. Figure 10-8 illustrates the use of this grammar to provide descriptions of unencoded ideo- graphs. A user wishing to represent an unencoded ideograph will need to analyze its structure to determine how to describe it using an Ideographic Description Sequence. As a rule, it is best to use the natural radical-phonetic division for an ideograph if it has one and to use as
The Unicode Standard 3.0 Copyright © 1991-2000 by Unicode, Inc. 269 10.1 Han East Asian Scripts
Figure 10-8. Using the Ideographic Description Characters